Main

How to build a smart RasPi Bot with Cloud Vision and Speech API - Google I/O 2016

Cloud Vision API and Speech API let you take advantage of fully-trained machine learning models from Google. These technologies enable the powerful search behind Google Photos, and voice search in the mobile Google App. In this session we learn how to build a smart RaspberryPi robot that can detect faces, objects, and follow voice commands using these APIs. See all the talks from Google I/O 2016 here: https://goo.gl/olw6kV Watch more Android talks at I/O 2016 here: https://goo.gl/Uv3jls Watch more Chrome talks at I/O 2016 here: https://goo.gl/JoMLpB Watch more Firebase talks at I/O 2016 here: https://goo.gl/JTH9Fr #io16 #GoogleIO #GoogleIO2016

Google for Developers

7 years ago

hello thank you for taking your time my assistance how to build a smart Raspberry Pi board with cloud vision and speech API I'm Cassie Sato I'm a developer advocate for Google cloud platform and I'm Glenn shires I'm a software engineer and cloud speech API okay I'd like to show a demonstration video out crowd vision both are at first so let me share the view our vision provides powerful image analytics capabilities as easy-to-use ap is it enables application developers to build the next generati
on of application that can see and understand the content within the images the service is built on colorful computer vision models that power several different Google services the service enables developers to detect a broad set of entities within an image from everyday objects to faces and product logos the service is so easy to use as one example of the use cases you can have any Raspberry Pi robot like gulp I go calling the cloud vision API directly so the ball can send the images taken by i
ts camera to the cloud and can get the analysis result in real-time it detects faces in the image along with the associated emotions the cloud vision API is also able to detect entities within the image now let's see how facial detection works cloud vision detects faces on the picture and returns the positions of eyes nose and mouth so you can program the bots to follow the face it also detects emotions such as joy anger surprise and sorrow so the bot can move toward smiling faces or avoid anger
or surprise face one of the very interesting features of cloud vision API is the entity detection that means it detects any objects you like let's see how it works its glasses you see cloud visions likes developers to take advantage of Google's latest machine learning technologies quite easily please go to cloud google.com slash vision to learn more so that was the demonstration video for Christ's vision box so I'd like to discuss how to build this what all by using the vision API in these sess
ions but a directives that are briefly discussing about the machine intelligence working behind the bot very good ok thank you so we are using technology called neural networks so what is neural network neural network is a function that can learn from the training data set so it designed to mimic the behavior of neurons inside human human brain by using the matrix operations for example if you want to do the image recognition with neural networks then you can convert your input image such as the
cat images into a ROG vector and then put that vector to the neural networks where it does the massive amount of the matrix operation such as here the multiplications or additions between vectors and matrices and then eventually you to have neither ROG vector as an output which represents the levels of the objects are detected in the image such as the cat or automobile or the human face so let's take a look at another example how neural networks works neural network works by using the data set
called double spiral in this W spiral data set we have two groups of the data points one is the orange group another is blue group if you are a programmer and if you are asked to classify those data points what kind of the program code you would write do you want to write many if statements or switch statements to you know to classify the points each location with the point by using the conditions and the thresholds do you want right there I don't want to write that kind of code instead I would
be using machine learning or neural networks for that I can let the computer try to find a way to solve this problem so let's take a look at the demonstration this is a demonstration called playground where you can just actually running the tail the neural networks to solve this problems now you are seeing the computer is trying to find the optimal combination of the parameters inside neural networks to solve this problem actually it's not working right so restore me again sometimes the machine
learning fails so you can you have to try so much for x now here you go the computer fine fine found a way to you know combine the parameters in optimal way to do the classification with GW spider datasets so this is how neural networks works to solve your problem rather than instructing the computers how to solve the problem by humans okay go back is right and you can apply this neural network technologies to solve the much more complex programs such as recognizing a cat in image or recognizing
pedestrians walking around the street you can do that but you have to have many more human values the layers inside the input vector and output vector so it takes a long time to finish training that is called deep neural networks or deep Blahniks but largest problems right now for deep neural networks is the computational resource it usually takes like a few few days or a few weeks sometimes to finish your trainings with deep neural network so that's the reason why Google has been researching o
n a distributed training by using the Google crowd by using the GPUs or TP use that can shorten the training times in order of one tenth or 100 and that is the reason why Google has been so successful on applying dat deep neural network technologies to the many consumer services such as the voice recognition with Android devices or the image recognitions with Google Porter's or the ranking in the Google search services now we have over 20 a production services at Google that has been using the d
eep learning technologies underlying and now we have started to externalize the power of the neural networks running on Google cloud to external developers the first product is called cloud vision API and the second product is called crud speech API what is quad vision API cloud vision API is an image analysis service that provides the pre trained model so you don't have to train your own neural networks or machine learning model rather than that you can just use the REST API just sending your o
wn images uploaded to the API then you will be receiving the Geonosis result in a JSON format in two seconds so you don't have to have any machine learning skill set or experience and it's so inexpensive it only costs two and a half dollars per 1000 units or images and it takes no judge to start trying out the API so let's look at another demo switches for garage vision API here I'm launching a demonstrations called cloud vision Explorer where we have imported over eighty thousand images from Wi
kimedia Commons uploaded on the Google Cloud storage and applied the vision API analysis already and by by using the result of the vision API we have done clustering analysis so that you are seeing the cluster of the images such as Seacrest er was no Craster or estate or register area Craster and if you take a look at the cluster for cat then you'll be seeing the many images that is classified as a cat like this if you put these images to the API the API will be sending back the results such as
the cat or pet or this must be a British Shorthair all those results will be returned in JSON format like this but in these demonstrations you can see them in graphical UI also if your image has an text inside it then the API can convert the text inside the image to string like this the three congos crossing next two kilometers you can get it at an stream or if you have the faces in your images then the API can detect the faces and locations of each face and also the emotions of each faces such
as joy or sorrow anger or surprise so you can easily find which face is smiling or not if your image contains the very popular locations then it can provide the name of the popular location such as the Citi Field Stadium in New York City with longitude and latitude you can even put a markers on the Google Maps oh sorry you can even use the product Robo detections features so that you can easily understand the log the product the image has the product Roble corporate bubble so that was the demons
trations of the vision API so let's take a look at how you can bring these machine intelligence into the raspberry pi board so crack vision bot is based on a Raspberry Pi robot called goal PI goal which is produced by Dexter industries so you can go to dick Dexter's website to buy go Pikeville at around two hundred dollars and also you may want to buy an fisheye camera to capture the wider range of views surrounding the bot and we have written a few hundred lines of Python code to capture the im
age by camera and send the image to the API it's really easy to start getting started to this division API you can just go to growl you calm flash vision - getting started you can try the QuickStart tutorial ah and that should be finished within a thirty minutes this is a sample Python code to send your image data to the API you have to convert the image binaries into the base64 text and embed that text into the content property of the request and also you have to specify the types of the featur
es you want to detect in this case it's specified the label detections are the features so that you will be receiving the labels as a result of the API analysis and you can make the call to the API then you will be receiving the result Jason in a few seconds so that you can easily dig into the JSON result to getting out the Devils if you specify face detection then you'll be receiving the positions of the face and random acts such as the bounding body properties where you would have the X&Y posi
tions and also you will have the joy likelihood property where you can find each face is smiling or not so it's really easy to write a Python code to turn the bot into the direction of the face and also if it's smiling then you can run the modules of the bot to follow the people follow the person so let's take a look at the real demonstration of the vision API bot so I'm showing the console so this is the user interface web console for the bot so actually it's showing the vision it is taking rig
ht now not sure it's working on now must be working I hope it's working is it working sure let me try so if you put the flowers it's not saying anything so it looks racks is not working anymore mmm maybe I can try fixing this stuff so while doing that maybe I can pass this case to him okay yeah so you want me to jump into my slides yeah okay yeah hi so I'd also like to talk about the cloud speech API which is actually a rather new API it was released about a month a half ago and it joins a numbe
r of speech api's that Google has had for quite some time quite a number of years you're probably familiar with the Android speech API which allows you to do speech to text and text to speech on android devices phones tablets autos TV etc so that's in Java that's a java-based API there's also the web speech API which is in Chrome which is a JavaScript based API that allows you to do speech and the cloud speech API is a new API that we've released that allows you to put this on any device so what
ever device or server you'd like to put speech into we've made it very easy and we support quite a number of languages so we've made it very easy to integrate this into all sorts of different clients and servers the cloud speech API is powered by Google's machine learning so we've got a lot of experience with speech and we've built that all into the API so it's the same powerful engine that you you have on Android you have on Chrome you now have available for whatever project you'd like to use i
t on the models are pre-trained so there's a you do not have to learn machine learning to specifically use it you can just get up and running immediately and it supports over 80 languages and variants of languages it's got real-time streaming and so what real-time streaming means is that as I'm talking that the text is actually coming out while I'm talking what I'd like to do is give you a quick demo of that so I'm sure you're familiar with this page which has speech on it Google search page web
speech API what I'm doing here is pulling up a demonstration page for the GUP for the Chrome web speech API and make it a little bit bigger here so what I'm going to do is as I'm talking I want you to notice that the words are coming out and they're actually first they'll be gray words because it's not quite sure then when it's very confident of the words it turns them to black so you can see as I'm speaking the words are coming out and appearing on the screen they turn black after it's very co
nfident of what I said and where's the presentation okay so that's an example of the real-time streaming that's built-in and the we have a limited preview right now for which you can sign up and join and start using the club speech API club speech API is actually two different api's there are at least two versions of the same API is probably a better way to put it there's a REST API that's a very very simple way to use it you can get started immediately it's as simple as writing a curl command a
nd some JSON and then there's remote procedure calls which gives you more power let me show you the REST API and just this is literally everything you need to know to do the REST API on one slide you see on the Left there's a JSON request and you can formulate that you can make it more complicated if you want add languages add different types of ways you want to process it you can even add context a new thing we released is the fact that you can add new words to the vocabulary new phrases and so
that's coming out this week so you can make it as complicated as you want but the simplest request would be exactly that those couple lines of JSON with the content where you insert your audio file or your audio data if you look at the bottom there's a curl command kind of long but basically all posting is a content type and posting to a URL so it's taking that JSON post into the URL and what you get back is that response that's on your right and again this is the simplest type of response if a
ll you want is one alternative you don't want to see multiple alternatives and interim results you'll get something that looks just like this so as I said there's actually two types of APs the other one is a remote procedure call API and what that means is you can do everything by simply calling methods in your favorite language in either Java or C++ or the ten different languages that are supported so you don't have to worry about the network and it also supports the bi-directional streaming th
at I mentioned so that as you're talking you're getting the data back so yes we support quite a number of languages for the remote procedure call and this is actually open source it's free and open source so if there's a language that doesn't appear here you can certainly build the source for that language and actually use it on any language it also uses HTTP / - which HTTP - secure which allows you to have some very robust bi-directional streaming so what I'd like to do is demonstrate this go s
traight stand still thank you I like sit go to sleep start start that over and we'll see let me move forward so let me show you what we're waiting for that to reboot exactly what the what we're sending with the RPC calls we're sending initial requests now this is similar to the JSON that I showed in the last slide but in here we're actually using proto buffers to send this so it's a very compact format in other words you're not sending extra bytes over the wire of the way you are with Jason so y
ou send your request you capture audio from the microphone and in this case what we're doing is on the thread we're reading a buffer full of audio and sending that buffer of audio you will see what we're calling here is on net so this request observer on next is code that's automatically supplied in any of ten languages so that you can keep passing new buffers to it and finally you've got the response which you can be running on a different thread because you're sending audio and receiving data
at the same time and again there's a non next command here that's provided automatically for you which provides the data in this case we're printing out the results of what we're receiving go forward spin left spin right go backwards do a dance go to sleep play dead I like that one too so there we are so one thing I wanted to point out is it's responding very quickly and that's because it's streaming the speech as I'm speaking so it's not going to capture all the speech and then wait and then se
nd a big chunk of data up and we as I'm speaking it's going by directionals and so that's why I can do it so quickly and my clicker is here so what I'd like to do is this is easier to show than to talk about so let me show this these are these are experimental features I want to say what I've showed you today is what's available right now on the cloud speech API these are experimental features that will be available soon what time is it in Tokyo the time in Tokyo Japan is 9:29 a.m. how do you sa
y when is the next train in French compare Porsche huh turn on the table lamp there we go go to sleep so as you can see I've demonstrated two different things in that demo the first one is spoken answers and what that's doing is as I'm asking questions it's providing answers and the second one is something that we've actually integrated with if this then that which is a way that you can integrate with all sorts of different devices in this case I've integrated with a light module controller so t
hat you can actually send up set up your own triggers to do this and this is exactly what I did there's a webpage that I went to and I typed in what I wanted to say as triggers voice triggers various ways I want to say it and what it's going to respond when I do say that and what it does is it goes out and configures this and now when I speak that phrase or one of those phrases it goes out and triggers whatever if action you would like to trigger so you want to purchase yeah let me try the I'm n
ot sure if it's working or not yeah I can show the console yeah let me try again is the vault Oh looks like it's working yeah exactly it's walking it's the Jews it's not it's currency I'm smiling yes I'm fine thank you it's not yes it is it's Cottle's not Gogol it's sad it's like yes I wear about this about this yeah if you keep smiling then it tries to hold me towards me what is this can you see this oh ah thank you so yeah it works finally so that yeah thank you so that was our demonstration p
robably yeah okay one thing I failed to mention with the gift triggers you can actually add parameters to those triggers so you could add a number or a string parameter and you could for example tell the robot to turn left to search a number of steps or degrees so you can make the triggers actually quite quite interesting so we have a number of resources that you should go out and take a look at the cloud vision API and the cloud speech API you are ready for sign up and we also have well persona
l thank you but we also have a number of other sessions that are coming up that you might be interested in we have code labs that talk about machine learning there's several machine learning presentations coming up there's cloud office hours if you want to learn more about cloud and integrating with the cloud api's we have office hours throughout the next few days and there's the sandbox so thank you very much

Comments

@negi2u

You were running short of breadth in the entire session... Don't worry about things so much, you are doing a wonderful job!

@stevelevine1151

Working hard to replicate this using GCP - Any github repos available on the experiment guys?

@alexisortega8927

Hello, any hint of what voice synthesizer did they used? thanks in advance...

@xorinzor

that must've been embarrassing that both demo's didn't work.. next time perhaps have some screen backstage showing a live log feed? might help fixing these kind of things :)

@user-gz2po7dx3k

Jin Yang , you again !

@user-hf2dr7sh4y

what project produced this tree picture?

@thoughtwave5130

What are the pricing plans , if i want to add this to an application pipeline i am building .

@AkshayAradhya

Awkward moment when google developer thinks the ultrasonic sensor is the camera

@neeraj26jan

I think it would be better to use OpenCV and Tensorflow. It will give more accurate and quick response once you train the model.

@sothirithhem3656

anyone call tell me why I can not combine the camera vision api with speech api >?( I used two different gmail)

@foxioi

does anyone actually have access to the cloud speech API (outside google)?

@whatcani2

so at the end all the robot's question have to pay $ to google.

@alph4966

Google Japanの人だ~

@jafar1607

awesome idea.. plz google.. better stick with softwares.. don't do robots.. :)