[MUSIC] MARCO CASALAINA:
Hello everybody, and thank you for
joining us on this, the last day of Build, the last breakout session
of the last day at Build. I guarantee you; I'm going
to make it worth your while. I'm Marco Casalaina, and I am VP of Cognitive
Services at Microsoft. Cognitive Services includes
the Azure OpenAI Service. It also includes vision,
language, decision, content safety, and
responsible AI and Speech. We have all these capabilities
and we're going to be taking a look at a wh
ole
bunch of them today. Today's session is about the new capabilities
in Azure AI, and how you can
use them to bring new types of applications
to your customers. When we think about Azure
AI, what is Azure AI? It's a complete suite
of services that's made to be accessible
both to developers, people without a data
science background, or to data scientists, people who want to
be able to build their own machine
learning models. Increasingly, it's repeatable. That means that
nowadays, you don't eve
n necessarily need to
train a model anymore. In the past, with AI, you had to train up all your own models and stuff like that. That's not so true
anymore, and we're going to see it in a minute. Finally, it's made to be responsible right out of
the box, so that your AI behaves responsibly and
doesn't go off the rails in front of your employees
and your customers. Now, this is the layout
of the Azure AI Services. But rather than
talking to this slide, we are going to
leave slide land. Give me jus
t a moment. My
computer locked itself. That's not awesome. There we go. We're going to leave
slide land and we're going to go straight
into the product because this is going to be
much more fun. Here we go. I'm going to start with Speech, and one of the new
capabilities of Speech. Now, I have a custom neural
voice of me in Speech. I'm just going to play
this here and you'll hear. SPEAKER 1:
"The bear said, I am so angry." Suddenly a fairy voice
appeared. "Don't be angry. I am here to help you."
MARCO CASALAINA: That definitely
sounds like me, because it is me, and even my mother
thinks it sounds like me. But that was pretty flat. It really lacked
emotional affect, there's nothing there. We have this new feature now; it's called auto predict. I can auto predict
on this stuff. Now, as you can
see, it's actually added some emotional
content here. I'm going to put it
back at the beginning and let's listen to that again. SPEAKER 1:
"The bear said, I am so angry." Suddenly a fairy voice appe
ared, "Don't be angry. I am here to help you," "Really, you can help me? That would be great," said the bear hopefully. MARCO CASALAINA: One of the
themes of this presentation is that these models, these AI capabilities,
are going multimodal. ChatGPT has kind of
conditioned everybody to believe that it's just a
text in, text out interface. But already, that's not true. Already, we have
capabilities like Speech, and some of
the vision things that I'll show you later on, that
will allow you to han
dle all kinds of different content,
not just text. This is just one of the capabilities that
we have in Speech. There's lots more, but I don't
have time for all of that right now, because we got
lots of Azure AI to cover. The next thing we're going to
take a look at is language. Before I did this session, I did a dry run of it. I took the transcript
of this dry run and I put it into this new capability, which is the summarizer API. In language, we are releasing our new document and
conversation
summarizers. Here we have the transcript. This is from the file
that I loaded in, and I set it to do the Chapter
title and the narrative. Basically, it's chapterising, it's breaking the meeting
up into different segments. Then, it writes a little
narrative about that. Here we see how it broke
it up, right on the screen. In fact, we are in
step 2 right now, summarizing a 30-minute
meeting with language, which gets me to this. Why would we do this? Why would we make a
summarization API when, as yo
u probably all know, you can totally summarize
stuff with GPT, with the Azure OpenAI Service. Well, there's a really
good reason for that. Well, there's two, actually. One is that this summarization API does what it says. It
says what it does. You don't have to do
anything with prompts or anything like that
to make it work. You just say, I want to do a conversation
summarization and it does it; it does it in a much more inexpensive
way than the GPT models. But moreover, let's
take a closer look.
You said summarizing a 30-minute
meeting with language. Well, this was actually
about a 45-minute meeting. You'll notice that it has
31,000 characters in it. The token limit of most of the GPT models is about 16,000
characters, give or take. It's actually
measured in tokens, but it's roughly 16,000
Latin characters. That means that I
couldn't really feed this whole conversation into GPT and expect it to be
able to summarize that, or I could into GPT-432K, which is the top-end model. Now, that's
an expensive
way to do it. This thing has a much
larger character limit. You can feed meetings into here, or conversations in
here, that are as long as like an
hour and a half or two hours and it'll still work. There is a place here for what we call task-specific models. That's what we have here. Before I move on, because it's about to get really interesting, I would note that there is
this QR code over here. I think many of you are
familiar with them now. As I go through
this presentation, the
re will be time at
the end for questions. If you'd like, you
can feel free to use that QR code to ask questions, and I will take them
at the end of this. Next, we're going to get
to Form Recognizer. Form Recognizer, if
you've never seen it before, Form Recognizer, well, it also does what it says and says what it does. Here is a form, and
it's recognized it. This is a form of a
bank application. Into this application,
I have hand written some of my
own information. Some of this is not real. My mo
ther's maiden
name is not "Jones." But what Form Recognizer does, it's not just OCR, it reads the document, and then it divides it
up into key-value pairs. That's what it's done here. You'll notice when I
mouse over this that it's discovered that this is
a key called "Surname." It's found correctly my
handwritten name here, Casalaina; that is,
indeed, my last name. Now, one of the challenges that our customers
used to have with Form Recognizer
that we have just recently fixed
is, consider this,
it says "Surname." There are lots of different ways that you can say last name, you can say surname,
you can say last name, you can say family name, there's all of different
ways that you can say that. For our customers
that were using Form Recognizer on a
diversity of forms, a lot of times they said
things in different ways. That made it difficult
to map that to whatever workflow or database they were trying
to send it into. Now, you'll notice that
there's a new key here. It's called "CommonNam
e." What we'll do here is we will attempt to map this
to a common name. Whether it says surname, or
family name, or last name, we will emit this extra piece of metadata that says LastName. You can take that
and map that in without having to do
that mapping manually. That's Form Recognizer
as it is today. It recognizes forms. What if the content
is not a form? What if it's just free text, a contract, something like that. You want to extract
information from that. Well, something's
coming for that
. This here, let's see
if I can get this to scroll down a little
bit. There we go. This is one of my own
documents, actually. This is, I was buying
a storage unit. I live in San Francisco, California; I was buying
a storage unit in San Francisco last year. This is the grant
deed from that. Now, this grant deed is
clearly not a form, so it's not arranged nicely like that other form in
key value pairs. The information is
in the PROSE here. We have this new capability
called "query fields." Really,
what's happening here
is that in the background, Azure Form Recognizer is
orchestrated to Azure OpenAI. I have defined four
fields here that I want Azure OpenAI to pull
out from this document. The buyer, the seller, the transfer tax, and the city. When I ran this analysis, as I did just before
I came up here, you'll notice along
the side here, it did indeed pull
this information out. The buyer is myself and
my wife, Karen Bird, the city, San
Francisco, the seller, SLATS Investors Three; that re
ally was the company I bought it from;
I didn't make that up. And the transfer
tax 60 bucks. All of that is embedded
here, in this paragraph. None of that is in
key-value form. Yet, I can use this
capability to derive that structure from
this unstructured data and drive downstream workflow. Orchestration:
that's the other theme of this session. The first theme is multimodality, the second theme is orchestration. From this point forward, we're going to be talking
about orchestration, because the
next generation of AI applications will
all be orchestrated, it won't just be a single model. Consider Bing, for example, I'm sure all of you
have used ChatGPT. If I press this button in Bing, it's going to do something
different than ChatGPT does, and you can see it already. Actually, Bing is nice about
this, because it tells you exactly what it's
orchestrating as it does it. It's going to go
give me a response and I'll just let
it keep doing that, but what I'm trying
to show here is this. If I
were to ask this
of just raw ChatGPT, not orchestrated to anything, the model by itself
would give a response. It would spin a tale
about orchestration and AI, but it wouldn't do this. It wouldn't do a search. Bing is orchestrated and it's
orchestrated in such a way, actually, this is a three-
step orchestration. The first step is I
ask this question. The first step is it goes
to GPT once and it says, hey GPT, I'm getting
this question; what should I search? GPT comes back and says, you should
search
orchestration and AI. The second step
is that Bing runs this search in its
own search index. Then the third step is, now,
it's got these results from that search index from all of these
different websites, Medium and eWeek, and Databricks, and all
these other things. It takes chunks of
all these webpages, puts them all together, sends them back to GPT and says, GPT make me a response, and the response is
what we see here. Now, we're going to peek behind the covers of this and
how this act
ually works. Hopefully, by this point
in Build, you've all had a chance to have a look at the Azure AI Studio and the playgrounds
that we have in there, but if you haven't, here we are. This is the chat playground, so I'm doing a
conversational interface here, and one really important part of this is the system message. The system message,
this is where you set the tone of your
conversational AI, this is where you set
the domain restriction: I only want it to talk about certain topics. This is
w
here you set the rules: don't make jokes
about Microsoft. This is where you set also
formatting instructions, which is why, for example, Bing just bolded
some of those things, some of those topics,
when it came up. I have done none of that here, so my system message here is the bare, raw default
system message. That means that right now, this ChatGPT model will
talk about anything. I'll say, tell me about Santorini, and in a
second it's going to spin me a tale about the
Greek island of Santorini
. If you're a business or a
government, you probably don't want your
conversational AI to just talk about whatever; that's probably not
such a good idea. Let's say that I am building a conversational
AI for my healthcare plan; here, I have a different
system message. The system message
is effectively the program for
these GPT models. And so, here, my system message
says something different. It says, "You are an AI assistant that helps people find information about
their healthcare plan... Your r
esponses will be in clear and uncomplicated
language," and so on. These are the instructions
that I'm giving it, which means if I go down
here and I say again, "Tell me about Santorini," we
will get a different result. It tells me a little bit about Santorini, but then it's like, but if you have any
questions about healthcare plans,
I'll talk about that. Unlike the last one, it didn't spin a whole big tale
about, it's like, I just want to talk
about healthcare plans. Now, let's say that I do ask
it a question
about a healthcare plan. "Does my plan cover new
glasses?" Let's say. Now, this is within the rules,
and it does, in fact, give me a pretty decent response,
but not a wonderful one. It says, "To determine
if your plan covers new glasses, we'd
need to know which plan you have," and
all this stuff. It's talking in generalities. Premera, actually, is our US healthcare
plan for Microsoft, but it doesn't know
that right now. Right now, I am talking to the raw model, and
the raw the mod
el has no idea which healthcare plan I'm talking
about from Premera. It knows I'm talking about
Premera, but Premera has thousands of healthcare plans. It's willing to talk, but
only in generalities. If I wanted to be
more specific, then I can ground it to my own data, which we will do right now. Now we have this
button up at the top here, called "Add your data," and I'm going to add
a data source here. Now, there's all kinds
of different data sources I can add, even now, and we'll
add more soon
, but we have a Cognitive Search, previously set up, into
which we loaded all of our Microsoft US
health plan documents. I've set up this
Cognitive Search and now I just got to give it a little bit of
information because, what this is going to do, it's going to render
citations, just like Bing did. I need to let it know
which metadata to use to render these
citations, like so. Now, I'm going to let it
use semantic search; I would note that we are
also releasing here at Build, although
it's diffi
cult to demonstrate, Vector search. Vector search is a different
type of search that effectively vectorizes
your query. What that means is that, you can give a question in a totally different language or using totally
different words, and the vector representations
of that is the same, and then it will allow it
to find that more reliably. That's a
behind-the-scenes thing, but that's all I need to do, so I've linked this now
to my Cognitive Search. Now, let's try that
same question again. Does my
plan cover new glasses? Now, we're getting a
different response. Yes, your plan covers
new glasses and it's telling me that
they're covered up to 100 percent with the
maximum benefit. This is grounded to my data. Now, let me take a moment to talk a little bit
about how this works, because this is a common
source of questions. First of all, there's not just one, like ChatGPT;
there are many, and this one is mine. You can make your own
instance of a GPT model, and that could be
the ChatGPT model,
it could be GPT-4; you can make your own instance
in Azure OpenAI, and the data that
goes in and out of that instance is yours. Microsoft can't see it, OpenAI can't see it. Nobody can use it to train any more models, or
anything like that. That data is yours. Furthermore, what
I just did here, I didn't do anything
to the model. This is the same
model. This is, in fact, the ChatGPT model. If I scroll down
here we'll see it, GPT Turbo, which is
the ChatGPT model. This is the same model, I didn't
do anything to it, I didn't train it, I
didn't fine-tune it. What I am doing is I am injecting this
content at run-time, when I make this query, it does a search and it injects that right into the prompt. I don't see it, but
it's happening. That's really what it's
doing, just like Bing was. That data is ephemeral. The GPT models by themselves, the Azure OpenAI Service, doesn't store this data, so if you're concerned
about HIPAA or GDPR, these various regulations
around the storage of data. This
doesn't store the data; once this conversation ends, this data is gone,
unless you as the user, unless you choose to store
it, we don't store it. This data will be gone. Now, that's one way that you can ground this to your data and that's
the super easy button for making that happen. Now, let's say that you want
to be able to integrate to other data sources or even other things altogether,
like systems of action. Well, we're also
announcing here at Build the ability to use
ChatGPT plugins. This
is an API that OpenAI
introduced in March, and we are supporting the very same API now
in Azure OpenAI. I'm going to ask a question here that I know that a GPT
model by itself could not possibly answer,
because the GPT models by themselves have no access
to real-time information. It can't possibly
know if there's an Audi Q7 available for
sale in Seattle right now. But it can now, because I've
enabled the Bing search plugin. We have this Bing
plugin to ChatGPT. What the model does
is, it can actu
ally decide when to call the plugin. I don't have to
orchestrate this, per se. I don't have to define
the orchestration; I just declare the plugin. I say I would like you to be
able to use this Bing plugin, and when it sees a
question that it knows it can't
answer by itself-- I'm anthropomorphizing
a little bit, it knows-- then it will go and
hit the plugin. That's exactly what
happened here. It went, and it went to Bing, and you can see it used
the Bing Search plugin, and it found these various
different Audi Q7s that are available from all
of these different vendors. That's how these plugins work. Now, also here at Build, we're releasing a new
means of creating your own orchestrations, because you can use the grounding
thing and as I said, it's the easy button to
ground this to your own data. You can use these plugins. In that case, the model
decides when it wants to use an external system. But in many cases, you may
want to have that control. You may want to decide how
this orchestr
ation works. For that, we have prompt flow. Prompt flow is
actually two things. It's not just one thing. What I'm showing, this
particular prompt flow is actually the analog of the query grounded to Cognitive
Search that I did earlier. What's happening
here is, this is actually the individual steps, like what I described with Bing. We take the input, we embed the question that
you asked into a vector. This is one is using
Vector search. We then go search that
in Cognitive Search, we take the out
put of
Cognitive Search, and we put it together in such a way that we can make a
prompt out of it. We actually make the
prompt out of it, like so. Finally, the last step is
that we feed it to a model. One of the things that you
can do with prompt flow is you can build your
own orchestration, just like I did here. You can decide, and you can put, by the way, not just
search in here, you can put other AI
services altogether in here, you can put speech in here. You could put vision in here, or tran
slation, for example, or something else entirely. Something that has
nothing to do with AI. You can use this to do
your orchestration itself, but it has another and
very important capability, and that is testing
and evaluations. As I said, I am
generating a prompt. We approach these
large language models using natural language, but that means that
your prompt matters. What you actually put
in this English text here can change the output. One of the things you can
do with prompt flow is you can t
ake each individual block, I'm taking this
prompt block here, and I'm going to show some
variants of this block. This is my variant_0,
my initial variant. This is the prompt that we have running in the
orchestration right now. But down here, I have a second one that's
written differently. This one is more
specific to healthcare. The question is, will this one work better
than my variant_0? To test that, what I can do, and I'm not going to do this
in this demo right now, but I can test that by pr
essing
this bulk test button, I can give it a whole
bunch of sample prompts of a whole bunch of
questions that people are asking about healthcare. It will automatically try
both variant_0 and variant_1. Based on a metric that I choose, it will do an
evaluation and give me an idea of which one
is performing better. Those metrics can be
groundedness, like is the second prompt
giving me something that's better
grounded to my data? It can be relevance; is the second prompt
giving me something more r
elevant to
what I'm asking? There's all manner of
evaluation metrics that you can use on the
backend of the bulk test, that sadly, I don't have
time to demo here today. But stay tuned, because
we'll be putting up more in-depth content on
prompt flow soon. Now, it's not just
the prompt though, because the other thing
that's very common here is that people want to test not just different
prompts, but different models. What if you're currently using the GPT Turbo model and you
want to try it with G
PT-4? How do you know that that works? Well, the same deal applies. Basically, what I can
do here is I can click on my show
variants button here. Here's my variant_0
with text-davici-002; that's good
old-fashioned GPT-3.5, and let's say I can try
it also with variant_1, text-davinci-003, that's
GPT-3.5.1 in this case. Once again, I press
that bulk test button. I have 1,000 different
prompts that it's going to try, and it will test these models
side-by-side to determine which one of these models
is giving
the better result. Does upgrading to the new
model give me a better result? Does it produce more latency? Does it produce
different artifacts? Does it hallucinate less? All of those kinds of things. Prompt flow is a
very important tool, both for orchestrating
your AI services, but also to test them and
evaluate them at scale. Now, once you've got this
thing up and running, the next thing you're going
to want to think about is, how do I ensure that this
is behaving responsibly? To that
end, we have our new
Azure Content Safety System. So, in our Azure
Content Safety system, we have these sliders here. I could actually choose
different levels of content that are permissible
or impermissible. Now, in this case, I have it set down the middle for all of them, for violence, self-harm,
sexual, and hate content. I have it set down the middle. I'm going to put in
something here that is not such a bad thing to say, "I need an ax to cut a tree." This is not very violent,
unless you're t
he tree. This is considered safe on all axes, and so I get
the green checks. That means that this will pass. Now, let's bring our bear back
from our story of earlier, "I needed ax to cut
a bear," instead. Now, we're getting
a little violent. Now, it's gone up
to medium, and I get the red blocker symbol. This would be blocked,
according to my settings. Now, why would you want
to move the slider bars? Let's say that you are building a conversational AI for
a police department, for people to actual
ly
describe crimes. In that case, there
might actually be permissible violent
content in there. You want to take
that violent content, in this case, because it's
talking about crimes. You might want to
turn these sliders all the way up to accept them. If, on the other
hand, you're writing an application like
GitHub Copilot, all the way down. GitHub Copilot should be talking about none of this stuff. You don't want any of
these things in your code. In that type of a situation, and we do use this
same
system by the way. This is the system
that's behind Bing, behind our copilots,
and all those things. Different settings
for different things, but you turn them all the
way down when necessary. That's the concept there. Generally, when it comes to
using the OpenAI models, we apply this content safety
system both in and out, and that is to say, so here, this is on the way in. "I need an ax to cut a bear." Well, we're going to
block that content because it's violent coming in. Now, what if I m
ade
some innocuous prompt? For whatever reason, the large language model was
about to say something back, that would be violent. Well, we apply the content
safety system also on the way out to ensure that nothing weird comes out
of the model either. We apply content
safety in and out. Finally, I've been talking still about, except
for the Speech thing, we've mostly been
looking at text here. I need to caffeinate
up for this one. We mostly have been
talking about Speech, but let's have a look at
vision. Now, this is traditional
object detection, and of course we still support traditional object detection. This is a picture of my
team in the Bay Area. I live in the Bay
Area, in California. We went on a hike one day. I ran this traditional
object detection on this picture in
which we got person, person, person, person,
person, person, person and it's right, right, right, right and right. All of these are indeed people,
and for some use cases, that's okay, that's
just fine, right? Sometime
s you just need to know, is there a person in this
scene, or is there not? But if you want to
do something more advanced, some more
advanced workflow or processing or
something like that, what you need, and this is the new piece,
is Dense Captioning. Dense Captioning is powered
by our new Florence model. Florence is a foundation model, much like the GPT models, but Florence is a different
type of foundation model that understands just about
everything that's visible in the world around us. When
I run this
exact same picture through Florence's
Dense Captioning, we get something very different. We get "A group of people
posing for a photo," "A man in a black shirt," "A woman holding a drink." What you see here is that it's picking up not just objects, but more descriptive notions of those objects and
what they're doing, their actions, like
posing for a photo. When I use this
particular photo, there's a little fun
artifact here, and it's this, "A woman holding a
bottle of vitamin." What i
s going on with that? This is Iman, on my team, and
I'm right behind her there. Now, let's zoom in on
this a little bit. She's holding a bottle of Vitamin Water, but she has her hand over the water, and so it looks
like a bottle of vitamin. In fact, the model is correct, this is a bottle of vitamin and that serves to
show that the model can read. But now let's say that
you're a retailer and it really matters to you. Like in this case,
the model by itself doesn't really know that this is specific
ally Vitamin Water, it just knows that
this is a bottle of something and that it
says "vitamin" on it. If you're a retailer
or a grocer, you might need to know
whether this thing, I mean, whether this
is Vitamin Water, whether it's Coca-Cola, whether it's Pepsi
or whatever, and so, these models are fine- tunable. You can give it just a few different images of Vitamin Water in
different configurations, maybe somebody with their
hand over the water, and a couple of other
images of Coke and Pepsi,
and pretty
quickly, it will learn that this is not
just a bottle of vitamin, this is Vitamin Water. If this is something that matters to you in your workflow, you can fine-tune this to
adapt to that workflow. Now, it's not just images, we can also run this on video. Here's a video of
some folks working in a warehouse, and I can
summarize this video. Why would I summarize
this video? One reason is for captioning
for people who are visually impaired, and so I can basically summarize what's
happeni
ng in the video. But another, and maybe
more common reason is for search metadata. Let's say that I would like
to be able to search through my video archive for instances of people
carrying a ladder, climbing a ladder, and
doing those kinds of things. Well then, I can use this summarization here, and feed that right into my search, whether that's Cognitive
Search or something else, and that would allow me to snap to this
very video here. Now, with this model, I can actually search in the video i
tself and I'm going to make a
custom search query. I'm going to say,
"person falling" over here, and I know that somewhere
in this video there is, this guy is going down. It snaps me right to the point that this person has fallen. Now, when I do this, it often looks like I'm
doing a super fancy, like, ooh, that's a cute demo, that's really nice
and all that stuff. Just to prove that this
is not just a demo, I got this other
video over here. I took it right outside over there, with Dayana, who's
sitting right over there, and Ikenna, I
don't know if he's in here right now, in which they were throwing some stuff at each other, and I'm going to
do the same thing here. I'm going to say,
"person throwing a box," and here we have Dayana
throwing the box, right here. But what was really crazy
is that Ikenna had, for some reason, in his pocket, a banana, and so if we just scroll this back
and there he is, throwing the banana,
but watch this. He pulls it out of his pocket
and throws it at Dayana
. That was right there,
like an hour ago. My point is, this stuff is real. This stuff is here today
and you can use it today. When you think about
how this might be orchestrated, to finish this up, how this might be
orchestrated to applications that you might build
in the real world. I mean, it's not just for
people throwing bananas. If we take a look at the
next generation of Bing, if those of you who were
paying attention, you might have
noticed, Bing made an announcement a
couple of weeks ago
that they're going multimodal. The next generation of Bing, I gave a picture
here of an Audi Q7. I say, "Where can I buy
one of these in Seattle?" I don't say it's a Q7. The next generation of Bing is orchestrated in such
a way that it does just this type of image analysis up front, and like I said, Bing is nice to us,
it tells us what it's doing as it does it, it analyzed the image
and it's figured out that it's an Audi Q7. Now it's searching for Audi Q7 dealers in
Seattle, and in a moment, as
you might expect,
there we go, it's going to give me all of
these results about where I can buy an Audi Q7 in Seattle. These are the types of orchestrations that you can
build for your employees, for your customers,
for your businesses, and for your governments. This stuff is here now. For the folks in the back,
if you would please switch me back to slide land. We're just going to
wrap this up and then we're going to
get to some questions. We have thousands of
customers who are now using Azure
AI and the Azure
OpenAI Service. Thermo Fisher
Scientific, for example, is using Copy.ai, and
Copy.ai, in turn, is using the Azure
OpenAI Service to generate content about Thermo Fisher
Scientific instruments, and materials for their
manuals, and stuff like that; eBay is using the
Azure OpenAI Service to generate this thing
called a magical listening. As you probably know, eBay is like this online
auction site and you can list things for sale on there and now, with
the magical listing, uses the
Azure OpenAI
Service to generate a complete listing for you in eBay, rather than you
having to type it all out. These kinds of applications, you're going to
see them more and more in our Copilots and third-party tools,
they're going to be everywhere you look and many of them are indeed based on
Azure OpenAI Service. One customer that's
using content safety, by the way, is Koo. Koo is like Twitter, but it's big in India, it's
starting to get big in Brazil as well, and Koo is using our
content saf
ety system, the very one that you
just saw, to ensure that the content coming in and out of Koo is safe, non-violent, not including self-harm or abusive, and that kind of
stuff, and that works very well for that
company and it can work well for you also. As I mentioned, I
couldn't really demo it here because there's not
a great way to show it, Vector search is coming to Cognitive Search, and this is
going to change the game. This is a key piece of how grounded large language
models are going to
work. Across the board, what
we covered today, we looked at some stuff
from Azure Speech, we looked at summarization
in language. We looked at both
the current and the new capabilities of
Azure Form Recognizer. We grounded Azure
OpenAI on my data, in this case, with the
Azure Data button. We added a plugin
to Azure OpenAI. We looked at orchestration, evaluation, and testing
with prompt flow, content safety, and Vector
Search and Cognitive Search. These are some of
the new features, not even all
of them,
that we've added to Azure AI just in the
last few months. Stay tuned, because
there's more coming. We'll see you again at Ignite, and I'm looking forward to that. What comes next on
learn.Microsoft.com, you can start your
certification journey and we have all
kinds of content there. One thing, in particular,
that I would recommend, by the way, we just published
on learn.Microsoft.com, our new prompt
engineering guide, using all the best practices
we've learned from Bing and from the Cop
ilot
and stuff like that, how to write effective prompts. That's something you're
going to want to search on learn.Microsoft.com.
You can join the AI tech community to
stay connected, both to us and to others of you
that are working with AI. I strongly encourage you to explore all of the
capabilities across Azure AI. There is much more than
I was able to show here today, and the more you mess
around, the more you find out, so check
this stuff out, get into Speech,
get into language, get into Azu
re OpenAI,
and Vision. Try these things for yourself. They are amazing. With that, I think we
are going to questions. Here, again, is that QR code. Now, on the big screen, we've got about
six-and-a-half minutes. So, we have a question from Gustavo, "Will all these features work in Spanish or other languages?" In general, the answer is yes. The vast majority
of these features work in a great
deal of languages. In fact, what I didn't
show here today is, there is a version of My
Voice that speaks
Chinese. I've always wanted to speak
Chinese, but I can't do it, but My Voice can on
there, but yeah. And another person,
John asks, "What languages are currently
supported in Form Recognizer?" There are 250 languages
supported by Form Recognizer. I forget the exact list, but there is a vast number of languages that Form
Recognizer supports. The full list is available in the Form Recognizer
documentation, and it's a long list. "How does Vector search differ from existing
Cognitive Search? How is
it leveraging innovations in Vision?" Oh,
there's a good question. Existing Cognitive Search, so Cognitive Search has
two modes right now. The first mode is, I'll call it keyword search, where it's literally looking for a word and it can do
things like stemming, e.g., if I searched the
word "running," it'll search for "running,"
and "run," and "ran," variants of that word, but it's very keyword-based. The second mode is
semantic search, where it will search for
things that are like that, but no
t exactly that. Vectors are a different
way of doing things. When you make a vector, you are actually mapping concepts into this
multidimensional space. One super easy example is
discussion and argument. If I take the two
words, "discussion" and "argument," in one sense
they're the same thing. An argument is a type
of a discussion. In another sense, and in another part of the
vector space, they're pointing in different
directions because an argument is angry and
then, a discussion is not. So, ve
ctorization is the
mapping of concepts to this multidimensional
vector space and you end up with thousands
of numbers, actually, when you've given a sentence,
you get a bunch of numbers. What that means, though, if I say the same sentence
in Spanish and in English, it maps to the same spot
in the vector space. The words don't match at all, but the concepts do, and so
I'm able to find that spot. The person also asked
about Vision, well, we can also vectorize
images in the same way. So, I can take
a picture of a hat and a
description of a hat, and that maps to the same
point in vector space, and that means that I can
do a search, like I did here with an image, and it's
still able to find it. "When Bing summarizes
resulting search documents, is that summarization AI-based?" Also, Bing is using
GPT-4, actually. When it summarizes the
information that it finds, it's not using this
summarization API, which is ideal for the
one that I demoed, which is ideal for document and conversation summa
rization, Bing is using the full power
of GPT-4 in that case. Guido asks, "Can we use
our OpenAI plugins also with Azure AI, instead of using Blob storage or
Cognitive Search?" At the moment, the
"Azure Data" button doesn't support plugins,
although of course, yes, you can add your own plugins in there and it will use it. Whatever you have a plugin to, whatever data source
or system of action, you can actually
add that in here and use it with Azure OpenAI. The Azure Data capability, we will cont
inue to add more methodologies of
adding data sources there. Stay tuned for that. But already, it is immensely
powerful, as it is today, and we have
a whole bunch of customers that are using it. One of which, by the
way, I've got to say this, Dynamics Copilot for
customer service is actually using that
exact same thing. In fact, like when it goes and creates an email, for those
of you who have seen it, you can log a case in Dynamics and you can press this
button that says, "Make an email with a
resolution," in which it goes and searches your
knowledge base and writes up the email.
That is exactly Azure OpenAPI on your data, in fact, so some of
our Copilots are using the same underlying technology. For Form Recognizer, how do I put a human-in-the-loop when there's low
confidence from the AI? Form Recognizer does have a
human-in-the-loop mode to it, so that you can find in the documentation
for Form Recognizer, but you could do it
with Form Recognizer itself, or you could do it as
part o
f your own workflow. When you integrate
Form Recognizer into whatever application
you're integrating it to, you probably actually
want to have the human check and say, is this the thing that
you expected it to say? And so, they can actually
correct that if needed. That's just a best practice
whenever you are doing that type of automation,
whenever it's feasible. "Is it possible to
ground Florence with your own data for domain-specific attribution and labeling," is another question. And the answe
r is yes, you can. You can fine-tune Florence
with your own data. So, if you have your
own labeled data set, as with the Vitamin
Water example, if you've got a
bunch of pictures of Vitamin Water, that says
this is Vitamin Water, not a bottle of
vitamins, then it will actually be
able to learn how to vectorize the Vitamin
Water and how to work with it, in the context of all of
these other orchestrations. Ashley asks, we've
got one minute left, so if you're going to ask the
question, now's the tim
e. "Can all these features
be used on-premise? Can some of them be used on-premise, or do they
have to be in the Cloud?" Well, many of these
features actually can be used in containers, and that
can be used on-premise. So, we have two modes
of containers, disconnected containers,
and connected containers. Connected containers
have the advantage that they phone home and
will get the latest model. So, as we add new languages,
new capabilities, better features, the
connected containers will pick th
ose up. Disconnected containers
don't do that, but they can be
run in things like an air gap environment. Some of these things, like
Azure OpenAI, can only be run in a Cloud because they require a supercomputer to run, which is what Azure is. So, most of you
probably don't have a supercomputer at
home, but I do. All right, I'm going to take one last question. I
got just a few seconds. "How do the latest
announcements fit into Microsoft's Responsible
AI mission?" That is core to all of this. For
Speech, I had to
give my consent for it to mimic my voice. For content safety, that's used across the
board in OpenAI; Responsible AI is
absolutely core to our mission and it's part of everything we do in Azure AI. And with that, I'd like to thank you all, and have a wonderful
rest of Build.
Comments