Yann Lecun | Objective-Driven AI: Towards AI systems that can learn, remember, reason, and plan

I'm Dan freered director of the uh Center of mathematical sciences and applications here at Harvard this is a center that was founded 10 years ago by St it's a mathematics Center we engage in mathematics and Mathematics in interaction two-way interaction with science we have uh quite a crew of postdocs doing research in mathematics and Mathematics in physics in economics in computer science and biology we run some programs workshops conferences and a few times a year we have special lectures and

today is one of them this is the fifth annual ding shum lecture and we're very pleased today to have Yan laon who's the um Chief AI scientist at meta and a professor at New York University an expert on machine learning in many many forms and today he'll talk to us about objectiv driven AI thank you very much uh thank you for inviting me uh for hosting me it it seems to me like I give a talk at Harvard every six months or so at least for the last few years um but to different crowds uh physics d

epartment center from mathematics psychology um everything um so um I'm going to talk obviously about uh about AI but more about the future than about about the present and a lot of it is going to be uh basically proposals rather than results but like preliminary results on the the the the way to go um I wrote uh a paper that I I put online about two years ago on what this program uh is about and you're basically going to hear a little bit of what we have accomplished in the last two years towar

ds that program uh if you're wondering about the picture here on the right um this is my amateurish connection with physics I take uh astrop photography pictures this is taken from my backyard in New Jersey um it's m 51 beautiful galaxy okay machine learning sucks uh at least compared to what we observe in humans and animals it it it really isn't um that good uh animals and humans can run new tasks extremely quickly um with very few samples or trials um they understand how the world Works which

is not the case for AI systems today they can reason and plan which is not the case for AI systems today they have common sense which is not the case for AI systems today um and the behavior is driven by objective which is also not the case for most AI systems today um objectives means you know you set an objective that you try to accomplish and you kind of plan a sequence of action to accomplish this goal uh and AI systems like LMS don't do this at all so the the paradigms of learning supervise

d learning uh has been you know very popular the the a lot of the success of machine learning at least until Fair recently mostly with supervisor learning reinforcement learning gave some people a lot of Hope but turned out to be so inefficient as to be almost impractical in the real world at least in isolation unless you uh rely much more on something called self supervising which is really what has brought about the big revolution that we've seen in AI over the last few years um so um the the

goal of AI really is to build systems that are smart as humans if not more and we have systems that are better than humans at various tasks today they're just not very general so hence people who call human level intelligence artificial general intelligence AGI I hate that term um because human intelligence is actually not General at all it's very specialized um so I think talking about general intelligence we mean human level intelligence is complete nonsense um but that has s unfortunately um

but we we do need systems that have human level intelligence because um in a very near future or not so near future but in the near future every single one of our interactions with the digital world we be mediated by an AI system um we'll have ai assistants that are with us at all times I'm actually wearing smart glasses right now I can take a picture of you guys okay I can click a button or I can say hey meta take a picture and it takes a picture um and or I can ask it the question and there is

an llm that will answer that question you're not going to hear it because it's bone conduction but um but it's pretty cool so pretty soon we'll have those things and they will be basically the main way that that we interact with the digital world eventually those systems will have uh displays which this uh pair of glasses doesn't have um and we'll use those AI assistants all the time the way for them to be non uh frustrating is for them to be as smart as human assistance right so we need human

level intelligence just for reasons of uh basically product design okay but of course there's a more kind of interesting scientific question of really what is human intelligence and how can we reproduce it in machines and things like that so it's one of those kind of uh small number of U areas where there is people who want a product and are ready to pay for the development of it but at the same time it's it's a really great scientific question to work on um and there there's not a lot of domain

s where that's the case right um um so but once we have human level you know smart assistant that have human level intelligence this will amplify uh human's Global intelligence if you want I'll come back on this later we're very far from that unfortunately okay despite all the hype you hear from Silicon Valley mostly uh people will tell you AGI is just around the corner uh we not actually that closed um and it's because uh the systems that we have at the moment are extremely Limited in some of t

he capabilities that we have uh they um if we had system that approach human intelligence we would have systems that can learn to drive a car in 20 hours of practice like any 17y old and we don't we do have sing cars but they are heavily engineered they cheat by using Maps using all kinds of uh expensive sensors active sensors and they certainly use a lot more than 20 hours of training data um so obviously we're missing something big if we had human level intelligence we would have domestic robo

ts that could do simple tasks that a 10-year-old can learn in one shot like clearing up the dinner table and feing up the dishwasher and unlike 10y olds we wouldn't have to you know it wouldn't be difficult to convince them to do it right um but in fact it's not even humans just uh what a cat can do no AI system at the moment can do in terms of you know planning complex sequences of actions to jump on a piece of furniture or catch a a small animal so we we're missing something big um and basical

ly what we're missing is uh systems that are able to learn how the world works not just from text but also from let's say video or other sensory inputs systems that have uh internal World models systems that have memory they can reason they can plan hierarchically like every human and animal um so that's the list of uh requirements systems that learn World models sensory inputs learning to the physics for example which babies learn in the first few months of Life systems that have persistant mem

ory which Curr systems don't have systems that can plan um actions so as to fulfill an objectives and systems that are controllable and safe um perhaps through the specification of guard ra objectiv so this is the idea of objective driven AI architectures but before I talk about this I'm going to lay the groundwork for um how we can can go about that um so the first thing is that self supervisor running has taken over the world and I first need to explain what self supervisor running is um or pe

rhaps in a a a a special case but really the success of llms and and all that stuff and even image recognition these days and speech recognition translation um all the cool stuff in AI it's really due to self supervisor running the generalization of the user self supervisor running so a particular way of doing it is is you take a piece of data let's say a text you transform it or you corrupt it in some way for a piece of text that would be replacing some of the words by blank markers for example

and then you train some gigantic neural net to predict the words that are missing basically to reconstruct the original input okay this is how an llm is trained it's it's got a particular architecture but that only lets the system look at uh words on the on the left of the word to be predicted but it's pretty much what it is and this is a a generative architecture because it produces parts of the input okay um there are systems of this type that have been trained to produce images and uh you kn

ow they use other techniques like diffusion models which I'm not going to go into um I played with one So Meta has one of course that you can talk to through WhatsApp and mess messenger and there's a paper that describes the system that meta has built and I typed the prompt here up there um in that system uh a photo of a Harvard mathematician proving the rean hypothesis on a Blackboard with the help of an intelligent robot and that's what it produces um I check the proof it's it's not correct I

there symbols here that I have idea what they are um um okay so uh you know everybody is excited about generative Ai and particularly particular type of it called Auto regressive llm um and really it's uh train uh very much like I described but as I said the system can only use words that are on the left of it to predict a particular word when you train it so the result is that once the system is trained you can show it a sequence of words and then ask it to produ to produce the next word okay t

hen you can inject that next word into the input you shift the input by one okay so the stuff that was produced by the system now becomes part of the input and you ask it to produce the second word shift that in produce the next next word shift that in ETC right so that's called Auto regressive prediction it's not a New Concept it's very very old in statistics and Signal processing but um and you can economics actually but uh that's the way an llm works it's Auto regressive um it uses its own pr

ediction as inputs so those things work amazingly well for the Simplicity conceptually of how they're trained which is just predict missing words it's amazing how they work um modern ones are trained typically on a few trillion tokens um this slide is too old now I should put a zero it's not one to two trillion it's more like 20 trillion um so a token is a subw unit really it's on average three qus of a word um and there's a bunch of those models that have appeared in the last uh few years it's

not just in the last year and a half since you know jgpt came out that's what made it known to The Wider public but those things have been around for for quite a while uh things like blunder Galactica lama lama 2 Cod Lama which are produced by fair mist mixed St from a small French Company formed by former Fair people and then various others J or more recently by Google and then proprietary models U meta AI which is built on top of Lama 2 and then gini from from Google CH GPT GPT 4 Etc um and th

ose things make stupid mistakes they don't really understand logic very well but if you tell them that a is the same thing as B they don't necessarily necessarily know that b is the same as a for example they don't really understand uh transitivity of uh ordering relationships and things like this like you know they not they don't do logic you have to sort of explicitly teach them to do arithmetics or have them to you know call tools do arithmetics um and they don't have any knowledge of the und

erlying reality they've only been trained on text some of them have been train also on images but it's it's basically by treating images like text so it's very limited um but it's very useful to have those things uh open sourced and available to everyone because everyone can sort of experiment with them and uh and do all kinds of stuff and there's literally millions of people using llama um as a basic platform so self supervising is not just used to produce text but also to do things like transl

ation so there's a system produced by my colleagues a few months ago called seamless m4t uh it can translate 100 languages into 100 languages um and it can do text to text text to speech speech to text and speech to speech and for speech to speech it can actually translate languages that are not written which is pretty cool um it's also available you can play with it um it's pretty amazing I mean that's kind of super human in some way right I mean there few humans that can translate 100 language

s into 100 languages in any direction we actually had a previous system that could do 200 languages but only from text not from speech um but you know there are dire limitations to the system the first thing is the auto regressive prediction is uh basically a exponentially Divergent process like every time the system produces a word there is some chance that this word is outside of the set of proper answers and there's no way to come back uh to correct mistakes right so uh the probability that a

sequence of words will be kind of a correct answer to the question decreases exponentially with the length of the answer which is not a good thing uh there's various kind of you know technical papers on this uh not by me that uh tend to show this a lot of criticism also on the fact that those smms can't really plan so the amount of computation that an llm devotes to producing a token is fixed right you give it a prompt it runs through however many layers it has in the architecture and then prod

uces the token so per token the amount of computation is fixed the only way to get a system to think more about something is to trick it into producing more tokens which is kind of a very kind of securious way of you know getting it to do work and so there's been quite a bit of research uh on the question of whether those systems are actually capable of of planning and the answer is no they really can't plan whenever they can plan or produce a plan it's basically because they've been trained on

a very similar situation and they already saw a plan and they basically regurgitate a very similar plan but they can't really use tools in new ways right um and then there is the last limitation which is that they're trained on language and so they only know whatever knowledge is contained in language and this may sound surprising but most of human knowledge is actually has nothing to do with language um so they can be used for as writing assistance you know giving you ideas if you are if you ha

ve the white patient anxiety or something like this um they're not good so far for producing factual content and consistent answers although they've been they've been they're kind of being uh modified for that and we are easily fooled into thinking that they are intelligent because they are fluent but really they're not that smart um and they really don't understand how the world works um so we're still far from Human level AI um as I said most of human and animal knowledge certainly is non-verb

al so what you know what are we missing um again I'm reusing those examples of you know learning to drive or learning to clear up the N table um we're going to have human level AI not before we have domestic robots that can do those things and this is called a moric paradox the fact that there are things that appear complex for humans like playing chess and go or planning a complex trajectory um and they are fairly simple for computers but then things that we take for granted that we think don't

require intelligence like you know what a cat can do it's actually finally complicated um and the reason might be this um so it might be the the fact that um um the data bandwidth um of text is actually very low right so 10 trillion token data set is uh um basically the totality of the publicly available text on the internet that's about 10 to the 13 bytes or 10 to the 13 tokens I should say a token is typically two bytes is about 30,000 possible tokens in a typical language um so that's two to

10 to the 13 bytes for training uh llm it would take 170,000 years for a human to read uh at 8 hours a day 250 hours per minute or 100,000 years if you read fast and you read 12 hours a day um now consider her human child a four-year-old child a four-year-old child has been awake 16,000 hours at least that's that's what psychologists are telling us um which by the way is only 30 minutes of YouTube down uploads um we have 2 million optical nerve fibers going into our visual cortex uh about a mil

lion from each eye um each fiber maybe carries about 10 bytes per second is going wet um um this an this is an upper bound um and so the data volume that a four-year-old has seen through vision um is probably on the order of 10 to 15 bytes that's way more than the totality of all text publicly available on the internet 50 times more 50 times more data by the time you four that you have seen through vision so that tells you a number of things but the first thing it tells you is that we're never g

oing to get to human Levi by just training on language it's just not happening there's just too much background knowledge about the world that we get from observing the world that current uh AI systems don't get um so so um that leads me to this idea of objetive driven a system you know what what is it that sort of makes humans for example capable of or or animals for that matter capable of kind of using tools and objects and situations in new ways and sort of invent new ways of uh uh of Behavin

g um so I wrote a fairly readable fairly long paper on this U you see the URL here it's not on archive because it's on this open review site which you can comment tell me how wrong this is and everything um and the basic uh architecture is kind of shown here so every time you have an arrow that means there is signals going through but also means there might be gradients going backwards so I'm assuming everything in there is differentiable um and there is a perception module that you know observe

s the world turning into representations of the world a memory that you know might be S persistent memory factual memory things like that a world model which is really the centerpiece of this system uh an actor and a cost module objective functions the configurator I'm not going to talk about at least not for now so here is how the system works a typical episode is that the system observes the world uh feed this through this perception system perception system produces some idea of theur state o

f the world or at least the part of the world that is observable currently maybe it can combine this with the content of a memory that contains the rest of the state of the world that has been previously observed okay so you get some pretty good idea what the current state of the world is and then the world model the role of the world model is to take into account the current state of the world and a hypothesized sequence of actions and to produce a prediction as to what is going to be the the f

uture state of the world resulting from taking those actions okay so state of the world at time T sequence of actions state of the world at time t plus whatever now that outcome that that's that predicted state of the world goes into a number of modules whose role is to compute basically a scalar objective so each of those square boxes here uh the red square boxes or qu they they're basically scalar valued function that take representation of the state of the world and tell you uh how far the st

ate of the world is from a particular goal objective Target whatever it is um or it takes a sequence of uh predicted States and it tells you to what extent that uh sequence of state is dangerous toxic whatever it is right so those are the guardal objectives okay so an episode now consist uh in uh what the system will do the way it operates the way it produces its output which is going to be an action sequence is going to be by optimizing the objectives the red boxes whatever comes out of the red

boxes with respect to the action sequence right so there going to be an a optimization process that is going to look for search for uh an action sequence in such a way that the predict icted uh outcome and state of the world satisfies the objectives okay so this is uh intrinsically very different principle from just running through a bunch of layers in the neural net this is intrinsically more powerful right you can express pretty much any algorithmic problem in terms of an optimization problem

and this is basically an optimization problem and not specifying here exactly what optimization algorithm to use um if the action sequence space you know the space in which we do this inference is continuous we can use gradient based methods because all of those modules are differentiable so we can back propagate gradients through the you know backwards through those arrows and then update the action sequence to minimize the objectives and then converge to an optimal action sequence for the obj

ective we're looking for according to a world model um if a word model is something like a you know discrete time differential equation or something like this uh we might have to run it for multiple steps okay so the initial World sequence is fed to the world model together with an initial action that predicts the next state from that next state we feed another action that produce the next next state the entire sequence can be fed to the guardrail objectives and then the the end result is fed to

the uh task objective essentially um so this is sort of a ideal situation where the world model is deterministic because the world might be deterministic there's very little uh uncertainty about what's going to happen if I do a sequence of action to grab this uh this bottle uh I'm in control but most of the world is not completely predictable so you probably need some sort of Laten variable that you feed to your world model that would account for like all the things you don't know about the wor

ld um you I have to sample those latent variables uh within a distribution to make multiple predictions about what might happen in the future um because of uncertainties in the world um really what you want to do ultimately is not this type of kind of one level planning but you want to do hierarchical planning so basically have a system that can produce multiple representations of the state of the world of multiple level of abstraction so that you can make predictions uh more or less long term i

n the future so here's an example let's say I'm sitting in my office uh at NYU um in New York and I want to go to Paris um I I'm not going to plan my entire trip from New York to Paris in terms of millisecond by millisecond muscle control it's impossible it would be intractable in terms of optimization obviously uh but also it's impossible because I don't know the condition that uh that will occur like you know do I have to avoid a particular obstacle that haven't seen yet um is the street light

going to be red or or green how long am I going to wait to grab a taxi you know whatever um so I can't plan everything from the start but what I can do is I can I can do high level planning so high level planning at a very abstract level I know that I need to get to the airport and catch a plane those are two macro actions right so that determines a sub goal for the lower level how do I get to the airport I'm in New York so I need to go down in the street and have a taxi that sets a goal for th

e level below how do I get to the street um where I get I have to you know take the elevator down and then work out on the street how do I go to the elevator I need to stand up on my chair open the door in my office walk to the elevator push the button how do I get up from my chair and that I can't describe um because it's like you know muscle control and everything right so so you can imagine that there is this hierarchical planning thing going on we do this completely effortlessly absolutely a

ll the time animals do this very well no AI system today is capable of doing this robots do some robot robotic system do hierarchical planning but it's it's hardwired it's it's handcrafted right so if you want to have like a a walking robot uh you know walk from from here to the to the door of stairs uh you first have a high level planning of the trajectory you know you're not going to work directly through here you're going to have to go through the stairs Etc and then at the lower level you're

going to plan the motion of the of the legs to kind of follow the trajectory but that's kind of handcrafted it's not like the system has learned to do this um it was kind of built by hand so how do we get systems to spontaneously learn the appropriate levels of abstractions to represent action plans and we really don't know to do this or at least we don't have any demonstration of any system that does this that actually works um okay so next question is going to be if we're going to build a sys

tem of this type is how are we going to build a world model again World model is state of the world at time T action predicted state of the world at time t plus one whatever the unit of time is um and the question is how how do uh humans do this or animals so you look at uh at what age babies learn basic concepts I saw this chart from Emanuel dupu who's a psychologist um in Paris and the basic things like you know basic object categories and things like this that are learned pretty early on uh w

ithout language right babies don't really understand language at the age of four months but they develop the notion of object categories spontaneously things like solidity rigidity of object difference between animate and inanimate objects and then uh intuitive physics pops up around 9 months so it takes about N9 months for babies to learn that objects that are not supported for because of gravity um and and more Concepts in intuitive physics is it's not fast right I mean we take a long time to

learn this most of this at least in the first few months of life is is learned mostly by observation with very little interaction with the world because like a baby until you know three four months can't really kind of manipulate anything or affect the world beyond their limbs um so most of what they learn about the world is mostly observation and the question is what type of learning is taking place when BBS do this this is what we need to reproduce um so there is a natural idea which is to jus

t transpose the idea of s supervision INF for text and use it for video let's say right so you know take a video call this y a full video and then uh corrupt it by masking a piece of it let's say the the second half of the video uh so call this MK video x and then train some gigantic neural net to predict the part of the video that is missing and hoping that if the system predicts what's going to happen in the video probably has a good idea of what the underlying nature of the physical world is

a very natural Concept in fact neuroscientists have been thinking about this kind of stuff for a very long time it's called predictive coding and um mean this idea that you learn by prediction is really very standard you do this and it doesn't work uh we've tried for you know my colleague and I have have been trying to do this for 10 years um and you don't get good representations of the world you don't get good predictions um the kind of prediction you get are very blurry kind of like the the v

ideo at the top here where you know the the four the first four frames of that video are observed the last two are predicted byal net and it predicts very blurry images the reason being that it can't really predict what's going to happen so it predicts the average of all the plausible things that may happen and that's a very blurry video um so um doesn't work the solution to this is to basically abandon the idea of generative models that might seem shocking given that this is the most popular th

ing in machine learning at the moment um but we're gonna have to do that and the solution is um that I'm proposing at least is to replace this by something I called Joint embedding predictive architectures jepa this is what a JEA is so you take y you corrupt it same story or you transform it in some way um but instead of reconstructing y from X you run both X and Y through end coders and what you reconstruct is you reconstruct the representation of Y from the representation of x so you're not tr

ying to predict every pixel you're only trying to predict a representation of the input which may not contain all the information about the input they contain you know only partial information um so that's the difference between those two architectures on the left generative architectures that reproduce y on the right joint embeding architectures that embed X and Y into a representation space and you do the prediction in representation space and there's various flavors of this uh um joint embedd

ing architecture um the one on the left is an old idea called sies networks goes back to the early 90s I worked on um and then there is deterministic and non-deterministic versions of those jepa architectures I'm not going to go into the the details um the reason why you might need uh latent variables in the predi is because it could be that the world is intrinsically unpredictable or not fully observable or stochastic and so you need some sort of way of making multiple predictions for a single

observation right so the Z variable here is uh basically parameterizes the the set of things you don't know about about the world that you have not observed in the state of the world and uh that will parameterize the set of potential predictions now there's another variable here called a and that's uh that's what Turns The Joint embeding architecture into a world model this is a world model OKAY X is an observation um SX is the representation of that observation a would be an action that you tak

e and then s y is a prediction of the representation of the state of the world after you've taken the action okay and the way you train the system is by minimizing the prediction error so why would be the the a future observation of the of the world right X is the past and the present Y is the future you just have to wait a little bit before you observe it uh you make a prediction you you take an action or You observe someone taking an action you make a prediction about what the state the future

state of the world is going to be and then you can compare the actual state of the world that you observe with the predicted State um and then train the system to minimize the prediction error but there's an issue with this which is that that system can collapse if you only minimize the prediction error what it can do is ignore X and Y completely prod use SX and s y that are constant and then the prediction problem becomes trivial so you cannot train a system of this type by just minimizing the

prediction error you have to be u a little smarter about how you do it and to understand how this works you have to um basically use a concept called Energy based models um which is you can think of as a very like a a weakened version of probabilistic modeling um and for the physicist in the room um the way to turn to go from energies to probabilities is you take exponential minus and normalize um but if you manipulate the energy function directly you don't need this normalization so that's the

advantage so what what is an energy based model it's basically an implicit function f of XY that measures the degree of incompatibility between X and Y whether Y is a good continuation for X in the case of video whether Y is a good set of missing words from X things like that right um but basically that function takes the two argument X and Y and gives you a scalar value that indicates to what extent X and Y are compatible or incompatible it gives you zero If X and Y are compatible or small val

ue and it gives you a larger value if they're not okay so imagine that those two variables are scalar and the observations of the the black uh uh dots that's your training data essentially you want to train this energy function in such a way that it takes low values on the training data and around and then higher value everywhere else and what I've represented here is kind of the lines of equal energy if you want um the Contours of equal energy um so how are we going to do this how are we going

to so okay so the energy function is not a function you minimize by training it's a function you minimize by inference right if I want to find an A Y that is compatible with an x i search over the space of Y for a value of y that minimizes f ofx y okay so the inference process does not consist in running Feit forward through an neural net it consists in minimizing an energy function with respect to Y and this is computationally this is intrinsically more powerful than running through a fixed num

ber of layers in the neural net so that gets around the limitation of uh aive LMS that you know spend a fixed amount of computation per token uh this way of doing inference can spend an unlimited amount of uh of resources you know figuring out a good y that minimizes F of XY depending on the nature of F and and the nature of Y so if Y is a continuous variable and and your function hopefully is differentiable you can minimize it using gradient based methods uh if it's not if it's discrete then yo

u would have to do some sort of commentarial search but that would be way less efficient so if you can make everything continuous uh and differentiable you're much better off um and by the way I me I I forgot to mention something when I I talked about or model uh this idea that you have a word model that can predict what's going to happen as a consequence of a sequence of actions and then you have an objective you want to minimize and you plan a sequence of action that minimize the objective thi

s is completely classical optimal control it's called Model preductive control it's been around since the early 60s if not the late 50s um and so it's completely standard uh the the main difference with what we want to do here is that the world model is going to be learned from sensory data as opposed to kind of a bunch of equations you're going to write down for the Dynamics of a rocket or something um here we're just going to learn it from sensory data right okay so there's two methods really

to to to train those energy functions uh so that they take the right shape okay so now we're going to talk about learning how do you shape the energy surface in such a way that it gives you low energy on the data points and high energy outside and these two classes of methods to prevent this collapse I was telling you about so the collapse is situation where you just minimize the energy for whatever training samples you have and what you get in the end is an energy function that is zero everywhe

re that's not good model you want an energy function that takes low energy on the data points and higher energy outside so two methods contrastive methods consist in generating those green flashing points contrasted samples and pushing their energy up okay so back propagate gradient through the entire system so that and tweak the parameters so that the output energy goes up or a green point and then so that it goes down for a blue point a data point um but those tend to to be inefficient in high

dimension so I'm more in favor of another set of method called regularized Methods that basically work by minimizing the volume of space that can take low energy so that when you push down the energy of a particular region it has to go up in other places because is only a limited amount of low energy stuff to go around um so those are two classes of method I'm going to argue for the regularized methods but really you should think about like two classes of method to train energy based models and

when I when I say energy based models this also applies to probalistic models which are essentially a special case of energy based models um okay there's a particular type of energy based model which are called lat variable models and they consist in either um in sort of models that have a latent variable Z that is not given to you during training or during test that you have to infer the value of and you can do this by either minimizing the energy with respect to Z so if you have an energy fun

ction e of XYZ you minimize it with respect to Z and then you put that Z into the energy function and the the resulting function does not depend on Z anymore and I call this F of XY right so um having lat variable models is really kind of a very simple thing um in many ways if you are a basian or probabilist instead of inferring a single value for Z you infer distribution but I might talk about this later a little bit so the depending on which architecture you're going to use for your your syste

m it may or may not collapse um and so you if it if it can collapse and you have to use one of those objective functions that prevent collapse either through contrastive training or through regularization um if you're a physicist you probably already know that it's very easy to turn energies into probability distributions um you compute P of Y given X if you know the energy of X and Y you do exponential minus some constant F of XY and then you normalize by the integral over all the space of Y of

the numerator so you get a normalized distribution of a y uh and that's a perfectly fine way of parameterizing a distribution if you really want the problem of course in a lot of statistical physics is that the denominator called the partition function is intractable and so here I'm basically just you know circumventing the problem by uh directly manipulating the energy function and not worrying about the normalization um but basically this idea of pushing down pushing up the energy minimizing

the volume of stuff that can take low energy that plays the same role of what would be normalization in the problemistic model um uh I'm not going to go through this it's a ey chart uh you can take a picture if you want this is basically a list of all kinds of classical methods as to whether they are contrastive or regularized uh all of them can be interpreted as some sort of sort of energy based model that is either one or the other um and the idea that is used in llm which is basically a parti

cular version of something called denoising Auto encoder is a contrastive method so the way we train LMS today um is contrastive okay we take a piece of data we corrupt it and when we train the system to reconstruct the missing information that's actually a special case of something called a do Auto encoder which is a very old idea um that's been revived multiple times uh since then um and this framework can allow us to interpret a lot of classical models like like cin spice coding things like t

hat but I don't want to spend too much time on this um you can do probably s inference but I want to skip this uh this is for these like free energies and varal free energies and stuff like that um but here's the recommendations I'm making abandoned genery models in favor of those joint Ting architectures abandoned problemistic modeling in favor of this energy based models abandon contractive methods in favor of those regularized methods um and I'm going to describe one in a minute uh and also a

bandon reinforcement planning but I've been seeing this for 10 years um so are the four most popular things in machine learning today which doesn't doesn't make me very popular um so how do you uh train a Jad with regularized methods um so there's a number of different methods I'm going to describe two classes one for which we really understand why it works and the other one it works but we don't understand why but it works really well um so um the first class of method consists in basically pre

venting this collapse I was telling you about uh where the output of the encoder is constant or carries very little information about the input so what we're going to do is um have a Criterion during training that tries to maximize the amount of information coming out of the encoders to prevent this collaps and the bad news with this is that to maximize the information content coming out of an neural net we would need some sort of lower bound on information content of the output and then push up

on it right uh the bad news is that we don't have lower bounds on information content we only have upper bounds so we're going to pass the fingers take an upper bound on information content push it up and hope that the actual information content follows and it kind of works actually works really well but it's not well Justified theoretically for that reason how do we do this so first thing we can do is uh make sure that the the the variables that come out of the encoders are not constant so ove

r a batch of samples you want each variable of the output Vector of the encoder to have some nonzero variance let's say one okay so you have a cost function that says I really want the variance to be larger than one or standard deviation okay the SE the system can produce a non-informative output by making all the all the outputs equal or highly correlated okay so you have a second Criterion that says in addition to this I want the different components of the output Vector to be uncorrelated so

basically I want a Criterion that says I want to bring the coari Matrix of the vectors coming out of the encoder as close to the identity Matrix as possible but still is not enough because you will get uncorrelated variables but they still could be very dependent so there's another trick which consists in taking the representation Vector SX and running it through an noral net that expands the dimension in a nonlinear way and then decorrelate those variables and we can show that under certain con

dition this actually has the effect of making pairs of variables independent okay not just uncorrelated um so have a paper on this U here on archive um okay so now we have a way of training one of those joint time Bing architectures to prevent collapse and it's really a regularized method we don't need to have contrastive samples we don't need to kind of pull things away from each other or anything like that we just train it on training samples uh and we have this Criterion once we've trained th

at system uh we can use the representation learned by the learned by the system sorry the representation learned by the system SX and then feed this to a subsequent classifier that we can train supervised for a particular task uh for example object recognition right so we can turn a linear classifier or or something more sophisticated uh and I'm not not going to bore you with the result but U um every row here is a different way of doing self supervised running some of them are generating some o

f them are joint embedding they use different types of criteria uh different types of distortions and and Corruption for the images uh and the top systems you know give you uh in the 70% correct on emission net when you train only the head on emission net you don't find tune the entire network you just use the the features um and what's interesting about s supervisor running is that uh those systems work really well they don't require a lot of data to basically learn a new task um so it's really

good for like transfer learning or multitask learning or whatever it is you learn generic features and then you use them as input to kind of subsequent task uh with sort of variations of this idea so this is called this method is called Vic and that means variance invariance covariance regularization variance covariance because of this uh covariance Matrix Criterion invariance because we want the representation of the corrupted and uncorrupted inputs to be identical um versions of this that wor

k for object detection and localization and stuff like that but here's another set of methods and those I have to admit that I don't completely understand why they work um there people like Yong Donan at fair and Su guli at Stanford who who claim they understand they'll have to explain this to me because uh I'm not entirely convinced and those are distillation methods so you have two encoders they have to be more or less identical in terms of architectures actually exactly identical they need to

have the same parameters and you share the parameters between them so there is something called weight EMA EMA means exponential moving average so the encoder on the right gets weights that are basically a running average of uh with exponential dking coefficient of the weight vectors produced by the enoder on the left as learning takes place so it's kind of a smoos out version of the of the weights um and soya and yongdong have explanation for why this prevent the system from collapsing um enco

urage you to read that paper if you can figure it out um and uh there's a number of different methods that uh are using this self-supervised pre-training uh to uh work really well um old methods like boost RP your on latent from Deep Mind siman by by fair and then dinov V2 which is a one-year-old method by my colleagues at Fair in Paris um which is probably the best system that produces generic features for images if you have a vision problem you need some generic features to be fed to some clas

sifier so you can train it with small amount of data used in ov2 today that's the best thing we have um and it it produces really nice features really good performance with very small amounts of data um for all kinds of things you can you can train it to do segmentation to do depth estimation to do object recognition to estimate the height of the tree canopy uh you know on the entire Earth to detect tumors in chest x-rays all kinds of stuff that that it's open source so a lot of people have been

using it for all kinds of stuff it's really cool um a particular instantiation of those distillation method is something called IA so this is a a JEA architecture that's has been trained using this distillation method but is different from Dino um and this is uh this works extremely well in fact better than Dino for the same amount of training uh and uh it's very fast to train as well so this is the best method we have and it Compares very favorably to competing methods that use generative mode

ls that that are trained by reconstruction so there's something called Mae masked Auto encoder um and uh which are the the the Holo squares here on this graph um um maybe I should show this one so this is a method also developed at at meta at fair but it works by reconstructing uh a photo right so you you you take a photo you mask some parts of it and you train what amounts to Noto encoder to reconstruct uh the parts that are missing and it's very difficult to predict what's missing in an image

because you know you can have like complicated textures and stuff like that and um in fact this system is uh much more expensive to train and doesn't work as well as the joint embedding methods right so the one lesson from this talk is uh generative method for images are bad they're good for text but not good for images whereas joint timeing methods are good for images not yet good for text um and the reason is image images are high dimensional and continuous so so generating them is actually ha

rd it's possible to produce image generation system that produce nice images but they're not good they don't produce good internal representations of of images um on the other hand generative models for text works because text is discrete so language is simple because it's discrete essentially we have this idea that language is kind of the most sophisticated stuff because only humans can do it in fact it's simple the real world is really what's hard um so um I works really well for all kinds of

tasks and people have used this use this for all kind of stuff of stuff there's some mathematics to do here which I'm going to have to skip um to talk about VJ so this is a version of IA but for video um that was uh put online fairly recently and there the idea is you take a take a piece of video you mask part of it and again you train one of those joint ting architectures to uh basically predict the representation of the full video from the repres presentation of the partially mased or corrupte

d video uh and this works really well in the sense that um when you take the representation learned by that system you feed it to a classifier to basically classify the action that is taking place in the video you get really good performance and you get better performance than any other self supervised learning Technique we have a lot of training data it doesn't work as well as purely supervised uh with all kinds of tricks and data augmentation but um it comes really close and it doesn't require

label data or not much so that's kind of a a big uh uh breakthrough a little bit the fact that we can train system to learn from video in self-supervised manner because now we can might be able to use this to learn World models right uh where the masking of the video is uh you know take a video Mas the second half of it and ask the system to predict what's going to happen feing it an action that is being taken in the video if you have that you have a world model if you have a world model you ca

n put it in a planning system if you can have a system that can plan then uh you might have systems that are a lot smarter than current systems and they might be able to plan actions not just words um they're not going to predict Auto auto aggressively anymore they're going to plan their answer kind of like what we do like we speak we don't produce one word after the other without thinking we usually kind of plan what we're going to say in advance um at least some of us do uh so um so this works

really well in the sense that we get really good performance on lots of different types of video for classifying the action and various other tasks uh better than basically anything else that people have tried before certainly better than any system that's been trained on video and this the pre-training here is on a relatively small amount of video actually it's not not a not a huge data set um this is speed so this is um reconstructions of missing parts of a video by that system and it's done

by training a separate decoder right so it's not part of the initial training but in the end we can use representation as input to a decoder that we trying to reconstruct the part of the image that's missing and these are the result of completing the the you know basically the entire middle of the image is is missing and the system is kind of filling in things that are reasonable like you know it's a cooking video and is a hand and knife and like some ingredients um okay there another topic I wa

nt to talk about because I know there are mathematicians and physicists in the room uh recent paper collaboration between uh some of us at at fair and um and B keani who's a student MIT with Seth Lloyd and bunch of people from MIT so um this uh system is basically using this idea Jing to learn something about partial differential equations that we observe through a solution so um look at the thing at the bottom we have a pde Burgers equation uh what you see are diagrams of uh you know space time

diagrams basically of a solution of that PD and what we're going to do is we're going to take two windows separate windows on the solution of that PD okay and of course the solution depends on the initial condition you're going to get different solutions for different initial conditions right so we're going to take two windows uh over two different solutions of that PD and we're going to do a joint embedding so we're going to train an encoder to produce representations so that the representatio

n can be predicted the representation for one piece of the solution can be predicted from representation from the other piece um and what the system ends up doing in that case is basically predict or represent the coefficient of the equation that is being solved right the only thing that's common between one region of the SpaceTime solution of PD and another region is that it's the same equation with the same coefficient what's different is the initial condition the the equation itself is the sa

me right so the system basically discover some representation and when we train now a supervis system to predict the coefficient of the uh equation it actually does a really good job in fact it does a m a better job than if we training completely supervised from scratch so that's really uh really interesting um there's various tricks in this thing for uh transformations of the solution according to invariance properties of the equation which I'm not going to go into but that's using the VC proce

dure I I described earlier U so we apply this to a bunch of different pdes the kuroto cinski where we try to kind of Identify some of the coefficients in the equation nav Stokes we try to identify U the buoyancy parameter in N sto which is a constant term at the end uh and this works better again than just training a supervised system to predict what the bancy is from observing the behavior U so this is pretty cool I mean there's already papers that have kind of recycled this idea in other conte

xt um okay so that's um end of the technical part uh for the conclusion uh we have a lot of problems to solve some of which are mathematical like the mathematical foundations of energy based learning I think are not completely worked out the idea that the dependency between uh sets of variables is represented by an energy function that takes low energy on the data manifold and high energy outside it's a very general idea it breaks the whole kind of hypothesis of protic modeling and I think we ne

ed to understand better like what are the properties of such things um we need to work on J architectures that have regularized variables I didn't talk much about this but that's kind of a necessity um planning algorithms in the presence of uncertainty hopefully using radi based methods uh learning cost modules to guarantee safety for example planning in the presence of uh inaccuracies of the world model if your world model is wrong you're going to plan wrong sequences of actions because you're

not going to predict the right outcomes so how you deal with that um and then exploration mechanisms to adjust the world model for regions of the space where the system is not uh not very good so we're working on sory provide Runing from video as I told for you um LMS that can reason and plan driven by objectiv so according to the objective driven architecture I showed but for text as well as for robotic control um and then uh trying to figure out if we can do this sort of hierarchical planning

idea I was telling you about earlier um let's see so um in this future where every one of our interactions are mediated by System what that means is that uh AI systems will essentially constitute a repository of all human knowledge um and that everyone will will use sort of like you know a Wikipedia you can talk to and and possibly knows more than Wikipedia um every one of those systems is necessarily biased okay is trained by you know on data that is available on the on the internet uh there's

more data in English than in any other language there's a lot of language for which there is very little data um so those systems are going to be biased in necessarily and we've seen you know pretty dramatic examples recently with the G system from Google where the bias really was like so um they they spent so much effort to kind of make sure the system was not biased that it was biased in a other obnoxious way um and so so bias is inevit and it's the same as in uh the media and the Press every

journal every new magazine newspaper is biased the way we fix this is we have a high diversity of very different magazines and newspapers we don't get our information from a single system we have a choice between various bias systems basically this is what is going to have to happen for AI as well we're not going to have unbiased AI systems so the solution is to have lots and lots of biased systems biased for your language your culture your value system your centers of Interest whatever it is so

what we need is a very simple platform that allows basically anyone to fine-tune an open source AI system open source llm for their own language culture value system centers of Interest basically a wi key but not a wiy where you write articles a week where you f tune LM that's a future of of AI that I see that I want to see uh a future in which all of our interaction are mediated by AI systems that are produced by three companies on the west coast of the US is not a good future um and I work fo

r one of those companies but that but I'm happy to say that uh meta has completely he bought this idea that AI platforms need to be open and is committed to open sourcing the various incarnations of Lama uh the next one being Lama 3 coming soon um so open source platforms are necessary they're necessary for even the preservation of democracy for the same reason that diversity of the press is necessary for democracy um so when big danger is that open source AI platforms will be regulated out of e

xistence because of the fact that some people think AI is dangerous and so they say you can't put AI in the hands of everyone it's too dangerous you need to regulate it and that will kill AI open source AI platforms I think that's much more dangerous the dangers of this are much much higher than the dangers of putting AI in the hands of everybody and uh how long is it going to take for us to reach human level AI with AI systems it's not going to be next year like El M says well M says before the

end of the year it's BS um it's not going to be next year despite what you might hear from a um it's probably not going to be in the next five years it's going to take a while before the program I describe here um Works to the level that uh we want and it's not going to be an event like it's not going to be you know AI achieved internally or anything like it's not going to be like an event with all of a sudden we we discover the secret to Ai and all of a sudden we have super intelligence system

it's not going to happen that way we're going to build systems of the type I describe and make them bigger and bigger and learn them more and more stuff you know put more and more guard rails and objectives and stuff like that and work our way out so that as they become smarter and smarter they also become uh you know more secure and safe and well behaved and and everything right so it's not going to be an event it's going to be Progressive uh uh motion towards uh more and more powerful and uh

and and more safe AI systems and we need contributions from everyone which is why we need open source uh models and um I'll stop here thank you very much thank you for a wonderful thought-provoking talk we have time for a few questions hello yeah I've been trying to figure out why you put encoder in front of uh in front of why because like uh you're getting the representation of the uh output image and you've been losing information and does that mean your architecture is as good as your encoder

so like uh I I couldn't figure out why you put it there that way so can you help me to understand sure I have two answers to this um are you physicist by any chance uh computer science computer scientists okay there are physicists in the room okay but this is very basic physics um if you want to predict the trajectory of planets most of the information about any planet is completely irrelevant to the prediction right the shape the size the den the composition all of that is completely relevant

the only thing that matters is six variables which are position and velocities right and you can you can predict the trajectory so the big question in making predictions and planning and stuff like that uh is what is the appropriate information and the appropriate abstraction level to make the prediction you want to make and then everything else eliminated because if you spend all of your your resources trying to predict those things that are irrelevant you're completely wasting your time right

um uh so that's the first answer the second answer is imagine that the video I'm I'm training the system on is video of this room where I point the camera this way and I pan slowly and I stop right before you um and I ask the system I predict what's going to happen next in the video the system will probably predict that you know the the panning is going to continue there's going to be people sitting and at some point there's going to be a wall there's absolutely no way you can predict what you l

ook like or what anybody uh you know is will look like no way it's going to predict like how many steps there are in the the stairs no no way it's going to predict the precise texture of the wall or the carpet uh right so there's all kinds of details here that are completely unpredictable yet if you train a generative system to predict y it's going to have to actually devote a lot of resources to predict those details right so the the the whole question of machine learning and to some extent sci

ence is what is the appropriate representation that allows you to make predictions that are useful right so Jad gives you that generative models don't hello my name is Morris and I'm a PhD student at MIT and I noticed that your JEA architecture looks a lot like the comman filter like uh you have a sequence of measurements and even when you want a common filter there is often a problem which is that you need a condition called observability and you have a very clever way of getting around this co

ndition of observability because in your lateen space you can come up with a clever regularizer for the things that you cannot see does the world model help in coming up with these regularizers and secondly your control would probably come in on the latent state is that how you think it would work out in the end or or I mean I yeah that's my question yeah yeah okay actually it's not like a like a common filter a common filter uh the the encoders are reversed they're not encoders they decoders uh

so I'm I'm looking for the general picture here of where I had the world model um yeah this one is probably the best okay so in a uh in in a in a common filter first of all you get a sequence of observation and the here the the observation goes into an encoder that produces the estimate of the state in the camon filter is actually the other way around you you have a hypothesized state and you run into a decoder that produces the observation and what you do is is you invert right right I mean yo

u're you're learning a hidden Dynamics uh so in that sense it's similar uh but then but then you you're generating the observation from the the hidden States right so so it's a bit reverse and then there is a you know a constraint in at least in traditional common filters where the the Dynamics is linear uh then there is extended camera filters where it's nonlinear and then a particular provision to handle the uncertainty so you assume gas and distributions of everything basically right but yeah

there is a connection because there is a connection with optimal control and felters are kind of is seeing you up to control hi so I have a bit of a less technical question but um given you know that you're also like a citizen of like France and like broadly the EU what do you think and like given all what you said about sort of like you know having the open models and sort of like you know potentially one of the main problems for these systems being sort of like regulat capture or like legisla

tive problems what do you think about the new EU AI act and like like does that kind of influence you think like or might influence like how Europe is going to proceed with like kind of R&D and like AI development and like potentially like mattera presence in France well so there there are good things and bad things in the EU AI act the good things are things like uh okay you can't use AI to like give a social score to people that's good idea uh you know you can put cameras that do face recognit

ion in public spaces unless there is you know special conditions like the Paris Olympic Games or whatever um so I mean those are good things like for privacy protection and stuff like that uh what is less good is that at the last minute there were discussions uh where this they started putting Provisions inside of it for what they call Frontier models right so you know powerful this is because of chbt U they say you know if you have a powerful model it's potentially dangerous so we need to regul

ate research and development not just regulate products regulate research and development I think that's completely wrong I think this is very destructive uh depending on on how it's applied uh I mean it might be applied in ways that you know in the in the end are benign are benign but it could be that uh you know they might be kind of a little too tight about it and what is going to cause uh is that uh companies like MAA are going to say well we're not going to open source to Europe right we're

going to have to we're going to open source to the rest of the world world uh but if you're from Europe you can download it and and that would be really really bad um some companies like n are probably going to move out so um I think uh you know we're at a fork in a road where things could go bad I mean there's kind of a similar phenomenon in the US with the executive order of the White House where it could go one way or the other depending on how it's uh applied um in fact nitia was uh had a r

equest for comment uh that meta submitted one and said like you know make sure that you don't legislate open source AI out of existence uh because the reason to do this would be imaginary risks existential risks that are really completely crazy um you know nuts like pardon my French but like the idea somehow that you know all of a sudden you're going to discover the secret to a gii and and you know super intelligence system is going to take over the world within minutes it's just completely ridi

culous uh this is not how the world works at all um but there are people with a lot of money who have funded a lot of think tanks that have lobbied or uh you know basically Lobby government into thinking this and so governments have you know organized meetings they like you know are we going to all be dead next year or stuff like that so you have to tell them first you know we're far away from Human level intelligence don't believe you know the guys who tell you is it's likeon that is just aroun

d the corner and second we can build them in ways that are non- dangerous and it's not going to be an event it's going to be you know gradual and Progressive and we have ways to build those things that you know in a safe way like don't rely on the fact that current llms are unreliable and hallucinate like don't project this to Future systems future systems will have completely different architecture perhap of perhaps of the type that I described and that makes them controllable because you can p

ut guard rails and objectives and everything uh so discussing the existential risk of AI systems today super intelligence system today is insane because they've not been invented yet we don't know what they what they would look like um it's like you know discussing the safety of transatlantic uh uh flight on a jet airliner in 1925 the the turle jet was not invented yet um and it didn't happen in one day right I it took decades before you know now you can fly Halfway Around the World in complete

safety with a two engine Jet Jet Plane that's amazing incredibly safe it took decades it's going to be the same thing so that's a good place to wrap it up so let's thank Yan again for a wonderful talk thank you

Yann Lecun | Objective-Driven AI: Towards AI systems that can learn, remember, reason, and plan

Related articles

Comments