Webinar : Transfer Learning for Image and Text Classification

hello everyone can you hear me yes you can you hear me wait a [Music] minute uh can everyone hear me yes I do okay now I could also hear okay uh is the voice quality okay Manoj okay it's much better than before okay awesome thank you just do some [Music] setup [Music] okay uh let's see uh okay how many uh all of you are students from previous here is it okay Mano so what we will do is I will uh uh you know so I will basically cover transfer learning today and uh and we'll record it um I will hav

e uh Google doc uh we will use that more like a whiteboard um and uh what we'll do is I will first cover the content at length and then I'll allow you to ask questions uh so any useful questions we can also uh you know we can also record those question answers because it will be useful questions okay uh feel free to give your honest feedback if certain things are not clear uh we can take a rch and we can re-record it because we plan to use this uh for for the course [Music] okay [Music] I'm goin

g to present my screen uh just let me know I write here something as well is the font uh is the doc visible to you properly yes sir is visible is is the font fine yeah it's okay sir all right all these feedback will be very useful um okay so good morning everyone Manoj let's get started um good morning everyone uh today I will talk about transfer learning why do we really need transfer learning that will be the first point so let's write down the topics that we'll be covering today yes go ahead

is there a question okay okay let me restart good morning everyone we will be uh talking about transfer learning today we'll be covering topics like uh introduction to transfer learning uh then future extraction using transfer learning and we'll look at some of the examples okay so let's start okay uh so in this course we have mainly Rel with tabular structure data so the data that we dealt with so far is a table structure data what do I mean by that so we typically have okay so we have data of

this format where we typically have a table right so we have tables few columns so this is let's say let us say this real number this is feature one feature [Music] two feure three and so on and we have M sech features right so there are typically we have looked at data set which has got M features and N examples right so this is example [Music] one example two example and so we have M cross n so let us say these are all features in between these are all examples in between so we have what we ha

ve is n examples and in features what do we mean by this all of us are very well aware of this each example is represented with M features [Music] okay so we uh often call this as n cross M Matrix or the feature Matrix where every example is represented with MES so what kind of uh data sets we you know uh so many data set right example data sets so we looked at housing price prediction data set we also looked at digit data set all those data sets we somehow represented in a in form of a feature

Matrix now what happens given M CRM feature Matrix our job is to learn mapping between features and labels correct I have not talked about labels yet but all of you know label could be a vector or label could also be a matrix depending on whether we are solving a single label or multi- label problem okay or in general single output or multi- output problem label can a vector or a matrix depending on the type of output right single output we have got label Vector multiple outputs we have got La M

atrix and then our machine learning process followed right so this was a very nice case for us where uh data was represented in form of uh n CRM feature Matrix and we were asked to uh solve a machine learning problem which is to train a machine learning model to learn mapping between features and TS so in this manner we are trained many machine learning models right trained m models for regression classification regression and classific ification tasks classification task okay uh so this was wor

king out very well uh but we also need to understand how do we process uh so-called unstructured data sets so this representation is challenging or processing and structure data types like text images audio and video data so what is the challenge [Music] here the data is not present in future format instead we have a form of data for example um in case of text you might have sentences like I am online pc student so you'll get such a statement right we'll get some such kind of a statement and let

's say we want to predict the sentiment of this statement right sentiment here is Neal right there is no State there is no sentiment here the sentiment in this case would be neutral right so another example could be [Music] um I like the course I like the MLP course like I like MLP course content so here the sentiment is positive sentiment is positive now here the data is given in this particular format where you have text and a label so in case of images you might have uh so this is example for

text one modality the training data is is in form of a pair of text and label okay for sentiment classific for sentiment classification we will have data of the following form so this is a statement and this is the sentiment right and you will have many such kind of statements now let's take an example of image data set right so let's say we want to classify images into two types cats and dogs for example or you want to classify images of cats and dogs so how will this data look like okay so we

'll have images so let me get some images from Google image search if you don't like cats I hope at least you like dogs otherwise so image like this and we have this SC then we [Music] have this also available as Cat Let's Get couple of dogs images this image of [Music] [Music] dog also image of talk right so you can see this the same thing will happen for audio and video right um You you may also have you know uh examples where there are multiple labels assigned to a picture right so in an imag

e that may be cats and dogs and you have to identify all animals then you will have labels which is whatever animals that are present in the picture so even that type of data set is POS possible and I have purposefully selected these images or which are of different length and breadth so it becomes uh you know uh this also becomes another challenge that all images are not of the same type so we need to see how to uh you know process that okay I'll wait for a minute and take a question is there a

question I got some notifications so I just wanted to make sure that no sir no no question yet okay sounds good okay um so I hope this gives you a good idea about how these data sets are represented I talked about text and image data set and audio and video data sets will also be in a very similar form where you would have a video and you might have let's say some classification of this video right whether this video um is an adult video or not adult video that kind of classification or whether

this is a sports video or uh a movie a movie video something like that those type of classification or you can also try to you know process audios and try to detect whether it's a podcast or a song right so interesting classification problems you can solve with that okay uh so the difference that you see here right in this representation and what we are used to uh so far right so so far I mean in this data in this course we saw examples uh we saw data sets which are which were mainly in a table

structure format where we had examples and we have features but now we are going to um we need to we need to understand how do we process these data sets Okay and uh you know the models that you learn right models were of what form so do you remember the regression formula so Y is equal to something like dat transpose X this is one another formula right so these are regation model for example uh so what do you see here this is a mathematical formula so X this Theta Matrix and Y all of these sho

uld be in a number right in the numerical format so the features have to be in numerical format and we have seen that for image text and all these type of constructure data setes features are not in numerical format so what do we need to do we need to somehow take the unstructured inputs and convert them to numbers correct so that we know that our models can process numbers so we have to somehow take these unstructured inputs and convert them to numbers so how do we really do it so in pre- deep

learning era right in pre- deep learning era there used to be multiple disciplines so text was mainly studied in a course called natural language processing or statistical natural language processing or NLP there is one entire work stream dedicated to this then there was computer vision or image processing for images this is for this is for text then we have audio right for audio there was speech speech recognition or entire speech area for dedicated for this right so this is for audio and image

s and videos are covered in computer vision so what is video videos video is nothing but what is video video is nothing but a bunch of images right arranged in a sequence but there are lots and lots of images right so there are image frames so there are images per frame and these images are arranged in a sequence to make a meaningful video and of course there are things like you know there is smooth transition between images that happens in video it is not really stitching of multiple images you

need to have very smooth transition happening from one image to the other so video uh can also be studied using computer vision techniques or specifically the techniques that we apply for images can also be applied for video but uh whenever we are processing videos what we do is we sample different images from the video and then apply uh let's say image processing on it and then apply some kind of a sequential model on top of it to get the prediction I know you have not studied sequential model

s um but uh that is something that you will study in deep learning course uh so for now just remember that uh videos and images are processed in in a similar manner okay so there used to be these disciplines or these courses uh where researchers used to systematically analyze these data types and and they had developed bunch of techniques to extract features uh from these data types so what what used to happen is that we have this data type then some specialized processing or transformation righ

t either automatic or even humans were involved in this process or humans might look at the image and they might give you uh you know uh features and then uh you used to get features Sor and we used to get features so the specialized processing or Transformations that were that were done here okay uh let me highlight this particular part so that we know where we are going to focus on but with deep learning what has happened is that so um yeah even before that let us try to understand our machine

learning process right so what we have is we have data then we get we have data we get features then we build model or train model evaluate so this particular process this basically happens automatically so let us put green to this this was more like manual process so some we require experts right we have seen this in the course that we require human experts to give us features now idea is that can we automate can you automate feuture extraction or featurization part given the data okay so this

is really the question okay this particular part can be automate right so the answer turns out to be yes so with deep learning techniques deep learning techniques can be applied here deep learning techniques can be used for automatic featur extraction okay uh now uh let us do a quick recap of neural network let's take a quick recap of new n so what did we do in neural network let me uh draw a network here let me L let me get [Music] we uh I'm thinking that we should break one part of the video

here and let's get questions if there are any questions you can ask and uh on this basic part it will be interesting to know your questions and whether we can record this question answering part and then we'll do quick recap of neur networks is anyone any questions yeah feel free to ask questions or pass on feedback whether this is useful this is not useful this is confusing anything that you that you feel understood up to here okay everything is clear right yes okay yes okay um there was no noi

se right that you were hearing from the background no sir no background no correct okay great um so let's move on um if there are any questions feel free to ask um these questions will also be useful for uh you know your fellow students okay so now what we'll do is we will do a quick recap of neural networks and see how deep learning can be used for automatic feature extraction so what do we see here on the screen uh we have got input and we have got output and between input and output there are

some layers right we call these as hidden layers so if you recall from our field forward neural network discussion uh this neural network has got two hidden layers and one output layer with exactly one unit so this could be a classification regression problem where you have exactly one output this will specifically be a binary classification problem or a regression problem with a single output and here input has got four features right first hidden lay has got three units the second hidden lay

has got three units okay here in this figure we have not shown bias unit but on every node in the hidden layer we'll have one more Arrow coming in which which is a bias unit which is not explicitly shown over here so there will also be a bias unit on the output so how many parameters are there we started with four inputs four input features and we have one output if you just use traditional linear regression problem where we where you use these four features and one output by removing everything

you will have exactly five features which is uh you you will have exactly five weights one for every feature and one more additional one for bias so there will be five weights in all now let's let's count how many weights will be there in this model or how many parameters will be there in this model so you can see that there are four inputs coming into every node plus there is a hidden there is a bias unit so that means there are five inputs per node there are three nodes so there are 15 of the

m there are 15 parameters for the first first layer then again you are getting three for each unit in the second lay plus one bias which is 4 per unit so 4 into 3 is 12 15 + 12 is 27 and um four more here three from this layer and one more which is the uh bias unit so there are 31 parameters so just by simply adding these two hidden layers with three units each we converted this problem into a model containing 31 parameters getting it now you can think of these layers as you know these layers ar

e doing some kind of transformation on the input so what is happening after the second after the first layer after the first layer okay so let us try to write this down in the first layer right or each of the unit or neuron performs the following operation what happens in every so in plain English it does what happens linear combination plus nonlinear activation linear combination you have seen linear combination is what we do in linear regression even in logistic regression we do linear combina

tion and we apply uh sigmo activation on top of that right so linear combination is basically transpose X non only activation there are different activations like sigmoid Ru or tan these activations you can use you can also use a linear activation okay so this operation mainly happens in every layer okay so what is happening is the original inputs are getting transformed into something else right so for these four inputs there are three units these four inputs are passed through these three unit

s and then you get basically these three numbers after the first layer right so after the first layer what happened is these four inputs were transformed into uh three output units right so and then these three outputs are fil into the second layer right so after second layer what is happening is again the linear combination Follow by nonlinear activation is happening in every neuron and you are getting three outputs and these three outputs are then fed into the final layer where again the same

kind of linear combination plus nonlinear activation operation happens and we get the final result now why am I telling you all these things because we are going to make use of this particular description for generating features from the deep learning models right what is deep learning model deep learning models have multiple hidden layers right so there could be uh hidden layers in tens or in hundreds right so there are some models with 200 hidden l or 50 hidden LS and these these models have g

ot huge number of parameters right you know right I mean as we add more layers number of parameters will go up just for this simple Network we had 31 parameters so if you add one more layer here you're going to increase 12 more parameters and so on this is going to happen right so as you add more less more parameters will get added and uh then there are different architectures also this is a very simple field forward architecture uh there are different architectures also uh which which is out of

scope for our syllabus you are going to study this in your deep learning course are different architectures but all architectures roughly follow the same kind of principle where you will have hidden layers input layer and output layer and then hidden layers multiple gu so now what if right I use this particular architecture to somehow generate features right so think about this I take take let's say text right I fit in each text token right so let us take this take a concrete example of text ri

ght how can we use this architecture to get features or text now we'll we will uh we will have some assumptions right so let's say I have I'm online B stud and we decided that we are not going to use words like I and because they are common words they do not mean much so we decided to use words like online BSC and student let's say we use this three words here right and second one is I like MLP CES and we decided to use words like MLP and for so what I'm going to do is so vocabulary here consist

of five words right just a very simple a toy example I'm showing you so that you understand this very clearly so vocabulary consist of five words correct and we already know right we already know some very simple representations like one hot encoding right one hot encoding we know one hot encoding how it works right so let's say uh we want to represent word online so we'll represent this as let's say the first word is online so 1 0 0 0 0 there are five words right so I have this kind of represe

ntation I'm actually uh you know making this super simple by keeping away aside all kinds of complexities right now let's say word BSC is there we can represent this as 0 1 0 0 zero then we have word student which we can represent as 0 0 1 0 0 MLP we can represent this as 0 0 0 1 Z and find the course 0 0 0 0 1 so these are our one heart encoding representations correct now what I can do is for a moment let's say I want to get some picture representation for this word right so this representatio

n has got lot of zeros so this is not a great representation as such so we call it as so this is a sparse representation this is a sparse representation since it has got majority of entries zero was sparse representation now Spar representation is not good um because it may not give us good prediction accuracy it doesn't contain much information so we want to basically come up with some kind of a dense representation right which has got not five entries exactly but maybe three entries okay and t

hose three entries are basically all three entries are nonzero entries so how can you use neural network for this now let's take this network now here there are uh four units right so what we'll do is we'll draw a network for this right so we have input withd okay so we have uh a neural network with five inputs right every word is represented with five features correct so these are five input units and let's say we decided to use three hidden units in the first layer for elliptical so I'm just c

hanging them to perfect square okay so let's say uh we have one unit which is hidden we have one layer which is which has got three hidden units and let's say we output again um you again output five different things let's say the task that we are trying to solve is to predict the next word for example right so we are going to I'll explain this if this is bit difficult for you to follow sorry let me not actually do this so you understand this part right so this will just take too much of time so

instead of that we'll have some very simple representation um let me make this as a transparent okay so this is the representation for example okay um got here how can we use [Music] Simple PR forward Network to extract features for every word getting it so now you see that this is the input layer this is the hidden layer this is the output layer so here we are trying to solve a problem an example Network to predict the next word huh so how will the data look like here you will have data like a

pair right word I next word is n examples so see what is the next word in the statement so word is I next word is M then I'm putting word m next word is online word is online I'm trying to predict BC the word is BC next word is student then again I word is like like the next word is MLP MLP the next one is course these are examples right this is input this is output getting it we are trying to predict the next one and let's say this is a very simple Network where what is happening is every word

is represented as a one H Vector with five entries or five components getting it so if you convert this into this one hot representation right I'm just showing this okay so for I and this right we don't have any word so let's say um we have to make this seven then we writing and also we are entering like right so we have now eight word vocabulary we also have we need write I am like 2 3 4 5 okay I put I this becomes m okay so we have this representation now so what will happen is we need to mod

ify this figure for for getting it uh so we have basically input which is one H encoding Vector with eight components so we have eight inputs 1 2 3 4 5 6 7 8 we passing it through one hidden layer with three units and then we are again trying to predict uh eight different outputs and these eight outputs corresponds to the next word so here right so here the uh how the input output pair will look like I'll show you so I is represented as this this is I basically and what is output output is for e

xample M right which is this so you can construct for of the pairs so this is how the input output will look like and we want to use neural network to solve this particular problem and of course I'm showing you here only on a small data set but imagine if we do if you sol uh such a problem at a very large scale let's say at a web Corpus scale so we downloaded each and every statement from the web and we have we set up this problem about which word comes after the after which word and we have thi

s Corpus and let's say we apply neural network like this so where we represent every word as a one encoding vector and then we have a hidden layer which with let's say 128 units or 256 unit or 7 um 700 712 units right something like that um so now let's say we train this network now how can we use this to create our uh Fe okay um all right uh so what we we are going to do is okay step one step one train the model step number one and step number two use the model to extract features how do we do

this let us demonstrate this so again I'll uh First St this process and then we'll try to U put a diagram which is some modification of this diagram so uh see this model uh has got one hidden unit and then this is the output uh layer right so what we do is we basically ignore the output layer and and take whatever output this layer is giving us so how many numbers we'll get out of this layer this layer you know that it has got three units like this so it is going to exactly output three distinct

numbers right three distinct numbers right and those numbers are obtained by linear combination followed by nonlinear activation so this one gives you one number this one gives you another number this one gives you three numbers so this one is going to give you the third number so now in this way we have you have three numbers for every word every word which was originally represented with um eight uh you know eight component one hot encoding Vector we are going to convert that representation i

nto a dense Vector with three compon getting it I'm going to illustrate this in a picture we'll copy it here then we'll edit this so now we are going to get rid of this and these are your features let us write this these are your what you get [Music] are you getting it okay is this process clear yeah is there any question I'll pause for a minute is there any question okay anyway I have paused so I'll ask uh how are you finding this so far okay any any Improvement is required okay looks good fine

yes okay uh so I'll just have a small water break and I'll come back okay and we'll continue this discussion just give me a couple of minutes okay I'm back let us continue our discussion okay okay so you understood now that how to get features from the deep learning models the process so what is happening here essentially we are removing the the final layer right or the output layer we generally remove remove the output layer and read the outputs of the previous layer which become our features

with the output of the which becomes our features now if you want let's say uh four features for every word then increase one more unit here you will get four features you to increase this unit and then retrain this network then you will start getting this four features so what is essentially Happening Here is that we are not interested in the output of this model but we are interested in understanding the the knowledge that this particular network has captured by looking at the training data so

the the relationship between different features from the training data that is captured by the model is what we are interested in and we are using that to construct the features getting it and it is not necessary let's say you might have a model now some I mean some of you might have question that why only remove the output layer why not remove last few layers okay or where should we so yeah you can also remove last few layers right then the question is uh where can I cut this network cor so th

ese are very interesting questions so let's tackle them one by one right uh first is let me write this point you we can remove last few layers of course in this network we don't have any scope but uh we will have networks with many many layers in that case we can remove last few layers and read the output from the previous lay right uh from one layer previous to the last removed layer so concretely let us assume we have a network with 10 hidden layers and one output layer and we remove output la

yer and two hidden layers then the output of layer number eight we serve as features okay now the question is how do we decide decide how many layers okay but before that uh let me tell you uh this model that this network that we saw uh is um is the same network that is used for Auto encoders uh you will learn autoencoders more in your deep learning class but just telling you that this simple idea was used in Auto encoders so how do you uh how do we decide how many layers should we know so typic

ally what happens in deep Learning Network I mean again all of this will be covered in detail uh in deep learning course this is outside scope of this course but still I want to you know um give you some high level understanding of how things work in deep right so how many layers to we move so typically what happens is that uh initial layers of deep Network let WR DNN deep Network initial lay of deep Network learn simple concepts for example in case of image initial lers will learn simple geomet

ric Concepts right line which are then composed into into more complex Concepts like corner right or and arbitary shape so what happens is that so what happens is that as we go deep into the network we start learning more and more specialized Concepts so we want to basically uh decide uh where to cut right whether we want to use uh you know basic concepts as the basis for our feature then we should be cutting early in the network we should be cutting up first few layers if you are interested in

more specialized concept we should be cutting far later in the uh in the in the network okay repeating it again so for using simple Concepts as features right read the output of network after was few layers which means remove which means remove later layers for using more specialized Concepts as features read the output of later layers in the network [Music] removing lost this right what do I mean by concept concept is just an abstract concept concept is an abstract concept so you are going to t

o get some numbers which will represent some very specialized things um of the of the TR data very very specialized features of the training data so I'll give you an example uh let's say uh we have trained a network to recognize uh gats and dogs and let's say you want to use this network to uh generate features for different varieties of dogs right now what is happening is that this network was already Trend to recognize um you know complex features of the dog and we want to use this for doing c

lassification between different varieties of the dog so in this case what you will do is you will cut this network far later you will only remove probably the last layer and get the features uh from the second last layer now let's say this network that was trained on cats and dogs if you want to use this network to um detect let's say faulty machine parts now this network is not at all aware of machine parts but machine parts are again geometrical objects right so some very simple Concepts that

were learned by this network in the initial lay would be useful for detecting uh for getting features for the machine part so in this case if the network is St on cat and dogs and if you want to use it for detecting 4y machine parts you will be selecting first few layers for generating your features and you will discard many more layers uh in the end getting it so this is how uh you should uh see and again these are again hyper parameters right I mean where to cut the network is again uh you kno

w these are some very simple urtic that I told you but uh you should uh you know there is you need to do some experiments to detect um at what exactly where you should be cutting your network to get effective features for your Downstream task so Downstream task meaning let's say for doing classification or for doing regression or for doing clustering right whatever task you want to do you can um you know for that task those features it should be useful okay uh so this idea of using a network whi

ch was trained on which is trained for some other task and using that Network for generating features is called transfer right so let me write this down so what's happening here the idea of using a network trained for some other task for generating representations all features for a given task is called transfer learning right now there are some terms so the network that we are using for generating features is called pre-trained [Music] model the task that we are trying to accomplish with these

features is called Downstream task I remember these Concepts getting it so we talked about these three important Concepts now the next question is how do we get these models Which models how do we get these pre-train models of course again there are two choices one is we train them ourselves or we use someone else's model for which the architecture and weights are available right so of course I uh this uh this particular thing is difficult uh o to see we talked about web scale training of the mo

del to predict next World rate now in order to train that type of model um which has so much of training data and complex Minal Network we will require huge amount of computational resources we require huge amount of compute resources and um it is also very expensive to train these models so it is not efficient uh to do it um for our problem so uh this is difficult going to um going to higher high compute cost so we are going to use this route this route is very interesting where we are going to

use someone else's model you're going to use someone else's model uh where architecture and weights are available so we are going to use this type of pre-train models okay so the flow will be as follows right so let us quickly write the [Music] flow uh how how okay how are these going to things how are these things going to proceed okay so let us write down the flow of these things but before that we'll write the flow of the traditional ml rate traditional what happens we have data have feature

s we TR models and we valate and now how the new flow is going to look like let us right this flow we are going to have data okay then we'll get output from Return model of two selected layers okay this becomes aures and then you know we have so between these two let's see what changed so we had features here pay your attention to this now we are going to get features from this particular step we are going to first select a pre Trend model we are going to decide up to which layer we will use thi

s model to get the features we train the model and evaluate it right so this is traditional ml this is using deep networks as automatic fature extractors and fortunately uh these type of models are available uh there are um you know open source repositories where these such kind of prein models are available so there are pre- Trend models available for vision uh audio text uh also for NLP so all these type of pre-rain models are available uh so what we will do in the next class is uh we will per

form Hands-On expent with each of these model one example at least from text and image we will do and then uh as a homework you can do the same thing with audio and video uh where we will provide you with the pre-train models for audio video uh and we'll experience this particular thing so what happens is uh typically right how our pipeline will look like I'm going to just give you some code Snippets right now but we will uh do more thorough job in the next class Cod Snippets for transfer learni

ng right so let me write down some C Snippets which will be very useful okay let us make this as a sub part okay so let us write down some of these techniques here's example code for text for example a lot of things will not be known to you but uh we will do it in the next class I'm just putting this snippet just to explain how the flow is end to end um but we will actually do this in the uh in the collab in the next class uh where we will code this up and uh we will do it step by step so here w

hat is happening is that we want a model right pre- Trend model we now know that we need a pre-trend model so we are going to choose b as a model as a pretend model for this task where we want to uh basically uh represent this particular text with features so we'll see how to do that so we use b as a model so this B model comes up with its own tokenizer meaning that it it needs to know what type of tokens uh it should be forming from this text token need not correspond to the word the token coul

d be thi another token could be s-p these type of tokens which B has already learned from its own Corpus we are going to use the same tokenization we are going to use B model for pre-training we're going to take this sample text and what we are going to do is uh we are going to first use this tokenizer and encode this text right so this is part of the input to this B model we are going to encode our text using B's to organizer and then we are going to pass this input into the model and literally

get the output of this model so you can think of this as a prediction but we are not going to get the final prediction from the model but we are going to read out the output of uh let's say uh last but one layer of the model right and we call it as an embedding so we get embedding for every word we are going to uh take average of these embeddings and that becomes embedding for the entire entire sentence so this is one way of doing it there could be multiple ways of doing it but here we have tak

en mean as a very simple uh way to uh average embeddings for every word and this becomes the embedding for your statement for your sentence now you take you can now take these embeddings and use this as an input to let's say a classification model or for a regression model or for clustering you you're free to use this Fe fees for any of those Downstream task okay so this was for text so we just saw this for text now let us see for image image will also be very similar and we'll see one for audio

okay uh so let's do image [Music] this is your image code right so you're going to use reset 50 as a pre-rain model okay I will uh you know in the next class we will have a list of uh you know well-known uh models and I will also give you the list of repositories where you can see these models so for now uh for image we are again going to do the resonet route we are going to take the resonet model and we are going to set the model in the evaluation mode where it will give us the prediction then

we perform some very simple uh transform on the model we want to resize this image because resonet accepts image uh with a specific size which is 256 by 256 so we are going to sorry resonate accepts I think 224 um you know uh there is some standard Dimension uh in which image should be present so we are going to perform some of the basic transformation on the image and then we are going to pass this particular image into the resident model and we are going to take the features from this particu

lar model so we are going to perform pre-processing on the image the pre-processing is defined over here and then we are going to basically get the output of fourth layer of the model and we are going to uh uh average or we are going to get the average pulling of this model so these are slightly advanced concepts um which you will learn in deep learning course um and this particular average becomes the feature of our model so in your work what you need to basically do is we are going to give you

these type of codes you just need to use these codes and get features and treat this as a blackbox get features and then solve for other Downstream tasks like classification regression clustering any of those down down Downstream tasks okay now let's finally see uh so similar so can you work out uh on Audio model and we can discuss it in uh in the next class think about um so do your research and find out what is the pre- trend what model can we use as a pre- trend model for audio and um you kn

ow the Audio model will also proceed in the same way we'll select the Audio model we'll do any necessary preprocessing for this model and then we are going to select uh the layer of this model from where we are going to read the output uh we are going to uh take some kind of probably the mean uh of those outputs and that BEC our feature right for the Audio model so I hope now you got an idea of transfer learning and how can we use transfer learning for uh you know generating features for unstruc

tured data types like text images audio and videos right for videos uh again you can think of uh video as as I said earlier as a sequence of images so you can take you can you can get embedding for every image and then you can possibly process uh those embeddings in some kind of a sequential Manner and then get predictions the video okay [Music] um okay so uh in the next class we will Implement um you know uh we will Implement classification uh task using the representations learned from these p

re Trend models uh and actually experience it in action okay I'm going to stop here Manoj uh and take any questions that uh you may have any questions so next session will be on when when you will conduct the next session sir next weekend next Sunday 11:00 okay sir so there's a question in the chat uh what do we do if there are new words in the model which are not present in the preting that's an interesting problem so there is something called as out of the uh out of the vocabulary concept and

uh then we use that particular concept or we sometimes ignore that word that could be another way or in the model itself uh we have uh you know a place for unknown words where what we do is all low frequency words we mark them as unknown words and the representation is learned from for them so there are different strategies again so is there not a way we can feed our compus Corpus to the model uh it is difficult because the model is already trained right so you are not retraining the model right

so you're just using that model [Music] um uh for prediction purposes you not training the model again okay if you train the model that's a different story but uh that is not possible other thing that you can do is there is something called as fine tuning of the model which uh which that concept we will learn in in the Deep learning class where you can build your own vocabulary to the model and then you can try to fine tune this model that is possible so what happens is that last few layers of

the models you can retrain with your own data set and your your your task but uh we are not going to uh you know do that in this course because here we just want to give you a tool uh by which you can get representation for text images audios which are very interesting data sets for many of them and many of you and right now you do not have any tool set to get features out of that so we are handicapped we are not able to solve problems uh on these data types so we thought that we will give you s

ome blackbox to to generate features and you know um tell you something about what is what are the concepts behind them right so that's why uh this transport learing session does it make sense yes thank you hello sir yes sir I was looking into a quote by tensor flows so in that what they were doing is they were taking uh they're making Style transvers from one image the style is being transferred to other image so what they did is they use a pre-train model and they extracted some features from

a style image so what they did is they manually took a list of some layers and they said that these layers are for style layers and some other layers they said that these layers are for classification layers so I did not understand understood I did not understand how they said that these particular layers are belonging for uh style features and these are for uh classification features classification layers so it could be uh you know by doing some uh deeper study of the model what kind of output

it generates based on that they might have uh you know studied it and then uh they might have so they might yeah they would have done some deeper analysis using um you know through human experts and then decided those layers as important for Styles so you can also do that I mean those are valid um hacks or valid ways of generating features make sense okay sir uh hello sir yes uh sir in case of YOLO is it also an example of transfer learning if we don't train it on our uh on our images if you don

't train on your images and if you use those models directly to get the prediction it is not transfer learning but if you use those models to get features and then you train your other model on the downstream task then it becomes a transfer learning okay so so even if I'm uh so if I using if I'm using the default model then this is not transfer learning but when I'm trying to train the model uh it only learns the um uh the par parameters from our images so then it becomes transfer learning is Ye

ah so basically what happens is that you have let's say images and you want to generate features for those images right yes so how can you do that right now you can either manually inspect images and write down features or apply some complex apply some uh standard computer vision libraries to generate features right um or the third option is to use this transfer learning where what you will do is you will uh basically use existing model or pre-trend model right you will take a pre- trend model a

nd you will remove last few layers of the models and you will pass the image through the model and you will read output of um of pre-selected layer and that output will nothing but bunch of numbers right those bunch of numbers you're going to use for let's say building a regression model or building a classification model let's say logistic regression model for example right so you can use those features and those that becomes your features and then you train your logistic regression model on to

p of that when you use YOLO right for object detection you are basically fitting image and asking YOLO you know do localize objects or tell me what objects are there yes it is not exactly transparent so you're basically using YOLO mainly for prediction it's called zero short yes zero short prediction that's why I got confused because when we use YOLO we don't change the classification the final yeah yeah it's not really transer okay thank you sir this a good question by the way um any other ques

tions I think Mano has left uh so let's stop this session today thank you very much for participating I hope it was useful do you have any feedback about the way uh this was Tau I mean I would appreciate honest feedback I mean even if it is bad I mean bad feedback is in fact more important because then I can course correct it you can also tell what worked well okay okay anything uh better you want me to do and the speed is fine right speed is fine volume is fine yes sir okay great uh thank you u

h we will then again meet next week and we will do uh you know uh Hands-On practice of this okay we might require couple of more sessions so uh if you can keep yourself free for next couple of weeks that will be great thank you thank you sir thank you bye thank you nice weekend you sir bye

Webinar : Transfer Learning for Image and Text Classification

Related articles

Comments