Main

Data Science Tutorial | Linear regression Diabetes data set | A project in Linear regression.

Careerera, the best institute for data science in Noida, (India), the US, Singapore. In this video, we will learn how to handle linear regression over diabetic data. In this project, we will analyze the condition of the diabetic patient by using regression. We will analyze the data in detail and will arrive at the output where we will tell you the patient's diabetes level. About Careerera: Careerera provides online training & certification on PGP in Data Science and other technologies. Certification is provided once the course is complete to all the students. The program highlights are: ~ Live sessions and classroom training. ~ Delivery of student academic material, learning notes, and sample papers. ~ 24/7 guidance for students. Students can revisit the LMS portal for recorded live sessions to clear doubts and queries that come up in their way. ~Job assistance. ~ Fixed course curriculum ~ Hands-on training experience ~ Multiple simulation exams. Learn about PG course in data science: https://www.careerera.com/data-science/post-graduate-program-in-data-science Explore other technology courses at: https://www.careerera.com/courses-list For any query call : India: +91-92-5000-4000 USA: +1-844-889-4054 Singapore: +65-31-591-123

Careerera

2 years ago

hello and welcome to career today we will be continuing the tutorial of data science drills and the video number is 2.3 and we will be covering delay regression we already discussed one video of linear regression where we discuss how to implement this we created dummy data set and we worked on it now today we will be um importing a data set from a scalar only uh it will be a diabetic data set which has the value now because uh i want to make it a little simpler initially we are not going to use
pandas anywhere initial point that is we are not going to do the wrangling but the data clearing part which is also a very important part in machine learning orientedness project but we are getting a data set which is already in numerical form so all the columns and its values are in numerical form again it is not human friendly but it will make a lot of sense when you read it okay so we will be implementing this data set we will using this data set and to find the target value we will have a x
and y and y is my target value x will be my individual columns and i am going to give this to my model i will import integration and then once my model is fitted and trained i will check the accuracy okay now uh once i'm done with all the processing and checking the you know how the model is performing and all those things i will also try to plot it accordingly now plotting is not a step you can say which we use every time when we are doing the work but plotting can be an easy thing to understan
d for you know especially for human eye we don't really plot every time so plotting is just one or two demo thing we will be not doing it again and again we will focus more on the performance and hypertropic part rather than this plotting so concluding all this let's move towards the code and let's start with it [Music] so let's continue with the project and uh we are going to import the important libraries first so first of all this project is diabetic uh reduce it so let me take this link from
here save my time stability just set it and we will be using linear regression here and capital let's put some size instrumental and this is here so we have uh this diabetic dataset project and we are going to use the integration here and we will be using matplotlib tools that is why i'm putting it again it was inputted below i just cut it and pasted it here just to make it show you know it is visible in the top usually you know you can always pick all the import libraries in a single cell so o
nce you're done with the project i simply cut it and paste it here that is all so and they all have all the import libraries in the same cell and it saves a lot of time miniscule errors are not something that i'm going to use i believe and r2 square is what we are going to prefer because mean square error usually give you just the total number of error that we have it very very really makes sense to be very frank msc because you're not going to calculate how many points are there so how much err
or is giving so on average how much uh you know error we have for each and every element this is not practical to do and it will not make sense to most of the people r2 square on the other side give you a decent score and give you the clarity okay this is the score so it starts from 0 to 100 and more towards you know it is 100 that means better your project is the values really come from 0 to 0.9592 something like that that means 98 percent of the value is towards the 100 that is one you can see
on normal terms 0 to 1 and more towards it is going towards one battle the model is performing so r2 square registers for all these methods are basically used to check the performance of your model and this is all later let's start with the project first rather than talking and we import the library so we import macbook live we input numpy we are going to use numpy a few in two steps where that is where i'm putting it once you're done with this as you can see i'm not even putting pandas anywher
e because i'm not going to use pandas the reason for that is uh when you print diabetes you will see that they are all numerical values you'd required pandas at that moment once you're done with that from a scale and import data set and linear model okay then a scale and metric import mean square error and r square score now this particular values are this libraries are going to be used to check the performance of the model which is the very last step this is the main step you will use data set
and you will use 3d model once you're done with this again guys uh this is a one way of doing it you can divide this into two lines that is you can use a skill and dot linear model and import so from scalar dot in linear mode import a linear regression and that is also a practical way of doing it but uh you know you can save your time like this to whatever way whichever way you want to do it's it is on you once you're done with this uh you can simply move forward so let me run this cell once you
're done with this you can simply move forward and i'm just loading the data set so reloading diabetic and uh i'm saying x and y has to be true that we will get an x and y separately we are catching the values x and y and two variables so it is giving us two value value by catching it in two variables and once we are done with loading we will print uh diabetic x i am using transform here transpose is basically used to make rows and two columns and column two rows the reason for that is that we a
re having row column in the rows and rows in the column so it is uh hard to understand that way that is why i am using transpose to make it easy to understand so this is the kind of value that we are having and these are the values once you have this value once you have this value let me you know let me remove transpose for the moment so it makes clear d okay so these are the values and once you're done with this and we have y two so i'm printing y 2 these are my y values okay so we have x and w
e have y a lot of values are there of course and we can simply find the shape and we can check how how the values are how it's really working so you know once you're done with this you can add one more step this is my shape and let's check diabetic y shape too let me paste it you know let me copy it completely paste it make it y all right so we have this shape so 10 columns are there 4 42 rows are there and diabetic why this is basically this is our target this is my target value okay this is my
target value and uh i'm getting 442 in my target value now uh it is in uh in our hand how many features we want to use in case you want to drop a feature you want to use which you can drop and use it again it is completely your call what you want to do i am going to use all the features rather than dropping any i don't want to drop i don't really see any sense in dropping any columns and uh just moving forward what i'm doing and splitting the data into training test split and what i'm doing her
e rather than using trinidad and split i'm doing it in this way again this is uh you can say uh normal way but i strongly recommend that you go for from uh scalar okay and from uh reflection and just import your split and it will make a lot of sense it will make your life easy and it saves a lot of time too oh here just and you can simply import and do this moving forward once you're done with this you can split it in this ratio so you can copy the steps so you can use your own uh you know tinde
r split whatever way you want to you will be making the value point two in case of tinder split once you're done with the splitting part what you will do you will simply move forward and uh as you can see i'm making x train and x test and then white trading y test and x strain has the output of white ring and x test has the output of white test if we check the size it uh y test and why x test will be uh equals 20 percent of the value and this is what we are getting so let's move forward one seco
nd alright so let's move forward so let's run this cell okay and move on now uh let i cannot remove this tool and use it too it's on me again if i if i use it say train just underscore split run it i will just you know need to use it in the next cell i hope you already know this uh drill that you just use it you will keep these variables okay extreme and x test then y train and y test and once you're done with you know x strain and the y train and comma x test and y test and once you're done wit
h the drill you move forward and uh put it the split size x the test size will be 20 percent and random can be anything and we can control like that right once i'm you know not using it so i'm not going to keep the cell like this i will make it a comment to you know avoid the confusion here we are splitting it in this fashion so that is what you need to mind once you've done this splitting if you check the shape of you know all the debating all those things so diabetic acts neptunic y x test eve
rything so you know if i check this shape here this is my shape let me you know and let me copy this let's take the string piece again and make it y train indicate x test x will be capital and get our test once i'm done with this you know i have this shape so you get the idea i hope uh you know it is clear we are getting this extreme so extreme is this x test with this okay and uh we have uh dependent column of 422 and 20 size is same same is good and if the size is correct there will be no issu
e i hope there will be no issue and running this uh i'm just importing uh immigration and dot lining relation with once you're done with this you will be you know creating object for this that is why i'm creating rig x regard you can also call it whichever name you want to use so if you even if you call it lr that is okay that is what we usually use as a name so lr uh is what the integration stands for and lr model is there of course if i'm changing the name i have to change it and every set tha
t is one thing i need to mine from now on lr now once i'm done changing the name i will be fitting it okay so lr and dot fit diabetes x string on diabetes egg white train and i will be fitting it once i'm done fitting it as you can see i'm getting an output that means the fitting part is done and the name will be lr here again and dot predict and i will reduce the value of diabetes and once i run this cell let me print the direct value this is my credit value that i'm getting and this is what i'
m getting is an output and this will be lr2 making it lr okay these are coefficient that has been there so after lifting uh coefficient is the value that you know get multiplied in the equation and whatever the coefficient you are having that value is getting printed and that is this is really not required you can when you want to get the inside how the no network is really resolving everything and solving anything that is where you use this kind of value frequently we don't really go for coeffi
cient we required just to put the values so you know if you talk about the actual projects we check the null values we check the categorical values once you are done with that we start modeling once we done with the modeling part we move forward and check the accuracy and stuff like that and so straight forward coefficient never really comes to the picture in that moment moving forward uh what i'm printing i'm printing the let me make it format so it is easier to understand okay let me get like
this and put it like this let me delete it dot format putting a bracket here uh our backhand sufficient one two and three open one two and we need to third to close it if you run this you get this mean square value and this is my value so uh i get an idea okay this is my mean square error and this is what i'm getting from x doesn't predict the value usually goes down after training that is what we are expecting in the second cell i am just printing using the format again i am using uh r square a
nd i am passing and diabetic white test and diabetic white predict once i pass this and will print the value uh printing the value for you know score part so once the score is printed and the mean square value is printed i will move forward and last cell that i'm going to write again guys we are getting uh our square value as 58 which is not very uh you know you can say decent score but it i know you can see it's ok score 50 around score is not very decent i usually expect at least 75 score when
i'm going decent and when you're going good you can expect even better score but right now as you're not done it when very careful manner and we have not already worried about the values itself we are getting a little little low score which is okay at the moment we can you know live with this no issue moving forward we need to plot now to plot we already have the in scope of the math quadratic work we will use the same library so macroclip is plotted is inputted as plt in the plt you can see i'
m first line i'm using scatter plot i'm giving diabetic acts test anyways plotting is just for us not required you can you can literally and the project above in the above cell where you have printed the coefficient you don't really need to plot it okay that is one thing now scatter plot diabetic x test diabetic y test and color is black for the second plot i'm using a line so i'm using a plot dot plot plt dot plot and giving x test and i'm giving y credit notice i'm changing the value of y pred
ict here okay so it is not white test it is white rate this time now it will be creating a line which takes the value from prediction so what what kind of prediction model has made for the coming output once you're done with these two steps you will just print it okay and obviously excess white will be we use where you have to put the name we are not using anywhere so it is you can see a removal step it's okay and we get the output like that so let me also remove this it is not required and if i
run this okay it is saying some error let me check that xmas wise i'll be the same should have the same same size is not same it is what is what it is saying and let me check let me check did i did i miss the name somewhere lr okay mystery mattress okay then that case this is this this should not be giving this error not really sure what so y must be of the same size so let me check the error resolution one second so i'm giving the correct value i'm not doubting the values i have to give one cr
edit and one this uh not required this uh you know value anyway uh it is giving me the error in which line and it is get a plot part and we're using actually my test and we're giving this first line that means the model has an issue fm of just same size now it is saying it is not in the same size let me let me check the size first so if i am giving the value of x test and diabetic y test okay both should be of same size x test and white test y test is missing let me print the value then it will
save my time rather than finding it like this let's you know print the value so what i'm doing i'm just going to print these two values i'm just taking you know print giving the variable and let me give some space in between so at least i know that you know i have given some value and not much help so that is let me [Music] okay so it's 210 and 20 and so scatter plot is taking a decent value i don't think there is an issue with the value it is okay and diabetes x is also correct i don't think th
ere is any value in the plot value let me return the cells i think some issue is there so let me read in the cells so let me refresh start okay all variables are lost now and let me take it from the top corner you start with an oil cell yes let us do that okay getting the same error so there is some issue definitely x and y must have the same size which is not there so x and y this is my x and y having the shape number of rows are equal and that is what we need to check and i don't see any issue
am i must have let's shoot the value somewhere let me check so let me also print these two values maybe let's take it there let me check these two shapes too so one is just only and i always want to check the grid part so and yes the shape is okay so this should not be having an issue the same size size is correct [Music] so getting back to the code guys what i was missing there so i took a pause and just checked the code so i was missing this part so the main thing is that you know for plottin
g if we are just going for linear regression it was okay but for plotting we need one axis i so don't forget to notice this second axis okay 20 comma 1 should be there and only when you have this you will be able to plot it so if you just remove it and run it you will get the plot okay that's the that's the only thing that was missing apart from that everything was okay so i hope you get the idea you know with the plotting so uh one thing that we missed at the moment that we have to keep it one
axis we cannot give multiple axes in a single part or that is obviously that we have given uh we're giving x and y axis basically you cannot give multiple points for y axis that is never acceptable okay and uh you cannot do the same thing for y x axis two so if you're giving multiple point four axes it is hard to plot for any any product right it is not practical that is why we are having that issue so i hope it is resolved and is clear apart from that everything is okay so guys i hope you get t
he idea and uh this is the plot that you are getting and this is all for the code so we will be discussing uh further what you know regulation can do for us and what we can do uh with the coding let's you know let's uh take one second just get the run part done once you're done with this run part and this plotting part uh you can move forward you can pause and you can do this plotting first and do all the coding first then we can move forward so i hope you get the idea so as we have discussed up
to this point the regression work in this fashion okay so i hope you get the idea that we are plotting the lines basically we are plotting the data set and we create a best fit line in between okay so just repeating what we have done of this point we have a collector data set we fit our model fitting mean this only we are plotting the points basically you can say think about it like this we are giving this data set to our model and it fits in once the fitting is done this password line is find
out now we will talk about the procedure what is going on at the back end we understood the mathematics we understood the coding part so how to code how to implement regression we are clear up to this point okay now we will discuss that how the things really goes on okay so let's let me put it here one second all right so let's this is about the point so i'm taking any random one point so this is my point okay this is my point which i'm focusing on this is the point there which i'm going to disc
uss we will plot this point we will discuss we will calculate the distance of this point from the line that we have plotted now this is my line okay the first question that should be asked is that where this line is coming from and where it is going to start it should not be think like this that is starting from this point no you see this x axis lines start parallel to this okay so line is exactly parallel to this particular line line starts parallel okay i'm just drawing a rough line here and t
his line keep moving okay this line start moving towards now question number two how it starts moving so the angle the angle that it has okay this angle keeps changing and this angle is considered with m okay the convergence that we discuss is in this angle okay now this angle is changing this line is moving and this line keeps moving forward it keeps moving in a fashion that the loss now what is loss the error the residue that means the distance of each and every point so we will calculate dist
ance of each every point in this line from this line okay and calculate the distance now at the point at this point the line is here so we will be calculating the distance from this line okay sorry the line goes a little up it is up to this point only so the distance from this line okay and from this line how much away it is how much away how much distance there is and this line keeps moving forward towards down okay and so on till till it reaches to a location where the addition the residue now
guys what is the residue we add the distance complete distance and we call it loss okay what is a loss residue all the same thing so we can calculate this residue and this residue is basically getting down it is getting smaller and smaller initially when you start with the starting position this value is very high okay this value is extremely high it start dropping as the line start moving as we change the value of okay it start moving down and it keeps dropping until we reach to a location whi
ch is called global minima which is called global minima and now let me move this it should be one second all right so until it reaches to the global minima so global minima is basically nothing it's just a location at the center so this line is moving as it keeps moving it creates this location this curve this parabolic curve this curve has a global minimum point now this glow minimum means the error keep reducing till you reach this point okay it keeps reducing till you reach this point and th
e moment the error is minimum okay the moment that is minimum becomes your global minima how you find this point the moment you cross it the moment you cross it the errors start increasing again that is start increasing again so you come one step back okay and declare this point as global minima this point becomes your global minima this is called global member and this is called gradient descent and we have already discussed how us find out and how to be reached to the gradient descent so this
is how the work is going on at the back end so we have seen the implementation in the coding and this is how it is solving it at the back end once it reaches the global minima it declares the best fit line and this is the result that we find out after fitting and then then what we do we go for prediction now let's talk about prediction what really happens in the prediction part let me clean all this and let me draw a line in front of it okay let me draw two lines basically so suppose this was ou
r data set of train okay this is my training part this was my training part and this is my test part okay test part variation part whatever you want to call it at the moment it's okay that's the validation doesn't matter okay we will test our model from this so we will give this these point the model and see if we extend the degree this best fit line the moment we extend the diversified line we will calculate a distance this error should be very less if it is very less it is performing good then
we check how much score we have the r2 score basically give us how much good your model is performing when you check the residue of the test part now if the model is performing good this score will be from 0 to 100 or we can say 1 it can be point say 0.95 or something like that and if it is 0.95 that means your model is performing really good okay now we move forward and in future suppose how the prediction pattern so actual testing so suppose this was validation this was validation so when you
go for the actual testing it predicts the future points will be coming here and here okay it predicts like that and that is how the prediction is happening in linear regression okay this is this is the way model is getting trained model is getting testing and we do the main testing part so main uh testing which we do in the training part is called validation so uh we divide dataset with three parts train test and validation and this is how the process goes on so let me let me clean this i hope
the logic is clear on this moving further so we talk about residue now residue and total error now when we talk about residue we are calculating this distance from each every point so that is why i have taken some very small number of dots it is easy to explain okay every distance is calculated and as if you just you know if you think logically the initial distance will be very high initial distance will be very high and as the line moves toward closer to this the error will go down the residue
will go down that is quite obvious and the moment it crosses the best fit position okay these point will get closer these points definitely get closure but other points will get away okay these points can start getting away and that is where the error increase okay so line can be benefited from any angle it is not just one particular angle it can be from any angle and this m value is changed by the convergence theory we only discuss the mathematics for this that the m will be separated from old
m so old m so i'm saying old m and this is my mu m and it will be subtracted from the learning rate okay and derivative of loss that is the error that you're getting divided by derivative of m okay and old m that is and once you get the value and you subtract this value from this old amp you get the new m and this new m is implemented every single time and we keep finding a new am and we keep finding the on this new value and keep moving forward line keeps moving and we get the best fit line bes
t fit line location so i hope it makes sense how restitute really works and how the work is going on moving forward scene ish scenario that we will discuss again now we will discuss these two points in depth okay we did it in quite fast fashion that time so what is the issue with lambda landing rate is small reading it is big why why we are considering this because the issue that we face is that usually happens that you know when we are moving as we discussed i hope you understand why why global
minima point is important okay because this gave us the best fit line location this is the best line location it is the most point area where we want to reach now to reach this value reach this point we do fit okay we do this fit part and when we do this fit part it start with the line start moving the line is like this and start moving in the down direction now m value is changing every iteration every iteration the m new value will be happening so this formula will be processing and we give a
new value for m this line keep moving and it keep moving okay now think about it if it is taking a very small step say bb steps okay and the lambda value is very small say point zero zero zero zero zero zero one okay now we have around five to six zeros and then one that means the value that we are going to multiply with it will be very small too and that means the new m value will not be very away from the old m value that means the lambda value is small and the steps are small and it will it
will reach global minima definitely it will reach people maybe someday but it will take it forever to reach you it will take a lot of time extreme amount of time is required to reach global minima with this step i gave you an example guys suppose you want to reach to a market okay this market is 10 kilometers away from you okay and you decide to you know you decide to just go by walking taking baby steps okay so this is a market this is a market location and you are standing here you are standin
g somewhere here okay this this is you and you are standing somewhere here and you want to go in this action you will start walking if you start walking in baby steps that means half foot just away and you keep taking baby step and you know guys again if you're taking baby steps it might take a very long time but you will reach market that is a for sure thing okay so you know even if you're taking baby step you will reach there but again guys it might take you three days okay just cover 10 kilom
eters area it might take you three days okay now that is a lot of time that you don't want to give you definitely don't want to give three days just to reach a market what a small but suppose you just want to eat something something small is there so you don't want to waste that kind of time so how can you improve that so suppose second scenario suppose i want to increase my speed and i decide to take a plane for that okay i decide to take an airplane to reach there now taking an airplane can be
sound okay might sound a very good idea at the moment but it is not suppose you are taking a plane it will take you with a very high speed extremely high speed okay so if it is just 10 kilometer away it is was just 10 km away okay plane will just take off and take no come down until again 10 kilometers is a runway path so it will just take off and come down and it will definitely cross the market it will never reach the market again if you are away you're far away from the market so what you wi
ll do at this moment okay you want to go market really badly so you badly want to reach market what you will do you come back you again take the plane and come back to looking suppose you it is now eight kilometer away you come back to a location where it is just eight millimeter away again you will take off go again and you will reach somewhere away from the market say five premier away now you can again take come back and try to get closure so you will reach say seven kilometer now again you w
ill go back again i hope you get the idea you keep jumping guys you will keep jumping forever but you will never reach market may you might never reach market so taking a plane is not a good idea um walking with baby steps not a good idea so what what is the conclusion what you can really do in such scenario the best uh best way that has been presented is you take something which is not too slow not too fast say a car okay car can be a very good option to reach this market okay it will save your
time you will reach there on time with a good speed and it will save you a lot of time for you okay so i'm using a car to reach to the market now i will reach market for sure with a decent speed but uh again guys uh there is another concept that has been uh introduced that is this concept now how linearizationally works is it does not really go very fast or doesn't go with the constant speed okay because we we want to save some time so most optimal way of doing it is rather than reaching market
so again you might not get the parking so what you can do and you might have to park it somewhere else maybe little away from the market okay and then you have to walk so to save your time what you can do you can reach to say a 9.5 kilometer from car okay and remaining 0.5 kilometer can be covered by walking this will ensure that you will reach to the market in the most optimal way okay so you will drive till 9.5 kilometer then you will park the car then you will start walking even if you are t
aking baby step you will reach there in time okay that saves time it is the most optimal way and you will never miss global missing global nema is not a practical way of this so this is what it is happening it is taking leap steps initially then taking the minimum step and reaching the global minima at the end i hope this example makes sense to you [Music] this is what lean aggression is doing at the back end i hope you know it is more clear it makes for more sense now why not to keep lambda ver
y small why not to keep lambda very high it is alpha at the moment it usually we define landing landing rate with lambda lambda and alpha are both on me in the 0.207 same condition that is learning rate same element basically so guys we are having this uh learning rate very small or large it will make an issue we don't want that so we avoid it okay that is how uh gradient descent learning regression works together let me clean this and let's move forward so we will be discussing in the coming ti
me about logistic relation it belongs to the regression family only so a little bit introduction is required just i'm introducing this i will cover it properly in the next class where again we will be doing a little bit of practical where we will be doing little bit of theoretical and we enhance our knowledge so in logic regression we again we have regression because it belongs to a regression family it uses the s curve and it is a binary classification the reason which is called a binary classi
fication is because it decide between two classes zero or one we will discuss this in deeper why this two classes and wireless than that again there are several extension in dna regression and loss relation too which is you can see until advanced machine learning which we will discuss once we are done with regression and uh logistic regression and integration we come back and we will discuss those methods too okay so uh in linear regression you have this uh you can say alvin and i will do which
is a regularization technique now we have to understand what is the knowledge decision technique and why do we even use it in the first place and what is the use of it okay we will discuss that clearly and regulation technique is very important especially in deep learning part where we will discuss other than dropped out and other topics let's not talk about that at the moment coming back to the topic uh sigmoid function uh formula is 1 by 1 plus e to the power minus z and this function uh takes
the value from 0 to 1 so it you have 0 or you have one okay and it is decided by the p-value so 0.5 is the you can say threshold and above this is uh one upper below it or at the same level it is zero so this is how oscillation really works this is all for the restoration regression part in the coming video we will discuss logistic regression and their implementation and once we are done with this we will again come back and discuss even in depth what we can do for regression what we have in th
e regression so this is all from my side today thank you guys thank you for your time this is all [Music] you

Comments