Welcome to today's workshop on
three must have projects for your data science portfolio. One of the most common questions
that we get, asked is which project should I work on. And typically this is from the
perspective of building a one. It is from the perspective of
learning, learning the right skills. And second, it is also from the
perspective of building a good portfolio, because before you apply
for an internship or a job in data science, you will have to first create
a strong resume and a
strong portfolio. And by portfolio some a list of projects,
which you have put up somewhere, either somewhere publicly, either on your GitHub
profile or on your Jovian profile or anywhere else, maybe on a blog or some,
or on your, or your own personal website. So today I just wanted to cover three must
have projects that you, should do before you try and apply for data science role. Specifically, if you're looking to
get into machine learning, I will also touch briefly upon what you should do. I
f you're looking for more of a data
analyst kind of role in that case, maybe you do not need to do all three of these. Maybe the first two might be sufficient. And then the third one, you might want
to substitute with something else. No, if you have to pick three projects
to showcase on your portfolio to showcase on your resume, then these
are the three that I would recommend. The first one would be exploratory
data analysis and visualization. And we'll talk about it in more detail. And then the
second one will be should
be a classical machine learning problem. So on tabula data. So this is by classical machine learning. We mean algorithms, which
came before deep learning. So this is pre 2012, or these are also
sometimes called shallow machine learning. Techniques. And sometimes the distinction is
also made between structured and unstructured data and structured
data typically refers to tabula data. Whereas unstructured data typically refers
to things like images, text, audio, video, a
nd such things where deep learning is
more applicable and more widely used. And then the third one should be a deep
learning project because today deep learning is so prevalent that pretty much. Any machine learning work you do at
some point you will want to at least try deep learning or neural networks work,
not work well, not just on unstructured data, like images and texts, but also
work really well on tabular data. So these are three projects that
you should do, and this should be in some se
nse, your goal
when you're getting started. That that you, do these projects
in, let's say you have a six month timeline for learning. So try to make sure that at the end of
six months you have one project or maybe more than one project in each of these
categories in your portfolio, so that when you start repairing your resume,
when you start applying for jobs, then you have these to showcase Just to get,
just to have a better chance at getting shortlisted and also having something
to talk about
during interviews. And of course, to just cultivate the
skills that you need have to apply that you will need to to fulfill
your job responsibilities properly. Okay. So the first project that we are
going to look at is exploratory data analysis and visualization. So this one is something. That, is probably the
first thing that you learn. If you've taken our course on
zero to pandas data analysis with Python that you'll know
exactly what I'm talking about. The basic idea here is you show, you ne
ed
to first go ahead and find a real world dataset of your choice online and others
show you how to find datasets as well. So you find a dataset
that you find interesting. Maybe there's a topic of your interest. Just make sure that it is a large
enough dataset, make sure that it has enough variety, make sure that it has
enough information in it so that you can do more than just one or two graphs
or one or two questions, and answers. Then it is also ideal that if you find
a dataset, which has som
e similarity to what you might see in the real world,
where maybe there is some missing data, maybe there is some incorrect data. Maybe there is some. Maybe there is some data, maybe
there are some miss some things that you need to clean up. For example there are, there is a
column called date and that data is actually written as text and you need
to convert it into a proper date. And maybe break out things like
month, your month, year, day of the ones and things like that. So the more you can s
howcase
your skills of parsing data set. So maybe how to read a CSV file, how
to clean the dataset, how to look for missing values, how to look for incorrect
values, how to fill in missing values. The more you can showcase those
skills, the better your project gets. Because then you are then one year
learning these skills in second year, also showcasing to a potential
employer that you know, how to clean data how to work with messy data. Yup. And then also use who's Matt plot lib
and see bond to
create visualizations. And again, if you've done the zero
to Pandas course, we've had a full lecture dedicated to the different
types of visualizations that you can do. So you definitely want to showcase that
you can, that you have different kinds of GRA you understand the different
kinds of graphs that can be created. You understand when they are used, you
understand when they should not be used. So picking the appropriate
visualization for the appropriate use case is very important. And then
you should also, you also
need to be able to showcase your skills in asking and answering
interesting questions about the data. This is one of the. Key things that the purpose of data
analysis is, to figure, get insights from data and to get insights you need
to first be able to ask good questions. Okay. And finally presentation is a very
important part of data analysis. So you should, if possible, at
least add some documentation in your Jupiter notebook or in your GitHub
repository or in your b
log post and showcase it in a way that is. Pleasant for somebody reading. So it's, somebody should feel
interested just by looking at the title. They should feel interested by
reading the first few lines by looking at the first few graphs. So try to make it engaging. So this is the first project,
and now we look at some examples. And once again, if you have any
questions at any point, please feel free. And as I'm doing this. Feel free to also follow along
and maybe open up these links. So I have
posted this, a link
to this Dropbox paper document in the Slack channel, but I will
also post it here in the chat. So as I'm, going through them, feel
free to look through it yourself too. So what we'll do is we'll probably
look at a few example projects first. And then we'll talk about how
can you acquire the skills that you need to do these projects? And finally, we'll also look at some ways
where you can find interesting datasets. Again, this is something that
a lot of people have asked. How
do I find a good data set to work on? So here are some examples and
I've picked three examples here. One is called WhatsApp data,
exploratory data analysis. So let me open this up. This was a course project
done by Michael gin. He was one of the participants
in the zero to pandas course. So the idea here is that the dataset
Michael is using is his own WhatsApp data. So on WhatsApp, you have this option
to export your data and you can see here's a mentioned, he's mentioned
where you can get this
information. So you're going to settings and you
have an option of exporting chats. And you can either export the chat
from a specific group, maybe a group that you've been part of for several
years, and you have tens of thousands of messages, or you can export
your entire WhatsApp data as well. Maybe go through five or 10 different
rates and then started the first thing is to read the data set. So the first thing is to import
the data set and you can see here. There is a text file, Chad dot TX
T. And the text file can be
imported like this using pandas. So this is the pandas file
and you're reading the data and you're showcasing that. It is you are able to show
that it is a data frame. The next step is to just present
your understanding of the data. So for example, here Michael
has described that there are three columns in the dataset. It contains date, text, and a non-value. It turns out that there are
21,000 or 23,000 rows of data. So there are 23,000 chats here, and
then there are
also some unknown values. So there are some, maybe in some cases,
there are some Nance here and there, so those may need to be cleaned up. So the next step after that is to. Yeah. So the next step after
that is to clean the data. So there are cases where there is
just an image and in which case we want to analyze textual data. We don't want to analyze image data. So Michael has done some cleaning up
here where he is written some code where he's going to drop all the tech,
all the chats, where it
refers to actually an image message being sent. And it has this message media omitted. So we will not get into the details yet,
but this is the idea that you look into the dataset and then you understand, okay,
there is some cleaning that is required. Maybe you need to remove some rows. Maybe you need to fill in some
missing values and you need to not only do that activity, but
you also need to document it. And if you go through this post and I
hope you are following along on your own. If you g
o through this post, you will see
how Michael has explained it really well. Which are the rows that he's going to
exclude and why he has chosen to do that. Okay. So giving a rationale for your data,
cleaning for your data, missing data imputation, all of these things are
pretty important and only then you get started with exploratory analysis. So expert, there are many different
ways of doing exploratory data analysis. Some people just like to take each column
and create graphs for each column,
or some people like to start with questions. And this is something that is totally
up to you because whichever way you start, you will end up doing pretty much
all the analysis one way or the other. So in this case, for example, Michael
has started by asking questions. So he started by asking the
question, which users have the most chat messages in a group. And this is also a useful
skill given a dataset. Now I'm giving a dataset about
WhatsApp chat, a WhatsApp chat history of a particular group
. And maybe the first thing you should
try to do is start to think about the questions that you would ask. Okay. And you can start with obvious questions. Obvious questions are how many total
messages are there in the, dataset. That's one, that's something that
is already clear from looking at the data frame asking questions, which
users have the most chat message chats or messages in the group. That's again, clear, you could also ask. Okay. What is the number of chat
messages, but per user you
can ask. Okay. Who has, the. Who was the earliest member
in the group who was the large last member in the group. So in what order did the
members of the group join? Again, this is something that
you can analyze based on the first message sent by each user. So there are a lot of interesting
questions you can ask and the idea is you ask a question and then you figure out
how you're going to do that using finders. So for instance, for which users
have the most chat messages in the group, Michael h
as said that we can
use panders to first group the data. So he's using a group by here. We can use finders to first group the
data, and then he's going to count, or he's going to group by user, and then he's
going to count the number of messages. And then he's going to
sort it in ascending order. So here you can see this is the, order. There are three people in this
group and it turns out that COVID has sent over 10,000 images. And you can also visualize this. So this is in tabular format, but t
he
same thing can be visualized here. So here, Michael has chosen to
show this using a line graph. Possibly I would say that maybe a bar
chart would also be a good choice here. So you may want to just showcase a bar
chart because showing three bars side by side might, be more informative here. Okay. And so on. Thanks. So now you have question two, which
emojis are the most used, and you can see here that these are the most
used emoji is the laughing emoji. The laughing and head turned emoji,
the
strong arm emoji, and here a pie chart makes a good is a good choice. Then you have a distribution
of emojis per user. How does that differ most
active hours on WhatsApp? So pure, we are looking at how much,
time are people spending and what, is the time, which time of day is our people the
most active in terms of sending messages. And you can see here that it turns
out that most of the messages are sent around the 1:00 PM to 5:00 PM. Seems like that is the range. And so on. Okay. Then your que
stion four and question
five, and then at the end, maybe there are some inferences as well. So here a word cloud is again, a
very interesting thing to showcase. So this is an exploratory
data analysis project. And, that's it. So it's not very difficult. You need to know the skills
you need to have for this are basic Python programming. You need to understand
binders, matplotlib Seaborn. And once these it's simply a
matter of picking a dataset and you really don't have to spend too much
time pick
ing a data set as well. Start with the first thing
that you can find online. And I'll talk about how to find
datasets and just start exploring. So there's no right way to do it. You just start exploring, start
doing start to start messing around with the data, start drawing graphs. And when you feel confident, maybe
you want to then go back and say, okay, do I need a bigger data set? Do I need. I want to showcase all the skills
I want to showcase cleaning. I want to showcase and analysis. I want
to showcase maybe
merging multiple data sets. I want to showcase visualizations. You want to showcase
your presentation skills. And then, so then you, maybe
you might want to consider the dataset that you've just analyzed. Does it satisfy all these criteria? If not, then maybe go back
and find another dataset. Maybe go back and find
a large enough dataset. Maybe go back and find a
complex enough data set. Maybe go back and find a 13 of dataset. And build your project. And once you build your pr
oject here,
for example Michael has written a blog post and a blog post is a great way to
showcase your project so you can create a blog post, or you can just put all of
this information into a Jupiter notebook and put it up on GitHub or on jokey. And so you can see here, this is the. Jupiter notebook that Michael was
using now, Jupiter notebook is good for a technical audience. But it's all for a blog. Post is often better when you are
presenting it to let's say potential employers because in a
Jupiter notebook,
there is a lot of unnecessary code which, is mostly like functions that are going
to use imports and things like that. What you can do is when you're
writing a blog post, you can remove a lot of those things. And you can talk more about the insights
and you can talk more about the dataset and you can talk more about the inferences
that you are drawing, and which is really the whole purpose of data analysis. The purpose of data answers
is not to write code. The purpose of data
analysis
is to gather insights. And that's where putting your work,
putting a code in the Jupiter notebook and then putting the results and maybe just
the outputs in a blog post is a good idea. It's like a report for your project. Okay. So that's one example, project,
WhatsApp data analysis. And this is, done on
your personal data set. So this is something that all of
you can do and you can all ask different questions, try asking. Try asking some other interesting
questions you can think of mayb
e instead of analyzing, maybe try picking a big
group rather than a small group, maybe try picking a personal conversation
so that if, and maybe if you want to try this on messenger, you can
probably export from messenger as well. If you want to try this on Snapchat
or something, you can do that as well. So that was one example. And then there are a couple more here. So there is one called analyzing
browsing patterns using pandas. This is also another very
interesting blog post. I highly recomme
nd checking it out. This is by one of our mentors Karthik. So here, what Karthik did was he
used and Google base, a Google application called Google takeout. So you can go to. takeout.google.com. And from there, you can download
your entire browsing history. If you have been using Google
Chrome and you have been logged in, so you can download it as a
CSV file or sorry, as a Jason file. So you can see here that it
is downloaded as a Jason file. And this is what it looks like. So you have just a,
the time at
which you visited a page, the title of the page and the URL of the page. And it's a very simple dataset,
but it is a very personal dataset. So there are a lot of interesting insights
that you can gather based on these. And then based on this there are some,
what Karthik has then done is to take the single take the single column,
which is the number of microseconds from S so this is called the UTC
time and he has converted out of it. He has picked out things like the year,
the month,
the date, so that shows. Data processing skills, where
you're taking this particular microsecond time and converting into
converting it into your date month. And then similarly, what he's also
done is he has passed the URL and captured the root domain out of it,
because maybe you want to know how much time have you spent on g-mail. How much time have you spent on Facebook? How much time have you spent
or how much activity have you done on Twitter for instance. So that's, where you can see that
t
here is this get domain function. There is this get day
of the week function. There is this convert time function. So this is, another thing
that this is showcasing. How do you process, or do you start
with some raw data and process it into more structured data? And after doing all of these things, after
doing all of these things, you can see that now he started to do visualizations. So here, for example, one very
simple thing is doing is checking. How many of the websites are
secured versus uns
ecured? This means NCDB versus HTTPS. Here he is analyzing
weekday versus weekend. Browser usage here is analyzing. Okay. This is another very interesting one. This is a heat map. So this is showcasing on each day
of a particular month, I believe. Yeah. On each day of a particular month. He is looking at his
usage throughout the day. And dark means a lot of usage and
light means very little usage. So you can see that he spends a
lot of time around the, between 10:00 AM to 2:00 PM browsing. So pr
obably for work and then spend
some time at night from 7:00 PM to 11:00 PM, probably for leisure. Okay. So there's, a lot you can do here. This is, again, this is barely
scratching the surface as he had many questions about stack overflow. So he tried to look up. What are the most common stack
overflow questions he's asking. So you see here that it turns out
that the most common stack overflow questions are actually related
to Putin in his case and so on. And then there's a word
cloud, et cetera
, et cetera. So you can check out the
third one here as well. The third one is a more standard
kind of Algorithm is a standard kind of analysis where it is a
dataset that is publicly available. So it's not something
that you have to gather. It's not a personal data set, but this
is probably something that probably most likely going to be your first project
is going to be something like this where, you're going to find a dataset
from some online source, such as Kaggle. So do check out the third o
ne as well. This is analyzing the Google play store. So things like let's see, let's look down. How do the app ratings vary? So this is data of all the
apps on the Google play store. You can see that approximately ratings
are in the range of 4.0 to 4.4. Anything more than that is an exceptional
lab, anything less than that is a bad app and so on counts of apps in
each category, et cetera, et cetera. Okay. And so now I want to spend some time
talking about where do you get these skills if you don
't have them already? There are a couple of places. These are very simple skills. These are beginner level skills. So one is the zero to Pandas course. We have data analysis with dot com. This course is now in a self-paced mode. So there are six lessons here. You can work through the lessons. Each lesson has. A video, you can watch the video, you
can open up this Jupiter notebook. So each lesson also comes
with a Jupiter notebook. So you can open up this Jupiter
notebook and then you can run th
is notebook online, make changes
then there are also assignments. So there are three assignments, three
weekly assignments, one for you to practice by thin one for you to practice. Number one for you to practice
binders, and you learn as part of this pandas, basic Python also. And visualization with Matt . And finally,
as part of this course, you do a course project on exploratory data analysis. And in fact, the final lesson
is actually a case study of our exploratory data analysis project. So
definitely do this course, if
you haven't yet, if you are doing some of the other courses, you may
want to first do this alongside, or maybe before you complete the other
course, this is really helpful. And then definitely do a project. And then definitely try and
write a blog post as well. I'm telling you a lot of people stop
at building the project part, but it really helps to write a blog post. I cannot stress it enough. Okay. Another one, another course that I
will recommend is the Python fo
r data science and machine learning bootcamp. This course on Udemy is great. This is how I learned
many of these things. So you can do this as well. It comes from a small
price about $10 or so. But it's, I think it's a pretty
good course and it has it is pretty comprehensive in the things that covers
although you, if you're doing zero to pandas, you need, no, you do not
need to do this course if some people prefer a different style of teaching,
so this is another option for you. If you do not fi
nd this good
enough for your purposes. Okay. Okay. So that was the first project. So moving forward one thing that I wanted
to cover was where to find datasets, and I think we've gone over this many times, but
I just want to show you once again, you just want to go to kaggle.com/datasets. This is probably the
best source for datasets. And over here, look going to the data tab. And this year you may, it
may take some searching. So it may not be very easy
to find the datasets here. Sometimes you m
ay have to
look for the right keywords. Sometimes you may have to add filters,
but what I like to do is I like to sort by most words and, depending
on the type of data I'm looking for exploratory data analysis, typically
what you want is some kind of a CSV file. So I select the CSV file and you want at
least maybe a few thousand rows of data. So if you're looking for at least
a few thousand rows of data, then the dataset might have to
be larger than 10 MB in size. So you may want to just put in
that you
want a dataset of size larger than 10 MB. If not 10, maybe try one MB
and you can see here already. We're starting to see
pretty good data sets. So you have us accidents,
3.5 million records. Tomato Bangalore restaurants
or restaurants in Bangalore. There's a lot of information about
a restaurant that is captured by Zomato Netflix price data. So this was a competition that was
organized by Netflix, gun violence, data, animal recommendations, artworks used
cars, crimes, Spotify, all of t
hese are great datasets for data analysis, right? And you can use this, a library
we've created a library for you to, for me to make it easy for
you to download these data sets. So all you need to do is install this
library, pavements stall, open datasets, import the library and grab the data set. You are a little, which is
let's say Spotify data here. Grab this URL and simply
call open dataset store. Download. And when you do that, you will be asked
to enter your Cargill credentials, and you ca
n follow the instructions here. So you can, you have to go into
your account and get a Kaggle dot Jason file, et cetera, et cetera. You can follow that, but
it's pretty straightforward. Okay. How do you get started with the project? One, one thing you can do is just go to
Jovan, jovan.ai, click on new notebook, blank notebook, give it a title. Let's say we are analyzing Spotify. Data analysis. Okay. Or Spotify tracks EDA. I will keep it public, create the
four, create the file and click run. And
you can run on any of these
options of normally for data analysis. I like to use binder I'll take a second or two to start up. Not long. so you start up binder and then you
download the dataset on the Jupiter notebook that opens up, and then you would
just save it back to your job profile. So you can do all of this online. You do not have to, if you want
to, you can install things on your computer, but the more important
part is to build the project. So to get started, you
can just start online
. You can use online resources. You can create the project on Jovian. You can run it on binder. You can turn it on Google CoLab. You can run it on Kaggle. I think you get the idea. So in any case, I'll keep that aside. So Kaggle is the best place,
but then there are also a lot of other places on Kaggle. There is also previous competitions. You can look at Kaggle
competitions, so go, to Kaggle competitions and look at the. Completed competitions. So there have been about, I think over
a hundred o
r so competitions on Kaggle and the competition data is actual
real world data that some company has put together and shared with Kaggle
for creating a crowdsourced project. Okay. So the, this is actual real world data. You will find all the challenges
that you find with your very data, which is missing values, incorrect
values, mislabel data And sometimes the data sets are very, lost. Sometimes the data sets are
quite small, but you have to make predictions on a larger dataset. So this is a gre
at dataset, a great
source of data sets both for data analysis and also for machine learning. Which brings us to the second
project that you should be doing. So after you do a project on
exploratory data analysis, so now you understand how to. Process a data set, how to clean a
data set and how to analyze a data set. The next thing you should do is try
to build a classical machine learning project and by classical machine
learning, we mean all the machine learning techniques that have come befor
e today's
deep learning and neural networks. So before this is probably
before 2013, 2014. So this includes
techniques like regression. So you have linear regression, logistic
regression, polynomial regression. This include techniques like k-means
this includes things like random forest decision trees, gradient boosting,
and then of course, so these are some supervised learning techniques, but
you also have unsupervised learning techniques like clustering, and then you
have things like collabora
tive filtering. And if these concepts don't make sense
right now, these stones don't make sense. That's okay. I will also point you to the right
courses where you can learn about these. You can learn about these machine
learning algorithms and these machine learning techniques. But it's important to cover these
techniques because when you start working as a data analyst or as a machine
learning practitioner, You will most likely be working with what is called
tabular data, which is data, which l
ooks like a spreadsheet or data, which looks
like a database table and on tabula data, especially when you have smaller
data sets, it is often these classical machine learning techniques that give
good results or good enough results. And the other thing is also that
these classical machine learning techniques have better explainability
compared to deep learning. So although deep learning seems like. The hot thing right now, although
everybody's talking about it and most papers are coming out aro
und
deep learning, but a lot, but deep learning, a box black box and it's
applications are better suited for unstructured data at the moment. But for most real business problems, you
will have to do some classical machine learning, all the techniques that I
talked about and you will have to. Work on the explainability
aspect of it as well. So you should also be able to tell why
your model is giving the results that it has given, especially if you're working
in areas like finance, or if you're wo
rking in anything that is regulated. For example, insurance, we need to
explain why our model gives us a result. And often you will also be asked
by whoever you're presenting it to Y to explain how your model works. So that's why that is important. And the step, the way for doing it once
again, is to find a dataset online and for machine learning projects I would suggest
that past Kaggle competitions are great. You can see here, like Custer Santander
customer transaction prediction. So this is,
as you can probably guess
this is information about customer transactions and the objective is to
identify who will make a transaction. Predicting box office revenue. So can you predict based on who
is in a movie based on, let's say title of the movie director of the
movie, the budget of the movie, can you predict a box office revenue? That's an interesting problem. It's a classical machine learning problem. Let's see, analyzing NFL game data, then. Pub G can you pray the battle
Royal finish of
pub G players. So looking at looking at past matches
and predicting who's going to win a particular match, or what, at what
position is somebody going to finish? These are all good, problems. So before you do any modeling,
though, you should be able to first. Understanding, describe
the modeling objective. So for example, you need to be able
to tell what type of data it is. Normally for classical machine
learning, you will be working with tabula data, not with images. This should probably be cha
nged, but still
there is still some variation in the data. Is it time series data? Is it just regular like
database columns data? Is it maybe. Kind of sensor measurements that have
to be interpreted in a certain way. So there is always some, information that
you need to know about the data before you can actually start working on the reader. So you should have that you should be
able to document that you should be able to identify that you should be able to
identify what type of problem it is. I
s it regression? Is that classification,
is it unsupervised? Is it something like Is it a
recommendation problem, a collaborative filtering, things like that. You need, to be able to identify what
type of problem it is, and then you need to perform any data cleaning if required. So in some sense, an exploration
data analysis project is included within a machine learning project,
but it still helps to first have a separate exploratory analysis project. So that you can pick and apply
the right ski
lls when you need to. Okay. Not every machine learning project
will involve all of these, involve data cleaning or involve a lot
of exploratory analysis, but it is always helpful to do some EDA,
always helpful to plot some graphs. Look for correlations, ask questions about
the data and the more you understand the data, the better you will get at model
building or one important lesson here is. Feature engineering as well. So once you do exploratory
analysis, you can figure out what new features y
ou can create. And this is one of the things with
classical machine learning algorithms, that there is a lot of feature
engineering involved because more gardens and cells are quite shallow. As in, there's not much you can
do, you just have to basically put the data into the algorithm and
you can get the result out of it. Now, if your data is poor, if your
features are not strong, then your algorithm cannot do a good job. For instance, if you have time and
microseconds, and then maybe your algor
ithm, your machine learning
algorithm will not be detect be able to detect weekly patterns. So what you need to do is you need to also
introduce a column called day of the week. You need to also introduce a
column called month of year. You need to also introduce a column
called maybe hour of day and so on, right? And when you create more new columns, then
you are able to train your model better. So your feature engineering and then
you have then you do the modeling. So then, you pick a type of m
odel. So maybe you pick a random forest,
maybe pick regression, maybe you pick gradient boosting, then you train
the model, you make some predictions, you evaluate it on the test data set. And let's say you record the metrics,
you record the metrics of the model, and then you try different hyper-parameters
and different types of models. So this is one thing in classical
machine learning that you must almost always try multiple approaches. So you will almost always try
regression and random fores
t gradient boosting, et cetera, et cetera. And you will also try different
kinds of hyper parameters. So here again, if you're using the library
like scikit-learn, you can probably use some hyper parameter optimization tools. So what is called grid search? Where you try for each model,
there are different sets of parameters that you tune. So you can set up your model to
be trained with different sets of parameters, and then you can
pick the best model out of it. Okay. And finally, what you need
to do is
then look back at all the different approaches you have tried, and then you
summarize your learnings and draw the inferences and identify what can you,
what, how can you further improve it? Because you cannot go on
working on a project forever. What you do is you stop at a
certain point and you say, okay, we're, we've achieved a good enough
accuracy or a good enough loss. And I'm ready to publish this
project where I've tried many different ideas, I've tuned, hyper
parameters and gotten
a good result. But at the same time, if you had time
to continue it, what would you do? Or if somebody wants to build on
your work, what should they do? And finally you take this
and you publish a notebook. You can put it up on GitHub,
put it up on a Jovian. And also again, if possible,
write a blog post to describe your experiments and summarize your work. Okay. So that's, a classical machine
learning project and we stop for questions at this point. Okay. So there's one question. Is there a co
urse in Jovian which
covers classical ML using Python and not yet, we are working on one, so
you will probably be able to register for it sometime in February, but it
is something that is coming soon. But in the meantime, though,
there are a couple of courses that I would recommend for this one
is machine learning on course. So this is the. Quintessential course on machine learning. I'm sure many of you
have done this already. If you haven't, you should do it. If you've done, if you did it long
ago,
it's always a good course to revise. Also, although this courses
not in Python, this course is in a language called octave. So that might be a little bit tricky,
but what you can do is you can learn the concepts from this course and
then reimplement them in Python. So you can using, let's say
the scikit-learn library. So this is one you can do this
course, at least there's a good course that sets the intuition. So you can, you will,
for example, get to know. Okay. What is linear regression?
What is logistic regression? What do you mean by then here it talks
about neural networks a little bit, but then support vector machines on supervised
learning, dimensionality reduction, anomaly detection, recommender systems. So there are a bunch of pretty drastic
things that are covered in this, course. So this is one that I would recommend. Another one that is very practical
is called ML course slot AI. And this is just a set of videos created
by one person, but it is pretty solid because th
is one person has a long
history of working in machine learning. And then also participating
in Kaggle competitions. Although I wouldn't say that it is
sufficient today to just do this course, especially because this course is in
a language called Optiv not in Python. So you can do this course to get an
idea of what machine learning is and what are some of the most popular
techniques used from doing machine learning and what are the different
kinds of problems you can solve with it? And then go
on to do this
course called the open machine learning course or ML course.ai. So this course is a community course
organized by open data science or odsp.ai. And you can check out the lectures here. The lectures are all available on YouTube
and all the code and other datasets for this course are also available on kaggle. And finally, if you are interested,
if you want to dig in further, I will also recommend this book
called hands-on machine learning with scikit-learn and TensorFlow. This is by
published by O'Reilly. The recent edition was released in 2019. And this covers scikit-learn
Keras and TensorFlow. So you can definitely focus
on the scikit-learn part of it for classical machine learning. And then for deep learning, you
can look at keras tensorflow. So that's your second project
classical machine learning. And I would recommend doing a
classical machine learning project before you do a deep learning project,
because even though deep learning is. Probably the most popular thing
right now in data science. The fact of the matter is if you start
working as a data analyst or a machine learning practitioner, you will be
working primarily with tabular data and you will be using classical machine
learning algorithms for two reasons, one for tabular data and especially
slightly smaller data sets, classical machine learning algorithms perform. Often just as well as sometimes
better than deep learning algorithms, especially things like gradient boosting. And the other thing is a
lso about
explainability and interpretability because these, your machine learning
models will be used somewhere to make some decisions or to data
mine, the behavior of a particular application or a particular website. So you will be asked to explain why a
more the model gives a certain result. So for instance, in a use case like
insurance, before you reject somebody's application, you have to give them a
reason why they, why it was rejected in such cases using a simple decision tree
or a logist
ic regression might make more sense than using a complicated, deep
learning model or even a random forest. So keep that in mind. And that's why I focus on
classical machine learning. Classical machine learning
is by no means dead. In fact, it is more widely
used today than ever before. So definitely spend some time doing all
of these, learning all of these techniques and doing some projects on them. In fact, to land your first job, you
may not even need a deep learning project, although it's goo
d to have one. So that brings us to our third
project topic, deep learning so we did a workshop on building a
deep learning project from scratch. So you can just follow along with the
workshop it's two and a half hours. And if you simply follow along and
follow all the instructions in the workshop, It goes through the entire
process of a deep learning project. So it starts with where to start
with finding a dataset online. We do that on Kaggle and then
understanding and describing the modeling
objectives. So identifying what type of data
you're working with, identifying what type of problem it is doing
any cleaning, if required and performing exploratory analysis. Then we'll look at modeling
where we define a model, a simple network architecture. Then we pick some hyper
parameters and we train the model. Then we evaluate the model and
then finally we iterate with different hyper-parameters and
different regularization techniques. So do check out this workshop
and you can just pause. I
f you want to start a project, you
can have this video running in the background and you can just pause a
different step, different places in the video and work on your project
and then get back and continue. So do use this as a, reference guide. And some example projects in
deep learning, and I'm sure you, you may have seen a lot. If you have been taking the zero to
Gans course that is currently ongoing. So the first one is I'll give
three or four examples here, different kinds of examples. The
most common one that you tend
to see is image classification. So here's one blindness detection
using image classification. So typically we've seen in, courses you'll
often see images of everyday objects. But in this case, this dataset
contains images that looked like this. So these are actually pictures of if I'm
not mistaken pictures of the retina or the human eye, essentially, but primarily the
retina and what you are required here to predict is the severity of the blindness. So the severity
changes from. Okay, so severity of diabetic
retinopathy on a scale of zero to four. So it goes from no diabetic
retinopathy to mild, moderate, severe, or proliferative DR. So here's how it goes. You create a training set and
then you create a test set and a validation set out of it. Then you add some transforms. Because here we are looking
with looking at medical images. So medical images is one huge
area where deep learning is helping make huge advances. So that's one area where, and there
are
also a lot of rules, a lot of opportunities in healthcare and in
medical imaging, especially whether it is cell images, whether it is. Images of the eye, things like
this or even, things like CT scans. So here, what we're doing is
we're applying some transforms to them, to the image. So we are resizing it. We're doing some randomized transformed,
like random, horizontal Philip converting into a tensor and the
normalizing it, and then creating training and validation sets, creating
training and
validation data loaders. Then defining a model architecture. So in this case, this, I
think this is done by Shaunak. Okay. Yeah. So in this case Shaunak has picked,
the restaurant one 52 architecture. So this is one interesting
thing that you can do. You look at the dataset on Kaggle. So if it is really the dataset on Kaggle,
you may want to just check out the notebooks tab on the category of dataset. And in the notebook step, try not to
just copy the code, but look at the code and understand wh
at people are doing. And look through three or four
notebooks, gather some ideas and try to implement it on your own. Okay. So dry in general, not to copy paste code. Even if you have to use somebody else's
code, you can just put it on the side and then type it out because when you
type out the code, you're automatically type each variable type each operation. And that forces you to think about it. So definitely at least drive the
code or the best is to look at the code, understand what it does
and
try to replicate it by looking at the documentation, by looking at other
notebooks that you have written in the past, or by looking at other things. So whatever pick up good ideas, but try
to replicate them and understand them. So here is creating a model and. Then changing the head of the model. So here we are using a pre-trained
model, the resonant one 52 model, but out of the pre-trained model, what he
has done is he has removed the final layer of the fully connected layer. And this is a
technique
called transfer learning. Once again, something we've covered in
the deep learning with Pytorch course, and then he's training the model here. So you can see that the
model is being trained. Yeah. So you can see that the
model is being trained here. So you train the model and you track the
losses and you try different architecture. So it seems like there's a learning rate
of zero different learning rates have been tried here in different models as well. So here you have restaurant one
52. This is resonant one 52 as well. And then the sort of resonant one Oh one. And you can try and compare the
different models, how the loss changes. So it seems like on Resnet one Oh one. This is how the loss changes. You can see that the validation loss
stops at around 0.35 on resonant one 52, the validation loss goes far lower
to.125, and then on resnet one Oh one. Once again, the validation loss is high. So you want to experiment with different. With different model architectures
with diffe
rent hyper parameters. And then finally you
make some predictions. So you make some predictions
and then verify whether those predictions make sense. So in this case, this particular blog
post does not contain predictions, but you should also always verify on maybe
five, 10, 15, 20 individual images. If you're working with images,
And make some predictions. And then if you have a test set and
you need to submit the prediction somewhere, then generate the predictions
for the test set and then sub
mit it. So that's roughly the structure of a
deep learning project, and it's always important to write a conclusion. So it's important to just summarize
your approach important to summarize what you've learned, what worked,
what did not work and what may be some other ideas to try out. And if you have looked at if you're
borrowed code or if you're borrowing ideas from different places,
then you can also add references once again, a great thing to do. So that's one deep learning project. And simi
larly, you can check
out a couple others too. We'll very briefly cover
this one called classifying environmental audio recordings. So you do not always have to
work with images in this case. What you can do is audio can
actually be turned into images. So here we are working with audio files
and audio files look like this, which are basically way forms, but these audio
files based on certain transformations, and this is the audio file. So this seems like a clock tick,
the tick of a clock probably
. And this is what you see
here in a way form format. And you can use certain tools and
libraries like live Rosa to convert that into this kind of an image. And after applying certain normalizations
that, that becomes an image like this. So now what you can do is you can
transform audio files into images, and once you've transformed audio files into
images, you can then use deep learning the same techniques that we use convolutional
neural networks to classify or do. So here you see here that we
have
this is what you end up with. Essentially different images of
different audio and then you, yeah. So this is one batch of training data. And now what you're looking at
is each pattern of audio that you see belongs to a certain category. So one could be a clock ticking. One could be a dog barking. One could be the sound of a
bell and so on, and then you can perform image classification. So again, an interesting project,
here's one generating art using Gans. This is also very interesting. Yo
u take some artwork, so you take
artwork and then you put that as an input to a generative adversarial network. And you end up with something like this. And I would say this is pretty
impressive, although it's not anything in particular, but it looks nice. So this is a good, there's a good start. In fact, generating art using Gans is
an entire area where hundreds of people are trying many different things. Trying to change the inputs to gans,
to generate interesting pictures and artists are acti
vely using Gans in their
work, especially with digital artists. So do check it out. I will not go through
it in a lot of detail. You can see this is a
pretty big blog post here. And finally, here's one more. So it, we looked at one problem which
was classification and classification can be single label or multi label. We looked at an, unsupervised learning
problem, which is generating art, using Gans generative modeling problem. And we can also do what
is called regression. So where we are detec
ting, we are
where we are coming up with, not what class an image belongs to, but specific
points or coordinates on an image. So for instance, here we are using
something called other ideas on something called pose estimation. So for pose estimation, he has used
that TensorFlow pose estimation package. And once a little bit of deep
learning, you can look it up. You can look up what pole pose estimation
is, and you may have to read it tutorial. You may have to read some code
on GitHub, but you ha
ve all the right skills to figure out what,
the code means and how it works. You may even have to read a paper,
but in most cases you can find a blog post explaining the paper. So here you can look at it here. That even in this case, even in
this case, Adithya has used this notebook, so particular notebook to
learn more about pose estimation. But in any case, the idea here is not to
predict a single class, but to predict a bunch of key points, zero to 17. So bunch of 18 key points and using thos
e
18 key points, you can then estimate the pose of the person in an image. For instance, here are some examples. here are some examples. So you can see that this, these are
images of cricketers, and then you can see all the key points, point out mark,
and then using the polls, you can actually then use it, use that to classify. Which specific short they're
taking, whether this is cricket or some other sport. So this is interesting. So this is a multi-step problem where
you take these images and
convert those images into these polls variable. So polls coordinates, and
then you run a classification model on the post coordinates. And sometimes what you can do is
you can use a pre-trained model to come up with the post coordinator. So then you simply have to
build a simple feedforward neural network for classifying. Oh yeah. So you can check out his notebook here. yeah. So th this is one other thing that even
try, which is multi-step deep learning problems where you're using one. Model to
convert the data set to convert
the data into a particular format, let's say, coordinates or embeddings. And then you use that as the input into
another deep learning model or possibly into a classical machine learning model. One other thing that you can
also do, which you've not covered here is working with text data. So this is where you will
have to use recurrent neural networks and transformers. So some learning resources
for deep learning. The, obviously we have the course deep
learning wit
h Python, zero, two Gans. I hope you've done the course. If you have not, you can still sign
up for it and do all the, watch all the lectures, do the assignments and
build a course project or another good set of courses is the deep
learning specialization on Coursera. And this is good. If you want to dig deeper into
the theoretical side of things. If you want to actually look at the
math, how the math works and you want to become familiar with some of the
terminology that is used in deep learnin
g it also contains a lot of practical tips
for building good deep learning models. So the deep learning specialization
on Coursera, I would recommend it. It's pretty good. Another good course to
do is the practical, deep learning for coders course. This is also known as fast AI. Now you don't have to
do all of these courses. You can do one of them, maybe two of them. So you, if you've done the zero to gans
course, maybe you can compliment it. You can do the deep learning
specialization, or if yo
u've done fast AI then you can do, the deep learning
specialization, or if you've done the deep learning specialization, then
you can do the deep learning Python books or do one or two of these. The more important thing is you should be
able to understand most, if not all of the terminology that is used in deep learning. If you're reading a blog post, you
shouldn't feel lost, or if you're reading news should be able to read code. So if we give you a link to a GitHub
repository, you should be abl
e to look into it and understand the code. And if one of the frameworks, either
TensorFlow or PI torch most of the other. So you just make sure
to learn one framework. It doesn't have to be either
TensorFlow Python specifically. It could be it could be either
one and just have some familiarity with the other enough that you can
understand it if not right coordinate. So this one, this book is also pretty
good, deep learning with Python. This is in fact written by
Franco , who is the author of. Th
e Keras library, which
is part of TensorFlow. So if you want to, if you want
to have a book for reference, you can check out this book as well. So that's the third project. Now, apart from this, there are a
few more projects that you can build. Now. One good thing you can do. One good project that you can do is
something with sequel, possibly because a Sequel or SQL as you might call it,
or basically relational databases. Are often very commonly used
for storing information. So you will be, you
will have
to work with a SQL database. And if you can work on a mini project
or maybe even just a blog post, where you can demonstrate that, you
know how to write advanced complex SQL queries to address different use
cases, that will be pretty helpful. Then another project that you
can work on is web scraping, web crawling and web scraping, which is. Essentially using a library, like
there's a library called scrapy. So this is used to get webpages and
then download information from web pages and
then get the links on those
pages and crawl those web pages. So for instance, you could
get a page from Amazon and get information about the product. So to actually parse the
information, you will have to use this library called beautiful soup. And then you can follow all the
other products listed, linked to from the product page and then create a
database of products in a particular category, and then do analysis on that. So you can use web scraping as
a technique to generate your own dataset
by scraping a website. And in fact, a lot of the data
sets that you find on Kaggle have been created in this way. So that's one other thing that you
can add once again, all of these other things that are mentioning are optional. One more thing that you can possibly
look into is web development. So you can look at the Flask framework. The flask framework is a very
simple web server for Python. In fact, you can see here, minimal
application looks, something like this. So you just write these four
lines of
code into a Python program, and then you simply share, then you simply run
it and you will be able to open it up in a browser and interact with it. So for example, if you open the route. Slash. Which is local hosts, eight, eight,
eight, eight, or wherever you're at, whatever port your application is
running at, you will see the words hello world, or you can go from there. And instead of returning a spring,
you can return an HTML page. So you can also learn some
HTML, CSS, and JavaScript
. If that interests you and create
an entire web application. Now you may not want to do a entire
web application project where you're building a website or a web app. What you can do is you can take
a machine learning model that you have created and put it up and,
create a server and a simple user interface, or like a simple form. For instance, if it is a flower
classification, Model, you can create a simple web application that allows
the user to upload a file, a picture of a flower, and then
it tells them which
flower there, which flower the picture represents just something simple like
that, that would require about 5,200 lines of code with evidence last HTML,
CSS, JavaScript included, and you can deploy it to platforms like Heroku. So check out Heroku headache was
a simple way, a simple platform to deploy python web applications. So you just check, check, getting
started on Heroku with python. It's all really simple. No, it seems complicated, but if, once
you spend some time with
it, then you become familiar with it and then it
doesn't, it's actually the other simple. So these are other things that
you can do once again, these are. Optional, but good to have. They will definitely set your profile
apart, especially when you're up applying for internship or you're
applying for jobs or you're reaching out to people on on LinkedIn. And we'll talk about that at some point
as well on what is the best, good way to reach out or when you're applying for
jobs cold reach out email,
reach out, or getting referrals and such things. Yeah, so that pretty much covers it. There's also some more areas. Things like spark Hadoop, hive, no
sequel that you can do, but all those things are, I wouldn't say that if
you're, looking to just get into data science right now, I wouldn't say that
you need to do any of those things. These are the three important
projects for you to do. Exploratory data analysis, classical
machine learning and deep learning. And finally, how to find
projects o
ut to find datasets. As we mentioned, several times,
Kaggle data sets is a great place to look for datasets. So just go to kaggle.com/datasets. A couple of tips I mentioned because
I've mentioned this so many times. One good thing you can do is
you can set a minimum file size. So if you're looking for slightly
big sets, data sets that have. Let's say more than at least a few
thousand rows of data, then you probably want to put in a filter like this,
maybe 10 MB, or if you're looking for image da
ta sets, then you maybe
want to put in a filter of hundred Mb because each image is about one. Let's say tens of KB's in size,
maybe up to a hundred KB in size. So to get to a thousand images, you
will need, at least the data set to be at least a hundred MB insights. Another thing you can do is you can add. File types like CSV. You can also search by tags. So use, make use of the filters here. It's pretty powerful. Yeah. One more thing to do would be to sort
by this is normally sorted by hotest,
but you want to sort by the most words. So now you can see with some
filters and with some votes we have. Some pretty, usually the sex, the
COVID dataset is about seven GB. The Bitcoin dataset is about
96 MB accidents, us accidents. That's about 300 MB Zomato
restaurants, 89 MB and so on. So explore different tags as well. Okay. Okay. So it looks like you can, let's say
search for sales or business related data. okay. Doesn't really seem to make sense,
but yeah, the tags help as well. This is i
nteresting. Uber pickups, SF salary. So just by looking at these datasets,
I'm sure you're probably thinking about the different things you can do with
them, whether it is data analysis, visualization, or prediction of some kind. Movie reviews. So this is again in testing education,
statistics around the world. Yeah. So Kaggle is a great place. Then there are a few more, you
can check out all of these. There is enough data set. There are enough datasets on the
internet, so do use these resources
. Okay. So that's where we will end the workshop.
Comments
a lot of information for beginners, I appreciate your hard work Aakash.
Top of the top content by Jovian. Thanks alot
Great information source. Thank you.
thanku for the detail explanation
Great work!
Amazing content. Can u pls tell me a way to use pretrained lstm cnn model I am working on video classification problem
Good explanation sir
Thanks sir
Good content
golden