Main

3 Beginner-friendly Data Science Projects to make your Resume UNIQUE | EDA, ML & DL

πŸ’» For real-time updates on events, connections & resources, join our community on WhatsApp: https://jvn.io/wTBMmV0 πŸ™‹β€β™‚οΈ We’re launching an exclusive part-time career-oriented certification program called the Zero to Data Science Bootcamp with a limited batch of 100 participants. Learn more and enroll here: https://www.jovian.ai/zero-to-data-science-bootcamp πŸ’Ό Before you start applying for data science jobs, you should make sure to complete at least one project in each of three important domains and host them on your public Jovian/GitHub profile. πŸ”— Resources used in the workshop: https://paper.dropbox.com/doc/3-Must-Have-Projects-for-Your-Data-Science-Portfolio--BDIenNsF5vSatHDJ0mvGXpsEAQ-WuvL3fhBPiUcxjXnR6gOC In this video, we’ll look at example projects in these three areas of data science: 00:00 - 01:54 - Introduction 01:54 - 28:04 βœ… Exploratory Data Analysis and Visualization 28:04 - 41:11 βœ… Classical Machine Learning on Tabular Data 41:11 - 01:02:43 βœ… Deep Learning (Computer Vision/NLP) The video will also cover the process for acquiring the skills required for these projects, identify good project topics and datasets and tips on documentation and presentation. - 🎀 About the speaker: Aakash N S is the co-founder and CEO of Jovian - a community learning platform for data science & ML. Previously, Aakash has worked as a software engineer (APIs & Data Platforms) at Twitter in Ireland & San Francisco and graduated from the Indian Institute of Technology, Bombay. He’s also an avid blogger, open-source contributor, and online educator. - Let's Build a Python Web Scraping Project from Scratch - https://youtu.be/RKsLLG-bzEY Let’s Build an Exploratory Data Analysis Project from Scratch - https://youtu.be/kLDTbavcmd0 Let's Build a Deep Learning Project From Scratch - https://youtu.be/MQGHl3E8QA0 How to get a data science job with no experience - https://youtu.be/9gEvswBA2d4 - Check out our Free Certification Courses referenced in the session πŸ”½ πŸ‘‰ Data Analysis with Python: Zero to Pandas, a self-based course offered by Jovian and is a practical, beginner-friendly, and coding-focused introduction to data analysis covering the basics of Python, Numpy, Pandas, Data Visualization and Exploratory Data Analysis for beginners. You can learn more and register for a Free Certificate of Accomplishment at http://zerotopandas.com. πŸ‘‰ Deep Learning with PyTorch: Zero to GANs is a beginner-friendly online course offering a practical and coding-focused introduction to deep learning using the PyTorch framework. You can learn more and register for a Free Certificate of Accomplishment at http://zerotogans.com. #DataScience #MachineLearning #Projects - πŸ”— Learn Data Science the right way at https://www.jovian.ai πŸ”— Interact with a global community of like-minded learners https://jovian.ai/whatsapp/ πŸ”— Get the latest news and updates on Machine Learning at https://twitter.com/jovianml πŸ”— Connect with us professionally on https://linkedin.com/company/jovianml πŸ”— Subscribe for new videos on Artificial Intelligence https://youtube.com/jovianml

Jovian

2 years ago

Welcome to today's workshop on three must have projects for your data science portfolio. One of the most common questions that we get, asked is which project should I work on. And typically this is from the perspective of building a one. It is from the perspective of learning, learning the right skills. And second, it is also from the perspective of building a good portfolio, because before you apply for an internship or a job in data science, you will have to first create a strong resume and a
strong portfolio. And by portfolio some a list of projects, which you have put up somewhere, either somewhere publicly, either on your GitHub profile or on your Jovian profile or anywhere else, maybe on a blog or some, or on your, or your own personal website. So today I just wanted to cover three must have projects that you, should do before you try and apply for data science role. Specifically, if you're looking to get into machine learning, I will also touch briefly upon what you should do. I
f you're looking for more of a data analyst kind of role in that case, maybe you do not need to do all three of these. Maybe the first two might be sufficient. And then the third one, you might want to substitute with something else. No, if you have to pick three projects to showcase on your portfolio to showcase on your resume, then these are the three that I would recommend. The first one would be exploratory data analysis and visualization. And we'll talk about it in more detail. And then the
second one will be should be a classical machine learning problem. So on tabula data. So this is by classical machine learning. We mean algorithms, which came before deep learning. So this is pre 2012, or these are also sometimes called shallow machine learning. Techniques. And sometimes the distinction is also made between structured and unstructured data and structured data typically refers to tabula data. Whereas unstructured data typically refers to things like images, text, audio, video, a
nd such things where deep learning is more applicable and more widely used. And then the third one should be a deep learning project because today deep learning is so prevalent that pretty much. Any machine learning work you do at some point you will want to at least try deep learning or neural networks work, not work well, not just on unstructured data, like images and texts, but also work really well on tabular data. So these are three projects that you should do, and this should be in some se
nse, your goal when you're getting started. That that you, do these projects in, let's say you have a six month timeline for learning. So try to make sure that at the end of six months you have one project or maybe more than one project in each of these categories in your portfolio, so that when you start repairing your resume, when you start applying for jobs, then you have these to showcase Just to get, just to have a better chance at getting shortlisted and also having something to talk about
during interviews. And of course, to just cultivate the skills that you need have to apply that you will need to to fulfill your job responsibilities properly. Okay. So the first project that we are going to look at is exploratory data analysis and visualization. So this one is something. That, is probably the first thing that you learn. If you've taken our course on zero to pandas data analysis with Python that you'll know exactly what I'm talking about. The basic idea here is you show, you ne
ed to first go ahead and find a real world dataset of your choice online and others show you how to find datasets as well. So you find a dataset that you find interesting. Maybe there's a topic of your interest. Just make sure that it is a large enough dataset, make sure that it has enough variety, make sure that it has enough information in it so that you can do more than just one or two graphs or one or two questions, and answers. Then it is also ideal that if you find a dataset, which has som
e similarity to what you might see in the real world, where maybe there is some missing data, maybe there is some incorrect data. Maybe there is some. Maybe there is some data, maybe there are some miss some things that you need to clean up. For example there are, there is a column called date and that data is actually written as text and you need to convert it into a proper date. And maybe break out things like month, your month, year, day of the ones and things like that. So the more you can s
howcase your skills of parsing data set. So maybe how to read a CSV file, how to clean the dataset, how to look for missing values, how to look for incorrect values, how to fill in missing values. The more you can showcase those skills, the better your project gets. Because then you are then one year learning these skills in second year, also showcasing to a potential employer that you know, how to clean data how to work with messy data. Yup. And then also use who's Matt plot lib and see bond to
create visualizations. And again, if you've done the zero to Pandas course, we've had a full lecture dedicated to the different types of visualizations that you can do. So you definitely want to showcase that you can, that you have different kinds of GRA you understand the different kinds of graphs that can be created. You understand when they are used, you understand when they should not be used. So picking the appropriate visualization for the appropriate use case is very important. And then
you should also, you also need to be able to showcase your skills in asking and answering interesting questions about the data. This is one of the. Key things that the purpose of data analysis is, to figure, get insights from data and to get insights you need to first be able to ask good questions. Okay. And finally presentation is a very important part of data analysis. So you should, if possible, at least add some documentation in your Jupiter notebook or in your GitHub repository or in your b
log post and showcase it in a way that is. Pleasant for somebody reading. So it's, somebody should feel interested just by looking at the title. They should feel interested by reading the first few lines by looking at the first few graphs. So try to make it engaging. So this is the first project, and now we look at some examples. And once again, if you have any questions at any point, please feel free. And as I'm doing this. Feel free to also follow along and maybe open up these links. So I have
posted this, a link to this Dropbox paper document in the Slack channel, but I will also post it here in the chat. So as I'm, going through them, feel free to look through it yourself too. So what we'll do is we'll probably look at a few example projects first. And then we'll talk about how can you acquire the skills that you need to do these projects? And finally, we'll also look at some ways where you can find interesting datasets. Again, this is something that a lot of people have asked. How
do I find a good data set to work on? So here are some examples and I've picked three examples here. One is called WhatsApp data, exploratory data analysis. So let me open this up. This was a course project done by Michael gin. He was one of the participants in the zero to pandas course. So the idea here is that the dataset Michael is using is his own WhatsApp data. So on WhatsApp, you have this option to export your data and you can see here's a mentioned, he's mentioned where you can get this
information. So you're going to settings and you have an option of exporting chats. And you can either export the chat from a specific group, maybe a group that you've been part of for several years, and you have tens of thousands of messages, or you can export your entire WhatsApp data as well. Maybe go through five or 10 different rates and then started the first thing is to read the data set. So the first thing is to import the data set and you can see here. There is a text file, Chad dot TX
T. And the text file can be imported like this using pandas. So this is the pandas file and you're reading the data and you're showcasing that. It is you are able to show that it is a data frame. The next step is to just present your understanding of the data. So for example, here Michael has described that there are three columns in the dataset. It contains date, text, and a non-value. It turns out that there are 21,000 or 23,000 rows of data. So there are 23,000 chats here, and then there are
also some unknown values. So there are some, maybe in some cases, there are some Nance here and there, so those may need to be cleaned up. So the next step after that is to. Yeah. So the next step after that is to clean the data. So there are cases where there is just an image and in which case we want to analyze textual data. We don't want to analyze image data. So Michael has done some cleaning up here where he is written some code where he's going to drop all the tech, all the chats, where it
refers to actually an image message being sent. And it has this message media omitted. So we will not get into the details yet, but this is the idea that you look into the dataset and then you understand, okay, there is some cleaning that is required. Maybe you need to remove some rows. Maybe you need to fill in some missing values and you need to not only do that activity, but you also need to document it. And if you go through this post and I hope you are following along on your own. If you g
o through this post, you will see how Michael has explained it really well. Which are the rows that he's going to exclude and why he has chosen to do that. Okay. So giving a rationale for your data, cleaning for your data, missing data imputation, all of these things are pretty important and only then you get started with exploratory analysis. So expert, there are many different ways of doing exploratory data analysis. Some people just like to take each column and create graphs for each column,
or some people like to start with questions. And this is something that is totally up to you because whichever way you start, you will end up doing pretty much all the analysis one way or the other. So in this case, for example, Michael has started by asking questions. So he started by asking the question, which users have the most chat messages in a group. And this is also a useful skill given a dataset. Now I'm giving a dataset about WhatsApp chat, a WhatsApp chat history of a particular group
. And maybe the first thing you should try to do is start to think about the questions that you would ask. Okay. And you can start with obvious questions. Obvious questions are how many total messages are there in the, dataset. That's one, that's something that is already clear from looking at the data frame asking questions, which users have the most chat message chats or messages in the group. That's again, clear, you could also ask. Okay. What is the number of chat messages, but per user you
can ask. Okay. Who has, the. Who was the earliest member in the group who was the large last member in the group. So in what order did the members of the group join? Again, this is something that you can analyze based on the first message sent by each user. So there are a lot of interesting questions you can ask and the idea is you ask a question and then you figure out how you're going to do that using finders. So for instance, for which users have the most chat messages in the group, Michael h
as said that we can use panders to first group the data. So he's using a group by here. We can use finders to first group the data, and then he's going to count, or he's going to group by user, and then he's going to count the number of messages. And then he's going to sort it in ascending order. So here you can see this is the, order. There are three people in this group and it turns out that COVID has sent over 10,000 images. And you can also visualize this. So this is in tabular format, but t
he same thing can be visualized here. So here, Michael has chosen to show this using a line graph. Possibly I would say that maybe a bar chart would also be a good choice here. So you may want to just showcase a bar chart because showing three bars side by side might, be more informative here. Okay. And so on. Thanks. So now you have question two, which emojis are the most used, and you can see here that these are the most used emoji is the laughing emoji. The laughing and head turned emoji, the
strong arm emoji, and here a pie chart makes a good is a good choice. Then you have a distribution of emojis per user. How does that differ most active hours on WhatsApp? So pure, we are looking at how much, time are people spending and what, is the time, which time of day is our people the most active in terms of sending messages. And you can see here that it turns out that most of the messages are sent around the 1:00 PM to 5:00 PM. Seems like that is the range. And so on. Okay. Then your que
stion four and question five, and then at the end, maybe there are some inferences as well. So here a word cloud is again, a very interesting thing to showcase. So this is an exploratory data analysis project. And, that's it. So it's not very difficult. You need to know the skills you need to have for this are basic Python programming. You need to understand binders, matplotlib Seaborn. And once these it's simply a matter of picking a dataset and you really don't have to spend too much time pick
ing a data set as well. Start with the first thing that you can find online. And I'll talk about how to find datasets and just start exploring. So there's no right way to do it. You just start exploring, start doing start to start messing around with the data, start drawing graphs. And when you feel confident, maybe you want to then go back and say, okay, do I need a bigger data set? Do I need. I want to showcase all the skills I want to showcase cleaning. I want to showcase and analysis. I want
to showcase maybe merging multiple data sets. I want to showcase visualizations. You want to showcase your presentation skills. And then, so then you, maybe you might want to consider the dataset that you've just analyzed. Does it satisfy all these criteria? If not, then maybe go back and find another dataset. Maybe go back and find a large enough dataset. Maybe go back and find a complex enough data set. Maybe go back and find a 13 of dataset. And build your project. And once you build your pr
oject here, for example Michael has written a blog post and a blog post is a great way to showcase your project so you can create a blog post, or you can just put all of this information into a Jupiter notebook and put it up on GitHub or on jokey. And so you can see here, this is the. Jupiter notebook that Michael was using now, Jupiter notebook is good for a technical audience. But it's all for a blog. Post is often better when you are presenting it to let's say potential employers because in a
Jupiter notebook, there is a lot of unnecessary code which, is mostly like functions that are going to use imports and things like that. What you can do is when you're writing a blog post, you can remove a lot of those things. And you can talk more about the insights and you can talk more about the dataset and you can talk more about the inferences that you are drawing, and which is really the whole purpose of data analysis. The purpose of data answers is not to write code. The purpose of data
analysis is to gather insights. And that's where putting your work, putting a code in the Jupiter notebook and then putting the results and maybe just the outputs in a blog post is a good idea. It's like a report for your project. Okay. So that's one example, project, WhatsApp data analysis. And this is, done on your personal data set. So this is something that all of you can do and you can all ask different questions, try asking. Try asking some other interesting questions you can think of mayb
e instead of analyzing, maybe try picking a big group rather than a small group, maybe try picking a personal conversation so that if, and maybe if you want to try this on messenger, you can probably export from messenger as well. If you want to try this on Snapchat or something, you can do that as well. So that was one example. And then there are a couple more here. So there is one called analyzing browsing patterns using pandas. This is also another very interesting blog post. I highly recomme
nd checking it out. This is by one of our mentors Karthik. So here, what Karthik did was he used and Google base, a Google application called Google takeout. So you can go to. takeout.google.com. And from there, you can download your entire browsing history. If you have been using Google Chrome and you have been logged in, so you can download it as a CSV file or sorry, as a Jason file. So you can see here that it is downloaded as a Jason file. And this is what it looks like. So you have just a,
the time at which you visited a page, the title of the page and the URL of the page. And it's a very simple dataset, but it is a very personal dataset. So there are a lot of interesting insights that you can gather based on these. And then based on this there are some, what Karthik has then done is to take the single take the single column, which is the number of microseconds from S so this is called the UTC time and he has converted out of it. He has picked out things like the year, the month,
the date, so that shows. Data processing skills, where you're taking this particular microsecond time and converting into converting it into your date month. And then similarly, what he's also done is he has passed the URL and captured the root domain out of it, because maybe you want to know how much time have you spent on g-mail. How much time have you spent on Facebook? How much time have you spent or how much activity have you done on Twitter for instance. So that's, where you can see that t
here is this get domain function. There is this get day of the week function. There is this convert time function. So this is, another thing that this is showcasing. How do you process, or do you start with some raw data and process it into more structured data? And after doing all of these things, after doing all of these things, you can see that now he started to do visualizations. So here, for example, one very simple thing is doing is checking. How many of the websites are secured versus uns
ecured? This means NCDB versus HTTPS. Here he is analyzing weekday versus weekend. Browser usage here is analyzing. Okay. This is another very interesting one. This is a heat map. So this is showcasing on each day of a particular month, I believe. Yeah. On each day of a particular month. He is looking at his usage throughout the day. And dark means a lot of usage and light means very little usage. So you can see that he spends a lot of time around the, between 10:00 AM to 2:00 PM browsing. So pr
obably for work and then spend some time at night from 7:00 PM to 11:00 PM, probably for leisure. Okay. So there's, a lot you can do here. This is, again, this is barely scratching the surface as he had many questions about stack overflow. So he tried to look up. What are the most common stack overflow questions he's asking. So you see here that it turns out that the most common stack overflow questions are actually related to Putin in his case and so on. And then there's a word cloud, et cetera
, et cetera. So you can check out the third one here as well. The third one is a more standard kind of Algorithm is a standard kind of analysis where it is a dataset that is publicly available. So it's not something that you have to gather. It's not a personal data set, but this is probably something that probably most likely going to be your first project is going to be something like this where, you're going to find a dataset from some online source, such as Kaggle. So do check out the third o
ne as well. This is analyzing the Google play store. So things like let's see, let's look down. How do the app ratings vary? So this is data of all the apps on the Google play store. You can see that approximately ratings are in the range of 4.0 to 4.4. Anything more than that is an exceptional lab, anything less than that is a bad app and so on counts of apps in each category, et cetera, et cetera. Okay. And so now I want to spend some time talking about where do you get these skills if you don
't have them already? There are a couple of places. These are very simple skills. These are beginner level skills. So one is the zero to Pandas course. We have data analysis with dot com. This course is now in a self-paced mode. So there are six lessons here. You can work through the lessons. Each lesson has. A video, you can watch the video, you can open up this Jupiter notebook. So each lesson also comes with a Jupiter notebook. So you can open up this Jupiter notebook and then you can run th
is notebook online, make changes then there are also assignments. So there are three assignments, three weekly assignments, one for you to practice by thin one for you to practice. Number one for you to practice binders, and you learn as part of this pandas, basic Python also. And visualization with Matt . And finally, as part of this course, you do a course project on exploratory data analysis. And in fact, the final lesson is actually a case study of our exploratory data analysis project. So
definitely do this course, if you haven't yet, if you are doing some of the other courses, you may want to first do this alongside, or maybe before you complete the other course, this is really helpful. And then definitely do a project. And then definitely try and write a blog post as well. I'm telling you a lot of people stop at building the project part, but it really helps to write a blog post. I cannot stress it enough. Okay. Another one, another course that I will recommend is the Python fo
r data science and machine learning bootcamp. This course on Udemy is great. This is how I learned many of these things. So you can do this as well. It comes from a small price about $10 or so. But it's, I think it's a pretty good course and it has it is pretty comprehensive in the things that covers although you, if you're doing zero to pandas, you need, no, you do not need to do this course if some people prefer a different style of teaching, so this is another option for you. If you do not fi
nd this good enough for your purposes. Okay. Okay. So that was the first project. So moving forward one thing that I wanted to cover was where to find datasets, and I think we've gone over this many times, but I just want to show you once again, you just want to go to kaggle.com/datasets. This is probably the best source for datasets. And over here, look going to the data tab. And this year you may, it may take some searching. So it may not be very easy to find the datasets here. Sometimes you m
ay have to look for the right keywords. Sometimes you may have to add filters, but what I like to do is I like to sort by most words and, depending on the type of data I'm looking for exploratory data analysis, typically what you want is some kind of a CSV file. So I select the CSV file and you want at least maybe a few thousand rows of data. So if you're looking for at least a few thousand rows of data, then the dataset might have to be larger than 10 MB in size. So you may want to just put in
that you want a dataset of size larger than 10 MB. If not 10, maybe try one MB and you can see here already. We're starting to see pretty good data sets. So you have us accidents, 3.5 million records. Tomato Bangalore restaurants or restaurants in Bangalore. There's a lot of information about a restaurant that is captured by Zomato Netflix price data. So this was a competition that was organized by Netflix, gun violence, data, animal recommendations, artworks used cars, crimes, Spotify, all of t
hese are great datasets for data analysis, right? And you can use this, a library we've created a library for you to, for me to make it easy for you to download these data sets. So all you need to do is install this library, pavements stall, open datasets, import the library and grab the data set. You are a little, which is let's say Spotify data here. Grab this URL and simply call open dataset store. Download. And when you do that, you will be asked to enter your Cargill credentials, and you ca
n follow the instructions here. So you can, you have to go into your account and get a Kaggle dot Jason file, et cetera, et cetera. You can follow that, but it's pretty straightforward. Okay. How do you get started with the project? One, one thing you can do is just go to Jovan, jovan.ai, click on new notebook, blank notebook, give it a title. Let's say we are analyzing Spotify. Data analysis. Okay. Or Spotify tracks EDA. I will keep it public, create the four, create the file and click run. And
you can run on any of these options of normally for data analysis. I like to use binder I'll take a second or two to start up. Not long. so you start up binder and then you download the dataset on the Jupiter notebook that opens up, and then you would just save it back to your job profile. So you can do all of this online. You do not have to, if you want to, you can install things on your computer, but the more important part is to build the project. So to get started, you can just start online
. You can use online resources. You can create the project on Jovian. You can run it on binder. You can turn it on Google CoLab. You can run it on Kaggle. I think you get the idea. So in any case, I'll keep that aside. So Kaggle is the best place, but then there are also a lot of other places on Kaggle. There is also previous competitions. You can look at Kaggle competitions, so go, to Kaggle competitions and look at the. Completed competitions. So there have been about, I think over a hundred o
r so competitions on Kaggle and the competition data is actual real world data that some company has put together and shared with Kaggle for creating a crowdsourced project. Okay. So the, this is actual real world data. You will find all the challenges that you find with your very data, which is missing values, incorrect values, mislabel data And sometimes the data sets are very, lost. Sometimes the data sets are quite small, but you have to make predictions on a larger dataset. So this is a gre
at dataset, a great source of data sets both for data analysis and also for machine learning. Which brings us to the second project that you should be doing. So after you do a project on exploratory data analysis, so now you understand how to. Process a data set, how to clean a data set and how to analyze a data set. The next thing you should do is try to build a classical machine learning project and by classical machine learning, we mean all the machine learning techniques that have come befor
e today's deep learning and neural networks. So before this is probably before 2013, 2014. So this includes techniques like regression. So you have linear regression, logistic regression, polynomial regression. This include techniques like k-means this includes things like random forest decision trees, gradient boosting, and then of course, so these are some supervised learning techniques, but you also have unsupervised learning techniques like clustering, and then you have things like collabora
tive filtering. And if these concepts don't make sense right now, these stones don't make sense. That's okay. I will also point you to the right courses where you can learn about these. You can learn about these machine learning algorithms and these machine learning techniques. But it's important to cover these techniques because when you start working as a data analyst or as a machine learning practitioner, You will most likely be working with what is called tabular data, which is data, which l
ooks like a spreadsheet or data, which looks like a database table and on tabula data, especially when you have smaller data sets, it is often these classical machine learning techniques that give good results or good enough results. And the other thing is also that these classical machine learning techniques have better explainability compared to deep learning. So although deep learning seems like. The hot thing right now, although everybody's talking about it and most papers are coming out aro
und deep learning, but a lot, but deep learning, a box black box and it's applications are better suited for unstructured data at the moment. But for most real business problems, you will have to do some classical machine learning, all the techniques that I talked about and you will have to. Work on the explainability aspect of it as well. So you should also be able to tell why your model is giving the results that it has given, especially if you're working in areas like finance, or if you're wo
rking in anything that is regulated. For example, insurance, we need to explain why our model gives us a result. And often you will also be asked by whoever you're presenting it to Y to explain how your model works. So that's why that is important. And the step, the way for doing it once again, is to find a dataset online and for machine learning projects I would suggest that past Kaggle competitions are great. You can see here, like Custer Santander customer transaction prediction. So this is,
as you can probably guess this is information about customer transactions and the objective is to identify who will make a transaction. Predicting box office revenue. So can you predict based on who is in a movie based on, let's say title of the movie director of the movie, the budget of the movie, can you predict a box office revenue? That's an interesting problem. It's a classical machine learning problem. Let's see, analyzing NFL game data, then. Pub G can you pray the battle Royal finish of
pub G players. So looking at looking at past matches and predicting who's going to win a particular match, or what, at what position is somebody going to finish? These are all good, problems. So before you do any modeling, though, you should be able to first. Understanding, describe the modeling objective. So for example, you need to be able to tell what type of data it is. Normally for classical machine learning, you will be working with tabula data, not with images. This should probably be cha
nged, but still there is still some variation in the data. Is it time series data? Is it just regular like database columns data? Is it maybe. Kind of sensor measurements that have to be interpreted in a certain way. So there is always some, information that you need to know about the data before you can actually start working on the reader. So you should have that you should be able to document that you should be able to identify that you should be able to identify what type of problem it is. I
s it regression? Is that classification, is it unsupervised? Is it something like Is it a recommendation problem, a collaborative filtering, things like that. You need, to be able to identify what type of problem it is, and then you need to perform any data cleaning if required. So in some sense, an exploration data analysis project is included within a machine learning project, but it still helps to first have a separate exploratory analysis project. So that you can pick and apply the right ski
lls when you need to. Okay. Not every machine learning project will involve all of these, involve data cleaning or involve a lot of exploratory analysis, but it is always helpful to do some EDA, always helpful to plot some graphs. Look for correlations, ask questions about the data and the more you understand the data, the better you will get at model building or one important lesson here is. Feature engineering as well. So once you do exploratory analysis, you can figure out what new features y
ou can create. And this is one of the things with classical machine learning algorithms, that there is a lot of feature engineering involved because more gardens and cells are quite shallow. As in, there's not much you can do, you just have to basically put the data into the algorithm and you can get the result out of it. Now, if your data is poor, if your features are not strong, then your algorithm cannot do a good job. For instance, if you have time and microseconds, and then maybe your algor
ithm, your machine learning algorithm will not be detect be able to detect weekly patterns. So what you need to do is you need to also introduce a column called day of the week. You need to also introduce a column called month of year. You need to also introduce a column called maybe hour of day and so on, right? And when you create more new columns, then you are able to train your model better. So your feature engineering and then you have then you do the modeling. So then, you pick a type of m
odel. So maybe you pick a random forest, maybe pick regression, maybe you pick gradient boosting, then you train the model, you make some predictions, you evaluate it on the test data set. And let's say you record the metrics, you record the metrics of the model, and then you try different hyper-parameters and different types of models. So this is one thing in classical machine learning that you must almost always try multiple approaches. So you will almost always try regression and random fores
t gradient boosting, et cetera, et cetera. And you will also try different kinds of hyper parameters. So here again, if you're using the library like scikit-learn, you can probably use some hyper parameter optimization tools. So what is called grid search? Where you try for each model, there are different sets of parameters that you tune. So you can set up your model to be trained with different sets of parameters, and then you can pick the best model out of it. Okay. And finally, what you need
to do is then look back at all the different approaches you have tried, and then you summarize your learnings and draw the inferences and identify what can you, what, how can you further improve it? Because you cannot go on working on a project forever. What you do is you stop at a certain point and you say, okay, we're, we've achieved a good enough accuracy or a good enough loss. And I'm ready to publish this project where I've tried many different ideas, I've tuned, hyper parameters and gotten
a good result. But at the same time, if you had time to continue it, what would you do? Or if somebody wants to build on your work, what should they do? And finally you take this and you publish a notebook. You can put it up on GitHub, put it up on a Jovian. And also again, if possible, write a blog post to describe your experiments and summarize your work. Okay. So that's, a classical machine learning project and we stop for questions at this point. Okay. So there's one question. Is there a co
urse in Jovian which covers classical ML using Python and not yet, we are working on one, so you will probably be able to register for it sometime in February, but it is something that is coming soon. But in the meantime, though, there are a couple of courses that I would recommend for this one is machine learning on course. So this is the. Quintessential course on machine learning. I'm sure many of you have done this already. If you haven't, you should do it. If you've done, if you did it long
ago, it's always a good course to revise. Also, although this courses not in Python, this course is in a language called octave. So that might be a little bit tricky, but what you can do is you can learn the concepts from this course and then reimplement them in Python. So you can using, let's say the scikit-learn library. So this is one you can do this course, at least there's a good course that sets the intuition. So you can, you will, for example, get to know. Okay. What is linear regression?
What is logistic regression? What do you mean by then here it talks about neural networks a little bit, but then support vector machines on supervised learning, dimensionality reduction, anomaly detection, recommender systems. So there are a bunch of pretty drastic things that are covered in this, course. So this is one that I would recommend. Another one that is very practical is called ML course slot AI. And this is just a set of videos created by one person, but it is pretty solid because th
is one person has a long history of working in machine learning. And then also participating in Kaggle competitions. Although I wouldn't say that it is sufficient today to just do this course, especially because this course is in a language called Optiv not in Python. So you can do this course to get an idea of what machine learning is and what are some of the most popular techniques used from doing machine learning and what are the different kinds of problems you can solve with it? And then go
on to do this course called the open machine learning course or ML course.ai. So this course is a community course organized by open data science or odsp.ai. And you can check out the lectures here. The lectures are all available on YouTube and all the code and other datasets for this course are also available on kaggle. And finally, if you are interested, if you want to dig in further, I will also recommend this book called hands-on machine learning with scikit-learn and TensorFlow. This is by
published by O'Reilly. The recent edition was released in 2019. And this covers scikit-learn Keras and TensorFlow. So you can definitely focus on the scikit-learn part of it for classical machine learning. And then for deep learning, you can look at keras tensorflow. So that's your second project classical machine learning. And I would recommend doing a classical machine learning project before you do a deep learning project, because even though deep learning is. Probably the most popular thing
right now in data science. The fact of the matter is if you start working as a data analyst or a machine learning practitioner, you will be working primarily with tabular data and you will be using classical machine learning algorithms for two reasons, one for tabular data and especially slightly smaller data sets, classical machine learning algorithms perform. Often just as well as sometimes better than deep learning algorithms, especially things like gradient boosting. And the other thing is a
lso about explainability and interpretability because these, your machine learning models will be used somewhere to make some decisions or to data mine, the behavior of a particular application or a particular website. So you will be asked to explain why a more the model gives a certain result. So for instance, in a use case like insurance, before you reject somebody's application, you have to give them a reason why they, why it was rejected in such cases using a simple decision tree or a logist
ic regression might make more sense than using a complicated, deep learning model or even a random forest. So keep that in mind. And that's why I focus on classical machine learning. Classical machine learning is by no means dead. In fact, it is more widely used today than ever before. So definitely spend some time doing all of these, learning all of these techniques and doing some projects on them. In fact, to land your first job, you may not even need a deep learning project, although it's goo
d to have one. So that brings us to our third project topic, deep learning so we did a workshop on building a deep learning project from scratch. So you can just follow along with the workshop it's two and a half hours. And if you simply follow along and follow all the instructions in the workshop, It goes through the entire process of a deep learning project. So it starts with where to start with finding a dataset online. We do that on Kaggle and then understanding and describing the modeling
objectives. So identifying what type of data you're working with, identifying what type of problem it is doing any cleaning, if required and performing exploratory analysis. Then we'll look at modeling where we define a model, a simple network architecture. Then we pick some hyper parameters and we train the model. Then we evaluate the model and then finally we iterate with different hyper-parameters and different regularization techniques. So do check out this workshop and you can just pause. I
f you want to start a project, you can have this video running in the background and you can just pause a different step, different places in the video and work on your project and then get back and continue. So do use this as a, reference guide. And some example projects in deep learning, and I'm sure you, you may have seen a lot. If you have been taking the zero to Gans course that is currently ongoing. So the first one is I'll give three or four examples here, different kinds of examples. The
most common one that you tend to see is image classification. So here's one blindness detection using image classification. So typically we've seen in, courses you'll often see images of everyday objects. But in this case, this dataset contains images that looked like this. So these are actually pictures of if I'm not mistaken pictures of the retina or the human eye, essentially, but primarily the retina and what you are required here to predict is the severity of the blindness. So the severity
changes from. Okay, so severity of diabetic retinopathy on a scale of zero to four. So it goes from no diabetic retinopathy to mild, moderate, severe, or proliferative DR. So here's how it goes. You create a training set and then you create a test set and a validation set out of it. Then you add some transforms. Because here we are looking with looking at medical images. So medical images is one huge area where deep learning is helping make huge advances. So that's one area where, and there are
also a lot of rules, a lot of opportunities in healthcare and in medical imaging, especially whether it is cell images, whether it is. Images of the eye, things like this or even, things like CT scans. So here, what we're doing is we're applying some transforms to them, to the image. So we are resizing it. We're doing some randomized transformed, like random, horizontal Philip converting into a tensor and the normalizing it, and then creating training and validation sets, creating training and
validation data loaders. Then defining a model architecture. So in this case, this, I think this is done by Shaunak. Okay. Yeah. So in this case Shaunak has picked, the restaurant one 52 architecture. So this is one interesting thing that you can do. You look at the dataset on Kaggle. So if it is really the dataset on Kaggle, you may want to just check out the notebooks tab on the category of dataset. And in the notebook step, try not to just copy the code, but look at the code and understand wh
at people are doing. And look through three or four notebooks, gather some ideas and try to implement it on your own. Okay. So dry in general, not to copy paste code. Even if you have to use somebody else's code, you can just put it on the side and then type it out because when you type out the code, you're automatically type each variable type each operation. And that forces you to think about it. So definitely at least drive the code or the best is to look at the code, understand what it does
and try to replicate it by looking at the documentation, by looking at other notebooks that you have written in the past, or by looking at other things. So whatever pick up good ideas, but try to replicate them and understand them. So here is creating a model and. Then changing the head of the model. So here we are using a pre-trained model, the resonant one 52 model, but out of the pre-trained model, what he has done is he has removed the final layer of the fully connected layer. And this is a
technique called transfer learning. Once again, something we've covered in the deep learning with Pytorch course, and then he's training the model here. So you can see that the model is being trained. Yeah. So you can see that the model is being trained here. So you train the model and you track the losses and you try different architecture. So it seems like there's a learning rate of zero different learning rates have been tried here in different models as well. So here you have restaurant one
52. This is resonant one 52 as well. And then the sort of resonant one Oh one. And you can try and compare the different models, how the loss changes. So it seems like on Resnet one Oh one. This is how the loss changes. You can see that the validation loss stops at around 0.35 on resonant one 52, the validation loss goes far lower to.125, and then on resnet one Oh one. Once again, the validation loss is high. So you want to experiment with different. With different model architectures with diffe
rent hyper parameters. And then finally you make some predictions. So you make some predictions and then verify whether those predictions make sense. So in this case, this particular blog post does not contain predictions, but you should also always verify on maybe five, 10, 15, 20 individual images. If you're working with images, And make some predictions. And then if you have a test set and you need to submit the prediction somewhere, then generate the predictions for the test set and then sub
mit it. So that's roughly the structure of a deep learning project, and it's always important to write a conclusion. So it's important to just summarize your approach important to summarize what you've learned, what worked, what did not work and what may be some other ideas to try out. And if you have looked at if you're borrowed code or if you're borrowing ideas from different places, then you can also add references once again, a great thing to do. So that's one deep learning project. And simi
larly, you can check out a couple others too. We'll very briefly cover this one called classifying environmental audio recordings. So you do not always have to work with images in this case. What you can do is audio can actually be turned into images. So here we are working with audio files and audio files look like this, which are basically way forms, but these audio files based on certain transformations, and this is the audio file. So this seems like a clock tick, the tick of a clock probably
. And this is what you see here in a way form format. And you can use certain tools and libraries like live Rosa to convert that into this kind of an image. And after applying certain normalizations that, that becomes an image like this. So now what you can do is you can transform audio files into images, and once you've transformed audio files into images, you can then use deep learning the same techniques that we use convolutional neural networks to classify or do. So here you see here that we
have this is what you end up with. Essentially different images of different audio and then you, yeah. So this is one batch of training data. And now what you're looking at is each pattern of audio that you see belongs to a certain category. So one could be a clock ticking. One could be a dog barking. One could be the sound of a bell and so on, and then you can perform image classification. So again, an interesting project, here's one generating art using Gans. This is also very interesting. Yo
u take some artwork, so you take artwork and then you put that as an input to a generative adversarial network. And you end up with something like this. And I would say this is pretty impressive, although it's not anything in particular, but it looks nice. So this is a good, there's a good start. In fact, generating art using Gans is an entire area where hundreds of people are trying many different things. Trying to change the inputs to gans, to generate interesting pictures and artists are acti
vely using Gans in their work, especially with digital artists. So do check it out. I will not go through it in a lot of detail. You can see this is a pretty big blog post here. And finally, here's one more. So it, we looked at one problem which was classification and classification can be single label or multi label. We looked at an, unsupervised learning problem, which is generating art, using Gans generative modeling problem. And we can also do what is called regression. So where we are detec
ting, we are where we are coming up with, not what class an image belongs to, but specific points or coordinates on an image. So for instance, here we are using something called other ideas on something called pose estimation. So for pose estimation, he has used that TensorFlow pose estimation package. And once a little bit of deep learning, you can look it up. You can look up what pole pose estimation is, and you may have to read it tutorial. You may have to read some code on GitHub, but you ha
ve all the right skills to figure out what, the code means and how it works. You may even have to read a paper, but in most cases you can find a blog post explaining the paper. So here you can look at it here. That even in this case, even in this case, Adithya has used this notebook, so particular notebook to learn more about pose estimation. But in any case, the idea here is not to predict a single class, but to predict a bunch of key points, zero to 17. So bunch of 18 key points and using thos
e 18 key points, you can then estimate the pose of the person in an image. For instance, here are some examples. here are some examples. So you can see that this, these are images of cricketers, and then you can see all the key points, point out mark, and then using the polls, you can actually then use it, use that to classify. Which specific short they're taking, whether this is cricket or some other sport. So this is interesting. So this is a multi-step problem where you take these images and
convert those images into these polls variable. So polls coordinates, and then you run a classification model on the post coordinates. And sometimes what you can do is you can use a pre-trained model to come up with the post coordinator. So then you simply have to build a simple feedforward neural network for classifying. Oh yeah. So you can check out his notebook here. yeah. So th this is one other thing that even try, which is multi-step deep learning problems where you're using one. Model to
convert the data set to convert the data into a particular format, let's say, coordinates or embeddings. And then you use that as the input into another deep learning model or possibly into a classical machine learning model. One other thing that you can also do, which you've not covered here is working with text data. So this is where you will have to use recurrent neural networks and transformers. So some learning resources for deep learning. The, obviously we have the course deep learning wit
h Python, zero, two Gans. I hope you've done the course. If you have not, you can still sign up for it and do all the, watch all the lectures, do the assignments and build a course project or another good set of courses is the deep learning specialization on Coursera. And this is good. If you want to dig deeper into the theoretical side of things. If you want to actually look at the math, how the math works and you want to become familiar with some of the terminology that is used in deep learnin
g it also contains a lot of practical tips for building good deep learning models. So the deep learning specialization on Coursera, I would recommend it. It's pretty good. Another good course to do is the practical, deep learning for coders course. This is also known as fast AI. Now you don't have to do all of these courses. You can do one of them, maybe two of them. So you, if you've done the zero to gans course, maybe you can compliment it. You can do the deep learning specialization, or if yo
u've done fast AI then you can do, the deep learning specialization, or if you've done the deep learning specialization, then you can do the deep learning Python books or do one or two of these. The more important thing is you should be able to understand most, if not all of the terminology that is used in deep learning. If you're reading a blog post, you shouldn't feel lost, or if you're reading news should be able to read code. So if we give you a link to a GitHub repository, you should be abl
e to look into it and understand the code. And if one of the frameworks, either TensorFlow or PI torch most of the other. So you just make sure to learn one framework. It doesn't have to be either TensorFlow Python specifically. It could be it could be either one and just have some familiarity with the other enough that you can understand it if not right coordinate. So this one, this book is also pretty good, deep learning with Python. This is in fact written by Franco , who is the author of. Th
e Keras library, which is part of TensorFlow. So if you want to, if you want to have a book for reference, you can check out this book as well. So that's the third project. Now, apart from this, there are a few more projects that you can build. Now. One good thing you can do. One good project that you can do is something with sequel, possibly because a Sequel or SQL as you might call it, or basically relational databases. Are often very commonly used for storing information. So you will be, you
will have to work with a SQL database. And if you can work on a mini project or maybe even just a blog post, where you can demonstrate that, you know how to write advanced complex SQL queries to address different use cases, that will be pretty helpful. Then another project that you can work on is web scraping, web crawling and web scraping, which is. Essentially using a library, like there's a library called scrapy. So this is used to get webpages and then download information from web pages and
then get the links on those pages and crawl those web pages. So for instance, you could get a page from Amazon and get information about the product. So to actually parse the information, you will have to use this library called beautiful soup. And then you can follow all the other products listed, linked to from the product page and then create a database of products in a particular category, and then do analysis on that. So you can use web scraping as a technique to generate your own dataset
by scraping a website. And in fact, a lot of the data sets that you find on Kaggle have been created in this way. So that's one other thing that you can add once again, all of these other things that are mentioning are optional. One more thing that you can possibly look into is web development. So you can look at the Flask framework. The flask framework is a very simple web server for Python. In fact, you can see here, minimal application looks, something like this. So you just write these four
lines of code into a Python program, and then you simply share, then you simply run it and you will be able to open it up in a browser and interact with it. So for example, if you open the route. Slash. Which is local hosts, eight, eight, eight, eight, or wherever you're at, whatever port your application is running at, you will see the words hello world, or you can go from there. And instead of returning a spring, you can return an HTML page. So you can also learn some HTML, CSS, and JavaScript
. If that interests you and create an entire web application. Now you may not want to do a entire web application project where you're building a website or a web app. What you can do is you can take a machine learning model that you have created and put it up and, create a server and a simple user interface, or like a simple form. For instance, if it is a flower classification, Model, you can create a simple web application that allows the user to upload a file, a picture of a flower, and then
it tells them which flower there, which flower the picture represents just something simple like that, that would require about 5,200 lines of code with evidence last HTML, CSS, JavaScript included, and you can deploy it to platforms like Heroku. So check out Heroku headache was a simple way, a simple platform to deploy python web applications. So you just check, check, getting started on Heroku with python. It's all really simple. No, it seems complicated, but if, once you spend some time with
it, then you become familiar with it and then it doesn't, it's actually the other simple. So these are other things that you can do once again, these are. Optional, but good to have. They will definitely set your profile apart, especially when you're up applying for internship or you're applying for jobs or you're reaching out to people on on LinkedIn. And we'll talk about that at some point as well on what is the best, good way to reach out or when you're applying for jobs cold reach out email,
reach out, or getting referrals and such things. Yeah, so that pretty much covers it. There's also some more areas. Things like spark Hadoop, hive, no sequel that you can do, but all those things are, I wouldn't say that if you're, looking to just get into data science right now, I wouldn't say that you need to do any of those things. These are the three important projects for you to do. Exploratory data analysis, classical machine learning and deep learning. And finally, how to find projects o
ut to find datasets. As we mentioned, several times, Kaggle data sets is a great place to look for datasets. So just go to kaggle.com/datasets. A couple of tips I mentioned because I've mentioned this so many times. One good thing you can do is you can set a minimum file size. So if you're looking for slightly big sets, data sets that have. Let's say more than at least a few thousand rows of data, then you probably want to put in a filter like this, maybe 10 MB, or if you're looking for image da
ta sets, then you maybe want to put in a filter of hundred Mb because each image is about one. Let's say tens of KB's in size, maybe up to a hundred KB in size. So to get to a thousand images, you will need, at least the data set to be at least a hundred MB insights. Another thing you can do is you can add. File types like CSV. You can also search by tags. So use, make use of the filters here. It's pretty powerful. Yeah. One more thing to do would be to sort by this is normally sorted by hotest,
but you want to sort by the most words. So now you can see with some filters and with some votes we have. Some pretty, usually the sex, the COVID dataset is about seven GB. The Bitcoin dataset is about 96 MB accidents, us accidents. That's about 300 MB Zomato restaurants, 89 MB and so on. So explore different tags as well. Okay. Okay. So it looks like you can, let's say search for sales or business related data. okay. Doesn't really seem to make sense, but yeah, the tags help as well. This is i
nteresting. Uber pickups, SF salary. So just by looking at these datasets, I'm sure you're probably thinking about the different things you can do with them, whether it is data analysis, visualization, or prediction of some kind. Movie reviews. So this is again in testing education, statistics around the world. Yeah. So Kaggle is a great place. Then there are a few more, you can check out all of these. There is enough data set. There are enough datasets on the internet, so do use these resources
. Okay. So that's where we will end the workshop.

Comments

@subhambijarnia3143

a lot of information for beginners, I appreciate your hard work Aakash.

@shubhampandey648

Top of the top content by Jovian. Thanks alot

@vidulakamat6564

Great information source. Thank you.

@yogeetakhatri4015

thanku for the detail explanation

@janicejose4725

Amazing content. Can u pls tell me a way to use pretrained lstm cnn model I am working on video classification problem

@muhammedshabeel1351

Good explanation sir

@musakhan9779

Thanks sir

@shiva16

golden