Trending December 2023 # Comprehensive Learning Path – Data Science In Python # Suggested January 2024 # Top 17 Popular

You are reading the article Comprehensive Learning Path – Data Science In Python updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Comprehensive Learning Path – Data Science In Python

Journey from a Python noob to a Kaggler on Python

So, you want to become a data scientist or may be you are already one and want to expand your tool repository. You have landed at the right place. The aim of this page is to provide a comprehensive learning path to people new to Python for data science. This path provides a comprehensive overview of steps you need to learn to use Python for data science. If you already have some background, or don’t need all the components, feel free to adapt your own paths and let us know how you made changes in the path.

Reading this in 2023? We have designed an updated learning path for you! Check it out on our courses portal and start your data science journey today.

Step 0: Warming up

Before starting your journey, the first question to answer is:

Why use Python?


How would Python be useful?

Watch the first 30 minutes of this talk from Jeremy, Founder of DataRobot at PyCon 2014, Ukraine to get an idea of how useful Python could be.

Step 1: Setting up your machine

Now that you have made up your mind, it is time to set up your machine. The easiest way to proceed is to just download Anaconda from chúng tôi . It comes packaged with most of the things you will need ever. The major downside of taking this route is that you will need to wait for Continuum to update their packages, even when there might be an update available to the underlying libraries. If you are a starter, that should hardly matter.

If you face any challenges in installing, you can find more detailed instructions for various OS here.

Step 2: Learn the basics of Python language

You should start by understanding the basics of the language, libraries and data structure. The free course by Analytics Vidhya on Python is one of the best places to start your journey. This course focuses on how to get started with Python for data science and by the end you should be comfortable with the basic concepts of the language.

Assignment: Take the awesome free Python course by Analytics Vidhya

Alternate resources: If interactive coding is not your style of learning, you can also look at The Google Class for Python. It is a 2 day class series and also covers some of the parts discussed later.

Step 3: Learn Regular Expressions in Python

You will need to use them a lot for data cleansing, especially if you are working on text data. The best way to learn Regular expressions is to go through the Google class and keep this cheat sheet handy.

Assignment: Do the baby names exercise

If you still need more practice, follow this tutorial for text cleaning. It will challenge you on various steps involved in data wrangling.

Step 4: Learn Scientific libraries in Python – NumPy, SciPy, Matplotlib and Pandas

This is where fun begins! Here is a brief introduction to various libraries. Let’s start practicing some common operations.

Practice the NumPy tutorial thoroughly, especially NumPy arrays. This will form a good foundation for things to come.

Next, look at the SciPy tutorials. Go through the introduction and the basics and do the remaining ones basis your needs.

If you guessed Matplotlib tutorials next, you are wrong! They are too comprehensive for our need here. Instead look at this ipython notebook till Line 68 (i.e. till animations)

Finally, let us look at Pandas. Pandas provide DataFrame functionality (like R) for Python. This is also where you should spend good time practicing. Pandas would become the most effective tool for all mid-size data analysis. Start with a short introduction, 10 minutes to pandas. Then move on to a more detailed tutorial on pandas.

You can also look at Exploratory Data Analysis with Pandas and Data munging with Pandas

Additional Resources:

If you need a book on Pandas and NumPy, “Python for Data Analysis by Wes McKinney”

There are a lot of tutorials as part of Pandas documentation. You can have a look at them here

Assignment: Solve this assignment from CS109 course from Harvard.

Step 5: Effective Data Visualization

Go through this lecture form CS109. You can ignore the initial 2 minutes, but what follows after that is awesome! Follow this lecture up with this assignment.

Step 6: Learn Scikit-learn and Machine Learning

Now, we come to the meat of this entire process. Scikit-learn is the most useful library on python for machine learning. Here is a brief overview of the library. Go through lecture 10 to lecture 18 from CS109 course from Harvard. You will go through an overview of machine learning, Supervised learning algorithms like regressions, decision trees, ensemble modeling and non-supervised learning algorithms like clustering. Follow individual lectures with the assignments from those lectures.

You should also check out the ‘Introduction to Data Science‘ course to give yourself a big boost in your quest to land a data scientist role.

Additional Resources:

If there is one book, you must read, it is Programming Collective Intelligence – a classic, but still one of the best books on the subject.

Additionally, you can also follow one of the best courses on Machine Learning course from Yaser Abu-Mostafa. If you need more lucid explanation for the techniques, you can opt for the Machine learning course from Andrew Ng and follow the exercises on Python.

Tutorials on Scikit learn

Step 7: Practice, practice and Practice

Congratulations, you made it!

You now have all what you need in technical skills. It is a matter of practice and what better place to practice than compete with fellow Data Scientists on the DataHack platform. Go, dive into one of the live competitions currently running on DataHack and Kaggle and give all what you have learnt a try!

Step 8: Deep Learning

Now that you have learnt most of machine learning techniques, it is time to give Deep Learning a shot. There is a good chance that you already know what is Deep Learning, but if you still need a brief intro, here it is.

I am myself new to deep learning, so please take these suggestions with a pinch of salt. The most comprehensive resource is chúng tôi You will find everything here – lectures, datasets, challenges, tutorials. You can also try the course from Geoff Hinton a try in a bid to understand the basics of Neural Networks.

Get Started with Python: A Complete Tutorial To Learn Data Science with Python From Scratch

P.S. In case you need to use Big Data libraries, give Pydoop and PyMongo a try. They are not included here as Big Data learning path is an entire topic in itself.

You're reading Comprehensive Learning Path – Data Science In Python

A Complete Python Tutorial To Learn Data Science From Scratch


This article is a complete tutorial to learn data science using python from scratch

It will also help you to learn basic data analysis methods using python

You will also be able to enhance your knowledge of machine learning algorithms


It happened a few years back. After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being a data scientist, my hunt for other useful tools was ON! Fortunately, it didn’t take me long to decide – Python was my appetizer.

I always had an inclination for coding. This was the time to do what I really loved. Code. Turned out, coding was actually quite easy!

I learned the basics of Python within a week. And, since then, I’ve not only explored this language to the depth, but also have helped many other to learn this language. Python was originally a general purpose language. But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling.

Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.

A complete python tutorial from scratch in data science.

Are you a beginner looking for a place to start your journey in data science and machine learning? Presenting a comprehensive course, full of knowledge and data science learning, curated just for you!

You can also check out the ‘Introduction to Data Science‘ course – a comprehensive introduction to the world of data science. It includes modules on Python, Statistics and Predictive Modeling along with multiple practical projects to get your hands dirty.

Basics of Python for Data Analysis Why learn Python for data analysis?

Python has gathered a lot of interest recently as a choice of language for data analysis. I had basics of Python some time back. Here are some reasons which go in favour of learning Python:

Open Source – free to install

Awesome online community

Very easy to learn

Can become a common language for data science and production of web based analytics products.

Needless to say, it still has few drawbacks too:

It is an interpreted language rather than compiled language – hence might take up more CPU time. However, given the savings in programmer time (due to ease of learning), it might still be a good choice.

Python 2.7 v/s 3.4

This is one of the most debated topics in Python. You will invariably cross paths with it, specially if you are a beginner. There is no right/wrong choice here. It totally depends on the situation and your need to use. I will try to give you some pointers to help you make an informed choice.

Why Python 2.7?

Awesome community support! This is something you’d need in your early days. Python 2 was released in late 2000 and has been in use for more than 15 years.

Plethora of third-party libraries! Though many libraries have provided 3.x support but still a large number of modules work only on 2.x versions. If you plan to use Python for specific applications like web-development with high reliance on external modules, you might be better off with 2.7.

Some of the features of 3.x versions have backward compatibility and can work with 2.7 version.

Why Python 3.4?

Cleaner and faster! Python developers have fixed some inherent glitches and minor drawbacks in order to set a stronger foundation for the future. These might not be very relevant initially, but will matter eventually.

It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x versions. Python 3 has released stable versions for past 5 years and will continue the same.

There is no clear winner but I suppose the bottom line is that you should focus on learning Python as a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated article on Python 2.x vs 3.x in the near future!

How to install Python?

There are 2 approaches to install Python:

Download Python

You can download Python directly from its project site and install individual components and libraries you want

Install Package

Alternately, you can download and install a package, which comes with pre-installed libraries. I would recommend downloading Anaconda. Another option could be Enthought Canopy Express.

Second method provides a hassle free installation and hence I’ll recommend that to beginners

The imitation of this approach is you have to wait for the entire package to be upgraded, even if you are interested in the latest version of a single library. It should not matter until and unless, until and unless, you are doing cutting edge statistical research.

Choosing a development environment

Once you have installed Python, there are various options for choosing an environment. Here are the 3 most common options:

Terminal / Shell based

IDLE (default environment)

iPython notebook – similar to markdown in R

IDLE editor for Python

While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It provides a lot of good features for documenting while writing the code itself and you can choose to run the code in blocks (rather than the line by line execution)

We will use iPython environment for this complete tutorial.

Warming up: Running your first Python program

You can use Python as a simple calculator to start with:

Few things to note

You can start iPython notebook by writing “ipython notebook” on your terminal / cmd, depending on the OS you are working on

The interface shows In [*] for inputs and Out[*] for output.

You can execute a code by pressing “Shift + Enter” or “ALT + Enter”, if you want to insert an additional row after.

Before we deep dive into problem solving, lets take a step back and understand the basics of Python. As we know that data structures and iteration and conditional constructs form the crux of any language. In Python, these include lists, strings, tuples, dictionaries, for-loop, while-loop, if-else, etc. Let’s take a look at some of these.

Python libraries and Data Structures Python Data Structures

Following are some data structures, which are used in Python. You should be familiar with them in order to use them as appropriate.

Lists – Lists are one of the most versatile data structure in Python. A list can simply be defined by writing a list of comma separated values in square brackets. Lists might contain items of different types, but usually the items all have the same type. Python lists are mutable and individual elements of a list can be changed.

Here is a quick example to define a list and then access it:

Strings – Strings can simply be defined by use of single ( ‘ ), double ( ” ) or triple ( ”’ ) inverted commas. Strings enclosed in tripe quotes ( ”’ ) can span over multiple lines and are used frequently in docstrings (Python’s way of documenting functions). is used as an escape character. Please note that Python strings are immutable, so you can not change part of strings.

Tuples – A tuple is represented by a number of values separated by commas. Tuples are immutable and the output is surrounded by parentheses so that nested tuples are processed correctly. Additionally, even though tuples are immutable, they can hold mutable data if needed.

Since Tuples are immutable and can not change, they are faster in processing as compared to lists. Hence, if your list is unlikely to change, you should use tuples, instead of lists.

Dictionary – Dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}.

Python Iteration and Conditional Constructs

Like most languages, Python also has a FOR-loop which is the most widely used method for iteration. It has a simple syntax:

for i in [Python Iterable]: expression(i) fact=1 for i in range(1,N+1): fact *= i

Coming to conditional statements, these are used to execute code fragments based on a condition. The most commonly used construct is if-else, with following syntax:

if [condition]: __execution if true__ else: __execution if false__

For instance, if we want to print whether the number N is even or odd:

if N%2 == 0: print ('Even') else: print ('Odd')

Now that you are familiar with Python fundamentals, let’s take a step further. What if you have to perform the following tasks:

Multiply 2 matrices

Find the root of a quadratic equation

Plot bar charts and histograms

Make statistical models

Access web-pages

If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more than 2 days! But lets not worry about that. Thankfully, there are many libraries with predefined which we can directly import into our code and make our life easy.

For example, consider the factorial example we just saw. We can do that in a single step as:


Off-course we need to import the math library for that. Lets explore the various libraries next.

Python Libraries

Lets take one step ahead in our journey to learn Python by getting acquainted with some useful libraries. The first step is obviously to learn to import them into our environment. There are several ways of doing so in Python:

import math as m from math import *

In the first manner, we have defined an alias m to library math. We can now use various functions from math library (e.g. factorial) by referencing it using the alias m.factorial().

In the second manner, you have imported the entire name space in math i.e. you can directly use factorial() without referring to math.

Tip: Google recommends that you use first style of importing libraries, as you will know where the functions have come from.

Following are a list of libraries, you will need for any scientific computations and data analysis:

SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.

Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.

Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.

Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.

Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of chúng tôi Moreover, it has the capability of high-performance interactivity over very large or streaming datasets.

Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating effective visualizations and dashboards on huge chunks of data.

Scrapy for web crawling. It is a very useful framework for getting specific patterns of data. It has the capability to start at a website home url and then dig through web-pages within the website to gather information.

SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability of formatting the result of the computations as LaTeX code.

Requests for accessing the web. It works similar to the the standard python library urllib2 but is much easier to code. You will find subtle differences with urllib2 but for beginners, Requests might be more convenient.

Additional libraries, you might need:

os for Operating system and file operations

networkx and igraph for graph based data manipulations

regular expressions for finding patterns in text data

BeautifulSoup for scrapping web. It is inferior to Scrapy as it will extract information from just a single webpage in a run.

Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into problem solving through Python. Yes I mean making a predictive model! In the process, we use some powerful libraries and also come across the next level of data structures. We will take you through the 3 key phases:

Data Exploration – finding out more about the data we have

Data Munging – cleaning the data and playing with it to make it better suit statistical modeling

Predictive Modeling – running the actual algorithms and having fun 🙂

Exploratory analysis in Python using Pandas

In order to explore our data further, let me introduce you to another animal (as if Python was not enough!) – Pandas

Image Source: Wikipedia

Pandas is one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. We will now use Pandas to read a data set from an Analytics Vidhya competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.

Before loading the data, lets understand the 2 key data structures in Pandas – Series and DataFrames

Introduction to Series and Dataframes

Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements of this series through these labels.

A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.

Series and dataframes form the core data model for Pandas in Python. The data sets are first to read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.

More: 10 Minutes to Pandas

Practice data set – Loan Prediction Problem

You can download the dataset from here. Here is the description of the variables:

VARIABLE DESCRIPTIONS: Variable Description Loan_ID Unique Loan ID Gender Male/ Female Married Applicant married (Y/N) Dependents Number of dependents Education Applicant Education (Graduate/ Under Graduate) Self_Employed Self employed (Y/N) ApplicantIncome Applicant income CoapplicantIncome Coapplicant income LoanAmount Loan amount in thousands Loan_Amount_Term Term of loan in months Credit_History credit history meets guidelines Property_Area Urban/ Semi Urban/ Rural Loan_Status Loan approved (Y/N) Let’s begin with the exploration

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal/windows command prompt:

ipython notebook --pylab=inline

This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):


I am currently working in Linux, and have stored the dataset in the following location:

Importing libraries and the data set:

Following are the libraries we will use during this tutorial:




Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.

After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:

import pandas as pd import numpy as np import matplotlib as plt %matplotlib inline Quick Data Exploration

Once you have read the dataset, you can have a look at few top rows by using the function head()


This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.

Next, you can look at summary of numerical fields by using describe() function


describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution)

Here are a few inferences, you can draw by looking at the output of describe() function:

LoanAmount has (614 – 592) 22 missing values.

Loan_Amount_Term has (614 – 600) 14 missing values.

Credit_History has (614 – 564) 50 missing values.

We can also look that about 84% applicants have a credit_history. How? The mean of Credit_History field is 0.84 (Remember, Credit_History has value 1 for those who have a credit history and 0 otherwise)

The ApplicantIncome distribution seems to be in line with expectation. Same with CoapplicantIncome

Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure.

For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution to understand whether they make sense or not. The frequency table can be printed by following command:


Similarly, we can look at unique values of port of credit history. Note that dfname[‘column_name’] is a basic indexing technique to acess a particular column of the dataframe. It can be a list of columns as well. For more information, refer to the “10 Minutes to Pandas” resource shared above.

Distribution analysis

Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely ApplicantIncome and LoanAmount

Lets start by plotting the histogram of ApplicantIncome using the following commands:


Here we observe that there are few extreme values. This is also the reason why 50 bins are required to depict the distribution clearly.

Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:


This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

df.boxplot(column='ApplicantIncome', by = 'Education')

We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.

Now, Let’s look at the histogram and boxplot of LoanAmount using the following command:


Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values values, while ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this up in coming sections.

Categorical variable analysis

Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand categorical variables in more details. We will use Excel style pivot table and cross-tabulation. For instance, let us look at the chances of getting a loan based on credit history. This can be achieved in MS Excel using a pivot table as:

Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the probability of getting loan.

Now we will look at the steps required to generate a similar insight using Python. Please refer to this article for getting a hang of the different data manipulation techniques in Pandas.

temp1 = df['Credit_History'].value_counts(ascending=True) temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x:{'Y':1,'N':0}).mean()) print ('Frequency Table for Credit History:') print (temp1) print ('nProbility of getting loan for each Credit History class:') print (temp2)

Now we can observe that we get a similar pivot_table like the MS Excel one. This can be plotted as a bar chart using the “matplotlib” library with following code:

import matplotlib.pyplot as plt fig = plt.figure(figsize=(8,4)) ax1 = fig.add_subplot(121) ax1.set_xlabel('Credit_History') ax1.set_ylabel('Count of Applicants') ax1.set_title("Applicants by Credit_History") temp1.plot(kind='bar') ax2 = fig.add_subplot(122) temp2.plot(kind = 'bar') ax2.set_xlabel('Credit_History') ax2.set_ylabel('Probability of getting loan') ax2.set_title("Probability of getting loan by credit history")

This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history. You can plot similar graphs by Married, Self-Employed, Property_Area, etc.

Alternately, these two plots can also be visualized by combining them in a stacked chart::

temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status']) temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)

You can also add gender into the mix (similar to the pivot table in Excel):

If you have not realized already, we have just created two basic classification algorithms here, one based on credit history, while other on 2 categorical variables (including gender). You can quickly code this to create your first submission on AV Datahacks.

We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.

Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques. I would strongly urge that you take another dataset and problem and go through an independent example before reading further.

Data Munging in Python : Using Pandas Data munging – recap of the need

While our exploration of the data, we found a few problems in the data set, which needs to be solved before the data is ready for a good model. This exercise is typically referred as “Data Munging”. Here are the problems, we are already aware of:

There are missing values in some variables. We should estimate those values wisely depending on the amount of missing values and the expected importance of variables.

While looking at the distributions, we saw that ApplicantIncome and LoanAmount seemed to contain extreme values at either end. Though they might make intuitive sense, but should be treated appropriately.

In addition to these problems with numerical fields, we should also look at the non-numerical fields i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful information.

If you are new to Pandas, I would recommend reading this article before moving on. It details some useful techniques of data manipulation.

Check missing values in the dataset

Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset

df.apply(lambda x: sum(x.isnull()),axis=0)

This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.

Though the missing values are not very high in number, but many variables have them and each one of these should be estimated and added in the data. Get a detailed view on different imputation techniques through this article.

Note: Remember that missing values may not always be NaNs. For instance, if the Loan_Amount_Term is 0, does it makes sense or would you consider that missing? I suppose your answer is missing and you’re right. So we should check for values which are unpractical.

How to fill missing values in LoanAmount?

There are numerous ways to fill the missing values of loan amount – the simplest being replacement by mean, which can be done by following code:

df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.

Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes. A key hypothesis is that the whether a person is educated or self-employed can combine to give a good estimate of loan amount.

First, let’s look at the boxplot to see if a trend exists:

Thus we see some variations in the median of loan amount for each group and this can be used to impute the values. But first, we have to ensure that each of Self_Employed and Education variables should not have a missing values.

As we say earlier, Self_Employed has some missing values. Let’s look at the frequency table:

Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of success. This can be done using the following code:


Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features. Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount:

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median) # Define function to return value of this pivot_table def fage(x): return table.loc[x['Self_Employed'],x['Education']] # Replace missing values df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

This should provide you a good way to impute missing values of loan amount.

NOTE : This method will work only if you have not filled the missing values in Loan_Amount variable using the previous approach, i.e. using mean.

How to treat for extreme values in distribution of LoanAmount and ApplicantIncome?

Let’s analyze LoanAmount first. Since the extreme values are practically possible, i.e. some people might apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log transformation to nullify their effect:

df['LoanAmount_log'] = np.log(df['LoanAmount']) df['LoanAmount_log'].hist(bins=20)

Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.

Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong support Co-applicants. So it might be a good idea to combine both incomes as total income and take a log transformation of the same.

df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome'] df['TotalIncome_log'] = np.log(df['TotalIncome']) df['LoanAmount_log'].hist(bins=20)

Now we see that the distribution is much better than before. I will leave it upto you to impute the missing values for Gender, Married, Dependents, Loan_Amount_Term, Credit_History. Also, I encourage you to think about possible additional information which can be derived from the data. For example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to pay back his loan.

Next, we will look at making predictive models.

Building a Predictive Model in Python

After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set. Skicit-Learn (sklearn) is the most commonly used library in Python for this purpose and we will follow the trail. I encourage you to get a refresher on sklearn through this article.

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. Before that we will fill all the missing values in the dataset. This can be done using the following code:

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True) df['Married'].fillna(df['Married'].mode()[0], inplace=True) df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True) df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True) df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True) from sklearn.preprocessing import LabelEncoder var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status'] le = LabelEncoder() for i in var_mod: df[i] = le.fit_transform(df[i]) df.dtypes

Next, we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation scores. Since this is an introductory article, I will not go into the details of coding. Please refer to this article for getting details of the algorithms with R and Python codes. Also, it’ll be good to get a refresher on cross-validation through this article, as it is a very important measure of power performance.

#Import models from scikit learn module: from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import KFold #For K-fold cross validation from sklearn.ensemble import RandomForestClassifier from chúng tôi import DecisionTreeClassifier, export_graphviz from sklearn import metrics #Generic function for making a classification model and accessing performance: def classification_model(model, data, predictors, outcome): #Fit the model:[predictors],data[outcome]) #Make predictions on training set: predictions = model.predict(data[predictors]) #Print accuracy accuracy = metrics.accuracy_score(predictions,data[outcome]) print ("Accuracy : %s" % "{0:.3%}".format(accuracy)) #Perform k-fold cross-validation with 5 folds kf = KFold(data.shape[0], n_folds=5) error = [] for train, test in kf: # Filter training data train_predictors = (data[predictors].iloc[train,:]) # The target we're using to train the algorithm. train_target = data[outcome].iloc[train]    # Training the algorithm using the predictors and target., train_target)    #Record error from each cross-validation run    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test])) print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error))) #Fit the model again so that it can be refered outside the function:[predictors],data[outcome]) Logistic Regression

Let’s make our first Logistic Regression model. One way would be to take all the variables into the model but this might result in overfitting (don’t worry if you’re unaware of this terminology yet). In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well. Read more about Logistic Regression.

We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan will be higher for:

Applicants having a credit history (remember we observed this in exploration?)

Applicants with higher applicant and co-applicant incomes

Applicants with higher education level

Properties in urban areas with high growth perspectives

So let’s make our first model with ‘Credit_History’.

outcome_var = 'Loan_Status' model = LogisticRegression() predictor_var = ['Credit_History'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945% Cross-Validation Score : 80.946%

#We can try different combination of variables: predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945% Cross-Validation Score : 80.946%

Generally we expect the accuracy to increase on adding variables. But this is a more challenging case. The accuracy and cross-validation score are not getting impacted by less important variables. Credit_History is dominating the mode. We have two options now:

Feature Engineering: dereive new information and try to predict those. I will leave this to your creativity.

Better modeling techniques. Let’s explore this next.

Decision Tree

Decision tree is another method for making a predictive model. It is known to provide higher accuracy than logistic regression model. Read more about Decision Trees.

model = DecisionTreeClassifier() predictor_var = ['Credit_History','Gender','Married','Education'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 81.930% Cross-Validation Score : 76.656%

Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them. Let’s try a few numerical variables:

#We can try different combination of variables: predictor_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 92.345% Cross-Validation Score : 71.009%

Here we observed that although the accuracy went up on adding variables, the cross-validation error went down. This is the result of model over-fitting the data. Let’s try an even more sophisticated algorithm and see if it helps:

Random Forest

Random forest is another algorithm for solving the classification problem. Read more about Random Forest.

model = RandomForestClassifier(n_estimators=100) predictor_var = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'LoanAmount_log','TotalIncome_log'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 100.000% Cross-Validation Score : 78.179%

Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can be resolved in two ways:

Reducing the number of predictors

Tuning the model parameters

Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important features.

#Create a series with feature importances: featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False) print (featimp)

Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest model a little bit:

model = RandomForestClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1) predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 82.899% Cross-Validation Score : 81.461%

Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well. Remember that random forest models are not exactly repeatable. Different runs will result in slight variations because of randomization. But the output should stay in the ballpark.

You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model. This exercise gives us some very interesting and unique learning:

Using a more sophisticated model does not guarantee better results.

Avoid using complex modeling techniques as a black box without understanding the underlying concepts. Doing so would increase the tendency of overfitting thus making your models less interpretable

Feature Engineering is the key to success. Everyone can use an Xgboost models but the real art and creativity lies in enhancing your features to better suit the model.

You can access the dataset and problem statement used in this post at this link: Loan Prediction Challenge


Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your data science journey with the following Practice Problems:

Frequently Asked Questions

Q1. How to learn python programming?

A. To learn Python programming, you can start by familiarizing yourself with the language’s syntax, data types, control structures, functions, and modules. You can then practice coding by solving problems and building projects. Joining online communities, attending workshops, and taking online courses can also help you learn Python. With regular practice, persistence, and a willingness to learn, you can become proficient in Python and start developing software applications.

Q2. Why Python is used?

A. Python is used for a wide range of applications, including web development, data analysis, scientific computing, machine learning, artificial intelligence, and automation. Python is a high-level, interpreted, and dynamically-typed language that offers ease of use, readability, and flexibility. Its vast library of modules and packages makes it a popular choice for developers looking to create powerful, efficient, and scalable software applications. Python’s popularity and versatility have made it one of the most widely used programming languages in the world today.

Q3. What are the 4 basics of Python?

A. The four basics of Python are variables, data types, control structures, and functions. Variables are used to store values, data types define the type of data that can be stored, control structures dictate the flow of execution, and functions are reusable blocks of code. Understanding these four basics is essential for learning Python programming and developing software applications.

Q4. Can I teach myself Python?

A. Yes, you can teach yourself Python. Start by learning the basics and practicing coding regularly. Join online communities to get help and collaborate on projects. Building projects is a great way to apply your knowledge and develop your skills. Remember to be persistent, learn from mistakes, and keep practicing.

End Notes

I hope this tutorial will help you maximize your efficiency when starting with data science in Python. I am sure this not only gave you an idea about basic data analysis methods but it also showed you how to implement some of the more sophisticated techniques available today.

You should also check out our free Python course and then jump over to learn how to apply it for Data Science.

Python is really a great tool and is becoming an increasingly popular language among the data scientists. The reason being, it’s easy to learn, integrates well with other databases and tools like Spark and Hadoop. Majorly, it has the great computational intensity and has powerful data analytics libraries.

So, learn Python to perform the full life-cycle of any data science project. It includes reading, analyzing, visualizing and finally making predictions.

Note – The discussions of this article are going on at AV’s Discuss portal. Join here! If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.


The 25 Best Data Science And Machine Learning Github Repositories From 2023


What’s the best platform for hosting your code, collaborating with team members, and also acts as an online resume to showcase your coding skills? Ask any data scientist, and they’ll point you towards GitHub. It has been a truly revolutionary platform in recent years and has changed the landscape of how we host and even do coding.

But that’s not all. It acts as a learning tool as well. How, you ask? I’ll give you a hint – open source!

The world’s leading tech companies open source their projects on GitHub by releasing the code behind their popular algorithms. 2023 saw a huge spike in such releases, with the likes of Google and Facebook leading the way. The best part about these releases is that the researchers behind the code also provide pretrained models so folks like you and I don’t have to waste time building difficult models from scratch.

2023 was a transcendent one in a lot of data science sub-fields, as we will shortly see. Natural Language Processing (NLP) was easily the most talked about domain within the community with the likes of ULMFiT and BERT being open-sourced. In my quest to bring the best to our awesome community, I ran a monthly series throughout the year where I hand-picked the top 5 projects every data scientist should know about. You can check out the entire collection below:

There will be some overlap here with my article covering the biggest breakthroughs in AI and ML in 2023. Do check out that article as well – it is essentially a list of all the major developments I feel everyone in this field needs to know about. As a bonus, there are predictions from experts as well – not something you want to miss. 🙂

Topics we will cover in this article

Tools and Frameworks

Computer Vision

Generative Adversarial Networks (GANs)

Other Deep Learning Projects

Natural Language Processing (NLP)

Automated Machine Learning (AutoML)

Reinforcement Learning

Tools and Frameworks

Let’s get the ball rolling with a look at the top projects in terms of tools, libraries and frameworks. Since we are speaking about a software repository platform, it feels right to open things with this section.

How about all you .NET developers wanting to learn a bit of machine learning to complement your existing skills? Here’s the perfect repository to get that idea started! chúng tôi a Microsoft project, is an open-source machine learning framework that allows you design and develop models in .NET.

You can even integrate existing ML models into your application, all without requiring explicit knowledge of how ML models are developed. chúng tôi is actually used in multiple Microsoft products, like Windows, Bing Search, MS Office, among others.

ML.NET runs on Windows, Linux and MacOS.

Machine learning in the browser! A fictional thought a few years back, a stunning reality now. A lot of us in this field are welded to our favorite IDEs, but chúng tôi has the potential to change your habits. It’s become a very popular release since it’s release earlier this year and continues to amaze with its flexibility.

As the repository states, there are primarily three major features of TensorFlow.js:

Develop machine learning and deep learning models in your browser itself

Run pre-existing TensorFlow models within the browser

Retrain or fine-tune these pre-existing models as well

If you’re familiar with Keras, the high-level layers API will seem quite familiar. There are plenty of examples available on the GitHub repository, so check those out to quicken your learning curve.

What a year it has been for PyTorch. It has won the hearts and now projects of data scientists and ML researchers around the globe. It is easy to grasp, flexible, and is already being implemented across high profile researches (as you’ll see later in this article). The latest version (v1.0) already powers many Facebook products and services at scale, including performing 6 billion text translations a day. If you’ve been wondering when to start dabbling with PyTorch, the time is NOW.

If you’re new to this field, ensure you check out Faizan Shaikh’s guide to getting started with PyTorch.

While not strictly a tool or framework, this repository is a gold mine for all data scientists. Most of us struggle with reading through a paper and then implementing it (at least I do). There are a lot of moving parts that don’t seem to work on our machines.

And that’s where ‘Papers with Code’ comes in. As the name suggests, they have a code implementation of all the major papers that have been released in the last 6 years or so. It is a mind-blowing collection that you will find yourself fawning over. They have even added code from papers presented at NIPS (NeurIPS) 2023. Get yourself over there now!

Computer Vision

Thanks to falling computational costs and a surge of breakthroughs from the top researchers (something tells me those two might be linked), deep learning is accessible to more people than ever before. And within deep learning, computer vision projects are ubiquitous – most of the repositories you’ll see in this section will cover one computer vision technique or another.

It is simply the hottest field in deep learning right now and will continue to be so for the foreseeable future. Whether it’s object detection or pose estimation, there’s a repository for seemingly all computer vision tasks. Never a better time to get acquainted with these developments – a lot of job openings might come your way soon.

Detectron made a HUGE splash when it was launched in early 2023. Developed by Facebook’s AI Research team (FAIR), it implements state-of-the-art object detection frameworks. It is (surprise, surprise) written in Python and has helped enable multiple projects, including DensePose (which we will talk about soon).

This repository contains the code and over 70 pretrained models. Too good an opportunity pass up, would’t you agree?

Object detection in images is awesome, but what about doing it in videos? And not just that, can we extend this concept and translate the style of one video to another? Yes, we can! It is a really cool concept and NVIDIA have been generous enough to release the PyTorch implementation for you to play around with.

The repository contains videos of how the technique looks, the full research paper, and of course the code. The Cityscapes dataset, available publicly post registration, is used in NVIDIA’s examples. One of my favorite projects from 2023.

Training a deep learning model in 18 minutes? While not having access to high-end computational resources? Believe me, it’s already been done.’s Jeremy Howard and his team of students built a model on the popular ImageNet dataset that even outperformed Google’s approach.

I encourage you to at least go through this project to get a sense of how these researchers structured their code. Not everyone has access to multiple GPUs (or even one) so this was quite a win for the minnows.

Another research paper collection repository! It’s always helpful to know how your subject of choice has evolved over a span of multiple years, and this one-stop shop will help you do just that for object detection. It’s a comprehensive collection of papers from 2014 till date, and even include code wherever possible.

Let’s turn our attention to the field of pose detection. I came across this concept this year itself and have been fascinated with it ever since. That above image captures the essence of this repository – dense human pose estimation in the wild.

The code to train and evaluate your own DensePose-RCNN model is included here. There are notebooks available as well to visualize the DensePose COCO dataset. Pretty good place to kick off your pose estimation learning.

The above image (taken from a video) really piqued my interest. I covered the release of the research paper back in August and have continued to be in awe of this technique. This technique enables us to transfer the motion between human objects in different videos. The video I mentioned is available within the repository – it will blow your mind!

This repository further contains the PyTorch implementation of this approach. The amount of intricate details this approach is capable of picking up and replicating is incredible.


I’m sure most of you must have come across a GAN application (even if you perhaps didn’t realize it at the time). GANs, or Generative Adversarial Networks, were introduced by Ian Goodfellow back in 2014 and have caught fire since. They specilize in performing creative tasks, especially artistic ones. Check out this amazing introductory guide by Faizan Shaikh to the world of GANs, along with an implementation in Python.

We saw a plethora of GAN based projects in 2023 and hence I wanted to create a separate section for this.

Let’s start off with one of my favorites. I want you to take a moment to just admire the above images. Can you tell which one was done by a human and which one by a machine? I certainly couldn’t. Here, the first frame is the input image (original) and the third frame has been generated by this technique.

Amazing, right? The algorithm adds an external object of your choosing to any image and manages to make it look like nothing touched it. Make sure you check out the code and try to implement it on a different set of images yourself. It’s really, really fun.

What if I gave you an image and asked you to extend the boundaries by imagining what it would look like when the entire scene was captured? You would understandably turn to some image editing software. But here’s the awesome news – you can achieve it in a few lines of code!

This project is a Keras implementation of Stanford’s Image Outpainting paper (incredibly cool and illustrated paper – this is how most research papers should be!). You can either build a model from scratch or use the one provided by this repository’s author. Deep learning wonders never cease to amaze.

If you haven’t got a handle on GANs yet, try out this project. Pioneered by researchers from MIT’s CSAIL division, it helped you visualize and understand GANs. You can explore what your GAN model has learned by inspecting and manipulating it’s neurons.

I would like to point you towards the official MIT project page, which has plenty of resources to get you familiar with the concept, including a video demo.

This algorithm enables you to change the facial expression of any person in an image. It’s as exciting as it is concerning. The images above inside the green border at the originals, the rest have been generated by GANimation.

The link contains a beginner’s guide, data preparation resources, prerequisites, and the Python code. As the author mentioned, do NOT use it for immoral purposes.

This project is quite similar to the Deep Painterly Harmonization one we saw earlier. But it deserved a mention given it came from NVIDIA themselves. As you can see in the image above, the FastPhotoStyle algorithm requires two inputs – a style photo and a content photo. The algorithm then works in one of two ways to generate the output – it either uses photorealistic image stylization code or uses semantic label maps.

Other Deep Learning Projects

The computer vision field has the potential to overshadow other work in deep learning but I wanted to highlight a few projects outside it.

Audio processing is another field where deep learning has started to make it’s mark. It’s not just limited to generating music, you can do tasks like audio classification, fingerprinting, segmentation, tagging, etc. There is a lot that’s still yet to be explored and who knows, perhaps you could use these projects to pioneer your way to the top.

Here are two intuitive articles to help you get acquainted with this line of work:

And here comes NVIDIA again. WaveGlow is a flow-based network capable of generating really high quality audio. It is essentially a single network for speech synthesis.

This repository includes a PyTorch implementation of WaveGlow along with a pre-trained model which you can download. The researchers have also listed down the steps you can follow if you want to train your own model from scratch.

Want to discover your own planet? That might perhaps be overstating things a bit, but this AstroNet repository will definitely get you close. The Google Brain team discovered two new planets in December 2023 by applying AstroNet. It’s a deep neural network meant for working with astronomical data. It goes to show the far-ranging applications of machine learning and was a truly monumental development.

And now the team behind the technology has open sourced the entire code (hint: the model is based on CNNs!) that powers AstroNet.

Who doesn’t love visualizations? But it can get a tad bit intimidating to imagine how a deep learning model works – there are too many moving parts involved. But VisualDL does a great job mitigating those challenges by designing specific deep learning jobs.

VisualDL currently supports the below components for visualizing jobs (you can see examples of each in the repository):






high dimensional

Natural Language Processing (NLP)

Surprised to see NLP so down in this list? That’s primarily because I covered almost all the major open source releases in this article. I highly recommend checking out that list to stay on top of your NLP game. The frameworks I have mentioned here include ULMFiT, Google’s BERT, ELMo, and Facebook’s PyText. I will briefly mention BERT and a couple of other respositories here as I found them very helpful.

I couldn’t possibly let this section pass by without mentioning BERT. Google AI’s release has smashed records on it’s way to winning the hearts of NLP enthusiasts and experts alike. Following ULMFiT and ELMo, BERT really blew away the competition with it’s performance. It obtained state-of-the-art results on 11 NLP tasks.

Apart from the official Google repository I have linked to above, a PyTorch implementation of BERT is worth checking out. Whether it marks a new era of not in NLP we will soon find out.

It often helps to know how well your model is performing against a certain benchmark. For NLP, and specifically deep text matching models, I have found the MatchZoo toolkit quite reliable. Potential tasks related to MatchZoo include:


Question Answer

Textual Entailment

Information Retrieval

Paraphrase Identification

MatchZoo 2.0 is currently under development so expect to see a lot more being added to this already useful toolkit.

This repository was created by none other than Sebastian Ruder. The aim of this project is to track the latest progress in NLP. This includes both datasets and state-of-the-art models.

Automated Machine Learning (AutoML)

What an year for AutoML. With industries look to integrate machine learning into their core mission, the need to data science specialists continues to grow. There is currently a massive gap between the demand and the supply. This gap could potentially be filled by AutoML tools.

These tools are designed for those people who do not have data science expertise. While there are certainly some incredible tools out there, most of them are priced significantly higher than most individuals can afford. So our amazing open source community came to the rescue in 2023, with two high profile releases.

This made quite a splash upon it’s release a few months ago. And why wouldn’t it? Deep learning has been long considered a very specialist field, so a library that can automate most tasks came as a welcome sign. Quoting from their official site, “The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background”.

You can install this library from pip:

pip install autokeras

The repository contains a simple example to give you a sense of how the whole thing works. You’re welcome, deep learning enthusiasts. 🙂

AdaNet is a framework for automatically learning high-quality models without requiring programming expertise. Since it’s a Google invention, the framework is based on TensorFlow. You can build ensemble models using AdaNet, and even extend it’s use to training a neural network.

The GitHub page contains the code, an example, the API documentation, and other things to get your hands dirty. Trust me, AutoML is the next big thing in our field.

Reinforcement Learning

Since I already covered a few reinforcement learning releases in my 2023 overview article, I will keep this section fairly brief. My hope in including a RL section where I can is to foster a discussion among our community and to hopefully accelerate research in this field.

First, make sure you check out OpenAI’s Spinning Up repository, an exhaustive educational resource for beginners. Then head over to Google’s Dopamine page. It is a research framework for accelerating research in this still nascent field. Now let’s look at a couple of other resources as well.

If you follow a few researchers on social media, you must have come across the above images in video form. A stick human running across a terrain, or trying to stand up, or some such sort. That, dear reader, is reinforcement learning in action.

Here is a signature example of it – a framework to train a simulated humanoid to imitate multiple motion skills. You can get the code, examples, and a step-by-step run-through on the above link.

This repository is a collection of reinforcement learning algorithms from Richard Sutton and Andrew Barto’s book and other research papers. These algorithms are presented in the form of Python notebooks.

As the author of this repo mentioned, you will only truly learn if you implement the learning as you go along. It’s a complex topic, and giving up or reading the resources like a storybook will lead you nowhere.

End Notes

And that bring us to the end of our journey for 2023. What a year! It was a joyful ride putting this article together and I learned a lot of new stuff along the way.


Data Science Vs Big Data: Key Differences

Data Science vs BigData: The key difference is in areas of focus, data size, tools, technologies used, and applications

Data Science and Big data are two interrelated concepts that have gained significant importance in recent years. Data science vs Big data is a trending topic. In the data analytics field, both play a vital role in leveraging data for decision-making, innovation, and gaining a competitive edge in today’s data-driven world.

The growth trend in the data segment of the industry suggests that data science and Big data analytics are the future. Data Science and Big data are two related but distinct concepts in the data analytics field. Data Science focuses on the application of statistical and machine learning techniques to extract insights from data and solve complex problems. It encompasses data acquisition, cleaning, exploration, and interpretation. Whereas, Big data refers to large, complex datasets that exceed the capacity of traditional data processing methods. Applications are in real-time processing and analysis fields like fraud detection, sentiment analysis, internet traffic analysis, etc.

Let’s delve into the key differences between Data Science and Big Data: Key Concept and Characteristics

Data Science is a multidisciplinary field combining scientific methods, algorithms, and systems for extracting valuable insights from structured and unstructured data. It emphasizes the use of data as the primary resource for analysis, decision-making, etc. To do so, they employ statistical techniques and ML algorithms. These data analysis techniques aim to solve real-world problems.

Scope and Methodology

Data science includes statistical analysis, ML, data visualization, and exploratory data analysis. These are employed to understand the patterns of data, make predictions and solve problems.

In big data, large datasets are handled using technologies and infrastructure. It involves distributed storage and processing frameworks like Hadoop and Spark. To manage vast volumes and high velocities of data, it enables parallel processing, scalability, etc.


The primary goal of data science is to gain insights, extract valuable knowledge, and solve complex problems using data.

The main objective of big data is to store, process and analyze massive volumes of data efficiently.


Data Science is extensively used in business intelligence to analyze customer behavior, market trends, and sales data. In healthcare, it plays a crucial role in analyzing patient data for diagnosing diseases and treatment outcome prediction. It also aids in clinical decision support, personalized medicine, and identifying patterns for disease outbreaks. Data science is utilized in financial institutions for fraud detection, risk modeling, algorithm trading, and making informed investment decisions. They are applied to analyze the human language that enables applications like chatbots, voice assistants, and machine translation.

Big data analyze customer preference, behavior, and purchasing patterns to improve product recommendation, inventory management, pricing strategies, and personalized marketing campaigns. It handles massive amounts of data generated by IoT devices such as wearables and sensors. These technologies are employed to analyze social media data including user interactions, sentiment analysis, and trending topics.


Data science helps organizations to make informed decisions by extracting meaningful insights from data. This is done through statistical analysis, ML techniques, and data visualization techniques. The wide range of applications including in finance, healthcare, business, etc. Efficient data management and analysis in data science offer significant cost savings.

Data science requires skilled professionals in the field. Due to the need for preprocessing and data cleaning, this technique is time-consuming and needs more resources. Since it deals with sensitive data, ethical concerns may be a problem.

Big data need skill and expertise in the field. Security and privacy are a concern when handling sensitive data. It can sometimes be expensive due to the need for specialized infrastructure and software.


Data science uses tools like Apche Hadoop, DataRovit, Tableau, QlikView, Microsoft HD Insights, TensorFlow, Jupyter Notebooks to effectively handle and analyze huge data.

Data Science Vs. Decision Science: What’s The Difference?

Data science Vs decision science: A primer on what makes them unique and what unites them

Data science and decision science are two different yet closely related disciplines. For a company or enterprise with a number of operational areas, certain characteristics overlap between them. This exactly creates confusion among aspiring data scientists and decision scientists. They take the functional areas of either fields for granted only to repent later. Having similar-sounding names and common expertise areas, it is not uncommon that they are mistaken for one another. Decision science and data science are two-data driven fields that have risen to prominence in the past few decades and therefore it is equally important to understand the differences by companies and individuals equally. While data science is all about providing insights, decision science is about putting those insights into an application to achieve better outputs. Let’s deep dive into the data science Vs decision science question.

What is Data Science?

Data science essentially involves processing large chunks of data using various mathematical, statistical, and analytical tools and machine learning algorithms. Data scientists interpret data for a better understanding of underlying data from a gigantic number of transactions. The ultimate goal is to provide actionable insights for the decision-makers to choose a course of action. Technically speaking, they write complex algorithms and build statistical models. Data science applications broadly lie in the area of finance, banking, healthcare, e-commerce, education, manufacturing etc.

What is Decision Science?

Decision science is about sketching the optimal strategy to solve the problem at hand with the insights provided. In a way, it involves taking a 360-degree view of the business challenge by taking into account the type of analysis, visualization methods, behavioral understanding, and feasibility of the strategy. Technically speaking a decision scientist applies complex quantitative data-driven methods combined with cognitive science and managerial capabilities. As decision science involves taking course-changing steps, it is mainly applied to public healthcare and policy, law and education, military science, environmental regulation, business, and management.

Key Comparative differences between Data Science and Decision Science View on Data:

For data scientists, data is a tool for innovation for interpreting and analyzing situations. It helps in building result-oriented solutions thereby leading to adopting data-driven methods. Decision scientists, in contrast, use data as a tool only as a suggestive mechanism for taking decisions. They apply data to design not one but different approaches to a problem. Though data is equally important, the mechanism differs vastly. While data scientists are focused on finding insights, decision scientists use data to reveal those insights.


A data scientist’s USP lies in processing structured as well as unstructured data and putting the derived information into an easily understandable format so that it reveals a certain pattern. On the other hand, decision scientists though do not work with big data, but using the insights, they arrive at a principled framework for decision-makers to align with a certain mindset.


The challenge lies in complexity. Data scientists have to process large amounts of data and hence have to address the issues related to it such as the development of sourcing, data security protocols etc. For decision scientists, given the techniques they apply are complex, the lack of reliable data and the right data environments stand as hindrances.

Estimators – An Introduction To Beginners In Data Science

This article was published as a part of the Data Science Blogathon.

Not having much information about the distribution of a random variable can become a major problem for data scientists and statisticians. Consider, a researcher trying to understand the distribution of Choco-chips in a cookie (a very popular example of Poisson distribution). The researcher is well aware that the distribution of Choco-chips follows a Poisson distribution, but does not know how to estimate the parameter λ of the distribution.

A parameter is essentially a numerical characteristic of a distribution (or any statistical model in general). Normal distributions have µ & σ as parameters, uniform distributions have a & b as parameters, and binomial distributions have n & p as parameters. These numerical characteristics are vital for understanding the size, shape, spread, and other properties of a distribution. In the absence of the true value of the parameter, it seems that the researcher may not be able to continue her investigation. But that’s when estimators step in.

Estimators are functions of random variables that can help us find approximate values for these parameters. Think of these estimators like any other function, that takes an input, processes it, and renders an output. So, the process of estimation goes as follows:

1) From the distribution, we take a series of random samples.

2) We input these random samples into the estimator function.

3) The estimator function processes it and gives a set of outputs.

4) The expected value of that set is the approximate value of the parameter.


Let’s take an example. Consider a random variable X showing a uniform distribution. The distribution of X can be represented as U[0, θ]. This has been plotted below:

(Figure A)

We have the random variable X and its distribution. But we don’t know how to determine the value of θ. Let’s use estimators. There are many ways to approach this problem. I’ll discuss two of them:

1) Using Sample Mean

We know that for a U[a, b] distribution, the mean µ is given by the following equation:

For U[0, θ] distribution, a = 0 & b = θ, we get:

Thus, if we estimate µ, we can estimate θ. To estimate µ, we use a very popular estimator called the sample mean estimator. The sample mean is the sum of the random sample value drawn divided by the size of the sample. For instance, if we have a random sample S = {4, 7, 3, 2}, then the sample mean is (4+7+3+2)/4 = 4 (the average value). In general, the sample mean is defined using the following notation:

Here, µ-hat is the sample mean estimator & n is the size of the random sample that we take from the distribution. A variable with a hat on top of it is the general notation for an estimator. Since our unknown parameter θ is twice of µ, we arrive at the following estimator for θ:

We take a random sample, plug it into the above estimator, and get a number. We repeat this process and get a set of numbers. The following figure illustrates the process:

(Figure B)

The lines on the x-axes correspond to the values present in the sample taken from the distribution. The red lines in the middle indicate the average value of the sample, and the red lines at the end are twice that average value i.e., the expected value of θ for one sample. Many such samples are taken, and the estimated value of θ for each sample is noted. The expected value/mean of that set of numbers gives the final estimate for θ. It can be mathematically proved (using properties of expectation):

It is seen that the expectation of the estimator is equal to the true value of the parameter. This amazing property that certain estimators have is called unbiasedness, which is a very useful criterion for assessing estimators.

2) Maximum Value Method

This time, instead of using mean, we’ll use order statistics, particularly the nth order statistic. The nth order statistic is defined as the nth smallest value of a random sample of size n. In other words, it’s the maximum value of a random sample. For instance, if we have a random sample S = {4, 7, 3, 2}, then the nth order statistic is 7 (the largest value). The estimator is now defined as follows:

We follow the same procedure- take random samples, input them, collect the output and find the expectation. The following figure illustrates the process:

(Figure C)

As noted previously, the lines on the x-axes are the values present in one sample. The red lines at the end are the maximum value for that sample i.e., the nth order statistic. Two random samples are shown for reference. However, we need to take much larger samples. Why? To prove it, we’ll use the general expression for the PDF (Probability Distribution Function) of nth order statistics for U[a, b] distribution:

For U[0, θ] distribution, a = 0 & b = θ, we get:

Using the integral form of expectation of a continuous variable,

Does that mean that we cannot use this estimator? Certainly not. As discussed earlier, the estimator bias can be significantly lowered by taking large n. For large values of n, n = n+1 (approximately). Thus, we get:

The Bottom Line

Hence, we have successfully solved our problem through estimators. We also learned a very important property of estimators- unbiasedness. While this may have been an extensive read, it’s imperative to acknowledge that the study of estimators is not restricted to just the above-explained concepts. Various other properties of estimators such as their efficiency, robustness, mean squared error, and consistency are also vital to deepen our understanding of them.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 



Update the detailed information about Comprehensive Learning Path – Data Science In Python on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!