Trending December 2023 # An Awesome Tutorial To Learn Outlier Detection In Python Using Pyod Library # Suggested January 2024 # Top 18 Popular

You are reading the article An Awesome Tutorial To Learn Outlier Detection In Python Using Pyod Library updated in December 2023 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 An Awesome Tutorial To Learn Outlier Detection In Python Using Pyod Library

Introduction

My latest data science project involved predicting the sales of each product in a particular store. There were several ways I could approach the problem. But no matter which model I used, my accuracy score would not improve.

I figured out the problem after spending some time inspecting the data – outliers!

This is a commonly overlooked mistake we tend to make. The temptation is to start building models on the data you’ve been given. But that’s essentially setting yourself up for failure.

There are no shortcuts to data exploration. Building models will only get you so far if you’ve skipped this stage of your data science project. After a point of time, you’ll hit the accuracy ceiling – the model’s performance just won’t budge.

Data exploration consists of many things, such as variable identification, treating missing values, feature engineering, etc. Detecting and treating outliers is also a major cog in the data exploration stage. The quality of your inputs decide the quality of your output!

PyOD is one such library to detect outliers in your data. It provides access to more than 20 different algorithms to detect outliers and is compatible with both Python 2 and 3. An absolute gem!

In this article, I will take you on a journey to understand outliers and how you can detect them using PyOD in Python.

This article assumes you have a basic knowledge of machine learning algorithms and the Python language. You can refer to this article -“Essentials of Machine Learning“, to understand or refresh these concepts.

What is an Outlier?

An outlier is any data point which differs greatly from the rest of the observations in a dataset. Let’s see some real life examples to understand outlier detection:

When one student averages over 90% while the rest of the class is at 70% – a clear outlier

How about Usain Bolt? Those record breaking sprints are definitely outliers when you factor in the majority of athletes

Outliers are of two types: Univariate and Multivariate. A univariate outlier is a data point that consists of extreme values in one variable only, whereas a multivariate outlier is a combined unusual score on at least two variables. Suppose you have three different variables – X, Y, Z. If you plot a graph of these in a 3-D space, they should form a sort of cloud. All the data points that lie outside this cloud will be the multivariate outliers.

I would highly recommend you to read this amazing guide on data exploration which covers outliers in detail.

Why do we need to Detect Outliers?

Outliers can impact the results of our analysis and statistical modeling in a drastic way. Check out the below image to visualize what happens to a model when outliers are present versus when they have been dealt with:

But here’s the caveat – outliers aren’t always a bad thing. It’s very important to understand this. Simply removing outliers from your data without considering how they’ll impact the results is a recipe for disaster.

“Outliers are not necessarily a bad thing. These are just observations that are not following the same pattern as the other ones. But it can be the case that an outlier is very interesting. For example, if in a biological experiment, a rat is not dead whereas all others are, then it would be very interesting to understand why. This could lead to new scientific discoveries.  So, it is important to detect outliers.”

                                                                                                        – Pierre Lafaye de Micheaux, Author and Statistician

Our tendency is to use straightforward methods like box plots, histograms and scatter-plots to detect outliers. But dedicated outlier detection algorithms are extremely valuable in fields which process large amounts of data and require a means to perform pattern recognition in larger datasets.

Applications like fraud detection in finance and intrusion detection in network security require intensive and accurate techniques to detect outliers. Can you imagine how embarrassing it would be if you detected an outlier and it turned out to be genuine?

The PyOD library can step in to bridge this gap. Let’s see what it’s all about.

Why should we use PyOD for Outlier Detection?

Numerous outlier detection packages exist in various programming languages. I particularly found these languages helpful in R. But when I switched to Python, there was a glaring lack of an outlier detection library. How was this even possible?!

Existing implementations like PyNomaly are not specifically designed for outlier detection (though it’s still worth checking out!). To fill this gap, Yue Zhao, Zain Nasrullah, and Zheng Li designed and implemented the PyOD library.

PyOD is a scalable Python toolkit for detecting outliers in multivariate data. It provides access to around 20 outlier detection algorithms under a single well-documented API.

Features of PyOD

Open Source with detailed documentation and examples across various algorithms

Optimized performance with JIT (Just in Time) and parallelization using numba and joblib

Compatible with both Python 2 & 3

Installing PyOD in Python

Time to power up our Python notebooks! Let’s first install PyOD on our machines:

pip install pyod

pip install --upgrade pyod  

# to make sure that the latest version is installed!

As simple as that!

Note that PyOD also contains some neural network based models which are implemented in Keras. PyOD will NOT install Keras or TensorFlow automatically. You will need to install Keras and other libraries manually if you want to use neural net based models.

Outlier Detection Algorithms used in PyOD

Let’s see the outlier detection algorithms that power PyOD. It’s well and good implementing PyOD but I feel it’s equally important to understand how it works underneath. This will give you more flexibility when you’re using it on a dataset.

Note: We will be using a term Outlying score in this section. It means every model, in some way, scores a data point than uses threshold value to determine whether the point is an outlier or not.

Angle-Based Outlier Detection (ABOD)

ABOD performs well on multi-dimensional data

PyOD provides two different versions of ABOD:

Fast ABOD: Uses k-nearest neighbors to approximate

Original ABOD: Considers all training points with high-time complexity

k-Nearest Neighbors Detector

For any data point, the distance to its kth nearest neighbor could be viewed as the outlying score

PyOD supports three kNN detectors:

Largest: Uses the distance of the kth neighbor as the outlier score

Mean: Uses the average of all k neighbors as the outlier score

Median: Uses the median of the distance to k neighbors as the outlier score

Isolation Forest

It uses the scikit-learn library internally. In this method, data partitioning is done using a set of trees. Isolation Forest provides an anomaly score looking at how isolated the point is in the structure. The anomaly score is then used to identify outliers from normal observations

Isolation Forest performs well on multi-dimensional data

Histogram-based Outlier Detection

It is an efficient unsupervised method which assumes the feature independence and calculates the outlier score by building histograms

It is much faster than multivariate approaches, but at the cost of less precision

Local Correlation Integral (LOCI)

LOCI is very effective for detecting outliers and groups of outliers. It provides a LOCI plot for each point which summarizes a lot of the information about the data in the area around the point, determining clusters, micro-clusters, their diameters, and their inter-cluster distances

None of the existing outlier-detection methods can match this feature because they output only a single number for each point

Feature Bagging

A feature bagging detector fits a number of base detectors on various sub-samples of the dataset. It uses averaging or other combination methods to improve the prediction accuracy

By default, Local Outlier Factor (LOF) is used as the base estimator. However, any estimator could be used as the base estimator, such as kNN and ABOD

Feature bagging first constructs n sub-samples by randomly selecting a subset of features. This brings out the diversity of base estimators. Finally, the prediction score is generated by averaging or taking the maximum of all base detectors

Clustering Based Local Outlier Factor

It classifies the data into small clusters and large clusters. The anomaly score is then calculated based on the size of the cluster the point belongs to, as well as the distance to the nearest large cluster

Extra Utilities provided by PyOD

A function

generate_data

can be used to generate random data with outliers. Inliers data is generated by a multivariate

Gaussian distribution

and outliers are generated by a

uniform distribution

.

We can provide our own values of outliers fraction and the total number of samples that we want in our dataset. We will use this utility function to create data in the implementation part.

Implementation of PyOD

Enough talk – let’s see some action. In this section, we’ll implement the PyOD library in Python. I’m going to use two different approaches to demonstrate PyOD:

Using a simulated dataset

Using a real-world dataset – The Big Mart Sales Challenge

PyOD on a Simulated Dataset

First, let’s import the required libraries:

import numpy as np

from scipy import stats

import matplotlib.pyplot as plt

%matplotlib inline

import matplotlib.font_manager

Now, we’ll import the models we want to use to detect the outliers in our dataset. We will be using ABOD (Angle Based Outlier Detector) and KNN (K Nearest Neighbors):

from chúng tôi import ABOD

from chúng tôi import KNN



Create a dictionary and add all the models that you want to use to detect the outliers:

classifiers = {

   

'Angle-based Outlier Detector (ABOD)'   : ABOD(contamination=outlier_fraction),

    'K Nearest Neighbors (KNN)' :  KNN(contamination=outlier_fraction)

}

Fit the data to each model we have added in the dictionary, Then, see how each model is detecting outliers:

#set the figure size

plt.figure(figsize=(10, 10))

for i, (clf_name,clf) in enumerate(classifiers.items()) :

# fit the dataset to the model

clf.fit(X_train)

# predict raw anomaly score

scores_pred = clf.decision_function(X_train)*-1

# prediction of a datapoint category outlier or inlier

y_pred = clf.predict(X_train)

# no of errors in prediction

n_errors = (y_pred != Y_train).sum()

print('No of Errors : ',clf_name, n_errors)

# rest of the code is to create the visualization

# threshold value to consider a datapoint inlier or outlier

threshold = stats.scoreatpercentile(scores_pred,100 *outlier_fraction)

# decision function calculates the raw anomaly score for every point

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1

Z = Z.reshape(xx.shape)

subplot = plt.subplot(1, 2, i + 1)

# fill blue colormap from minimum anomaly score to threshold value

subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 10),cmap=plt.cm.Blues_r)

# draw red contour line where anomaly score is equal to threshold

a = subplot.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')

# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score

subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')

# scatter plot of inliers with white dots

b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white',s=20, edgecolor='k')

# scatter plot of outliers with black dots

c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black',s=20, edgecolor='k')

subplot.axis('tight')

subplot.legend(

[a.collections[0], b, c],

['learned decision function', 'true inliers', 'true outliers'],

       

prop=matplotlib.font_manager.FontProperties(size=10),

       

loc='lower right')

subplot.set_title(clf_name)

subplot.set_xlim((-10, 10))

subplot.set_ylim((-10, 10))

plt.show()

Looking good!

PyOD on the Big Mart Sales Problem

Now, let’s see how PyOD does on the famous Big Mart Sales Problem.

Go ahead and download the dataset from the above link. Let’s start with importing the required libraries and loading the data:

import pandas as pd import numpy as np # Import models from chúng tôi import ABOD from pyod.models.cblof import CBLOF from pyod.models.feature_bagging import FeatureBagging from chúng tôi import HBOS from pyod.models.iforest import IForest from chúng tôi import KNN from chúng tôi import LOF # reading the big mart sales training data df = pd.read_csv("train.csv")

Let’s plot Item MRP vs Item Outlet Sales to understand the data:

df.plot.scatter('Item_MRP','Item_Outlet_Sales')

The range of Item Outlet Sales is from 0 to 12000 and Item MRP is from 0 to 250. We will scale down both these features to a range between 0 and 1. This is required to create a explainable visualization (it will become way too stretched otherwise). As for this data, using the same approach will take much more time to create the visualization.

Note: If you don’t want the visualization, you can use the same scale to predict whether a point is an outlier or not.

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) df[['Item_MRP','Item_Outlet_Sales']] = scaler.fit_transform(df[['Item_MRP','Item_Outlet_Sales']]) df[['Item_MRP','Item_Outlet_Sales']].head()

Store these values in the NumPy array for using in our models later:

X1 = df['Item_MRP'].values.reshape(-1,1) X2 = df['Item_Outlet_Sales'].values.reshape(-1,1) X = np.concatenate((X1,X2),axis=1)

Again, we will create a dictionary. But this time, we will add some more models to it and see how each model predicts outliers.

You can set the value of the outlier fraction according to your problem and your understanding of the data. In our example, I want to detect 5% observations that are not similar to the rest of the data. So, I’m going to set the value of outlier fraction as 0.05.

random_state = np.random.RandomState(42) outliers_fraction = 0.05 # Define seven outlier detection tools to be compared classifiers = { 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction), 'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state), 'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state), 'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction), 'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state), 'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction), 'Average KNN': KNN(method='mean',contamination=outliers_fraction) }

Now, we will fit the data to each model one by one and see how differently each model predicts the outliers.

xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200)) for i, (clf_name, clf) in enumerate(classifiers.items()): clf.fit(X) # predict raw anomaly score scores_pred = clf.decision_function(X) * -1 # prediction of a datapoint category outlier or inlier y_pred = clf.predict(X) n_inliers = len(y_pred) - np.count_nonzero(y_pred) n_outliers = np.count_nonzero(y_pred == 1) plt.figure(figsize=(10, 10)) # copy of dataframe dfx = df dfx['outlier'] = y_pred.tolist() # IX1 - inlier feature 1, IX2 - inlier feature 2 IX1 = np.array(dfx['Item_MRP'][dfx['outlier'] == 0]).reshape(-1,1) IX2 = np.array(dfx['Item_Outlet_Sales'][dfx['outlier'] == 0]).reshape(-1,1) # OX1 - outlier feature 1, OX2 - outlier feature 2 OX1 = dfx['Item_MRP'][dfx['outlier'] == 1].values.reshape(-1,1) OX2 = dfx['Item_Outlet_Sales'][dfx['outlier'] == 1].values.reshape(-1,1) print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name) # threshold value to consider a datapoint inlier or outlier threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction) # decision function calculates the raw anomaly score for every point Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1 Z = Z.reshape(xx.shape) # fill blue map colormap from minimum anomaly score to threshold value plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r) # draw red contour line where anomaly score is equal to thresold a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red') # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange') b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k') c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k') plt.axis('tight') # loc=2 is used for the top left corner plt.legend( [a.collections[0], b,c], ['learned decision function', 'inliers','outliers'], prop=matplotlib.font_manager.FontProperties(size=20), loc=2) plt.xlim((0, 1)) plt.ylim((0, 1)) plt.title(clf_name) plt.show()

OUTPUT

OUTLIERS : 447 INLIERS : 8076 Angle-based Outlier Detector (ABOD) OUTLIERS : 427 INLIERS : 8096 Cluster-based Local Outlier Factor (CBLOF) OUTLIERS : 386 INLIERS : 8137 Feature Bagging OUTLIERS : 501 INLIERS : 8022 Histogram-base Outlier Detection (HBOS) OUTLIERS : 427 INLIERS : 8096 Isolation Forest OUTLIERS : 311 INLIERS : 8212 K Nearest Neighbors (KNN) OUTLIERS : 176 INLIERS : 8347 Average KNN

In the above plots, the white points are inliers surrounded by red lines, and the black points are outliers in the blue zone.

Frequently Asked Questions

That was an incredible learning experience for me as well. I spent a lot of time researching PyOD and implementing it in Python. I would encourage you to do the same. Practice using it on different datasets – it’s such a useful library!

PyOD already supports around 20 classical outlier detection algorithms which can be used in both academic and commercial projects. Its contributors are planning to enhance the toolbox by implementing models that will work well with time series and geospatial data.

Check out the below awesome courses to learn data science and it’s various aspects:

Related

You're reading An Awesome Tutorial To Learn Outlier Detection In Python Using Pyod Library

A Complete Python Tutorial To Learn Data Science From Scratch

Overview

This article is a complete tutorial to learn data science using python from scratch

It will also help you to learn basic data analysis methods using python

You will also be able to enhance your knowledge of machine learning algorithms

Introduction

It happened a few years back. After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being a data scientist, my hunt for other useful tools was ON! Fortunately, it didn’t take me long to decide – Python was my appetizer.

I always had an inclination for coding. This was the time to do what I really loved. Code. Turned out, coding was actually quite easy!

I learned the basics of Python within a week. And, since then, I’ve not only explored this language to the depth, but also have helped many other to learn this language. Python was originally a general purpose language. But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling.

Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.

A complete python tutorial from scratch in data science.

Are you a beginner looking for a place to start your journey in data science and machine learning? Presenting a comprehensive course, full of knowledge and data science learning, curated just for you!

You can also check out the ‘Introduction to Data Science‘ course – a comprehensive introduction to the world of data science. It includes modules on Python, Statistics and Predictive Modeling along with multiple practical projects to get your hands dirty.

Basics of Python for Data Analysis Why learn Python for data analysis?

Python has gathered a lot of interest recently as a choice of language for data analysis. I had basics of Python some time back. Here are some reasons which go in favour of learning Python:

Open Source – free to install

Awesome online community

Very easy to learn

Can become a common language for data science and production of web based analytics products.

Needless to say, it still has few drawbacks too:

It is an interpreted language rather than compiled language – hence might take up more CPU time. However, given the savings in programmer time (due to ease of learning), it might still be a good choice.

Python 2.7 v/s 3.4

This is one of the most debated topics in Python. You will invariably cross paths with it, specially if you are a beginner. There is no right/wrong choice here. It totally depends on the situation and your need to use. I will try to give you some pointers to help you make an informed choice.

Why Python 2.7?

Awesome community support! This is something you’d need in your early days. Python 2 was released in late 2000 and has been in use for more than 15 years.

Plethora of third-party libraries! Though many libraries have provided 3.x support but still a large number of modules work only on 2.x versions. If you plan to use Python for specific applications like web-development with high reliance on external modules, you might be better off with 2.7.

Some of the features of 3.x versions have backward compatibility and can work with 2.7 version.

Why Python 3.4?

Cleaner and faster! Python developers have fixed some inherent glitches and minor drawbacks in order to set a stronger foundation for the future. These might not be very relevant initially, but will matter eventually.

It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x versions. Python 3 has released stable versions for past 5 years and will continue the same.

There is no clear winner but I suppose the bottom line is that you should focus on learning Python as a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated article on Python 2.x vs 3.x in the near future!

How to install Python?

There are 2 approaches to install Python:

Download Python

You can download Python directly from its project site and install individual components and libraries you want

Install Package

Alternately, you can download and install a package, which comes with pre-installed libraries. I would recommend downloading Anaconda. Another option could be Enthought Canopy Express.

Second method provides a hassle free installation and hence I’ll recommend that to beginners

The imitation of this approach is you have to wait for the entire package to be upgraded, even if you are interested in the latest version of a single library. It should not matter until and unless, until and unless, you are doing cutting edge statistical research.

Choosing a development environment

Once you have installed Python, there are various options for choosing an environment. Here are the 3 most common options:

Terminal / Shell based

IDLE (default environment)

iPython notebook – similar to markdown in R

IDLE editor for Python

While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It provides a lot of good features for documenting while writing the code itself and you can choose to run the code in blocks (rather than the line by line execution)

We will use iPython environment for this complete tutorial.

Warming up: Running your first Python program

You can use Python as a simple calculator to start with:

Few things to note

You can start iPython notebook by writing “ipython notebook” on your terminal / cmd, depending on the OS you are working on

The interface shows In [*] for inputs and Out[*] for output.

You can execute a code by pressing “Shift + Enter” or “ALT + Enter”, if you want to insert an additional row after.

Before we deep dive into problem solving, lets take a step back and understand the basics of Python. As we know that data structures and iteration and conditional constructs form the crux of any language. In Python, these include lists, strings, tuples, dictionaries, for-loop, while-loop, if-else, etc. Let’s take a look at some of these.

Python libraries and Data Structures Python Data Structures

Following are some data structures, which are used in Python. You should be familiar with them in order to use them as appropriate.

Lists – Lists are one of the most versatile data structure in Python. A list can simply be defined by writing a list of comma separated values in square brackets. Lists might contain items of different types, but usually the items all have the same type. Python lists are mutable and individual elements of a list can be changed.

Here is a quick example to define a list and then access it:



Strings – Strings can simply be defined by use of single ( ‘ ), double ( ” ) or triple ( ”’ ) inverted commas. Strings enclosed in tripe quotes ( ”’ ) can span over multiple lines and are used frequently in docstrings (Python’s way of documenting functions). is used as an escape character. Please note that Python strings are immutable, so you can not change part of strings.



Tuples – A tuple is represented by a number of values separated by commas. Tuples are immutable and the output is surrounded by parentheses so that nested tuples are processed correctly. Additionally, even though tuples are immutable, they can hold mutable data if needed.

Since Tuples are immutable and can not change, they are faster in processing as compared to lists. Hence, if your list is unlikely to change, you should use tuples, instead of lists.

Dictionary – Dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}.



Python Iteration and Conditional Constructs

Like most languages, Python also has a FOR-loop which is the most widely used method for iteration. It has a simple syntax:

for i in [Python Iterable]: expression(i) fact=1 for i in range(1,N+1): fact *= i

Coming to conditional statements, these are used to execute code fragments based on a condition. The most commonly used construct is if-else, with following syntax:

if [condition]: __execution if true__ else: __execution if false__

For instance, if we want to print whether the number N is even or odd:

if N%2 == 0: print ('Even') else: print ('Odd')

Now that you are familiar with Python fundamentals, let’s take a step further. What if you have to perform the following tasks:

Multiply 2 matrices

Find the root of a quadratic equation

Plot bar charts and histograms

Make statistical models

Access web-pages

If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more than 2 days! But lets not worry about that. Thankfully, there are many libraries with predefined which we can directly import into our code and make our life easy.

For example, consider the factorial example we just saw. We can do that in a single step as:

math.factorial(N)

Off-course we need to import the math library for that. Lets explore the various libraries next.

Python Libraries

Lets take one step ahead in our journey to learn Python by getting acquainted with some useful libraries. The first step is obviously to learn to import them into our environment. There are several ways of doing so in Python:

import math as m from math import *

In the first manner, we have defined an alias m to library math. We can now use various functions from math library (e.g. factorial) by referencing it using the alias m.factorial().

In the second manner, you have imported the entire name space in math i.e. you can directly use factorial() without referring to math.

Tip: Google recommends that you use first style of importing libraries, as you will know where the functions have come from.

Following are a list of libraries, you will need for any scientific computations and data analysis:

SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.

Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.

Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.

Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.

Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of chúng tôi Moreover, it has the capability of high-performance interactivity over very large or streaming datasets.

Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can be used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh, Blaze can act as a very powerful tool for creating effective visualizations and dashboards on huge chunks of data.

Scrapy for web crawling. It is a very useful framework for getting specific patterns of data. It has the capability to start at a website home url and then dig through web-pages within the website to gather information.

SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to calculus, algebra, discrete mathematics and quantum physics. Another useful feature is the capability of formatting the result of the computations as LaTeX code.

Requests for accessing the web. It works similar to the the standard python library urllib2 but is much easier to code. You will find subtle differences with urllib2 but for beginners, Requests might be more convenient.

Additional libraries, you might need:

os for Operating system and file operations

networkx and igraph for graph based data manipulations

regular expressions for finding patterns in text data

BeautifulSoup for scrapping web. It is inferior to Scrapy as it will extract information from just a single webpage in a run.

Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into problem solving through Python. Yes I mean making a predictive model! In the process, we use some powerful libraries and also come across the next level of data structures. We will take you through the 3 key phases:

Data Exploration – finding out more about the data we have

Data Munging – cleaning the data and playing with it to make it better suit statistical modeling

Predictive Modeling – running the actual algorithms and having fun 🙂

Exploratory analysis in Python using Pandas

In order to explore our data further, let me introduce you to another animal (as if Python was not enough!) – Pandas

Image Source: Wikipedia

Pandas is one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. We will now use Pandas to read a data set from an Analytics Vidhya competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.

Before loading the data, lets understand the 2 key data structures in Pandas – Series and DataFrames

Introduction to Series and Dataframes

Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements of this series through these labels.

A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.

Series and dataframes form the core data model for Pandas in Python. The data sets are first to read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.

More: 10 Minutes to Pandas

Practice data set – Loan Prediction Problem

You can download the dataset from here. Here is the description of the variables:

VARIABLE DESCRIPTIONS: Variable Description Loan_ID Unique Loan ID Gender Male/ Female Married Applicant married (Y/N) Dependents Number of dependents Education Applicant Education (Graduate/ Under Graduate) Self_Employed Self employed (Y/N) ApplicantIncome Applicant income CoapplicantIncome Coapplicant income LoanAmount Loan amount in thousands Loan_Amount_Term Term of loan in months Credit_History credit history meets guidelines Property_Area Urban/ Semi Urban/ Rural Loan_Status Loan approved (Y/N) Let’s begin with the exploration

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal/windows command prompt:

ipython notebook --pylab=inline

This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):

plot(arange(5))

I am currently working in Linux, and have stored the dataset in the following location:

Importing libraries and the data set:

Following are the libraries we will use during this tutorial:

numpy

matplotlib

pandas

Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.

After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:

import pandas as pd import numpy as np import matplotlib as plt %matplotlib inline Quick Data Exploration

Once you have read the dataset, you can have a look at few top rows by using the function head()

df.head(10)

This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.

Next, you can look at summary of numerical fields by using describe() function

df.describe()

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution)

Here are a few inferences, you can draw by looking at the output of describe() function:

LoanAmount has (614 – 592) 22 missing values.

Loan_Amount_Term has (614 – 600) 14 missing values.

Credit_History has (614 – 564) 50 missing values.

We can also look that about 84% applicants have a credit_history. How? The mean of Credit_History field is 0.84 (Remember, Credit_History has value 1 for those who have a credit history and 0 otherwise)

The ApplicantIncome distribution seems to be in line with expectation. Same with CoapplicantIncome

Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure.

For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution to understand whether they make sense or not. The frequency table can be printed by following command:

df['Property_Area'].value_counts()

Similarly, we can look at unique values of port of credit history. Note that dfname[‘column_name’] is a basic indexing technique to acess a particular column of the dataframe. It can be a list of columns as well. For more information, refer to the “10 Minutes to Pandas” resource shared above.

Distribution analysis

Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely ApplicantIncome and LoanAmount

Lets start by plotting the histogram of ApplicantIncome using the following commands:

df['ApplicantIncome'].hist(bins=50)

Here we observe that there are few extreme values. This is also the reason why 50 bins are required to depict the distribution clearly.

Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:

df.boxplot(column='ApplicantIncome')

This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

df.boxplot(column='ApplicantIncome', by = 'Education')

We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.

Now, Let’s look at the histogram and boxplot of LoanAmount using the following command:

df['LoanAmount'].hist(bins=50)

Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values values, while ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this up in coming sections.

Categorical variable analysis

Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand categorical variables in more details. We will use Excel style pivot table and cross-tabulation. For instance, let us look at the chances of getting a loan based on credit history. This can be achieved in MS Excel using a pivot table as:

Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the probability of getting loan.

Now we will look at the steps required to generate a similar insight using Python. Please refer to this article for getting a hang of the different data manipulation techniques in Pandas.

temp1 = df['Credit_History'].value_counts(ascending=True) temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x: x.map({'Y':1,'N':0}).mean()) print ('Frequency Table for Credit History:') print (temp1) print ('nProbility of getting loan for each Credit History class:') print (temp2)

Now we can observe that we get a similar pivot_table like the MS Excel one. This can be plotted as a bar chart using the “matplotlib” library with following code:

import matplotlib.pyplot as plt fig = plt.figure(figsize=(8,4)) ax1 = fig.add_subplot(121) ax1.set_xlabel('Credit_History') ax1.set_ylabel('Count of Applicants') ax1.set_title("Applicants by Credit_History") temp1.plot(kind='bar') ax2 = fig.add_subplot(122) temp2.plot(kind = 'bar') ax2.set_xlabel('Credit_History') ax2.set_ylabel('Probability of getting loan') ax2.set_title("Probability of getting loan by credit history")

This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history. You can plot similar graphs by Married, Self-Employed, Property_Area, etc.

Alternately, these two plots can also be visualized by combining them in a stacked chart::

temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status']) temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)

You can also add gender into the mix (similar to the pivot table in Excel):

If you have not realized already, we have just created two basic classification algorithms here, one based on credit history, while other on 2 categorical variables (including gender). You can quickly code this to create your first submission on AV Datahacks.

We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.

Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques. I would strongly urge that you take another dataset and problem and go through an independent example before reading further.

Data Munging in Python : Using Pandas Data munging – recap of the need

While our exploration of the data, we found a few problems in the data set, which needs to be solved before the data is ready for a good model. This exercise is typically referred as “Data Munging”. Here are the problems, we are already aware of:

There are missing values in some variables. We should estimate those values wisely depending on the amount of missing values and the expected importance of variables.

While looking at the distributions, we saw that ApplicantIncome and LoanAmount seemed to contain extreme values at either end. Though they might make intuitive sense, but should be treated appropriately.

In addition to these problems with numerical fields, we should also look at the non-numerical fields i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful information.

If you are new to Pandas, I would recommend reading this article before moving on. It details some useful techniques of data manipulation.

Check missing values in the dataset

Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset

df.apply(lambda x: sum(x.isnull()),axis=0)

This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.

Though the missing values are not very high in number, but many variables have them and each one of these should be estimated and added in the data. Get a detailed view on different imputation techniques through this article.

Note: Remember that missing values may not always be NaNs. For instance, if the Loan_Amount_Term is 0, does it makes sense or would you consider that missing? I suppose your answer is missing and you’re right. So we should check for values which are unpractical.

How to fill missing values in LoanAmount?

There are numerous ways to fill the missing values of loan amount – the simplest being replacement by mean, which can be done by following code:

df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.

Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes. A key hypothesis is that the whether a person is educated or self-employed can combine to give a good estimate of loan amount.

First, let’s look at the boxplot to see if a trend exists:

Thus we see some variations in the median of loan amount for each group and this can be used to impute the values. But first, we have to ensure that each of Self_Employed and Education variables should not have a missing values.

As we say earlier, Self_Employed has some missing values. Let’s look at the frequency table:

Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of success. This can be done using the following code:

df['Self_Employed'].fillna('No',inplace=True)

Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features. Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount:

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median) # Define function to return value of this pivot_table def fage(x): return table.loc[x['Self_Employed'],x['Education']] # Replace missing values df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

This should provide you a good way to impute missing values of loan amount.

NOTE : This method will work only if you have not filled the missing values in Loan_Amount variable using the previous approach, i.e. using mean.

How to treat for extreme values in distribution of LoanAmount and ApplicantIncome?

Let’s analyze LoanAmount first. Since the extreme values are practically possible, i.e. some people might apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log transformation to nullify their effect:

df['LoanAmount_log'] = np.log(df['LoanAmount']) df['LoanAmount_log'].hist(bins=20)

Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.

Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong support Co-applicants. So it might be a good idea to combine both incomes as total income and take a log transformation of the same.

df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome'] df['TotalIncome_log'] = np.log(df['TotalIncome']) df['LoanAmount_log'].hist(bins=20)

Now we see that the distribution is much better than before. I will leave it upto you to impute the missing values for Gender, Married, Dependents, Loan_Amount_Term, Credit_History. Also, I encourage you to think about possible additional information which can be derived from the data. For example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to pay back his loan.

Next, we will look at making predictive models.

Building a Predictive Model in Python

After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set. Skicit-Learn (sklearn) is the most commonly used library in Python for this purpose and we will follow the trail. I encourage you to get a refresher on sklearn through this article.

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. Before that we will fill all the missing values in the dataset. This can be done using the following code:

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True) df['Married'].fillna(df['Married'].mode()[0], inplace=True) df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True) df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True) df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True) from sklearn.preprocessing import LabelEncoder var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status'] le = LabelEncoder() for i in var_mod: df[i] = le.fit_transform(df[i]) df.dtypes

Next, we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation scores. Since this is an introductory article, I will not go into the details of coding. Please refer to this article for getting details of the algorithms with R and Python codes. Also, it’ll be good to get a refresher on cross-validation through this article, as it is a very important measure of power performance.

#Import models from scikit learn module: from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import KFold #For K-fold cross validation from sklearn.ensemble import RandomForestClassifier from chúng tôi import DecisionTreeClassifier, export_graphviz from sklearn import metrics #Generic function for making a classification model and accessing performance: def classification_model(model, data, predictors, outcome): #Fit the model: model.fit(data[predictors],data[outcome]) #Make predictions on training set: predictions = model.predict(data[predictors]) #Print accuracy accuracy = metrics.accuracy_score(predictions,data[outcome]) print ("Accuracy : %s" % "{0:.3%}".format(accuracy)) #Perform k-fold cross-validation with 5 folds kf = KFold(data.shape[0], n_folds=5) error = [] for train, test in kf: # Filter training data train_predictors = (data[predictors].iloc[train,:]) # The target we're using to train the algorithm. train_target = data[outcome].iloc[train]    # Training the algorithm using the predictors and target.    model.fit(train_predictors, train_target)    #Record error from each cross-validation run    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test])) print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error))) #Fit the model again so that it can be refered outside the function: model.fit(data[predictors],data[outcome]) Logistic Regression

Let’s make our first Logistic Regression model. One way would be to take all the variables into the model but this might result in overfitting (don’t worry if you’re unaware of this terminology yet). In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well. Read more about Logistic Regression.

We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan will be higher for:

Applicants having a credit history (remember we observed this in exploration?)

Applicants with higher applicant and co-applicant incomes

Applicants with higher education level

Properties in urban areas with high growth perspectives

So let’s make our first model with ‘Credit_History’.

outcome_var = 'Loan_Status' model = LogisticRegression() predictor_var = ['Credit_History'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945% Cross-Validation Score : 80.946%

#We can try different combination of variables: predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 80.945% Cross-Validation Score : 80.946%

Generally we expect the accuracy to increase on adding variables. But this is a more challenging case. The accuracy and cross-validation score are not getting impacted by less important variables. Credit_History is dominating the mode. We have two options now:

Feature Engineering: dereive new information and try to predict those. I will leave this to your creativity.

Better modeling techniques. Let’s explore this next.

Decision Tree

Decision tree is another method for making a predictive model. It is known to provide higher accuracy than logistic regression model. Read more about Decision Trees.

model = DecisionTreeClassifier() predictor_var = ['Credit_History','Gender','Married','Education'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 81.930% Cross-Validation Score : 76.656%

Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them. Let’s try a few numerical variables:

#We can try different combination of variables: predictor_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 92.345% Cross-Validation Score : 71.009%

Here we observed that although the accuracy went up on adding variables, the cross-validation error went down. This is the result of model over-fitting the data. Let’s try an even more sophisticated algorithm and see if it helps:

Random Forest

Random forest is another algorithm for solving the classification problem. Read more about Random Forest.

model = RandomForestClassifier(n_estimators=100) predictor_var = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'LoanAmount_log','TotalIncome_log'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 100.000% Cross-Validation Score : 78.179%

Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can be resolved in two ways:

Reducing the number of predictors

Tuning the model parameters

Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important features.

#Create a series with feature importances: featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False) print (featimp)

Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest model a little bit:

model = RandomForestClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1) predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area'] classification_model(model, df,predictor_var,outcome_var)

Accuracy : 82.899% Cross-Validation Score : 81.461%

Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well. Remember that random forest models are not exactly repeatable. Different runs will result in slight variations because of randomization. But the output should stay in the ballpark.

You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model. This exercise gives us some very interesting and unique learning:

Using a more sophisticated model does not guarantee better results.

Avoid using complex modeling techniques as a black box without understanding the underlying concepts. Doing so would increase the tendency of overfitting thus making your models less interpretable

Feature Engineering is the key to success. Everyone can use an Xgboost models but the real art and creativity lies in enhancing your features to better suit the model.

You can access the dataset and problem statement used in this post at this link: Loan Prediction Challenge

Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your data science journey with the following Practice Problems:

Frequently Asked Questions

Q1. How to learn python programming?

A. To learn Python programming, you can start by familiarizing yourself with the language’s syntax, data types, control structures, functions, and modules. You can then practice coding by solving problems and building projects. Joining online communities, attending workshops, and taking online courses can also help you learn Python. With regular practice, persistence, and a willingness to learn, you can become proficient in Python and start developing software applications.

Q2. Why Python is used?

A. Python is used for a wide range of applications, including web development, data analysis, scientific computing, machine learning, artificial intelligence, and automation. Python is a high-level, interpreted, and dynamically-typed language that offers ease of use, readability, and flexibility. Its vast library of modules and packages makes it a popular choice for developers looking to create powerful, efficient, and scalable software applications. Python’s popularity and versatility have made it one of the most widely used programming languages in the world today.

Q3. What are the 4 basics of Python?

A. The four basics of Python are variables, data types, control structures, and functions. Variables are used to store values, data types define the type of data that can be stored, control structures dictate the flow of execution, and functions are reusable blocks of code. Understanding these four basics is essential for learning Python programming and developing software applications.

Q4. Can I teach myself Python?

A. Yes, you can teach yourself Python. Start by learning the basics and practicing coding regularly. Join online communities to get help and collaborate on projects. Building projects is a great way to apply your knowledge and develop your skills. Remember to be persistent, learn from mistakes, and keep practicing.

End Notes

I hope this tutorial will help you maximize your efficiency when starting with data science in Python. I am sure this not only gave you an idea about basic data analysis methods but it also showed you how to implement some of the more sophisticated techniques available today.

You should also check out our free Python course and then jump over to learn how to apply it for Data Science.

Python is really a great tool and is becoming an increasingly popular language among the data scientists. The reason being, it’s easy to learn, integrates well with other databases and tools like Spark and Hadoop. Majorly, it has the great computational intensity and has powerful data analytics libraries.

So, learn Python to perform the full life-cycle of any data science project. It includes reading, analyzing, visualizing and finally making predictions.

Note – The discussions of this article are going on at AV’s Discuss portal. Join here! If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Related

How To Build An Awesome Website With Wix

Whether you’re an aspiring online influencer in need of a homebase for your brand, a professional who needs a portfolio to show off your work, or a small business trying to spread word of your products and services beyond the town you set up shop in, building an awesome website is crucial to your success.

Benefits range from improving your likelihood of being discovered in today’s digital-oriented society to boosting your credibility as someone who cares enough to not only build an online presence but also take time to create an aesthetically pleasing portal into what you’re offering others.

Table of Contents

The beauty of a cloud-based web development platform like Wix is that it gives you the ability to accomplish all of this and more with incredible ease at an affordable price. While other platforms are similar, they aren’t necessarily as intuitive.

Others may offer more potential, but they also require more skill, i.e., coding knowledge. Often, platforms on both ends of the spectrum can end up costing more while offering relatively fewer benefits to those who aren’t programmers yet want a site that looks like a programmer built it.

After trying multiple web platforms to build my own portfolio site, here’s why I’ve come to love Wix and how you too can leverage its technology to build an awesome website for you or your business.

Free or Freedom

First thing’s first: what are you looking for in your website? The answer could depend on how comfortable you are with

If you’re not at all comfortable or are short on time, Wix Artificial Design Intelligence (ADI) can build a free website for you. But if you have the skills or want to build them and have the time, Wix also gives you the option to have full control over design with the Wix Editor.

Wix ADI

Wix ADI doesn’t just give generalized templates to choose from; it gathers information about you and your needs and legitimately builds a customized space based on what you tell it to create. Once the skeleton of the site is produced, with some additional input from you on thematic choices, Wix hands over control to you to start adding additional pages and making changes to themes, fonts, colors, and more.

The upside is you get everything you need in a basic website that requires zero effort on your end to actually set up–and it’s free. All you need to worry about is providing the content for your site and making sure it looks how you want it to.

The downside is you might find yourself wishing for functionality or options you don’t get with Wix ADI, such as the premium plan. Here’s where you can really experience design freedom.

Drag and Drop Editor

If you decide to pay for a site you build with Wix Editor, like I did, your options expand considerably. There are several premium options available, ranging from just $5/month to $25/month, with each offering additional services. However, each premium plan includes:

Free hosting

Domain connection

500MB+ storage

Google Analytics

Premium support

No setup fee

To begin with, there are several Wix templates other users have created that you can leverage for your own design. The site I built, for instance, was originally a layout from someone else who had published their design for others to use. Now, it looks nothing like the original, as I’ve made several changes both aesthetically and to the flow.

Once you have the layout you’re looking for in a website, whether you start from scratch or build off someone else’s available design, you can really start transforming your site into something unique with the drag and drop editor Wix offers.

Add Elements

There’s way more to choose from, and Wix helps you identify whether something might be of good use for your site by providing detailed information about each element when you hover over/select it.

Almost anything you can imagine a legitimate site requiring to be both aesthetically appealing and highly professional, Wix makes it available to simply select, drop in and move around.

This process is so intuitive and iterative, it’s easy to experiment with different design or logistical elements on your site, preview them, then decide to keep or delete with practically no work on your end other than deciding where to put something.

Add Apps

Another great feature of the the Wix Editor is the ability to add apps, several of which are fre. There are over 300 to choose from–Wix itself offers over 100 apps developed in-house–from social tools to chat apps to business tools.

Adding these apps to your site is as easy as adding any other design element you drag and drop in. There an easy way to make your site less static, more dynamic, more interactive to really capture the attention of visitors–and make you look like a total web dev pro.

Make It Mobile Friendly

Once you’ve made finishing touches to your website, check to make sure it’s displaying properly on a mobile interface using Wix’s mobile editor. This is of growing importance, as most people today access websites and services via their mobile phones more than any other device.

Ensuring you provide an elegant user experience (UX) that transitions seamlessly from a ~13” to a ~5” display is critical. Bad UX can ruin your chances at securing confidence from a site visitor.

You will lose their attention or their business and end up increasing your bounce rate, a metric that measures how quickly someone visits and leaves your site. The higher the bounce rate, the less likely you are to appear in Google searches.

Once you’re in the Wix mobile editor, you can rearrange images and text and hide elements you don’t want to display to ensure your site’s user experience is as elegant in the palm of a hand as it is on tablet or monitor. It only takes a few minutes to clean things up before hitting publish.

Email Marketing

One of the coolest extras Wix offers is email marketing. Once you’ve built your site, you might want to promote it–especially if you’re an aspiring influencer trying to garner followers; a creative professional trying to land a new job or freelance gigs with a sharp new portfolio; or a business trying to attract new customers.

As with everything, Wix makes this super easy, providing you with a flashy email template that only requires you inputting information and tweaking it to fit your brand. Besides that, your biggest job is providing a contacts list to which you send the email.

It’s possible you already have hundreds of contacts you could promote your new site to. If not, there are data-collection companies out there whom you can purchase email lists from (it sounds sketchy, but this is how the world works today).

stats spike over the next few days. By the way, make sure you have a submission form somewhere on your site that visitors can use to contact you. You’ll definitely want to provide a way for visitors to communicate with you if you’re putting yourself out there like this.

Wix also offers an app that can help you grow your subscribers list if that makes sense for your site. This is a great way to keep yourself or your business in the spotlight with a growing list of contacts, increasing your visibility and potentially your profit or status.

View Stats and Build SEO

We’ve talked a little about the importance of making your brand–whether it’s personal or business–more visible to the online world with a stellar website. But building a great site can only take you so far.

A lot of the hard work involves playing the game of search engine optimization (SEO). Google and other search engines use algorithms to decide what results appear first in a search. There are specific best practices to follow if you want Google and other search engines to put your site higher in its search index.

Learning these best practices and understanding how to effectively apply them takes time and trial and error. Fortunately for Wix users, Wix SEO Wiz makes this much easier. It gives you a step-by-step process to follow to ensure your site is optimized correctly before even allowing you to connect it to Google in order to be indexed and ranked in searches–talk about due diligence!

Once you’re connected to Google, Wix will take you to your SEO plan where you can continue to optimize the rest of your site. You’re never alone in this; Wix SEO Wiz continues to give you a checklist of items to complete to help your site perform better.

When you’re ready to see how well your site is performing, you can connect it to Google Analytics, the most powerful analytics platform available for measuring organic and paid SEO, and start monitoring other metrics.

End Your Site Search

If you’re looking for an all-inclusive cloud-based web development platform, Wix offers quite possible everything you need without the burden of mastering technical skills to derive the utmost value from it. Whether you need a lot or a little for your personal or business

How To Create An Ogive Graph In Python?

An ogive graph graphically represents the cumulative distribution function (CDF) of a set of data, sometimes referred to as a cumulative frequency curve. It is applied to examine data distribution and spot patterns and trends. Matplotlib, Pandas, and Numpy are just a few of the libraries and tools offered by Python to create ogive graphs. In this tutorial, we’ll look at how to use Matplotlib to generate an ogive graph in Python.

To create an ogive graph, we need to import the required libraries. In this example, we will use Matplotlib, Pandas, and Numpy. Matplotlib is a popular data visualization library used in Python to create interactive plots and charts. Numpy, on the other hand, is used for performing complex mathematical operations. Pandas is another widely used Python library specifically designed for data manipulation and analysis.

Syntax plt.plot(*np.histogram(data, bins), 'o-')

In this syntax, ‘data’ is the dataset to create an ogive graph. The data’s frequency distribution is determined by the ‘np.histogram’ function, which also returns the histogram’s values and bin edges. ‘plt.plot’ is used to create the ogive graph, using the ‘’o- ‘’ format string to plot the data points and connect them with lines. Then the ‘*’ operator unpacks the histogram values and bin edges as separate arguments to ‘plt.plot’.

Example

Here’s a simple example that creates an ogive graph to visualize the cumulative frequency distribution of a list of dice rolls.

import numpy as np import matplotlib.pyplot as plt # List of dice rolls rolls = [1, 2, 3, 4, 5, 6, 3, 6, 2, 5, 1, 6, 4, 2, 3, 5, 1, 4, 6, 3] # Calculate the cumulative frequency bins = np.arange(0, 8, 1) freq, bins = np.histogram(rolls, bins=bins) cumulative_freq = np.cumsum(freq) # Create the ogive graph plt.plot(bins[1:], cumulative_freq, '-o') plt.xlabel('Dice Rolls') plt.ylabel('Cumulative Frequency') plt.title('Ogive Graph of Dice Rolls') plt.show()

First, we created an ogive graph to visualize the cumulative frequency distribution of a list of dice rolls which is implemented by importing the necessary modules NumPy and Matplotlib. Then the code defines a list of dice rolls and uses NumPy’s histogram function to produce a “histogram” of the data, specifying the number of bins and the range of values the data can take on. Next, the cumulative frequency of the data is represented using NumPy’s ‘cumsum’ function.

Lastly, the cumulative frequency is plotted against the upper bound of each bin to form the ogive graph using Matplotlib’s “plot” function. The resulting ogive graph shows the cumulative frequency distribution of the dice rolls, where the x-axis represents the value of the rolls and the y-axis represents the cumulative frequency of those values up to a certain point. This graph can be used to analyze the frequency and distribution of dice rolls.

Output

Example

This example demonstrates an ogive graph to visualize a random distribution of 500 numbers between 0 and 100.

import numpy as np import matplotlib.pyplot as plt # Generate random data data = np.random.randint(0, 100, 500) # Calculate the cumulative frequency bins = np.arange(0, 110, 10) freq, bins = np.histogram(data, bins=bins) cumulative_freq = np.cumsum(freq) # Create the ogive graph plt.plot(bins[1:], cumulative_freq, '-o') plt.xlabel('Data') plt.ylabel('Cumulative Frequency') plt.title('Ogive Graph of Random Data') plt.show()

In this example, we first generate a random dataset of 500 numbers between 0 and 100 using NumPy. The cumulative frequency of the data is then determined using NumPy with a bin width of 10. Lastly, using Matplotlib, we plot the cumulative frequency against the upper bound of each bin to produce the ogive graph. This example demonstrates how to create an ogive graph in Python using randomly generated data.

Output

We learned to create an ogive graph in Python by utilising the Matplotlib module is a simple process using the matplotlib library. By loading your data, calculating the cumulative frequency, and plotting the results, you can easily visualize the distribution of your dataset and identify any patterns or trends. You can customize your graph with labels, titles, and styles to make it more visually appealing and informative. Ogive graphs are useful tools in statistical analysis and can represent a wide range of data, from income distributions to test scores.

Top 10 Python Skills To Learn To Get Hired In Meta Or Microsoft

These top 10 Python skills will help you get hired in Meta or Microsoft

Most top companies like Meta and Microsoft look for developers and coders who have a great understanding of the Python language. To become a great developer and get a high-paying job, one should study a variety of Python skills. Here are the top 10 python skills one must master:

Object-Related Mapping

Object-Relational Mapping (ORM) is a technology that uses an object-oriented paradigm to query and manipulate data from a database. An ORM library is a standard library written in your preferred language that encapsulates the code needed to manipulate the data. Instead of using SQL, users directly interact with an object in the same language.

Artificial Intelligence and Machine Learning Skills

 AI and ML are considered one of the most important Python skills. Good knowledge of Artificial Intelligence and Machine learning is required to become a developer in Data Science. Neural Networks should be well-understood, as should the ability to obtain insights from the data, data visualization, data analysis, and data collection skills

Python Libraries Deep Learning

 Deep Learning is a sort of Machine Learning that is based on the human brain’s structure. Deep learning algorithms analyze the data with a predefined logical structure to reach similar conclusions as humans. You should be able to use your newfound skills to develop Deep Learning-powered systems once you’ve figured out what Deep Learning is.

Analytical Skills

In the world of Data Science, to become a good python developer, one must excel in Python skills. The required Analytical skills may include a deep understanding of building useful websites for web development, better visualizing datasets for Data Science, algorithm optimization while coding, writing clean, non-redundant code, and so on.

Good Grasp of Web Frameworks

A good Python web developer has incredible honing over either of the two web frameworks Django or Flask or both. Django is a high-level Python Web Framework that encourages a good, clean, and pragmatic design and Flask is also a widely used Python micro web framework. Sound knowledge of Front-end technologies like HTML, CSS, and JavaScript is also expected.

Object Relational Mappers

ORM is a programming technique in computer science, that comes in handy when we convert data between two incompatible type systems using Object Oriented programming languages. It creates a “virtual object database” that can be used from within any programming language. There are customized ORM tools used by programmers.

Understand Multi-Process Architecture

 Your team may consist of a design engineer, but you should also know how the code works in deployment and release. As a Python-Dev you should definitely know about the MVC(Model View Controller) and MVT(Model View Template) Architecture. Once you understand the multi-Process Architecture you can solve issues related to the core framework etc.

Design Skills

The developers should be able to design scalable products also, implement servers in such a way that they are highly available. One should also keep in mind the frameworks of python like Django or Flask while designing a website as python can work in both client and server-side programming.

Communication Skills

Communication- Python Developer Skills- EdurekaOne of the most important aspects of any profession largely depends on having really good communication skills. If you are able to contribute to the team, do peer code reviews, and communicate efficiently, then half of your job is done there.

How To Save Storage On Your Mac By Using Itunes In Referenced Library Mode

When importing songs into desktop iTunes, check your settings to ensure the media-management app is not set to create copies of any imported items in your library.

I have a ton of music on my computer that I ripped from my personal CD collection.

It isn’t uncommon for some people to have a vast music/video collection spanning multiple volumes or external disks. As you know, iTunes doesn’t automagically know about your media unless you import the items so they appear in your library.

The iTunes library is an .ITL file in your iTunes folder that the computer uses to keep track of your imported media and meta data such as play counts, ratings and more.

TUTORIAL: Using Photos for Mac in referenced library mode

When importing music from a CD, the files get automatically added to the iTunes Media folder.

When you add MP3s another way—i.e. choosing the Add to Library option from the File menu or by dragging them into iTunes—iTunes may or may not create copies of your source files. If it’s set to copy imports in the media folder, iTunes keeps your originals intact. You can even delete any imported files in their original locations because iTunes now has copies.

TUTORIAL: Where are the media files from the Photos app saved on my Mac?

Here’s how to switch to a referenced iTunes library that doesn’t create duplicates when importing stuff that’s already stored somewhere on your computer.

How to use iTunes in referenced library mode

Follow these steps to put your iTunes in referenced library mode:

1) Open iTunes on your Mac or Windows PC.

2) Choose Preferences from the iTunes menu (Mac) or the Edit menu (Windows).

4) Untick the box next to “Copy files to iTunes Media folder when adding to library.”

TIP: Hold down the Option (⌥) key while you drag files to the iTunes window to temporarily override this setting.

Now when you import an item to iTunes, a reference (or pointer) to the item is created rather than a copy of the item itself. Referenced library mode is great if you prefer manual file management and organization without worrying about duplicates.

I like to meticulously nest my music and other media manually into multiple sub-folders. I find it easier to manage, copy, share and back up my media this way, so I’m using iTunes in referenced library mode.

TUTORIAL: How to move iTunes library to an external drive

This mode should be indispensable to those of you who prefer to keep multi-gigabyte video files on an external disk rather than in the iTunes Media folder on your computer.

Consolidating your iTunes library

If you use a referenced library, it’s easy to forget that moving the original files to another folder or disk will confuse iTunes because it expects them in the old locations. This is a major issue when moving your library to a new computer or an external drive. Thankfully, that’s what the library consolidation feature was designed for.

Consolidating your library keeps the originals in their original location and creates copies placed in the iTunes Media folder. This lets you safely move your iTunes folder to a new computer or external disk without losing anything.

Here’s how to consolidate your iTunes library

1) Open iTunes on your Mac or Windows PC.

3) Tick the option labeled “Consolidate files”.

TIP: To have your media files organized into sub-folders (like Music, Movies, Podcasts and so forth), tick the option labeled “Reorganize files in the folder iTunes Media”.

Any referenced items will be copied into appropriate sub-folders in your iTunes Media folder.

This may take a while depending on the number of the source files being consolidated, their size, the speed of your computer, available storage space and other factors.

Need help? Ask iDB!

Update the detailed information about An Awesome Tutorial To Learn Outlier Detection In Python Using Pyod Library on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!