Trending February 2024 # Machine Learning Techniques For Text Representation In Nlp # Suggested March 2024 # Top 4 Popular

You are reading the article Machine Learning Techniques For Text Representation In Nlp updated in February 2024 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Machine Learning Techniques For Text Representation In Nlp

This article was published as a part of the Data Science Blogathon.

Introduction

Natural Language Processing is a branch of artificial intelligence that deals with human language to make a system able to understand and respond to language. Data being the most important part of any data science project should always be represented in a way that helps easy understanding and modeling, especially when it comes to NLP machine learning. It is said that when we provide very good features to bad models and bad features to well-optimized models then bad models will perform far better than an optimized model. So in this article, we will study how features from text data can be extracted, and used in our NLP machine learning modeling process and why feature extraction from text is a bit difficult compared to other types of data.

Table of Contents

Brief Introduction on Text Representation

Why Feature Extraction from text is difficult?

Common Terms you should know

Techniques for Feature Extraction from text data

One-Hot Encoding

Bag of words Technique

N-Grams

TF-IDF

End Notes

Introduction to Text Representation

The first question arises is what is Feature Extraction from the text?  Feature Extraction is a general term that is also known as a text representation of text vectorization which is a process of converting text into numbers. we call vectorization because when text is converted in numbers it is in vector form.

Now the second question would be Why do we need feature extraction? So we know that machines can only understand numbers and to make machines able to identify language we need to convert it into numeric form.

Why Feature extraction from textual data is difficult? 

If you ask any NLP practitioner or experienced data scientist then the answer will be yes that handling textual data is difficult? Now first let us compare text feature extraction with feature extraction in other types of data. So In an image dataset suppose digit recognition is where you have images of digits and the task is to predict the digit so in this image feature extraction is easy because images are already present in form of numbers(Pixels). If we talk about audio features, suppose emotion prediction from speech recognition so in this we have data in form of waveform signals where features can be extracted over some time Interval. But when I say I have a sentence and want to predict its sentiment How will you represent it in numbers? An image dataset, the speech dataset was the simple case but in a text data case, you have to think a little bit. In this article, we are going to study these techniques only.

Common Terms Used

These are common terms that we will use in further techniques so I want you to be familiar with these four basic terms

Corpus(C) ~ The total number of combinations of words in the whole dataset is known as Corpus. In simple words concatenating all the text records of the dataset forms a corpus.

Vocabulary(V) ~ a total number of distinct words which form your corpus is known as Vocabulary.

Document(D) ~ There are multiple records in a dataset so a single record or review is referred to as a document.

Word(W) ~ Words that are used in a document are known as Word.

Techniques for Feature Extraction 1 One-Hot Encoding

Now to perform all the techniques using python let us get to Jupyter notebook and create a sample dataframe of some sentences.

import numpy as np import pandas as pd df = pd.DataFrame({"text":sentences, "output":[1,1,0]})

Now we can perform one-hot encoding using sklearn pre-built class as well as you can implement it using python. After implementation, each sentence will have a different shape 2-D array as shown in below sample image of one sentence.

1) Sparsity – You can see that only a single sentence creates a vector of n*m size where n is the length of sentence m is a number of unique words in a document and 80 percent of values in a vector is zero.

2) No fixed Size – Each document is of a different length which creates vectors of different sizes and cannot feed to the model.

3) Does not capture semantics – The core idea is we have to convert text into numbers by keeping in mind that the actual meaning of a sentence should be observed in numbers that are not seen in one-hot encoding.

2 Bag Of Words from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() bow = cv.fit_transform(df['text'])

Now to see the vocabulary and the vector it has created you can use the below code as shown in the below results image.

Advantages

1) Simple and intuitive – Only a few lines of code are required to implement the technique.

2) Fix size vector – The problem which we saw in one-hot encoding where we are unable to feed data the data to machine learning model because each sentence forms a different size vector but here It ignores the new words and takes only words which are vocabulary so creates a vector of fix size.

2) Sparsity – when we have a large vocabulary, and the document contains a few repeated terms then it creates a sparse array.

3) Not considering ordering is an issue – It is difficult to estimate the semantics of the document.

3 N-Grams

The technique is similar to Bag of words. All the techniques till now we have read it is made up of a single word and we are not able to use them or utilize them for better understanding. So N-Gram technique solves this problem and constructs vocabulary with multiple words. When we built an N-gram technique we need to define like we want bigram, trigram, etc. So when you define N-gram and if it is not possible then it will throw an error. In our case, we cannot build after a 4 or 5-gram model. Let us try bigram and observe the outputs.

#Bigram model from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(ngram_range=[2,2]) bow = cv.fit_transform(df['text'])

You can try trigram with a range like [3,3] and try with N range so you get more clarification over the technique and try to transform a new document and observe how does it perform.

Advantages

1) Able to capture semantic meaning of the sentence – As we use Bigram or trigram then it takes a sequence of sentences which makes it easy for finding the word relationship.

2) Intuitive and easy to implement – implementation of N-Gram is straightforward with a little bit of modification in Bag of words.

1) As we move from unigram to N-Gram then dimension of vector formation or vocabulary increases due to which it takes a little bit more time in computation and prediction

2) no solution for out of vocabulary terms – we do not have a way another than ignoring the new words in a new sentence.

4 TF-IDF (Term Frequency and Inverse Document Frequency)

Now the technique which we will study does not work in the same way as the above techniques. This technique gives different values(weightage) to each word in a document. The core idea of assigning weightage is the word that appears multiple time in a document but has a rare appearance in corpus then it is very important for that document so it gives more weightage to that word. This weightage is calculated by two terms known as TF and IDF. So for finding the weightage of any word we find TF and IDF and multiply both the terms.

Term Frequency(TF) – The number of occurrences of a word in a document divided by a total number of terms in a document is referred to as Term Frequency. For example, I have to find the Term frequency of people in the below sentence then it will be 1/5. It says how frequently a particular word occurs in a particular document.

People read on Analytics Vidhya

Inverse Document Frequency – Total number of documents in corpus divided by the total number of documents with term T in them and taking the log of a complete fraction is inverse document frequency. If we have a word that comes in all documents then the resultant output of the log is zero But in implementation sklearn uses a little bit different implementation because if it becomes zero then the contribution of the word is ignored so they add one in the resultant and because of which you can observe the values of TFIDF a bit high. If a word comes only a single time then IDF will be higher.

from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() tfidf.fit_transform(df['text']).toarray()

So one term keeps track of how frequently the term occurs while the other keeps track of how rarely the term occurs.

End Notes

Connect with me on Linkedin

Check out my other articles here and on Blogger

Thanks for giving your time!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

Related

You're reading Machine Learning Techniques For Text Representation In Nlp

Automated Machine Learning For Supervised Learning (Part 1)

This article was published as a part of the Data Science Blogathon                      

This article aims to demonstrate automated Machine Learning, also referred to as AutoML. To be specific, the AutoML will be applied to the problem statement requiring supervised learning, like regression and classification for tabular data. This article does not discuss other kinds of Machine Learning problems, such as clustering, dimensionality reduction, time series forecasting, Natural Language Processing, recommendation machine, or image analysis.

Understanding the problem statement and dataset

Before jumping to the AutoML, we will cover the basic knowledge of conventional Machine Learning workflow. After getting the dataset and understanding the problem statement, we need to identify the goal of the task. This article, as mentioned above, focuses on regression and classification tasks. So, make sure that the dataset is tabular. Other data formats, such as time series, spatial, image, or text, are not the main focus here.

Next, explore the dataset to understand some basic information, such as the:

Descriptive statistics (count, mean, standard deviation, minimum, maximum, and quartile) using .describe();

Data type of each feature using .info() or .dtypes;

Count of values using .value_counts();

Null value existance using .isnull().sum();

Correlation test using .corr();

etc.

 

Pre-processing

After understanding the dataset, do the data pre-processing. This part is very important in that it will result in a training dataset for Machine Learning fitting. Data pre-processing can start with handling the missing data. Users should decide whether to remove the observation with missing data or apply data imputation. Data imputation means to fill the missing value with the average, median, constant, or most occurring value. Users can also pay attention to outliers or bad data to remove them so that they will not be the noise.

Feature scaling is a very important process in data preprocessing. Feature scaling aims to scale the value range in each feature so that features with higher values and small variance do not dominate other features with low values and high variance. Some examples of feature scaling are standardization, normalization, log normalization, etc.

Feature scaling is suitable to apply to gradient descent- and distance-based Machine Learning algorithms. Tree-based algorithms do not need feature scaling The following table shows the examples of algorithms.

Table 1 Examples of algorithms

Machine Learning Type Algorithms

Gradient descent-based Linear Regression, Ridge Regression, Lasso Regression, Elasticnet Regression, Neural Network (Deep Learning)

Distance-based K Nearest Neighbors, Support Vector Machine, K-means, Hierarchical clustering

Tree-based Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting,

Notice that there are also clustering algorithms in the table. K-means and hierarchical clustering are unsupervised learning algorithms.

Feature engineering: generation, selection, and extraction refer to the activities of creating new features (expected to help the prediction), removing low importance features or noises, and adding new features from extracting partial information of combined existing features respectively. This part is very important that adding new features or removing features can improve model accuracy. Cutting the number of features can also reduce the running time.

Creating model, hyperparameter-tuning, and model evaluation

The main part of Machine Learning is choosing an algorithm and build it. The algorithm needs training dataset features, a target or label feature, and some hyperparameters as the arguments. After the model is built, it is then used for predicting validation or test dataset to check the score. To improve the score, hyperparameter-tuning is performed. Hyperparameter-tuning is the activity of changing the hyperparameter(s) of each Machine Learning algorithms repeatedly until a satisfied model is obtained with a set of hyperparameters. The model is evaluated using scorer metrics, such Root Mean Squared Error, Mean Squared Error, or R2 for regression problems and accuracy, Area Under the ROC Curve, or F1-score for classification problems. The model score is evaluated using cross-validation. To read more about hyperparameter-tuning, please find this article.

After getting the optimum model with a set of hyperparameters, we may want to try other Machine Learning algorithms, along with the hyperparameter-tuning. There are many algorithms for regression and classification problems with their pros and cons. Different datasets have different Machine Learning algorithms to build the best prediction models. I have made notebooks containing a number of commonly used Machine Learning algorithms using the steps mentioned above. Please check it here:

The datasets are provided by Kaggle. The regression task is to predict house prices using the parameters of the houses. The notebook contains the algorithms: Linear Regression, Ridge Regression, Lasso Regression, Elastic-net Regression, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine (GBM), Light GBM, Extreme Gradient Boosting (XGBoost), and Neural Network (Deep Learning).

The binary classification task is to predict whether the Titanic passengers would survive or not. This is a newer dataset published just this April 2023 (not the old Titanic dataset for Kaggle newcomers). The goal is to classify each observation into class “survived” or not survived” without probability. If the classes are more than 2, it is called multi-class classification. However, the technics are similar. The notebook contains the algorithms: Logistic Regression, Naive Bayes, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting, and Neural Network (Deep Learning). Notice that some algorithms can perform regression and classification works.

Another notebook I created is to predict binary classification with probability. It predicts whether each observation of location, date, and time was in high traffic or not with probability. If the probability of being high traffic is, for example, 0.8, the probability of not being high traffic is 0.2. There is also multi-label classification which predicts the probability of more than two classes.

If you have seen my notebooks from the hyperlinks above, there are many algorithms used to build the prediction models for the same dataset. But, which model should be used since the models predict different outputs. The simplest way is just picking the model with the best score (lowest RMSE or highest accuracy). Or, we can perform ensemble methods. Ensemble methods use multiple different machine learning algorithms for predicting the same dataset. The final output is determined by averaging the predicted outputs in regression or majority voting in classification. Actually, Random Forest, GBM, and XGBoost are also ensemble methods. But, they develop the same type of Machine Learning, which is a Decision Tree, from different subsets of the training data.

Finally, we can save the model if it is satisfying. The saved model can be loaded again in other notebooks to do the same prediction.

Fig. 1 Machine Learning Workflow. Source: created by the author

 

Automated Machine Learning

The process to build Machine Learning models and choose the best model is very long. It takes many lines of code and much time to complete. However, Data Science and Machine Learning are associated with automation. Then, we have automated Machine learning or autoML. AutoML only needs a few lines to do most of the steps above, but not all of the steps. Figure 1 shows the workflow of Machine Learning. The autoML covers only the parts of data pre-processing, choosing model, and hyperparameter-tuning. The users still have to understand the goals, explore the dataset, and prepare the data.

There are many autoML packages for regression and classification tasks for structured tabular data, image, text, and other predictions. Below is the code of one of the autoML packages, named Auto-Sklearn. The dataset is Titanic Survival, still the same as in the previous notebooks. Auto-Sklearn was developed by Matthias Feurer, et al. (2024) in the paper “Efficient and Robust Automated Machine Learning”. Auto-Sklearn is available openly in python scripting. Yes, Sklearn or Scikit-learn is the common package for performing Machine Learning in Python language. Almost all of the algorithms in the notebooks above are from Sklearn.

# Install and import packages !apt install -y build-essential swig curl !pip install auto-sklearn from autosklearn.classification import AutoSklearnClassifier # Create the AutoSklearnClassifier sklearn = AutoSklearnClassifier(time_left_for_this_task=3*60, per_run_time_limit=15, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_sklearn)))

Output:

Dataset name: da588f6e-c217-11eb-802c-0242ac130202 Metric: accuracy Best validation score: 0.769936 Number of target algorithm runs: 26 Number of successful target algorithm runs: 7 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 19 Number of target algorithms that exceeded the memory limit: 0 Accuracy: 0.7710593242331447 # Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_sklearn))) print(classification_report(y_val, pred_sklearn))

Output:

Confusion Matrix 0 1 0 8804 2215 1 2196 6052 precision recall f1-score support 0 0.80 0.80 0.80 11019 1 0.73 0.73 0.73 8248 accuracy 0.77 19267 macro avg 0.77 0.77 0.77 19267 weighted avg 0.77 0.77 0.77 19267

The code is set to run for 3 minutes with no single algorithm running for more than 30 seconds. See, with only a few lines, we can create a classification algorithm automatically. We do not even need to think about which algorithm to use or which hyperparameters to set. Even a beginner in Machine Learning can do it right away. We can just get the final result. The code above has run 26 algorithms, but only 7 of them are completed. The other 19 algorithms exceeded the set time limit. It can achieve an accuracy of 0.771. To find the process of finding the selected model, run this line

print(sklearn.show_models()).

The following code is also Auto-Sklearn, but for regression work. It develops an autoML model to predict the House Prices dataset. It can find a model with RMSE of 28,130 from successful 16 algorithms out of the total 36 algorithms.

# Install and import packages !apt install -y build-essential swig curl !pip install auto-sklearn from autosklearn.regression import AutoSklearnRegressor # Create the AutoSklearnRegessor sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=30, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the RMSE rmse_sklearn=MSE(y_val, pred_sklearn)**0.5 print('RMSE: ' + str(rmse_sklearn))

Output:

Dataset name: 71040d02-c21a-11eb-803f-0242ac130202 Metric: r2 Best validation score: 0.888788 Number of target algorithm runs: 36 Number of successful target algorithm runs: 16 Number of crashed target algorithm runs: 1 Number of target algorithms that exceeded the time limit: 15 Number of target algorithms that exceeded the memory limit: 4 RMSE: 28130.17557050461 # Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()

Output:

# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()

Fig. 2 Scatter plot from autoSklearnRegressor. Source: created by the author

So, do you think that Machine Learning Scientists/Engineers are still needed?

There are still other autoML packages to discuss, like Hyperopt–Sklearn, Tree-based Pipeline Optimization Tool (TPOT), AuroKeras, MLJAR, and so on. But, we will discuss them in part 2.

About Author

Connect with me here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Google Colab For Machine Learning And Deep Learning

“Memory Error” – that all too familiar dreaded message in Jupyter notebooks when we try to execute a machine learning or deep learning algorithm on a large dataset. Most of us do not have access to unlimited computational power on our machines. And let’s face it, it costs an arm and a leg to get a decent GPU from existing cloud providers. So how do we build large deep learning models without burning a hole in our pockets? Step up – Google Colab!

It’s an incredible online browser-based platform that allows us to train our models on machines for free! Sounds too good to be true, but thanks to Google, we can now work with large datasets, build complex models, and even share our work seamlessly with others. That’s the power of Google Colab.

What is Google Colab?

Google Colaboratory is a free online cloud-based Jupyter notebook environment that allows us to train our machine learning and deep learning models on CPUs, GPUs, and TPUs.

Here’s what I truly love about Colab. It does not matter which computer you have, what it’s configuration is, and how ancient it might be. You can still use Google Colab! All you need is a Google account and a web browser. And here’s the cherry on top – you get access to GPUs like Tesla K80 and even a TPU, for free!

TPUs are much more expensive than a GPU, and you can use it for free on Colab. It’s worth repeating again and again – it’s an offering like no other.

Are you are still using that same old Jupyter notebook on your system for training models? Trust me, you’re going to love Google Colab.

What is a Notebook in Google Colab? Google Colab Features

Colab provides users free access to GPUs and TPUs, which can significantly speed up the training and inference of machine learning and deep learning models.

Colab’s interface is web-based, so installing any software on your local machine is unnecessary. The interface is also intuitive and user-friendly, making it easy to get started with coding.

Colab allows multiple users to work on the same notebook simultaneously, making collaborating with team members easy. Colab also integrates with other Google services, such as Google Drive and GitHub, making it easy to share your work.

Colab notebooks support markdown, which allows you to include formatted text, equations, and images alongside your code. This makes it easier to document your work and communicate your ideas.

Colab comes pre-installed with many popular libraries and tools for machine learning and deep learning, such as TensorFlow and PyTorch. This saves time and eliminates the need to manually install and configure these tools.

GPUs and TPUs on Google Colab

Ask anyone who uses Colab why they love it. The answer is unanimous – the availability of free GPUs and TPUs. Training models, especially deep learning ones, takes numerous hours on a CPU. We’ve all faced this issue on our local machines. GPUs and TPUs, on the other hand, can train these models in a matter of minutes or seconds.

If you still need a reason to work with GPUs, check out this excellent explanation by Faizan Shaikh.

It gives you a decent GPU for free, which you can continuously run for 12 hours. For most data science folks, this is sufficient to meet their computation needs. Especially if you are a beginner, then I would highly recommend you start using Google Colab.

Google Colab gives us three types of runtime for our notebooks:

CPUs,

GPUs, and

TPUs

As I mentioned, Colab gives us 12 hours of continuous execution time. After that, the whole virtual machine is cleared and we have to start again. We can run multiple CPU, GPU, and TPU instances simultaneously, but our resources are shared between these instances.

Let’s take a look at the specifications of different runtimes offered by Google Colab:

It will cost you A LOT to buy a GPU or TPU from the market. Why not save that money and use Google Colab from the comfort of your own machine?

How to Use Google Colab?

You can go to Google Colab using this link. This is the screen you’ll get when you open Colab:

You can also import your notebook from Google Drive or GitHub, but they require an authentication process.

Google Colab Runtimes – Choosing the GPU or TPU Option

The ability to choose different types of runtimes is what makes Colab so popular and powerful. Here are the steps to change the runtime of your notebook:

Step 2: Here you can change the runtime according to your need:

A wise man once said, “With great power comes great responsibility.” I implore you to shut down your notebook after you have completed your work so that others can use these resources because various users share them. You can terminate your notebook like this:

Using Terminal Commands on Google Colab

You can use the Colab cell for running terminal commands. Most of the popular libraries come installed by default on Google Colab. Yes, Python libraries like Pandas, NumPy, scikit-learn are all pre-installed.

If you want to run a different Python library, you can always install it inside your Colab notebook like this:

!pip install 

library_name

Pretty easy, right? Everything is similar to how it works in a regular terminal. We just you have to put an exclamation(!) before writing each command like:

!ls

or:

!pwd

Cloning Repositories in Google Colab

You can also clone a Git repo inside Google Colaboratory. Just go to your GitHub repository and copy the clone link of the repository:

Then, simply run:

And there you go!

Uploading Files and Datasets

Here’s a must-know aspect for any data scientist. The ability to import your dataset into Colab is the first step in your data analysis journey.

The most basic approach is to upload your dataset to Colab directly:

You can also upload your dataset to any other platform and access it using its link. I tend to go with the second approach more often than not (when feasible).

Saving Your Notebook

All the notebooks on Colab are stored on your Google Drive. The best thing about Colab is that your notebook is automatically saved after a certain time period and you don’t lose your progress.

If you want, you can export and save your notebook in both *.py and *.ipynb formats:

Not just that, you can also save a copy of your notebook directly on GitHub, or you can create a GitHub Gist:

I love the variety of options we get.

Exporting Data/Files from Google Colab

You can export your files directly to Google Drive, or you can export it to the VM instance and download it by yourself:

Exporting directly to the Drive is a better option when you have bigger files or more than one file. You’ll pick up these nuances as you work on bigger projects in Colab.

Sharing Your Notebook

Google Colab also gives us an easy way of sharing our work with others. This is one of the best things about Colab:

What’s Next?

Google Colab now also provides a paid platform called Google Colab Pro, priced at $9.99 a month. In this plan, you can get the Tesla T4 or Tesla P100 GPU, and an option of selecting an instance with a high RAM of around 27 GB. Also, your maximum computation time is doubled from 12 hours to 24 hours. How cool is that?

You can consider this plan if you need high computation power because it is still quite cheap when compared to other cloud GPU providers like AWS, Azure, and even GCP.

Recommendations

If you’re new to the world of Deep Learning, I have some excellent resources to help you get started in a comprehensive and structured manner:

Related

Toolset For Using Machine Learning Without Matlab

Although Matlab is a popular programming language in the field of machine learning, it is expensive. Nowadays, many programmers are looking for substitute toolkits to build machine learning algorithms. Thankfully, there are a number of open-source, economical solutions that can provide comparable features. This post will examine some of the top toolkits for employing machine learning outside of Matlab, including R packages like caret and randomForest as well as Python libraries like scikit-learn and TensorFlow.

List of toolset

There are many tools available for using machine learning without MATLAB. Here are some popular options −

1. Python

Python is a powerful and flexible programming language that has gained popularity for application in data analysis and machine learning. There are a number of machine-learning frameworks and tools that have been developed using this free and open-source language, which has a substantial and active development community.

Another is the well-known Python machine learning library PyTorch.Facebook created PyTorch, an open-source machine learning framework that offers a powerful tensor library for deep learning. Compared to rival frameworks, it is more adaptable and user-friendly due to its dynamic computational network.

Scikit-learn is another popular Python machine-learning package. It is a straightforward and effective data mining and data analysis tool that offers a variety of supervised and unsupervised learning methods for applications like classification, regression, and clustering.

Together with these libraries, Python also provides a wide range of additional beneficial machine-learning tools including Keras, Theano, and Pandas. Theano is a deep learning framework for numerical computing, Pandas is a data manipulation library that offers data structures for effective data analysis, and Keras is a high-level neural network library.

Generally, Python’s appeal in machine learning may be attributed to its simplicity, adaptability, and abundance of libraries and frameworks. Building and training machine learning models as well as analyzing and manipulating data for diverse applications are made simpler by these tools and frameworks.

2. R

R is a software environment and programming language for statistical computation and graphics. It also features several packages, like caret and randomForest, and is frequently used for machine learning applications.

R is a widely used programming language and computing environment for statistical computation and graphics. MATLAB has grown to be a well-liked option for data analysis, machine learning, and statistical modeling thanks to its large library of statistical and graphical tools.

R’s extensive library of packages created especially for data analysis and machine learning is one of the key factors contributing to its appeal in machine learning. Caret and randomForest are two of these tools that are frequently used for machine learning.

The R package Caret (Classification And Regression Training) offers a uniform interface for training and fine-tuning a wide range of machine-learning models. It supports a broad range of methods, including linear and nonlinear regression, decision trees, and support vector machines, and provides functions for data splitting, preprocessing, feature selection, and model training.

Another well-liked R package, RandomForest, offers the random forest technique implementation for classification and regression problems. Because of its capacity to manage high-dimensional data, cope with missing values, and handle relationships between variables, it is a preferred option for machine learning applications.

R features a wide range of other helpful machine learning packages, like the caretEnsemble package, which offers tools for merging several machine learning models, and the glmnet package, which offers effective generalized linear model implementations.

Overall, R’s large library of packages for statistical computation and data analysis makes it a popular language for machine learning.

3. RapidMiner

An integrated environment for model deployment, machine learning, and data preparation is provided by the data science platform RapidMiner. The interface is drag-and-drop and supports a wide range of data sources and formats.

Model deployment, machine learning, and data preparation can all be done in one integrated environment with the help of the potent data science platform RapidMiner. It seeks to simplify for users the processes of data collecting, machine learning model construction, and application in real-world scenarios.

The essential aspect is the rapid miner’s ability to use workflows to automate the machine learning process. The best-performing models may be generated quickly and easily, tested, and then put into production via a number of approaches.

Overall, RapidMiner is a capable and flexible data science platform that can be used for many machine learning and data analysis tasks. It is a well-liked option for both novice and experienced users because of its user-friendly drag-and-drop interface, wide selection of machine-learning algorithms, and compatibility with a number of data sources and formats.

4. KNIME

An open-source platform for data analytics called KNIME offers a graphical user interface for creating data pipelines and processes. It may be expanded with plugins and customized nodes in addition to having several built-in nodes for data preparation, machine learning, and visualization.

An open-source platform for data analytics called KNIME offers a visual interface for creating data pipelines and processes. Even those without programming skills may use it easily, yet it nonetheless has cutting-edge features for applications involving machine learning and data analytics.

Moreover, KNIME enables distinctive nodes and plugins that are developed and shared by the user base. Now, users may enhance platform features to meet their own demands.

KNIME’s capacity to interact with various platforms and data sources like Hadoop, Spark, and R is another important aspect. As a result, working with big, complicated datasets and incorporating KNIME into current data ecosystems are made simple.

A variety of machine learning methods, such as decision trees, clustering, and regression models, are offered by KNIME. With the use of a straightforward drag-and-drop interface, these can be set up, trained, and then applied to fresh data inside the platform.

Last but not least, KNIME offers a wide selection of charts, graphs, and other visualizations as part of its rich support for data visualization. This enables users to study and comprehend their data in a number of ways and successfully share their conclusions with others.

Conclusion

Python is a powerful and flexible programming language that has gained popularity in data analysis and machine learning due to its simplicity, adaptability, and abundance of libraries and frameworks. RapidMiner provides an integrated environment for model deployment, machine learning, and data preparation, and KNIME offers a graphical user interface for creating data pipelines and processes. KNIME is a powerful and adaptable framework for data analytics that is suitable for both novice and expert users due to its large library of built-in nodes, support for new nodes, and plugins.

Learning Different Techniques Of Anomaly Detection

This article was published as a part of the Data Science Blogathon.

Introduction

As a data scientist, in many cases of fraud detection in the bank for a transaction, Smart meters anomaly detection,

Have you ever thought about the bank where you make several transactions and how the bank helps you by identifying fraud?

Someone logins to your account, and a message is sent to you immediately after they notice suspicious activity at a place to confirm if it’s you or someone else.

What is Anomaly Detection?

Suppose we own a otice the font end data errors, even when our company supplies the same service, but the sales are declining. Here come the errors, which are termed anomalies or outliers.

Let’s take an example that will further clarify what an means.

Source: Canvas

 Here in this example, a bird is an outlier or noise.

Have you ever thought about the bank where you make several transactions, How the bank helps you by identifying fraud detection?

If the bank manager notices unusual behavior in your accounts, it can block the card. For example, spending a lot of amount in one day or another sum amount on another day will send a message of alert or will block your card as it’s not related to how you were spending previously.

Two AI firms detect an anomaly inside the bank. One is Fedzai’s detection firm, and another one by Ayasdi’s solution.

Let’s take another example of shopping e end of the month, the shopkeeper puts certain items on sale and offers you a scheme where you can buy two at less rate.

Now how do we describe the sales data compared to the start-of-month data? Do sale data validate data concerning monthly sales at the start of selling? It’s not vali

Outliers are “something I should remove from the dataset so that it doesn’t skew the model I’m building,” typically because they suspect that the data in question is flawed and that the model they want to construct shouldn’t need to take into account.

Outliers are most commonly caused by:

Intentional (dummy outliers created to test detection methods)

Data processing errors (data manipulation or data set unintended mutations)

Sampling errors (extracting or mixing data from wrong or various sources)

Natural (not an error, novelties in the data)

An actual data point significantly outside a distribution’s mean or median is an outlier.

An anomaly is a false data point made by a different process than the rest of the data.

If you construct a linear regression model, it is less likely that the model generated points far from the regression line. The likelihood of the data is another name for it.

Outliers are data points with a low likelihood, according to your model. They are identical from the perspective of modeling.

For instance, you could construct a model that describes a trend in the data and then actively looks for existing or new values with a very low likelihood. When people say “anomalies,” they mean these things. The anomaly detection of one person is the outlier of another!

Extreme values in your data series are called outliers. They are questionable. One student can be much more brilliant than other students in the same class, and it is possible.

However, anomalies are unquestionably errors. For example, one million degrees outside, or the air temperature won’t stay the same for two weeks. As a result, you disregard this data.

An outlier is a valid data point, and can’t be ignored or removed, whereas noise is garbage that needs removal. Let’s take another example to understand noise.

Suppose you wanted to take the average salary of employees and in data added the pay of Ratan Tata or Bill Gate, all the employer’s salary averages will show an increase which is incorrect data.

1.

2. Uni-variate – Uni-variate a variable with different values in the dataset.

3. Multi-variate – It is defined by the dataset by having more than one variable with a different set of values.

We will now use various techniques which will help us to find outliers.

Anomaly Detection by Scikit-learn

We will import the required library and read our data.

import seaborn as sns import pandas as pd titanic=pd.read_csv('titanic.csv') titanic.head()



We can see in the image many null values. We will fill the null values with mode.

titanic['age'].fillna(titanic['age'].mode()[0], inplace=True) titanic['cabin'].fillna(titanic['cabin'].mode()[0], inplace=True) titanic['boat'].fillna(titanic['boat'].mode()[0], inplace=True) titanic['body'].fillna(titanic['body'].mode()[0], inplace=True) titanic['sex'].fillna(titanic['sex'].mode()[0], inplace=True) titanic['survived'].fillna(titanic['survived'].mode()[0], inplace=True) titanic['home.dest'].fillna(titanic['home.dest'].mode()[0], inplace=True)

Let’s see our data in more detail. When we look at our data in statistics, we prefer to know its distribution types, whether binomial or other distributions.

titanic['age'].plot.hist( bins = 50, title = "Histogram of the age" )

This distribution is Gaussian distribution and is often called a normal distribution.

Mean and Standard Deviation are considered the two parameters. With the change in mean values, the distribution curve changes to left or right depending on the mean values.

Standard Normal distribution means mean(μ = 0) and standard deviation (σ) is one. To know the probability Z-table is already available.

Z-Scores

We can calculate Z – Scores by the given formula where x is a random variable, μ is the mean, and σ is the standard deviation.

Why do we need Z-Scores to be calculated?

It helps to know how a single or individual value lies in the entire distribution.

For example, if the maths subject scores mean is given to us 82, the standard deviation σ is 4. We have a value of x as 75. Now Z-Scores will be calculated as 82-75/4 = 1.75. It shows the value 75 with a z-score of 1.75 lies below the mean. It helps to determine whether values are higher, lower, or equal to the mean and how far.

Now, we will calculate Z-Score in python and look at outliers.

We imported Z-Scores from Scipy. We calculated Z-Score and then filtered the data by applying lambda. It gives us the number of outliers ranging from the age of 66 to 80.

from scipy.stats import zscore titanic["age_zscore"] = zscore(titanic["age"]) titanic["outlier"] = titanic["age_zscore"].apply( lambda x: x = 2.8 ) titanic[titanic["outlier"]]

We will now look at another method based on clustering called Density-based spatial clustering of applications with noise (DBSCAN).

DBSCAN

As the name indicates, the outliers detection is on clustering. In this method, we calculate the distance between points.

Let’s continue our titanic data and plot a graph between fare and age. We made a scatter graph between age and fare variables. We found three dots far away from the others.

Before we proceed further, we will normalize our data variables.

There are many ways to make our data normalize. We can import standard scaler by sklearn or min max scaler.

titanic['fare'].fillna(titanic['fare'].mean(), inplace=True) from sklearn.preprocessing import StandardScaler scale = StandardScaler() fage = scale.fit_transform(fage) fage = pd.DataFrame(fage, columns = ["age", "fare"]) fage.plot.scatter(x = "age", y = "fare")

We used Standard Scaler to make our data normal and plotted a scatter graph.

Now we will import DBSCAN to give points to the clusters. If it fails, it will show -1.

from sklearn.cluster import DBSCAN outlier = DBSCAN( eps = 0.5, metric="euclidean", min_samples = 3, n_jobs = -1) clusters = outlier.fit_predict(fage) clusters  array([0, 1, 1, ..., 1, 1, 1])

Now we have the results, but how do we check which value is min, max and whether we have -1 values? We will use the arg min value to check the smallest value in the cluster.

value=-1 index = clusters.argmin() print(" The element is at ", index) small_num = np.min(clusters) print("The small number is : " , small_num) print(np.where(clusters == small_num)) The element is at: 14 The small number is : -1 (array([ 14, 50, 66, 94, 285, 286], dtype=int64),)

We can see from the result six values which are -1.

Lets now plot a scatter graph.

from matplotlib import cm c = cm.get_cmap('magma_r') fage.plot.scatter( x = "age", y = "fare", c = clusters, cmap = c, colorbar = True )

The above methods we applied are on uni-variate outliers.

For Multi-variates outliers detections, we need to understand the multi-variate outliers.

For example, we take Car readings. We have seen two reading meters one for the odometer, which records or measures the speed at which the vehicle is moving, and the second is the rpm reading which records the number of rotations made by the car wheel per minute.

Suppose the odometer shows in the range of 0-60 mph and rpm in 0-750. We assume that all the values which come should correlate with each other. If the odometer shows a 50 speed and the rpm shows 0 intakes, readings are incorrect. If the odometer shows a value more than zero, that means the car was moving, so the rpm should have higher values, but in our case, it shows a 0 value. i.e., Multi-variate outliers.

Mahalanobis Distance Method

In DBSCAN, we used euclidean distance metrics, but in this case, we are talking about the Mahalanobis distance method. We can also use Mahalanobis distance with DBSCAN.

DBSCAN(eps=0.5, min_samples=3, metric='mahalanobis', metric_params={'V':np.cov(X)}, algorithm='brute', leaf_size=30, n_jobs=-1)

Why is Euclidean unfit for entities cor-related to each other? Euclidean distance cannot find or will give incorrect data on how close are the two points.

Mahalanobis method uses the distance between points and distribution that is clean data. Euclidean distance is often between two points, and its z-score is calculated by x minus mean and divided by standard deviation. In Mahalanobis, the z-score is x minus the mean divided by the covariance matrix.

Therefore, what effect does dividing by the covariance matrix have? The covariance values will be high if the variables in your dataset are highly correlated.

Similarly, if the covariance values are low, the distance is not significantly reduced if the data are not correlated. It does so well that it addresses both the scale and correlation of the variables issues.

Code

df = pd.read_csv('caret.csv').iloc[:, [0,4,6]] df.head()

We defined the function distance as x= None, data= None, and Covariance = None. Inside the function, we took the mean of data and used the covariance value of the value there. Otherwise, we will calculate the covariance matrix. T stands for transpose.

For example, if the array size is five or six and you want it to be in two variables, then we need to transpose the matrix.

np.random.multivariate_normal(mean, cov, size = 5) array([[ 0.0509196, 0.536808 ], [ 0.1081547, 0.9308906], [ 0.4545248, 1.4000731], [ 0.9803848, 0.9660610], [ 0.8079491 , 0.9687909]]) np.random.multivariate_normal(mean, cov, size = 5).T array([[ 0.0586423, 0.8538419, 0.2910855, 5.3047358, 0.5449706], [ 0.6819089, 0.8020285, 0.7109037, 0.9969768, -0.7155739]])

We used sp.linalg, which is Linear algebra and has different functions to be performed on linear algebra. It has the inv function for the inversion of the matrix. NumPy dot as means for the multiplication of the matrix.

import scipy as sp def distance(x=None, data=None, cov=None): x_m = x - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_cov = sp.linalg.inv(cov) left = np.dot(x_m, inv_cov) m_distance = np.dot(left, x_m.T) return m_distance.diagonal() df_g= df[['carat', 'depth', 'price']].head(50) df_g['m_distance'] = distance(x=df_g, data=df[['carat', 'depth', 'price']]) df_g.head() B. Tukey’s method for outlier detection

Tukey method is also often called Box and Whisker or Box plot method.

Tukey method utilizes the Upper and lower range.

Upper range = 75th Percentile -k*IQR

Lower range = 25th Percentile + k* IQR

Let us see our Titanic data with age variable using a box plot.

sns.boxplot(titanic['age'].values)

We can see in the image the box blot created by Seaborn shows many dots between the age of 55 and 80 are outliers not within the quartiles. We will detect lower and upper range by making a function outliers_detect.

def outliers_detect(x, k = 1.5): x = np.array(x).copy().astype(float) first = np.quantile(x, .25) third = np.quantile(x, .75) # IQR calculation iqr = third - first #Upper range and lower range lower = first - (k * iqr) upper = third + (k * iqr) return lower, upper outliers_detect(titanic['age'], k = 1.5) (2.5, 54.5) Detection by PyCaret

We will be using the same dataset for detection by PyCaret.

from pycaret.anomaly import * setup_anomaly_data = setup(df)

Pycaret is an open-source machine learning which uses an unsupervised learning model to detect outliers. It has a get_data method for using the dataset in pycaret itself, set_up for preprocessing task before detection, usually takes data frame but also has many other features like ignore_features, etc.

Other methods create_model for using an algorithm. We will first use Isolation Forest.

ifor = create_model("iforest") plot_model(ifor) ifor_predictions = predict_model(ifor, data = df) print(ifor_predictions) ifor_anomaly = ifor_predictions[ifor_predictions["Anomaly"] == 1] print(ifor_anomaly.head()) print(ifor_anomaly.shape)

Anomaly 1 indicates outliers, and Anomaly 0 shows no outliers.

The yellow color here indicates outliers.

Now let us see another algorithm, K Nearest Neighbors (KNN)

knn = create_model("knn") plot_model(knn) knn_pred = predict_model(knn, data = df) print(knn_pred) knn_anomaly = knn_pred[knn_pred["Anomaly"] == 1] knn_anomaly.head() knn_anomaly.shape

Now we will use a clustering algorithm.

clus = create_model("cluster") plot_model(clus) clus_pred = predict_model(clus, data = df) print(clus_pred) clus_anomaly = clus_predictions[clus_pred["Anomaly"] == 1] print(clus_anomaly.head()) clus_anomaly.shape Anomaly Detection by PyOD

PyOD is a python library for the detection of outliers in multivariate data. It is good both for supervised and unsupervised learning.

from pyod.models.iforest import IForest from chúng tôi import KNN

We imported the library and algorithm.

from chúng tôi import generate_data from chúng tôi import evaluate_print from pyod.utils.example import visualize train= 300 test=100 contaminate = 0.1 X_train, X_test, y_train, y_test = generate_data(n_train=train, n_test=test, n_features=2,contamination=contaminate,random_state=42) cname_alg = 'KNN' # the name of the algorithm is K Nearest Neighbors c = KNN() c.fit(X_train) #Fit the algorithm y_trainpred = c.labels_ y_trainscores = c.decision_scores_ y_testpred = c.predict(X_test) y_testscores = c.decision_function(X_test) print("Training Data:") evaluate_print(cname_alg, y_train, y_train_scores) print("Test Data:") evaluate_print(cname_alg, y_test, y_test_scores) visualize(cname_alg, X_train, y_train, X_test, y_test, y_trainpred,y_testpred, show_figure=True, save_figure=True)

We will use the IForest algorithm.

fname_alg = 'IForest' # the name of the algorithm is K Nearest Neighbors f = IForest() f.fit(X_train) #Fit the algorithm y_train_pred = c.labels_ y_train_scores = c.decision_scores_ y_test_pred = c.predict(X_test) y_test_scores = c.decision_function(X_test) print("Training Data:") evaluate_print(fname_alg, y_train_pred, y_train_scores) print("Test Data:") evaluate_print(fname_alg, y_test_pred, y_test_scores) visualize(fname_alg, X_train, y_train, X_test, y_test_pred, y_train_pred,y_test_pred, show_figure=True, save_figure=True) Anomaly Detection by Prophet import prophet from prophet import forecaster from prophet import Prophet m = Prophet() data = pd.read_csv('air_pass.csv') data.head() data.columns = ['ds', 'y'] data['y'] = np.where(data['y'] != 0, np.log(data['y']), 0)

The Log of the y column enables no negative value. We split our data into train, test, and stored the prediction in the variable forecast.

train, test= train_test_split(data, random_state =42) m.fit(train[['ds','y']]) forecast = m.predict(test) def detect(forecast): forcast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy() forcast['real']= data['y'] forcast['anomaly'] =0 forcast.loc[forcast['real']< forcast['yhat_lower'], 'anomaly']=-1 forcast['imp']=0 in_range = forcast['yhat_upper']-forcast['yhat_lower'] forcast.loc[forcast['anomaly']==1, 'imp'] = forcast['real']-forcast['yhat_upper']/in_range forcast.loc[forcast['anomaly']==-1, 'imp']= forcast['yhat_lower']-forcast['real']/in_range return forcast detect(forecast)

We took the anomaly as -1.

Conclusion

The process of finding outliers in a given dataset is called anomaly detection. Outliers are data objects that stand out from the rest of the object values in the dataset and don’t behave normally.

Anomaly detection tasks can use distance-based and density-based clustering methods to identify outliers as a cluster.

We here discuss anomaly detection’s various methods and explain them using the code on three datasets of Titanic, Air passengers, and Caret to

Key Points

1. Outliers or anomaly detection can be detected using the Box-Whisker method or by DBSCAN.

2. Euclidean distance method is used with the items not correlated.

3. Mahalanobis method is used with Multivariate outliers.

4. All the values or points are not outliers. Some are noises that ought to be garbage. Outliers are valid data that need to be adjusted.

5. We used PyCaret for outliers detection by using different algorithms where the anomaly is one, shown in yellow colors, and no outliers where the outlier is 0.

6. We used PyOD, which is the Python Outlier detection library. It has more than 40 algorithms. Supervised and unsupervised techniques it is used.

7. We used Prophet and defined the function detect to outline the outliers.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Interview Questions On Bagging Algorithms In Machine Learning

This article was published as a part of the Data Science Blogathon.

Introduction

Bagging is a famous ensemble technique in the field of machine learning which is widely used for its performance and better results. It is one of the most important and high-performing ensemble techniques, which is easy to use and accurate. Due to the rich performance on even weak machine learning algorithms, It has become a popular ensemble technique and is being compared with other strong machine learning algorithms.

Most machine learning interviews frequently asked interview questions related to bagging algorithms. This article will discuss the top interview questions on bagging, which are mostly asked in machine-learning interviews. Practicing these questions will help one understand the concept of bagging very deeply and help answer the interview questions related to it very efficiently.

1. What is Bagging and How Does it Work? Explain it with Examples.

Bagging stands for Bootstarp Aggregation. Bootstrapping generally means randomly selecting a sample from a dataset, and aggregations stand for the further procedure and preprocessing of the selected samples. So in the bagging, we generally take multiple machine learning models of the same algorithm, meaning that we only take the same machine learning algorithm multiple times.

For example, if we are using SVM as a base algorithm and the count of the models is 5 then all the models will be of SVM only. Once the base model is decided, there will be a bootstrapping process where the random samples from the dataset will be selected and fed to the machine learning model.

The data will be fed to the models by bootstrapping, and there will be separate training for every model. Once all the models are trained, then there will be a prediction phase where all the different models will predict individually, and as a step of aggregation, we can apply any method to the multiple prediction data as there will be 5 different predictions from every model. The common approach is to calculate the mean of the predictions in case of regression or consider the majority count of it in case of classification.

2. How is Bagging Different from the Random Forest Algorithm?

The very basic difference between bagging and the random forest is related to the base models. In bagging, the base model can be any machine learning algorithm, and there is an option of selecting any machine learning algorithm as the base model in bagging by using the base_estimator parameter.

In the random forest, the base estimator or the base models are always decision trees, and other is not any option of selecting any other machine learning algorithms as base estimators in random forest.

Another difference between bagging and the random forest is that in bagging, all the features are selected for the training of the base models, whereas in the random forest, only a subset of the features are selected for the base model training, and out of that only the best performing are chosen as final features.

3. What is the Difference Between Bootstrapping and Pasting in Bagging?

The main difference between bootstrapping and pasting is in the data sampling. As we know, in bagging, there is a sampling of the main dataset, It could be row or column sampling, out of which samples of the dataset are provided to the base models for training.

In bagging or bootstrapping, the samples are taken from the main dataset and fed to the first model, now the same samples can be again used for the training of any other method;, here, the sampling will be with replacement.

In pasting, there is a sample taken from the main dataset, but once the samples are used for training any model, the sawm samples will not be used again for the training of any other model. So here, the sampling is done without replacement.

4. Why does Bagging Performs Well on the Low Bias High Variance Datasets?

In general, low bias high variance datasets are the data that have a very good performance on the training data and poor performance on the testing data, the case of overfitting. The data prone to overfit on any model is preferred for bagging algorithms as bagging reduces the variance of the dataset. Now let’s suppose we have a dataset which is having a very high variance. Suppose we have 10000 rows in our data from which 100 samples have a high variance; now, if this data is fed to any other algorithm, the algorithm will perform poorly as these 100 samples will affect the training, but in the case of bagging, there will be multiple models of the same algorithm, so there will not be a case where all the 100 rows will be fed to the same model due to bootstrapping or sampling of the chúng tôi here now every model will experience the same weightage of the variance in the dataset, and in the end, the high variance of the dataset will not affect the final predictions of the model.

5. What is the Difference between Bagging and Boosting? Which is Better?

In the bagging algorithms, the main dataset is sampled in the parts, and the same multiple base models are used for training with different samples. In the final stage of aggregation, the output from every single base model will be considered, and the final output can be a mean or most frequent term from all models trained. It is also known as parallel learning, as all weak learners learn at the same time. Boosting is generally a stagewise addition method, where multiple weak learners are trained, and all the models are of the same machine learning algorithm. The errors and the mistake from the previously trained weak learner are considered to avoid the same errors in the further training of the next weak learner. It is also known as sequential learning, as the weak learner learns in sequence with each other.

We can not say which algorithm will perform better all the way, but generally, bagging is preferred when there is a low bias and high variance in the dataset (overfitting), whereas boosting is preferred in the case of a high bias and low variance dataset (underfitting).

Conclusion

This article discusses the top 5 interview questions with the core idea and intuition behind them. Reading and preparing these questions will help one understand the bagging algorithm’s core intuition and how it differs from other algorithms.

Some Key Takeaways from this article are:

1. Random forest is a bagging algorithm with decision trees as base models.

2. Bagging uses sampling of the data with replacement, whereas pasting uses sampling of the data without replacement.

3. Bagging performs well on the high variance dataset and boosting performs well on high-bias datasets.

Want to Contact the Author?

Follow Parth Shukla @AnalyticsVidhya, LinkedIn, Twitter, and Medium for more content.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Update the detailed information about Machine Learning Techniques For Text Representation In Nlp on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!