You are reading the article Learning Different Techniques Of Anomaly Detection updated in December 2023 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Learning Different Techniques Of Anomaly Detection
This article was published as a part of the Data Science Blogathon.
IntroductionAs a data scientist, in many cases of fraud detection in the bank for a transaction, Smart meters anomaly detection,
Have you ever thought about the bank where you make several transactions and how the bank helps you by identifying fraud?
Someone logins to your account, and a message is sent to you immediately after they notice suspicious activity at a place to confirm if it’s you or someone else.
What is Anomaly Detection?
Suppose we own a otice the font end data errors, even when our company supplies the same service, but the sales are declining. Here come the errors, which are termed anomalies or outliers.
Let’s take an example that will further clarify what an means.
Source: Canvas
Here in this example, a bird is an outlier or noise.
Have you ever thought about the bank where you make several transactions, How the bank helps you by identifying fraud detection?
If the bank manager notices unusual behavior in your accounts, it can block the card. For example, spending a lot of amount in one day or another sum amount on another day will send a message of alert or will block your card as it’s not related to how you were spending previously.
Two AI firms detect an anomaly inside the bank. One is Fedzai’s detection firm, and another one by Ayasdi’s solution.
Let’s take another example of shopping e end of the month, the shopkeeper puts certain items on sale and offers you a scheme where you can buy two at less rate.
Now how do we describe the sales data compared to the start-of-month data? Do sale data validate data concerning monthly sales at the start of selling? It’s not vali
Outliers are “something I should remove from the dataset so that it doesn’t skew the model I’m building,” typically because they suspect that the data in question is flawed and that the model they want to construct shouldn’t need to take into account.
Outliers are most commonly caused by:
Intentional (dummy outliers created to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
Natural (not an error, novelties in the data)
An actual data point significantly outside a distribution’s mean or median is an outlier.
An anomaly is a false data point made by a different process than the rest of the data.
If you construct a linear regression model, it is less likely that the model generated points far from the regression line. The likelihood of the data is another name for it.
Outliers are data points with a low likelihood, according to your model. They are identical from the perspective of modeling.
For instance, you could construct a model that describes a trend in the data and then actively looks for existing or new values with a very low likelihood. When people say “anomalies,” they mean these things. The anomaly detection of one person is the outlier of another!
Extreme values in your data series are called outliers. They are questionable. One student can be much more brilliant than other students in the same class, and it is possible.
However, anomalies are unquestionably errors. For example, one million degrees outside, or the air temperature won’t stay the same for two weeks. As a result, you disregard this data.
An outlier is a valid data point, and can’t be ignored or removed, whereas noise is garbage that needs removal. Let’s take another example to understand noise.
Suppose you wanted to take the average salary of employees and in data added the pay of Ratan Tata or Bill Gate, all the employer’s salary averages will show an increase which is incorrect data.
1.
2. Uni-variate – Uni-variate a variable with different values in the dataset.
3. Multi-variate – It is defined by the dataset by having more than one variable with a different set of values.
We will now use various techniques which will help us to find outliers.
Anomaly Detection by Scikit-learnWe will import the required library and read our data.
import seaborn as sns import pandas as pd titanic=pd.read_csv('titanic.csv') titanic.head()
We can see in the image many null values. We will fill the null values with mode.
titanic['age'].fillna(titanic['age'].mode()[0], inplace=True) titanic['cabin'].fillna(titanic['cabin'].mode()[0], inplace=True) titanic['boat'].fillna(titanic['boat'].mode()[0], inplace=True) titanic['body'].fillna(titanic['body'].mode()[0], inplace=True) titanic['sex'].fillna(titanic['sex'].mode()[0], inplace=True) titanic['survived'].fillna(titanic['survived'].mode()[0], inplace=True) titanic['home.dest'].fillna(titanic['home.dest'].mode()[0], inplace=True)Let’s see our data in more detail. When we look at our data in statistics, we prefer to know its distribution types, whether binomial or other distributions.
titanic['age'].plot.hist( bins = 50, title = "Histogram of the age" )This distribution is Gaussian distribution and is often called a normal distribution.
Mean and Standard Deviation are considered the two parameters. With the change in mean values, the distribution curve changes to left or right depending on the mean values.
Standard Normal distribution means mean(μ = 0) and standard deviation (σ) is one. To know the probability Z-table is already available.
Z-ScoresWe can calculate Z – Scores by the given formula where x is a random variable, μ is the mean, and σ is the standard deviation.
Why do we need Z-Scores to be calculated?
It helps to know how a single or individual value lies in the entire distribution.
For example, if the maths subject scores mean is given to us 82, the standard deviation σ is 4. We have a value of x as 75. Now Z-Scores will be calculated as 82-75/4 = 1.75. It shows the value 75 with a z-score of 1.75 lies below the mean. It helps to determine whether values are higher, lower, or equal to the mean and how far.
Now, we will calculate Z-Score in python and look at outliers.
We imported Z-Scores from Scipy. We calculated Z-Score and then filtered the data by applying lambda. It gives us the number of outliers ranging from the age of 66 to 80.
from scipy.stats import zscore titanic["age_zscore"] = zscore(titanic["age"]) titanic["outlier"] = titanic["age_zscore"].apply( lambda x: x = 2.8 ) titanic[titanic["outlier"]]We will now look at another method based on clustering called Density-based spatial clustering of applications with noise (DBSCAN).
DBSCANAs the name indicates, the outliers detection is on clustering. In this method, we calculate the distance between points.
Let’s continue our titanic data and plot a graph between fare and age. We made a scatter graph between age and fare variables. We found three dots far away from the others.
Before we proceed further, we will normalize our data variables.
There are many ways to make our data normalize. We can import standard scaler by sklearn or min max scaler.
titanic['fare'].fillna(titanic['fare'].mean(), inplace=True) from sklearn.preprocessing import StandardScaler scale = StandardScaler() fage = scale.fit_transform(fage) fage = pd.DataFrame(fage, columns = ["age", "fare"]) fage.plot.scatter(x = "age", y = "fare")We used Standard Scaler to make our data normal and plotted a scatter graph.
Now we will import DBSCAN to give points to the clusters. If it fails, it will show -1.
from sklearn.cluster import DBSCAN outlier = DBSCAN( eps = 0.5, metric="euclidean", min_samples = 3, n_jobs = -1) clusters = outlier.fit_predict(fage) clusters array([0, 1, 1, ..., 1, 1, 1])Now we have the results, but how do we check which value is min, max and whether we have -1 values? We will use the arg min value to check the smallest value in the cluster.
value=-1 index = clusters.argmin() print(" The element is at ", index) small_num = np.min(clusters) print("The small number is : " , small_num) print(np.where(clusters == small_num)) The element is at: 14 The small number is : -1 (array([ 14, 50, 66, 94, 285, 286], dtype=int64),)We can see from the result six values which are -1.
Lets now plot a scatter graph.
from matplotlib import cm c = cm.get_cmap('magma_r') fage.plot.scatter( x = "age", y = "fare", c = clusters, cmap = c, colorbar = True )The above methods we applied are on uni-variate outliers.
For Multi-variates outliers detections, we need to understand the multi-variate outliers.
For example, we take Car readings. We have seen two reading meters one for the odometer, which records or measures the speed at which the vehicle is moving, and the second is the rpm reading which records the number of rotations made by the car wheel per minute.
Suppose the odometer shows in the range of 0-60 mph and rpm in 0-750. We assume that all the values which come should correlate with each other. If the odometer shows a 50 speed and the rpm shows 0 intakes, readings are incorrect. If the odometer shows a value more than zero, that means the car was moving, so the rpm should have higher values, but in our case, it shows a 0 value. i.e., Multi-variate outliers.
Mahalanobis Distance MethodIn DBSCAN, we used euclidean distance metrics, but in this case, we are talking about the Mahalanobis distance method. We can also use Mahalanobis distance with DBSCAN.
DBSCAN(eps=0.5, min_samples=3, metric='mahalanobis', metric_params={'V':np.cov(X)}, algorithm='brute', leaf_size=30, n_jobs=-1)Why is Euclidean unfit for entities cor-related to each other? Euclidean distance cannot find or will give incorrect data on how close are the two points.
Mahalanobis method uses the distance between points and distribution that is clean data. Euclidean distance is often between two points, and its z-score is calculated by x minus mean and divided by standard deviation. In Mahalanobis, the z-score is x minus the mean divided by the covariance matrix.
Therefore, what effect does dividing by the covariance matrix have? The covariance values will be high if the variables in your dataset are highly correlated.
Similarly, if the covariance values are low, the distance is not significantly reduced if the data are not correlated. It does so well that it addresses both the scale and correlation of the variables issues.
Code
df = pd.read_csv('caret.csv').iloc[:, [0,4,6]] df.head()We defined the function distance as x= None, data= None, and Covariance = None. Inside the function, we took the mean of data and used the covariance value of the value there. Otherwise, we will calculate the covariance matrix. T stands for transpose.
For example, if the array size is five or six and you want it to be in two variables, then we need to transpose the matrix.
np.random.multivariate_normal(mean, cov, size = 5) array([[ 0.0509196, 0.536808 ], [ 0.1081547, 0.9308906], [ 0.4545248, 1.4000731], [ 0.9803848, 0.9660610], [ 0.8079491 , 0.9687909]]) np.random.multivariate_normal(mean, cov, size = 5).T array([[ 0.0586423, 0.8538419, 0.2910855, 5.3047358, 0.5449706], [ 0.6819089, 0.8020285, 0.7109037, 0.9969768, -0.7155739]])We used sp.linalg, which is Linear algebra and has different functions to be performed on linear algebra. It has the inv function for the inversion of the matrix. NumPy dot as means for the multiplication of the matrix.
import scipy as sp def distance(x=None, data=None, cov=None): x_m = x - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_cov = sp.linalg.inv(cov) left = np.dot(x_m, inv_cov) m_distance = np.dot(left, x_m.T) return m_distance.diagonal() df_g= df[['carat', 'depth', 'price']].head(50) df_g['m_distance'] = distance(x=df_g, data=df[['carat', 'depth', 'price']]) df_g.head() B. Tukey’s method for outlier detectionTukey method is also often called Box and Whisker or Box plot method.
Tukey method utilizes the Upper and lower range.
Upper range = 75th Percentile -k*IQR
Lower range = 25th Percentile + k* IQR
Let us see our Titanic data with age variable using a box plot.
sns.boxplot(titanic['age'].values)We can see in the image the box blot created by Seaborn shows many dots between the age of 55 and 80 are outliers not within the quartiles. We will detect lower and upper range by making a function outliers_detect.
def outliers_detect(x, k = 1.5): x = np.array(x).copy().astype(float) first = np.quantile(x, .25) third = np.quantile(x, .75) # IQR calculation iqr = third - first #Upper range and lower range lower = first - (k * iqr) upper = third + (k * iqr) return lower, upper outliers_detect(titanic['age'], k = 1.5) (2.5, 54.5) Detection by PyCaretWe will be using the same dataset for detection by PyCaret.
from pycaret.anomaly import * setup_anomaly_data = setup(df)Pycaret is an open-source machine learning which uses an unsupervised learning model to detect outliers. It has a get_data method for using the dataset in pycaret itself, set_up for preprocessing task before detection, usually takes data frame but also has many other features like ignore_features, etc.
Other methods create_model for using an algorithm. We will first use Isolation Forest.
ifor = create_model("iforest") plot_model(ifor) ifor_predictions = predict_model(ifor, data = df) print(ifor_predictions) ifor_anomaly = ifor_predictions[ifor_predictions["Anomaly"] == 1] print(ifor_anomaly.head()) print(ifor_anomaly.shape)Anomaly 1 indicates outliers, and Anomaly 0 shows no outliers.
The yellow color here indicates outliers.
Now let us see another algorithm, K Nearest Neighbors (KNN)
knn = create_model("knn") plot_model(knn) knn_pred = predict_model(knn, data = df) print(knn_pred) knn_anomaly = knn_pred[knn_pred["Anomaly"] == 1] knn_anomaly.head() knn_anomaly.shapeNow we will use a clustering algorithm.
clus = create_model("cluster") plot_model(clus) clus_pred = predict_model(clus, data = df) print(clus_pred) clus_anomaly = clus_predictions[clus_pred["Anomaly"] == 1] print(clus_anomaly.head()) clus_anomaly.shape Anomaly Detection by PyODPyOD is a python library for the detection of outliers in multivariate data. It is good both for supervised and unsupervised learning.
from pyod.models.iforest import IForest from chúng tôi import KNNWe imported the library and algorithm.
from chúng tôi import generate_data from chúng tôi import evaluate_print from pyod.utils.example import visualize train= 300 test=100 contaminate = 0.1 X_train, X_test, y_train, y_test = generate_data(n_train=train, n_test=test, n_features=2,contamination=contaminate,random_state=42) cname_alg = 'KNN' # the name of the algorithm is K Nearest Neighbors c = KNN() c.fit(X_train) #Fit the algorithm y_trainpred = c.labels_ y_trainscores = c.decision_scores_ y_testpred = c.predict(X_test) y_testscores = c.decision_function(X_test) print("Training Data:") evaluate_print(cname_alg, y_train, y_train_scores) print("Test Data:") evaluate_print(cname_alg, y_test, y_test_scores) visualize(cname_alg, X_train, y_train, X_test, y_test, y_trainpred,y_testpred, show_figure=True, save_figure=True)We will use the IForest algorithm.
fname_alg = 'IForest' # the name of the algorithm is K Nearest Neighbors f = IForest() f.fit(X_train) #Fit the algorithm y_train_pred = c.labels_ y_train_scores = c.decision_scores_ y_test_pred = c.predict(X_test) y_test_scores = c.decision_function(X_test) print("Training Data:") evaluate_print(fname_alg, y_train_pred, y_train_scores) print("Test Data:") evaluate_print(fname_alg, y_test_pred, y_test_scores) visualize(fname_alg, X_train, y_train, X_test, y_test_pred, y_train_pred,y_test_pred, show_figure=True, save_figure=True) Anomaly Detection by Prophet import prophet from prophet import forecaster from prophet import Prophet m = Prophet() data = pd.read_csv('air_pass.csv') data.head() data.columns = ['ds', 'y'] data['y'] = np.where(data['y'] != 0, np.log(data['y']), 0)The Log of the y column enables no negative value. We split our data into train, test, and stored the prediction in the variable forecast.
train, test= train_test_split(data, random_state =42) m.fit(train[['ds','y']]) forecast = m.predict(test) def detect(forecast): forcast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy() forcast['real']= data['y'] forcast['anomaly'] =0 forcast.loc[forcast['real']< forcast['yhat_lower'], 'anomaly']=-1 forcast['imp']=0 in_range = forcast['yhat_upper']-forcast['yhat_lower'] forcast.loc[forcast['anomaly']==1, 'imp'] = forcast['real']-forcast['yhat_upper']/in_range forcast.loc[forcast['anomaly']==-1, 'imp']= forcast['yhat_lower']-forcast['real']/in_range return forcast detect(forecast)We took the anomaly as -1.
ConclusionThe process of finding outliers in a given dataset is called anomaly detection. Outliers are data objects that stand out from the rest of the object values in the dataset and don’t behave normally.
Anomaly detection tasks can use distance-based and density-based clustering methods to identify outliers as a cluster.
We here discuss anomaly detection’s various methods and explain them using the code on three datasets of Titanic, Air passengers, and Caret to
Key Points
1. Outliers or anomaly detection can be detected using the Box-Whisker method or by DBSCAN.
2. Euclidean distance method is used with the items not correlated.
3. Mahalanobis method is used with Multivariate outliers.
4. All the values or points are not outliers. Some are noises that ought to be garbage. Outliers are valid data that need to be adjusted.
5. We used PyCaret for outliers detection by using different algorithms where the anomaly is one, shown in yellow colors, and no outliers where the outlier is 0.
6. We used PyOD, which is the Python Outlier detection library. It has more than 40 algorithms. Supervised and unsupervised techniques it is used.
7. We used Prophet and defined the function detect to outline the outliers.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
You're reading Learning Different Techniques Of Anomaly Detection
Reinforcement Learning Techniques Based On Types Of Interaction
This article was published as a part of the Data Science Blogathon.
IntroductionWith the ubiquitous adoption of deep learning, reinforcement learning (RL) has seen a sharp rise in popularity, scaling to problems that were intractable in the past, such as controlling robotic agents and autonomous vehicles, playing complex games from pixel observations, etc.
This article will cover what reinforcement learning is and different types of reinforcement learning paradigms based on the types of interaction.
Now, let’s begin…
Highlights
Reinforcement Learning (RL) is a general framework that enables an agent to discover the best way to maximize a given reward signal through trial and error using feedback from its actions and experiences, i.e., actively interacting with the environment by taking actions and observing the reward.
In online RL, the agent is free to interact with the environment and must gather new experiences with the latest policy before updating.
In off-policy RL, an agent interacts with the environment and appends its new experiences to a replay buffer, which can then be sampled to update the policy. This paradigm allows for the reuse of prior experiences while relying on a steady stream of fresh ones.
In offline RL, a behavior policy is used to gather experiences that are used to collect experiences that are stored in a static dataset. Then a new policy is learned without any further interactions with the environment.
What is Reinforcement Learning?Reinforcement Learning (RL) is a general framework for adaptive control that enables an agent to learn to maximize a specified reward signal through trial and error using feedback from its actions and experiences, i.e., actively interacting with the environment by taking actions and observing the reward.
Figure 1: Diagram illustrating Reinforcement Learning
In figure 1, we consider that the agent interacts with the environment. Even though the agent and the environment are separately drawn, we can also picture the agent to be somewhere existing inside the environment. Imagine a huge world in which the agent exists somewhere inside that world and interacts with it, e.g., the Super Mario game.
For instance, in the animation shown below, Mario exists inside the game environment and can move and jump in both left and right directions. When Mario interacts with the flower, he gets a positive reward; however, if Mario comes in contact with the monster, he gets penalized (negative reward). So Mario learns by actively interacting with the environment by taking actions and observing the rewards. Furthermore, it learns to maximize positive rewards through trial and error.
Animation 1: Example of Reinforcement Learning
In essence, reinforcement learning essentially provides a mathematical formalism for learning-based control. By using reinforcement learning, we can automatically develop near-optimal behavioral skills, represented by policies, to optimize user-specified reward functions. The reward function determines what an agent should do, and a reinforcement learning algorithm specifies how to do it.
Now that we are familiar with Reinforcement Learning, let’s explore different RL paradigms based on the type of interaction.
Different RL Techniques Based on the Type of InteractionIn this section, we will focus on the following types of Reinforcement Learning techniques based on interaction-type:
i) Online/On-policy Reinforcement Learning
ii) Off-policy Reinforcement Learning
iii) Offline Reinforcement Learning
i) Online/On-policy RL: The reinforcement learning process involves gathering experience by interacting with the environment, generally with the latest learned policy, and then using that experience to improve the policy. In online RL, the agent is free to interact with the environment and must gather new experiences with the latest policy before updating.
with streaming data collected by πk itself.
Figure 2: Diagram illustrating Online Reinforcement Learning (Source: Arxiv)
ii) Off-policy RL: In off-policy, the agent is still free to interact with the environment. However, it can update its current policy by leveraging experiences gathered from any previous policies. As a result, the sample efficiency of training increases because the agent doesn’t have to discard all of its prior interactions and can rather maintain a buffer where old interactions can be sampled multiple times.
In Figure 3, shown below, the agent interacts with the environment and appends its new experiences to a data buffer (also called a replay buffer) D. Each new policy πk collects additional data, such that D comprises samples from π0, π1, . . . , πk, and all of this data is used to train an updated new policy πk+1.
Figure 3: Diagram illustrating Off-policy Reinforcement Learning (Source: Arxiv)
iii) Offline RL/ Batch RL: In offline RL, a behavior policy is used to gather experiences that are used to collect experiences that are stored in a static dataset. Then a new policy is learned without any further interactions with the environment. After learning an offline policy, one can opt to fine-tune their policy either via online or off-policy RL methods, with the added benefit that their initial policy is likely safer and cheaper to interact with the environment than an initial random policy.
In figure 4, the offline reinforcement learning employs a dataset D gathered by some (potentially unknown) behavior policy πβ. The dataset is collected once and is not altered during training, which makes it feasible to use large previously collected datasets. The training process doesn’t interact with the MDP, and the policy is only deployed after being fully trained.
Figure 4: Diagram illustrating Offline Reinforcement Learning (Source: Arxiv)
Caveats in Reinforcement Learning1. The offline RL paradigm can be incredibly beneficial in settings where online interaction is impractical, either due to data collection being expensive (e.g., in healthcare, educational agents, or robotics) or dangerous (e.g., in autonomous driving, etc). Furthermore, even in domains where online interaction is viable, one might still prefer to utilize previously collected data for improved generalization in complex domains.
offline RL. Moreover, this issue is further exacerbated by the ubiquitous use of high-capacity function approximators. To navigate this, most offline RL algorithms address this issue by proposing different losses or training methods that can reduce distributional shifts.
3. After learning an offline policy, one can still opt to tune the policy online, with the added benefit that their initial policy is more likely safer and cheaper to interact with the environment than an initial random policy.
ConclusionTo sum up, in this article, we learned the following:
1. Reinforcement Learning (RL) is a general framework for adaptive control that enables an agent to le imize a specified reward signal through trial and error using feedback from its actions and experiences, i.e., actively interacting with the environment by taking actions and observing the reward.
2. In online RL, the agent is free to interact with the environment and must gather new experiences with the latest policy before updating.
3. In off-policy RL, an agent interacts with the environment and appends its new experiences to a replay buffer, which can then be sampled to update the policy. This paradigm allows for the reuse of prior experiences while relying on a steady stream of fresh ones.
4. In offline RL, a behavior policy is used to gather experiences that are used to collect experiences that are stored in a static dataset. Then a new policy is learned without any further interactions with the environment.
5. The offline RL paradigm can be incredibly beneficial in settings where online interaction is impractical due to expensive or dangerous data collection. Also, even in domains where online interaction is viable, one might still prefer to utilize previously collected data for improved generalization in complex domains.
6. After learning an offline policy, one can still opt to tune the policy online, with the added benefit that their initial policy is more likely safer and cheaper to interact with the environment than an initial random policy.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
Learn The Different Test Techniques In Detail
Introduction to Test techniques
Start Your Free Software Development Course
Web development, programming languages, Software testing & others
List of Test techniquesThere are various techniques available; each has its own strengths and weakness. Each technique is good at finding particular types of defects and relatively poor at finding other types of defects. In this section, we are going to discuss the various techniques.
1. Static testing techniques 2. Specification-based test techniquesall Specification-based techniques have the common characteristics that they are based on the model of some aspect of the specification, enabling the cases to be derived systematically. There are 4 sub-specification-based techniques which are as follows
Equivalence partitioning: It is a specification-based technique in which test cases are designed to execute representatives from equivalence partition. In principle, cases are designed to cover each partition at least once.
Boundary value analysis: It is a technique in which cases are designed based on the boundary value. Boundary value is an input value or output value which is on the edge of an equivalence partition or at the smallest incremental distance on either side of an edge. For example, minimum and maximum value.
Decision table testing: It is a technique in which cases are designed to execute the combination of inputs and causes shown in a decision table.
State transition testing: It is a technique in which cases are designed to execute valid and invalid state transitions.
3. Structure-based testing
Test coverage: It is a degree that is expressed as a percentage to which a specified coverage item has been exercised by a test suite.
Statement coverage: It is a percentage of executable statements that the test suite has exercised.
Decision Coverage: It is a percentage of decision outcomes that a test suite has exercised. 100% decision coverage implies both 100% branch coverage and 100% statement coverage.
Branch coverage: It is a percentage of the branches that the test suite has exercised. 100% branch coverage implies both 100% decision coverage and 100% statement coverage.
4. Experience-based testingThe experience-based technique is a procedure to derive and select the cases based on the experience and knowledge of the tester. All experience-based have the common characteristics that they are based on human experience and knowledge, both of the system itself and likely defects. Cases are derived less systematically but may be more effective. The experience of both technical people and business people is a key factor in an experience-based technique.
ConclusionThe most important thing to understand here is that the best technique is no single testing, as each technique is good at finding one specific class of the defect. also, using just a single technique will help ensure that any defects of that particular class are found. It may also help to ensure that any defects of other classes are missed. So using a variety of techniques will help you ensure that a variety of defects are found and will result in more effective testing. Therefore it is most often used to statistically test the source code.
Recommended ArticlesThis is a guide to Test Techniques. Here we discuss the List of Various Test techniques along with their Strength and Weakness. You may also have a look at the following articles to learn more –
Machine Learning Techniques For Text Representation In Nlp
This article was published as a part of the Data Science Blogathon.
IntroductionNatural Language Processing is a branch of artificial intelligence that deals with human language to make a system able to understand and respond to language. Data being the most important part of any data science project should always be represented in a way that helps easy understanding and modeling, especially when it comes to NLP machine learning. It is said that when we provide very good features to bad models and bad features to well-optimized models then bad models will perform far better than an optimized model. So in this article, we will study how features from text data can be extracted, and used in our NLP machine learning modeling process and why feature extraction from text is a bit difficult compared to other types of data.
Table of Contents
Brief Introduction on Text Representation
Why Feature Extraction from text is difficult?
Common Terms you should know
Techniques for Feature Extraction from text data
One-Hot Encoding
Bag of words Technique
N-Grams
TF-IDF
End Notes
Introduction to Text RepresentationThe first question arises is what is Feature Extraction from the text? Feature Extraction is a general term that is also known as a text representation of text vectorization which is a process of converting text into numbers. we call vectorization because when text is converted in numbers it is in vector form.
Now the second question would be Why do we need feature extraction? So we know that machines can only understand numbers and to make machines able to identify language we need to convert it into numeric form.
Why Feature extraction from textual data is difficult?If you ask any NLP practitioner or experienced data scientist then the answer will be yes that handling textual data is difficult? Now first let us compare text feature extraction with feature extraction in other types of data. So In an image dataset suppose digit recognition is where you have images of digits and the task is to predict the digit so in this image feature extraction is easy because images are already present in form of numbers(Pixels). If we talk about audio features, suppose emotion prediction from speech recognition so in this we have data in form of waveform signals where features can be extracted over some time Interval. But when I say I have a sentence and want to predict its sentiment How will you represent it in numbers? An image dataset, the speech dataset was the simple case but in a text data case, you have to think a little bit. In this article, we are going to study these techniques only.
Common Terms UsedThese are common terms that we will use in further techniques so I want you to be familiar with these four basic terms
Corpus(C) ~ The total number of combinations of words in the whole dataset is known as Corpus. In simple words concatenating all the text records of the dataset forms a corpus.
Vocabulary(V) ~ a total number of distinct words which form your corpus is known as Vocabulary.
Document(D) ~ There are multiple records in a dataset so a single record or review is referred to as a document.
Word(W) ~ Words that are used in a document are known as Word.
Techniques for Feature Extraction 1 One-Hot EncodingNow to perform all the techniques using python let us get to Jupyter notebook and create a sample dataframe of some sentences.
import numpy as np import pandas as pd df = pd.DataFrame({"text":sentences, "output":[1,1,0]})Now we can perform one-hot encoding using sklearn pre-built class as well as you can implement it using python. After implementation, each sentence will have a different shape 2-D array as shown in below sample image of one sentence.
1) Sparsity – You can see that only a single sentence creates a vector of n*m size where n is the length of sentence m is a number of unique words in a document and 80 percent of values in a vector is zero.
2) No fixed Size – Each document is of a different length which creates vectors of different sizes and cannot feed to the model.
3) Does not capture semantics – The core idea is we have to convert text into numbers by keeping in mind that the actual meaning of a sentence should be observed in numbers that are not seen in one-hot encoding.
2 Bag Of Words from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() bow = cv.fit_transform(df['text'])Now to see the vocabulary and the vector it has created you can use the below code as shown in the below results image.
Advantages
1) Simple and intuitive – Only a few lines of code are required to implement the technique.
2) Fix size vector – The problem which we saw in one-hot encoding where we are unable to feed data the data to machine learning model because each sentence forms a different size vector but here It ignores the new words and takes only words which are vocabulary so creates a vector of fix size.
2) Sparsity – when we have a large vocabulary, and the document contains a few repeated terms then it creates a sparse array.
3) Not considering ordering is an issue – It is difficult to estimate the semantics of the document.
3 N-GramsThe technique is similar to Bag of words. All the techniques till now we have read it is made up of a single word and we are not able to use them or utilize them for better understanding. So N-Gram technique solves this problem and constructs vocabulary with multiple words. When we built an N-gram technique we need to define like we want bigram, trigram, etc. So when you define N-gram and if it is not possible then it will throw an error. In our case, we cannot build after a 4 or 5-gram model. Let us try bigram and observe the outputs.
#Bigram model from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(ngram_range=[2,2]) bow = cv.fit_transform(df['text'])You can try trigram with a range like [3,3] and try with N range so you get more clarification over the technique and try to transform a new document and observe how does it perform.
Advantages
1) Able to capture semantic meaning of the sentence – As we use Bigram or trigram then it takes a sequence of sentences which makes it easy for finding the word relationship.
2) Intuitive and easy to implement – implementation of N-Gram is straightforward with a little bit of modification in Bag of words.
1) As we move from unigram to N-Gram then dimension of vector formation or vocabulary increases due to which it takes a little bit more time in computation and prediction
2) no solution for out of vocabulary terms – we do not have a way another than ignoring the new words in a new sentence.
4 TF-IDF (Term Frequency and Inverse Document Frequency)Now the technique which we will study does not work in the same way as the above techniques. This technique gives different values(weightage) to each word in a document. The core idea of assigning weightage is the word that appears multiple time in a document but has a rare appearance in corpus then it is very important for that document so it gives more weightage to that word. This weightage is calculated by two terms known as TF and IDF. So for finding the weightage of any word we find TF and IDF and multiply both the terms.
Term Frequency(TF) – The number of occurrences of a word in a document divided by a total number of terms in a document is referred to as Term Frequency. For example, I have to find the Term frequency of people in the below sentence then it will be 1/5. It says how frequently a particular word occurs in a particular document.
People read on Analytics VidhyaInverse Document Frequency – Total number of documents in corpus divided by the total number of documents with term T in them and taking the log of a complete fraction is inverse document frequency. If we have a word that comes in all documents then the resultant output of the log is zero But in implementation sklearn uses a little bit different implementation because if it becomes zero then the contribution of the word is ignored so they add one in the resultant and because of which you can observe the values of TFIDF a bit high. If a word comes only a single time then IDF will be higher.
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() tfidf.fit_transform(df['text']).toarray()So one term keeps track of how frequently the term occurs while the other keeps track of how rarely the term occurs.
End NotesConnect with me on Linkedin
Check out my other articles here and on Blogger
Thanks for giving your time!
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
Different Uses Of Evernote Notes
Introduction to Evernote Notes
Web development, programming languages, Software testing & others
Evernote UsesWe all have warranty cards and user manuals of the different products we own and we cannot even through them out and don’t even open them a single time in years. Now, we can download their pdf versions and save them to Evernote. After saving them in Evernote, now we can search for any word in those warranty cards and user manuals and can find the word in seconds.
Make Evernote your memorabilia. We can use Evernote,s camera to save our memories whether it can be ticket stubs, love letters, theatre programs, travel brochure, and the list never ends. After saving these memories in Evernote, we can rewatch those memories any time and can get a peek into the past. This will make your evernote into a digital scrapbook.
There is an increasing trend of keeping pet animals but we fail to keep the documents related to our pet safe which makes it difficult for us when our pet needs care. Now, using Evernote, we can save all of the pet information. We can save the pet’s adoption papers, pet sitter’s contact, licenses, veterinary care information, vaccination records data in Evernote which would help us to keep every document related to our furry friends in a single place.
Save your kid’s work of art. We can create a virtual refrigerator door that can be shared to our friends and family. We can scan our kid’s works of art and can save them for our lives, moreover, it also saves us from all the clutter. Using Evernote, we can make a scrapbook of our kid’s art and can gift them on their graduation day.
Use Evernote as your spending tracker. We can track our everyday expenses on Evernote which would eventually help in making a personal financial budget. We can use the email forwarding or camera feature of Evernote to send our receipts to Evernote’s notebook. If we buy a large number of goods from a single seller then we can also make a tag for the seller. Now at the end of the year, we can check our spending in our favorite stores.
We all have discount coupons from the different stores of e-commerce platforms. Now, there are different ways to use coupons with Evernote. We can use web clipper to snap the photos of coupons, we can also take screenshots of the coupons or can use email forwarding. Now, we can keep a tag with coupons stored on Evernote and can set a reminder for the expiry date of those coupons.
Gone are the days of writing dear diaries offline. Now people find it more helpful to write journals online and Evernote is one such application that can help us in this. Make Evernote your journal which can be accessed anywhere and anytime.
Make Evernote your day planner or calendar. Now, if we need to keep track of something in terms of date, we can do it through Evernote. Moreover, Evernote has some awesome templates for us to keep an eye on our year, month, week, or day. This will help us in keeping the track of various things even if we are not at home or at the workplace.
Keep your secret recipe collection. Scan and save the old recipes of your family, extract good recipes from the web or snap photos of your favorite recipes from a book. We can also add notes for our help while preparing the dishes. If required, we can share these recipes to our family and friends as well.
Make Evernote your idea box. Save the things which inspire you by grabbing articles, social media posts or images via web clipper. Use the saved articles and images whenever looking out for inspiration or ideas anytime and anyplace. We can use web clippers on our mobile phones as well making it comfortable for the users to clip anything and anywhere.
We can make Evernote our newsletter destination. We can use Evernote’s email address for receiving all of the newsletters directly in Evernote instead of cluttering and spamming our mailboxes.
ConclusionOn the basis of the above article, we understood about Evernote. We went through the different uses of Evernote which would help us in using Evernote in its most efficient way. This article would help anyone who is looking for an online platform where they can store their important documents, set reminders, monitor their meals and calories, and do many other tasks.
Recommended ArticlesThis is a guide to Evernote Notes. Here we discuss the Introduction, Different Uses of Evernote Notes. You may also have a look at the following articles to learn more –
Learn The Different Examples Of Sqlite Function
Introduction to SQLite functions
SQLite provides different kinds of functions to the user. Basically, SQLite has different types of inbuilt functions, and that function we easily use and handle whenever we require. All SQLite functions work on the string and numeric type data. All functions of SQLite are case sensitive that means we can either use functions in uppercase or lowercase. By using the SQLite function, we sort data as per the user requirements. SQLite functions have a different category, such as aggregate functions, data functions, string functions, and windows functions, and that function we can use as per the requirement.
Start Your Free Data Science Course
Hadoop, Data Science, Statistics & others
SQLite functionsNow let’s see the different functions in SQLite as follows.
1. Aggregate Functions
AVG: It is used to calculate the average value of a non-null column in a group.
COUNT: It is used to return how many rows from the table.
MAX: It is used to return the maximum value from a specified
MIN: It is used to return the minimum value from a specified
SUM: is used to calculate the sum of non-null columns from the specified table.
GROUP_CONCAT: It is used to concatenate the null value from the column.
2. String Functions
SUBSTR: It is used to extract and return the substring from the specified column with predefined length and also its specified position.
TRIM: It is used to return the copy of the string, and it removes the start the end character.
LTRIM: It is used to return the copy of the string that removed the starting character of the string.
RTRIM: It is used to return the copy of the string that removed the ending character of the string.
LENGTH: It is used to return how many characters in the string.
REPLACE: It is used to display the copy of the string with each and every instance of the substring that is replaced by the other specified string.
UPPER: It is used to return the string with uppercase that means it is used to convert the all character into the upper cases.
LOWER: It is used to return the string with a lowercase, which means converting all character into lower cases.
INSTR: It is used to return the integer number that indicates the very first occurrence of the substring.
3. Control Flow Functions
COALESCE: It is used to display the first non-null argument.
IFNULL: It is used to implement if-else statements with the null values.
IIF: By using this, we can add if – else into the queries.
NULLIF: It is used to return the null if first and second the element is equal.
4. Data and Time Function
DATE: It is used to determine the date based on the multiple data modifiers.
TIME: It is used to determine the time based on the multiple data modifiers.
DATETIME: It is used to determine the date and time based on the multiple data modifiers.
STRFTIME: That returns the date with the specified format.
5. Math Functions
ABS: It is used to return the absolute value of the number.
RANDOM: It is used to return the random floating value between the minimum and maximum integer.
ROUND: It is used to specify the precision.
ExamplesNow let’s see the different examples of SQLite functions as follows.
create table comp_worker(worker_id integer primary key, worker_name text not null, worker_age text, worker_address text, worker_salary text);Explanation
In the above example, we use the create table statement to create a new table name as comp_worker with different attributes such as worder_id, worker_name, worker_age, worker_address, and worker_salary with different data types as shown in the above example.
Now insert some record for function implementation by using the following insert into the statement as follows.
insert into comp_worker(worker_id, worker_name, worker_age, worker_address, worker_salary) values(1, "Jenny", "23", "Mumbai", "21000.0"), (2, "Sameer", "31", "Pune", "25000.0"), (3, "John", "19", "Mumbai", "30000.0"), (4, "Pooja", "26", "Ranchi", "50000.0"), (5, "Mark", "29", "Delhi", "45000.0");Explanation
In the above statement, we use to insert it into the statement. The end output of the above statement we illustrate by using the following screenshot as follows.
Now we can perform the SQLite different functions as follows.
a. COUNT FunctionSuppose users need to know how many rows are present in the table at that time; we can use the following statement.
select count(*) from comp_worker;Explanation
In the above example, we use the count function. The end output of the above statement we illustrate by using the following screenshot.
b. MAX FunctionSuppose we need to know the highest salary of the worker so that we can use the following statement as follows.
select max(worker_salary) from comp_worker;Explanation
In the above example, we use the max function to know the max salary of a worker from the comp_worker table. The end output of the above statement we illustrate by using the following screenshot.
c. MIN Function select min(worker_salary) from comp_worker;Explanation
The end output of the above statement we illustrate by using the following screenshot.
d. AVG FunctionSuppose users need to know the total average salary of a worker from comp_worker at that time; we can use the following statement as follows.
select avg(worker_salary) from comp_worker;Explanation
The end output of the above statement we illustrate by using the following screenshot.
e. SUM FunctionSuppose users need to know the total sum salary of a worker from comp_worker at that time; we can use the following statement as follows.
select sum(worker_salary) from comp_worker;Explanation
The end output of the above statement we illustrate by using the following screenshot.
f. Random Function select random() AS Random;The end output of the above statement we illustrate by using the following screenshot.
g. Upper FunctionSuppose we need to return the worker_name column in the upper case at that time, we can use the following statement as follows.
select upper(worker_name) from comp_worker;Explanation
The end output of the above statement we illustrate by using the following screenshot.
h. Length Function select worker_name, length(worker_name) from comp_worker;Explanation
The end output of the above statement we illustrate by using the following screenshot.
ConclusionWe hope from this article you have understood about the SQLite Function. From the above article, we have learned the basic syntax of Function statements, and we also see different examples of Function. From this article, we learned how and when we use SQLite Functions.
Recommended ArticlesWe hope that this EDUCBA information on “SQLite functions” was beneficial to you. You can view EDUCBA’s recommended articles for more information.
Update the detailed information about Learning Different Techniques Of Anomaly Detection on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!