Trending February 2024 # Multicollinearity: Problem, Detection And Solution # Suggested March 2024 # Top 4 Popular

You are reading the article Multicollinearity: Problem, Detection And Solution updated in February 2024 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Multicollinearity: Problem, Detection And Solution

This article was published as a part of the Data Science Blogathon.

Multicollinearity causes the following 2 primary issues –

1. Multicollinearity generates high variance of the estimated coefficients and hence, the coefficient estimates corresponding to those interrelated explanatory variables will not be accurate in giving us the actual picture. They can become very sensitive to small changes in the model.

2. Consecutively the t-ratios for each of the individual slopes might get impacted leading to insignificant coefficients. It is also possible that the adjusted R squared for a model is pretty good and even the overall F-test statistic is also significant but some of the individual coefficients are statistically insignificant. This scenario can be a possible indication of the presence of multicollinearity as multicollinearity affects the coefficients and corresponding p-values, but it does not affect the goodness-of-fit statistics or the overall model significance.

How do we measure Multicollinearity?

A very simple test known as the VIF test is used to assess multicollinearity in our regression model. The variance inflation factor (VIF) identifies the strength of correlation among the predictors.

Now we may think about why we need to use ‘VIF’s and why we are simply not using the Pairwise Correlations.

Since multicollinearity is the correlation amongst the explanatory variables it seems quite logical to use the pairwise correlation between all predictors in the model to assess the degree of correlation. However, we may observe a scenario when we have five predictors and the pairwise correlations between each pair are not exceptionally high and it is still possible that three predictors together could explain a very high proportion of the variance in the fourth predictor.

I know this sounds like a multiple regression model itself and this is exactly what VIFs do. Of course, the original model has a dependent variable (Y), but we don’t need to worry about it while calculating multicollinearity. The formula of VIF is 

Here the Rj2 is the R squared of the model of one individual predictor against all the other predictors. The subscript j indicates the predictors and each predictor has one VIF. So more precisely, VIFs use a multiple regression model to calculate the degree of multicollinearity. Suppose we have four predictors – X1, X2, X3, and X4. So, to calculate VIF, all the independent variables will become dependent variables one by one. Each model will produce an R-squared value indicating the percentage of the variance in the individual predictor that the set of other predictors explain.

The name “variance inflation factor” was coined because VIF tells us the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 10 indicates that the existing multicollinearity is inflating the variance of the coefficients 10 times compared to a no multicollinearity model. The variances that we are talking about here are the standard errors of the coefficient estimates which indicates the precision of these estimates. These standard errors are used to calculate the confidence interval of the coefficient estimates.

Larger standard errors will produce wider confident intervals leading to less precise coefficient estimates. Additionally, wide confidence intervals may sometimes flip the coefficient signs as well.

VIFs do not have any upper limit. The lower the value the better. VIFs between 1 and 5 suggest that the correlation is not severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of multicollinearity where the coefficient estimates may not be trusted and the statistical significance is questionable. Well, the need to reduce multicollinearity depends on its severity.

How can we fix Multi-Collinearity in our model?

The potential solutions include the following:

1. Simply drop some of the correlated predictors. From a practical point of view, there is no point in keeping 2 very similar predictors in our model. Hence, VIF is widely used as variable selection criteria as well when we have a lot of predictors to choose from.

3. Do some linear transformation e.g., add/subtract 2 predictors to create a new bespoke predictor.

4. As an extension of the previous 2 points, another very popular technique is to perform Principal components analysis (PCA). PCA is used when we want to reduce the number of variables in our data but we are not sure which variable to drop. It is a type of transformation where it combines the existing predictors in a way only to keep the most informative part.

It then creates new variables known as Principal components that are uncorrelated. So, if we have 10-dimensional data then a PCA transformation will give us 10 principal components and will squeeze maximum possible information in the first component and then the maximum remaining information in the second component and so on. The primary limitation of this method is the interpretability of the results as the original predictors lose their identity and there is a chance of information loss. At the end of the day, it is a trade-off between accuracy and interpretability.

How to calculate VIF (R and Python Code):

I am using a subset of the house price data from Kaggle. The dependent/target variable in this dataset is “SalePrice”. There are around 80 predictors (both quantitative and qualitative) in the actual dataset. For Simplicity’s purpose, I have selected 10 predictors based on my intuition that I feel will be suitable predictors for the Sale price of the houses. Please note that I did not do any treatment e.g., creating dummies for the qualitative variables. This example is just for representation purposes.

The following table describes the predictors I chose and their description.

The below code shows how to calculate VIF in R. For this we need to install the ‘car’ package. There are other packages available in R as well.

The output is shown below. As we can see most of the predictors have VIF <= 5

Now if we want to do the same thing in python then please see the code and output below

Please note that in the python code I have added a column of intercept/constant to my data set before calculating the VIFs. This is because the variance_inflation_factor function in python does not assume the intercept by default while calculating the VIFs. Hence, often we may come across very different results in R and Python output. For details, please see this discussion here.

Related

You're reading Multicollinearity: Problem, Detection And Solution

Top 5 Intrusion Detection And Preventions Systems To Consider

For example, a breach or data leak may cause a significant loss of reputation and brand trust. Furthermore, due to the liabilities that come with data storage, companies may face serious fines and penalties.

To that end, intrusion detection and prevention systems — IDPS are critical. In this post, we will look at the most important IDS and IPS capabilities and technologies. But first, let’s define these systems.

What is Intrusion Detection and Prevention System?

An intrusion detection system — IDS is a system that scans and evaluates both incoming and outgoing packets for fraudulent actions using recognized intrusion patterns. This system can be both hardware and software.

IDS scans and tracks apps, services, and capabilities by analyzing malware patterns in system files, scanning algorithms that may indicate risky sequences, tracking endpoints’ actions to find fraudulent intents, and analyzing parameters and variables.

On the other hand, an intrusion prevention system or intrusion detection and prevention system — IPS/IDPS is a cybersecurity mechanism that continually observes all the systems for fraudulent action and takes preventative measures. This mechanism is relatively more sophisticated than IDS. IPS services, which are mostly autonomous, assist to eliminate harmful actions before it affects other segregations of the network. This increases efficiency while reducing the effort

While IDS service only discovers suspicious activities but does little more than warn an operator, IDPS services warn security teams, filter harmful packets, block source address activity, reconstruct connections, and utilize different security protection services to keep companies away from possible risks. The Intrusion Detection and Prevention System utilizes three different techniques to function properly.

Signature-based: This method avoids potential hazards by analyzing identified viruses and other dangerous sequences already stored in datasets.

Anomaly-based: This technology discovers potentially dangerous operations in the business network to eliminate internal hazards.

Protocol-based: This technology monitor and evaluate business infrastructure according to pre-defined security policies to mitigate potential hazards.

Also read:

7 Best Woocommerce Plugins to boost your Store you must know

Intrusion Detection and Prevention Systems

Companies may encounter different variations of IDP systems. When deciding the sort to utilize for the enterprises, consider aspects such as the corporate’s attributes, the aims and intentions for implementing an IDPS, and existing organizational security regulations such as threat prevention.

1- Network-based Intrusion Detection & Prevention System — NIPS:

Our first variant looks for fraudulent activity on whole apps, services, capabilities, and segregations. This is commonly accomplished by examining procedure compliance. If the procedure action fits a list of recognized hazards, the relevant data is denied access.

NIPS are typically used at network edges. This type of IDPS tracks both incoming and outgoing traffic to prevent possible cyber attacks. This type monitors and protects a network’s privacy, authenticity, and reliability. Its primary duties involve safeguarding the network against attacks.

2- Advanced Threat Prevention — ATP:

3- Network Behaviour Analyst — NBA:

While NIPS regulates variations in procedure action, this one observes unexpected operational outcomes to mark risks. The NBA gathers and evaluates business confidential information to locate fraudulent or anomalous activities. These technologies examine data from a variety of inputs and avert possible cyber threats.

This type may aid in the security of systems. It continuously analyzes companies’ apps, services, and online actions and warns them of any unusual actions or anomalies. With this companies can promptly address any possible concerns before they escalate.

4- Wireless Intrusion Detection & Prevention System — WIPS:

This approach analyzes wifi specifications to evaluate wireless systems. This service is installed within the wireless capabilities and in regions where unwanted wireless networking is possible. This sort of IPS merely regulates Wi-Fi features for unwanted admission and disconnects illegitimate endpoints.

5- Host-based Intrusion Detection & Prevention System — HIPS:

Final Words

Also, with the right measure, a company can minimize the costs related to cybersecurity while increasing brand trust. As creating secure connections is of great importance for modern businesses, adopting the right solutions is critical.

Learning Different Techniques Of Anomaly Detection

This article was published as a part of the Data Science Blogathon.

Introduction

As a data scientist, in many cases of fraud detection in the bank for a transaction, Smart meters anomaly detection,

Have you ever thought about the bank where you make several transactions and how the bank helps you by identifying fraud?

Someone logins to your account, and a message is sent to you immediately after they notice suspicious activity at a place to confirm if it’s you or someone else.

What is Anomaly Detection?

Suppose we own a otice the font end data errors, even when our company supplies the same service, but the sales are declining. Here come the errors, which are termed anomalies or outliers.

Let’s take an example that will further clarify what an means.

Source: Canvas

 Here in this example, a bird is an outlier or noise.

Have you ever thought about the bank where you make several transactions, How the bank helps you by identifying fraud detection?

If the bank manager notices unusual behavior in your accounts, it can block the card. For example, spending a lot of amount in one day or another sum amount on another day will send a message of alert or will block your card as it’s not related to how you were spending previously.

Two AI firms detect an anomaly inside the bank. One is Fedzai’s detection firm, and another one by Ayasdi’s solution.

Let’s take another example of shopping e end of the month, the shopkeeper puts certain items on sale and offers you a scheme where you can buy two at less rate.

Now how do we describe the sales data compared to the start-of-month data? Do sale data validate data concerning monthly sales at the start of selling? It’s not vali

Outliers are “something I should remove from the dataset so that it doesn’t skew the model I’m building,” typically because they suspect that the data in question is flawed and that the model they want to construct shouldn’t need to take into account.

Outliers are most commonly caused by:

Intentional (dummy outliers created to test detection methods)

Data processing errors (data manipulation or data set unintended mutations)

Sampling errors (extracting or mixing data from wrong or various sources)

Natural (not an error, novelties in the data)

An actual data point significantly outside a distribution’s mean or median is an outlier.

An anomaly is a false data point made by a different process than the rest of the data.

If you construct a linear regression model, it is less likely that the model generated points far from the regression line. The likelihood of the data is another name for it.

Outliers are data points with a low likelihood, according to your model. They are identical from the perspective of modeling.

For instance, you could construct a model that describes a trend in the data and then actively looks for existing or new values with a very low likelihood. When people say “anomalies,” they mean these things. The anomaly detection of one person is the outlier of another!

Extreme values in your data series are called outliers. They are questionable. One student can be much more brilliant than other students in the same class, and it is possible.

However, anomalies are unquestionably errors. For example, one million degrees outside, or the air temperature won’t stay the same for two weeks. As a result, you disregard this data.

An outlier is a valid data point, and can’t be ignored or removed, whereas noise is garbage that needs removal. Let’s take another example to understand noise.

Suppose you wanted to take the average salary of employees and in data added the pay of Ratan Tata or Bill Gate, all the employer’s salary averages will show an increase which is incorrect data.

1.

2. Uni-variate – Uni-variate a variable with different values in the dataset.

3. Multi-variate – It is defined by the dataset by having more than one variable with a different set of values.

We will now use various techniques which will help us to find outliers.

Anomaly Detection by Scikit-learn

We will import the required library and read our data.

import seaborn as sns import pandas as pd titanic=pd.read_csv('titanic.csv') titanic.head()



We can see in the image many null values. We will fill the null values with mode.

titanic['age'].fillna(titanic['age'].mode()[0], inplace=True) titanic['cabin'].fillna(titanic['cabin'].mode()[0], inplace=True) titanic['boat'].fillna(titanic['boat'].mode()[0], inplace=True) titanic['body'].fillna(titanic['body'].mode()[0], inplace=True) titanic['sex'].fillna(titanic['sex'].mode()[0], inplace=True) titanic['survived'].fillna(titanic['survived'].mode()[0], inplace=True) titanic['home.dest'].fillna(titanic['home.dest'].mode()[0], inplace=True)

Let’s see our data in more detail. When we look at our data in statistics, we prefer to know its distribution types, whether binomial or other distributions.

titanic['age'].plot.hist( bins = 50, title = "Histogram of the age" )

This distribution is Gaussian distribution and is often called a normal distribution.

Mean and Standard Deviation are considered the two parameters. With the change in mean values, the distribution curve changes to left or right depending on the mean values.

Standard Normal distribution means mean(μ = 0) and standard deviation (σ) is one. To know the probability Z-table is already available.

Z-Scores

We can calculate Z – Scores by the given formula where x is a random variable, μ is the mean, and σ is the standard deviation.

Why do we need Z-Scores to be calculated?

It helps to know how a single or individual value lies in the entire distribution.

For example, if the maths subject scores mean is given to us 82, the standard deviation σ is 4. We have a value of x as 75. Now Z-Scores will be calculated as 82-75/4 = 1.75. It shows the value 75 with a z-score of 1.75 lies below the mean. It helps to determine whether values are higher, lower, or equal to the mean and how far.

Now, we will calculate Z-Score in python and look at outliers.

We imported Z-Scores from Scipy. We calculated Z-Score and then filtered the data by applying lambda. It gives us the number of outliers ranging from the age of 66 to 80.

from scipy.stats import zscore titanic["age_zscore"] = zscore(titanic["age"]) titanic["outlier"] = titanic["age_zscore"].apply( lambda x: x = 2.8 ) titanic[titanic["outlier"]]

We will now look at another method based on clustering called Density-based spatial clustering of applications with noise (DBSCAN).

DBSCAN

As the name indicates, the outliers detection is on clustering. In this method, we calculate the distance between points.

Let’s continue our titanic data and plot a graph between fare and age. We made a scatter graph between age and fare variables. We found three dots far away from the others.

Before we proceed further, we will normalize our data variables.

There are many ways to make our data normalize. We can import standard scaler by sklearn or min max scaler.

titanic['fare'].fillna(titanic['fare'].mean(), inplace=True) from sklearn.preprocessing import StandardScaler scale = StandardScaler() fage = scale.fit_transform(fage) fage = pd.DataFrame(fage, columns = ["age", "fare"]) fage.plot.scatter(x = "age", y = "fare")

We used Standard Scaler to make our data normal and plotted a scatter graph.

Now we will import DBSCAN to give points to the clusters. If it fails, it will show -1.

from sklearn.cluster import DBSCAN outlier = DBSCAN( eps = 0.5, metric="euclidean", min_samples = 3, n_jobs = -1) clusters = outlier.fit_predict(fage) clusters  array([0, 1, 1, ..., 1, 1, 1])

Now we have the results, but how do we check which value is min, max and whether we have -1 values? We will use the arg min value to check the smallest value in the cluster.

value=-1 index = clusters.argmin() print(" The element is at ", index) small_num = np.min(clusters) print("The small number is : " , small_num) print(np.where(clusters == small_num)) The element is at: 14 The small number is : -1 (array([ 14, 50, 66, 94, 285, 286], dtype=int64),)

We can see from the result six values which are -1.

Lets now plot a scatter graph.

from matplotlib import cm c = cm.get_cmap('magma_r') fage.plot.scatter( x = "age", y = "fare", c = clusters, cmap = c, colorbar = True )

The above methods we applied are on uni-variate outliers.

For Multi-variates outliers detections, we need to understand the multi-variate outliers.

For example, we take Car readings. We have seen two reading meters one for the odometer, which records or measures the speed at which the vehicle is moving, and the second is the rpm reading which records the number of rotations made by the car wheel per minute.

Suppose the odometer shows in the range of 0-60 mph and rpm in 0-750. We assume that all the values which come should correlate with each other. If the odometer shows a 50 speed and the rpm shows 0 intakes, readings are incorrect. If the odometer shows a value more than zero, that means the car was moving, so the rpm should have higher values, but in our case, it shows a 0 value. i.e., Multi-variate outliers.

Mahalanobis Distance Method

In DBSCAN, we used euclidean distance metrics, but in this case, we are talking about the Mahalanobis distance method. We can also use Mahalanobis distance with DBSCAN.

DBSCAN(eps=0.5, min_samples=3, metric='mahalanobis', metric_params={'V':np.cov(X)}, algorithm='brute', leaf_size=30, n_jobs=-1)

Why is Euclidean unfit for entities cor-related to each other? Euclidean distance cannot find or will give incorrect data on how close are the two points.

Mahalanobis method uses the distance between points and distribution that is clean data. Euclidean distance is often between two points, and its z-score is calculated by x minus mean and divided by standard deviation. In Mahalanobis, the z-score is x minus the mean divided by the covariance matrix.

Therefore, what effect does dividing by the covariance matrix have? The covariance values will be high if the variables in your dataset are highly correlated.

Similarly, if the covariance values are low, the distance is not significantly reduced if the data are not correlated. It does so well that it addresses both the scale and correlation of the variables issues.

Code

df = pd.read_csv('caret.csv').iloc[:, [0,4,6]] df.head()

We defined the function distance as x= None, data= None, and Covariance = None. Inside the function, we took the mean of data and used the covariance value of the value there. Otherwise, we will calculate the covariance matrix. T stands for transpose.

For example, if the array size is five or six and you want it to be in two variables, then we need to transpose the matrix.

np.random.multivariate_normal(mean, cov, size = 5) array([[ 0.0509196, 0.536808 ], [ 0.1081547, 0.9308906], [ 0.4545248, 1.4000731], [ 0.9803848, 0.9660610], [ 0.8079491 , 0.9687909]]) np.random.multivariate_normal(mean, cov, size = 5).T array([[ 0.0586423, 0.8538419, 0.2910855, 5.3047358, 0.5449706], [ 0.6819089, 0.8020285, 0.7109037, 0.9969768, -0.7155739]])

We used sp.linalg, which is Linear algebra and has different functions to be performed on linear algebra. It has the inv function for the inversion of the matrix. NumPy dot as means for the multiplication of the matrix.

import scipy as sp def distance(x=None, data=None, cov=None): x_m = x - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_cov = sp.linalg.inv(cov) left = np.dot(x_m, inv_cov) m_distance = np.dot(left, x_m.T) return m_distance.diagonal() df_g= df[['carat', 'depth', 'price']].head(50) df_g['m_distance'] = distance(x=df_g, data=df[['carat', 'depth', 'price']]) df_g.head() B. Tukey’s method for outlier detection

Tukey method is also often called Box and Whisker or Box plot method.

Tukey method utilizes the Upper and lower range.

Upper range = 75th Percentile -k*IQR

Lower range = 25th Percentile + k* IQR

Let us see our Titanic data with age variable using a box plot.

sns.boxplot(titanic['age'].values)

We can see in the image the box blot created by Seaborn shows many dots between the age of 55 and 80 are outliers not within the quartiles. We will detect lower and upper range by making a function outliers_detect.

def outliers_detect(x, k = 1.5): x = np.array(x).copy().astype(float) first = np.quantile(x, .25) third = np.quantile(x, .75) # IQR calculation iqr = third - first #Upper range and lower range lower = first - (k * iqr) upper = third + (k * iqr) return lower, upper outliers_detect(titanic['age'], k = 1.5) (2.5, 54.5) Detection by PyCaret

We will be using the same dataset for detection by PyCaret.

from pycaret.anomaly import * setup_anomaly_data = setup(df)

Pycaret is an open-source machine learning which uses an unsupervised learning model to detect outliers. It has a get_data method for using the dataset in pycaret itself, set_up for preprocessing task before detection, usually takes data frame but also has many other features like ignore_features, etc.

Other methods create_model for using an algorithm. We will first use Isolation Forest.

ifor = create_model("iforest") plot_model(ifor) ifor_predictions = predict_model(ifor, data = df) print(ifor_predictions) ifor_anomaly = ifor_predictions[ifor_predictions["Anomaly"] == 1] print(ifor_anomaly.head()) print(ifor_anomaly.shape)

Anomaly 1 indicates outliers, and Anomaly 0 shows no outliers.

The yellow color here indicates outliers.

Now let us see another algorithm, K Nearest Neighbors (KNN)

knn = create_model("knn") plot_model(knn) knn_pred = predict_model(knn, data = df) print(knn_pred) knn_anomaly = knn_pred[knn_pred["Anomaly"] == 1] knn_anomaly.head() knn_anomaly.shape

Now we will use a clustering algorithm.

clus = create_model("cluster") plot_model(clus) clus_pred = predict_model(clus, data = df) print(clus_pred) clus_anomaly = clus_predictions[clus_pred["Anomaly"] == 1] print(clus_anomaly.head()) clus_anomaly.shape Anomaly Detection by PyOD

PyOD is a python library for the detection of outliers in multivariate data. It is good both for supervised and unsupervised learning.

from pyod.models.iforest import IForest from chúng tôi import KNN

We imported the library and algorithm.

from chúng tôi import generate_data from chúng tôi import evaluate_print from pyod.utils.example import visualize train= 300 test=100 contaminate = 0.1 X_train, X_test, y_train, y_test = generate_data(n_train=train, n_test=test, n_features=2,contamination=contaminate,random_state=42) cname_alg = 'KNN' # the name of the algorithm is K Nearest Neighbors c = KNN() c.fit(X_train) #Fit the algorithm y_trainpred = c.labels_ y_trainscores = c.decision_scores_ y_testpred = c.predict(X_test) y_testscores = c.decision_function(X_test) print("Training Data:") evaluate_print(cname_alg, y_train, y_train_scores) print("Test Data:") evaluate_print(cname_alg, y_test, y_test_scores) visualize(cname_alg, X_train, y_train, X_test, y_test, y_trainpred,y_testpred, show_figure=True, save_figure=True)

We will use the IForest algorithm.

fname_alg = 'IForest' # the name of the algorithm is K Nearest Neighbors f = IForest() f.fit(X_train) #Fit the algorithm y_train_pred = c.labels_ y_train_scores = c.decision_scores_ y_test_pred = c.predict(X_test) y_test_scores = c.decision_function(X_test) print("Training Data:") evaluate_print(fname_alg, y_train_pred, y_train_scores) print("Test Data:") evaluate_print(fname_alg, y_test_pred, y_test_scores) visualize(fname_alg, X_train, y_train, X_test, y_test_pred, y_train_pred,y_test_pred, show_figure=True, save_figure=True) Anomaly Detection by Prophet import prophet from prophet import forecaster from prophet import Prophet m = Prophet() data = pd.read_csv('air_pass.csv') data.head() data.columns = ['ds', 'y'] data['y'] = np.where(data['y'] != 0, np.log(data['y']), 0)

The Log of the y column enables no negative value. We split our data into train, test, and stored the prediction in the variable forecast.

train, test= train_test_split(data, random_state =42) m.fit(train[['ds','y']]) forecast = m.predict(test) def detect(forecast): forcast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy() forcast['real']= data['y'] forcast['anomaly'] =0 forcast.loc[forcast['real']< forcast['yhat_lower'], 'anomaly']=-1 forcast['imp']=0 in_range = forcast['yhat_upper']-forcast['yhat_lower'] forcast.loc[forcast['anomaly']==1, 'imp'] = forcast['real']-forcast['yhat_upper']/in_range forcast.loc[forcast['anomaly']==-1, 'imp']= forcast['yhat_lower']-forcast['real']/in_range return forcast detect(forecast)

We took the anomaly as -1.

Conclusion

The process of finding outliers in a given dataset is called anomaly detection. Outliers are data objects that stand out from the rest of the object values in the dataset and don’t behave normally.

Anomaly detection tasks can use distance-based and density-based clustering methods to identify outliers as a cluster.

We here discuss anomaly detection’s various methods and explain them using the code on three datasets of Titanic, Air passengers, and Caret to

Key Points

1. Outliers or anomaly detection can be detected using the Box-Whisker method or by DBSCAN.

2. Euclidean distance method is used with the items not correlated.

3. Mahalanobis method is used with Multivariate outliers.

4. All the values or points are not outliers. Some are noises that ought to be garbage. Outliers are valid data that need to be adjusted.

5. We used PyCaret for outliers detection by using different algorithms where the anomaly is one, shown in yellow colors, and no outliers where the outlier is 0.

6. We used PyOD, which is the Python Outlier detection library. It has more than 40 algorithms. Supervised and unsupervised techniques it is used.

7. We used Prophet and defined the function detect to outline the outliers.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Web Therapy For Problem Preschoolers

Web Therapy for Problem Preschoolers Advice for parents from a long-distance observer

Jonathan Comer, director of CARD’s early childhood intervention program (left), and Jami Furr, clinical director of its child program, lead clinical trials for families of children with preschool oppositional defiant disorder. Photos by Vernon Doucette

“He was running the show,” says Renee, who asked BU Today not to use her full name. “All day was spent trying to get him to do those two things he really needed to do.”

When her son turned three, a pediatrician recommended a neuropsychological evaluation, which yielded a diagnosis of attention deficit hyperactivity disorder (ADHD) and sensory processing disorder, with both conditions exacerbated by his temperament.

Renee was then handed a flyer for Boston University’s Center for Anxiety & Related Disorders (CARD), where Jonathan Comer, director of the early childhood intervention program, and Jami Furr, clinical director of the child program, are conducting trials on preschool oppositional defiant disorder. These behavioral therapy trials, supported by more than $1 million from the National Institutes of Health and the Charles H. Hood Foundation, address a condition that afflicts about 5 percent of preschool-aged children, and they help alleviate symptoms for up to 75 percent of participants.

Children with preschool ODD, says Furr, a postdoctoral associate in the College of Arts & Sciences psychology department, often also have ADHD. They refuse to comply with requests, like to annoy others, and may hit, bite, or throw things, and in more severe cases, they are physically cruel to animals. For obvious reasons, they are often kicked out of preschool. It’s no surprise that parents of such children “are often pretty defeated by the time they get to us,” she says.

Therapists typically use parent-child interaction therapy (PCIT) to treat preschool ODD in a clinical setting, where the therapist observes behind a one-way mirror and coaches parents, via an earpiece, as they play with their child. But such therapy isn’t an option for families who live far from clinics. Children in those families are particularly likely to be prescribed medicine with unfortunate side effects.

“What we’re seeing is skyrocketing rates of antipsychotic medications to treat aggression in young kids, in preschoolers even,” says Comer, a CAS research assistant professor of psychology, adding that such medications are associated with concerning metabolic, circulatory, and endocrine effects in young children.

As a postdoc at Columbia University, Comer had been part of an innovative team that provided PCIT to military families at Fort Drum in Jefferson County, N.Y., near the Canadian border. Distance (more than five hours by car) and the lack of appropriately trained local clinicians made traditional therapy impossible, so doctors turned to technology. They gave each family a web camera, scheduled meeting times, and conducted telehealth sessions online.

Watch this video on YouTube

Back at BU, Comer and Furr—a husband and wife team with a toddler themselves—are halfway through five-year trials that compare the two options: clinical therapy versus telehealth, and telehealth versus delayed treatment. More than 20 families have signed up already, Comer says, and “it looks like the internet-delivered options are equally credible.” It’s possible, he says, that telehealth could be a lifeline for families of defiant preschoolers nationwide.

Before being admitted to the trials, families undergo a thorough assessment to determine whether their child has preschool ODD. Renee remembers when she and her three-year-old arrived at CARD and were put in a small observation room. “It was the first time ever that I was hoping he’d act out,” she says. She didn’t have to worry. Complying with the therapist’s request, she asked her son to take blocks and build a tower. Instead, he flipped the table, emptied nearby shelves, and started hitting her.

“You’re in!” the therapist told them.

Once families are “in,” they receive a technology packet with a webcam, a room microphone, a Bluetooth device, and a mobile hotspot, if necessary. Therapists schedule weekly one-hour sessions with the family, preferably when both parents are available, and everyone signs in through an encrypted site called WebEx.

Once they master those skills, families establish house rules (“No hitting, kicking, or spitting”) and start practicing a detailed timeout sequence when their child misbehaves. “This is really a learning process not only for the parents, but also for the children,” Furr says. “What we’re really emphasizing is consistency, predictability, and follow-through.”

Amy, another parent in the trial, says her son would never sit through a timeout. But after beginning therapy, he started putting himself in timeout for violating a house rule, like hitting his younger sister. “He wasn’t able to control the action, but he knew it wasn’t right,” she says.

The last stage is practicing timeout sessions outside the home, a realm most families avoid for fear of explosive and embarrassing tantrums. Therapists use walkie-talkies to coach parents as they practice their newfound skills in the middle of a restaurant, store, or park.

Most families master these skills within six months. And the earlier families start therapy, the better. “When left untreated,” Comer says, “these behaviors can snowball into more serious conduct problems and then antisocial problems later on in adulthood.”

Amy and her husband completed therapy last year, and she says they learned valuable lessons about parenting and have strengthened their relationship with their son. They still struggle at times, she says, but “we’re able to connect on a different level and understand each other.”

Meanwhile, Renee’s family is seven weeks into therapy, and they are already seeing results. Her son now asks for special playtime and even helps with simple tasks like putting away toys, which used to be a daylong battle.

“It saved our life,” Renee says. “As a family, we were being thrown apart. It’s been a savior for us to deal with him, understand him, and know that this kid has potential, and we want to help him achieve it.”

For more information about CARD’s telehealth program, contact the center at 617-353-9610. Evaluations are free. Qualifying families also receive free therapy.

Explore Related Topics:

Kaggle Solution: What’s Cooking ? (Text Mining Competition)

Introduction

Tutorial on Text Mining, XGBoost and Ensemble Modeling in R

I came across What’s Cooking competition on Kaggle last week. At first, I was intrigued by its name. I checked it and realized that this competition is about to finish. My bad! It was a text mining competition.  This competition went live for 103 days and ended on 20th December 2024.

Still, I decided to test my skills. I downloaded the data set, built a model and managed to get a score of 0.79817 in the end. Even though, my submission wasn’t accepted after the competition got over, but I could check my score. This got me in top 20 percentile.

I used Text Mining, XGBoost and Ensemble Modeling to get this score. And, I used R. It took me less than 6 hours to achieve this milestone. I teamed up with Rohit Hinduja, who is currently interning at Analytics Vidhya.

To help beginners in R, here is my solution in a tutorial format. In the article below, I’ve adapted a step by step methodology to explain the solution. This tutorial requires prior knowledge of R and Machine Learning.

I am confident that this tutorial can improve your R coding skills and approaches.

Let’s get started.

Before you start…

Here’s a quick approach to (for beginners) give a tough fight in any kaggle competition:

What’s Cooking ?

Yeah! I could smell, it was a text mining competition. The data set had a list of id, ingredients and cuisine. There were 20 types of cuisine in the data set. The participants were asked to predict a cuisine based on available ingredients.

The ingredients were available in the form of a text list. That’s where text mining was used. Before reaching to the modeling stage, I cleaned the text using pre-processing methods. And, finally with available set of variables, I used an ensemble of XGBoost Models.

Note: My system configuration is core i5 processor, 8GB RAM and 1TB Hard Disk. 

Solution

Below is the my solution of this competition:

Step 1. Hypothesis Generation

Though many people don’t believe in this, but this step do wonders when done intuitively. Hypothesis Generation can help you to think ‘out of data’. It also helps you understand the data and relationship between the variables. It should ideally be done after you’ve looked at problem statement (but not the data).

Before exploring data, you must think smartly on the problem statement. What could be the features which can influence your outcome variable? Think on these terms and write down your findings. I did the same. Below is my list of findings which I thought could help me in determining a cuisine:

Taste: Different cuisines are cooked to taste different. If you know the taste of the food, you can estimate the type of cuisine.

Smell: With smell also, we can determine a cuisine type

Serving Type: We can identify the cuisine by looking at the way it is being served. What are the dips it is served with?

Hot or Cold: Some cuisines are served hot while some cold.

Group of ingredients and spices: Precisely, after one has tasted, we can figure out the cuisine by the mix of ingredients used. For example, you are unlikely to find pasta as an ingredient in any Indian cooking.

Liquid Served: Some cuisines are represented by the type of drinks served with food.

Location: The location of eating can be a factor in determining cuisine.

Duration of cooking: Some cuisines tend to have longer cooking cycles. Others might have more fast food style cooking.

Order of pouring ingredients: At times, the same set of ingredients are poured in a different order in different cuisines.

Percentage of ingredients which are staple crops / animals in the country of the cuisine:  A lot of cooking historically has been developed based on the availability of the ingredients in the country. A high percentage here could be a good indicator.

Step 2. Download and Understand the Data Set

The data set shows a list of id, cuisine and ingredients . The data set is available in json format. The dependent variable is cuisine. The independent variable is ingredients. Train data set is used for creating model. Test data is used to checking the accuracy of the model. If you are still confused between the two, remember, test data set do not have dependent variable.

Since the data is available in text format, I was determined to quickly build a corpus of ingredients (next step). Here is a snapshot of data set for your perusal in json format:

Step 3. Basics of Text Mining

For this solution, I’ve used R (precisely R Studio 0.99.484) in Windows environment.

Text Mining / Natural Language Processing helps computers to understand text and derive useful information from it. Several brands use this technique to analyse customer sentiments on social media. It consists of pre-defined set of commands used to clean the data. Since, text mining is mainly used to verify sentiments, the incoming data can be loosely structured, multilingual, textual or might have poor spellings.

Some of the commonly used techniques in text mining are:

Bag of Words : This techniques creates a ‘bag’ or group of words by counting the number of times each word has appear and use these counts as independent variables.

Deal with Punctuation : This can be tricky at times. Your tool(R or Python) would read ‘data mining’ & ‘data-mining’ as two different words. But they are same. Hence, we should remove the punctuation elements also.

Remove Stopwords : Stopwords are nothing but the words which add no value to text. They don’t describe any sentiment. Examples are ‘i’,’me’,’myself’,’they’,’them’ and many more. Hence, we should remove such words too. In addition to stopwords, you may find other words which are repeated  but add no value. Remove them as well.

Stemming or Lemmatization : This suggests bringing a word back to its root. It is generally used of words which are similar but only differ by tenses. For example: ‘play’, ‘playing’ and ‘played’ can be stemmed into one word ‘play’, since all three connotes the same action.

I’ve used these techniques in my solution too.

Step 4. Importing and Combining Data Set

Since the data set is in json format, I require different set of libraries to perform this step. jsonlite offers an easy way to import data in R. This is how I’ve done:

1. Import Train and Test Data Set

setwd('D:/Kaggle/Cooking') install.packages('jsonlite') library(jsonlite) train <- fromJSON("train.json") test <- fromJSON("test.json")

2. Combine both train and test data set. This will make our text cleaning process less painful. If I do not combine, I’ll have to clean train and test data set separately. And, this would take a lot of time.

But I need to add the dependent variable in test data set. Data can be combine using rbind (row-bind) function.

#add dependent variable test$cuisine <- NA #combine data set combi <- rbind(train, test) Step 5. Pre-Processing using tm package ( Text Mining)

As explained above, here are the steps used to clean the list of ingredients. I’ve used tm package for text mining.

1. Create a Corpus of Ingredients (Text)

#install package library(tm) #create corpus corpus <- Corpus(VectorSource(combi$ingredients))

2.  Convert text to lowercase

corpus <- tm_map(corpus, tolower) corpus[[1]]

3. Remove Punctuation

corpus <- tm_map(corpus, removePunctuation) corpus[[1]]

4.  Remove Stopwords

corpus <- tm_map(corpus, removeWords, c(stopwords('english'))) corpus[[1]]

5. Remove Whitespaces

corpus <- tm_map(corpus, stripWhitespace) corpus[[1]]

6. Perform Stemming

corpus <- tm_map(corpus, stemDocument) corpus[[1]]

6. After we are done with pre-processing, it is necessary to convert the text into plain text document. This helps in pre-processing documents as text documents.

corpus <- tm_map(corpus, PlainTextDocument)

7. For further processing, we’ll create a document matrix where the text will categorized in columns

#document matrix frequencies <- DocumentTermMatrix(corpus) frequencies Step 6. Data Exploration

1. Computing frequency column wise to get the ingredient with highest frequency

#organizing frequency of terms freq <- colSums(as.matrix(frequencies)) length(freq) ord <- order(freq) ord #if you wish to export the matrix (to see how it looks) to an excel file m <- as.matrix(frequencies) dim(m) write.csv(m, file = 'matrix.csv') #check most and least frequent words freq[head(ord)] freq[tail(ord)] #check our table of 20 frequencies head(table(freq),20) tail(table(freq),20)

We see that, there are may terms (ingredients) which occurs once, twice or thrice. Such ingredients won’t add any value to the model. However, we need to be sure about removing these ingredients as it might cause loss in data. Hence, I’ll remove only the terms having frequency less than 3

#remove sparse terms sparse <- removeSparseTerms(frequencies, 1 - 3/nrow(frequencies)) dim(sparse)

2. Let’s visualize the data now. But first, we’ll create a data frame.

#create a data frame for visualization wf <- data.frame(word = names(freq), freq = freq) head(wf) #plot terms which appear atleast 10,000 times library(ggplot2)

chart <- chart + geom_bar(stat = ‘identity’, color = ‘black’, fill = ‘white’) chart <- chart + theme(axis.text.x=element_text(angle=45, hjust=1)) chart

Here we see that salt, oil, pepper are among the highest occurring ingredients. You can change the freq values (in graph above) to visualize the frequency of ingredients.

3. We can also find the level of correlation between two ingredients. For example, if you have any ingredient in mind which can be highly correlated with others, we can find it. Here I am checking the correlation of salt and oil with other variables. I’ve assigned the correlation limit as 0.30. It means, I’ll only get the value which have correlation higher than 0.30.

#find associated terms findAssocs(frequencies, c('salt','oil'), corlimit=0.30)

4. We can also create a word cloud to check the most frequent terms. It is easy to build and gives an enhanced understanding of ingredients in this data. For this, I’ve used the package ‘wordcloud’.

#create wordcloud library(wordcloud) set.seed(142) #plot word cloud wordcloud(names(freq), freq, chúng tôi = 2500, scale = c(6, .1), colors = brewer.pal(4, "BuPu"))

#plot 5000 most used words wordcloud(names(freq), freq, max.words = 5000, scale = c(6, .1), colors = brewer.pal(6, 'Dark2'))

5. Now I’ll make final structural changes in the data.

#create sparse as data frame newsparse <- as.data.frame(as.matrix(sparse)) dim(newsparse) #check if all words are appropriate colnames(newsparse) <- make.names(colnames(newsparse)) #check for the dominant dependent variable table(train$cuisine)

Here I find that, ‘italian’ is the most popular in all the cuisine available. Using this information, I’ve added the dependent variable ‘cuisine’ in the data frame newsparse as ‘italian’.

#add cuisine newsparse$cuisine <- as.factor(c(train$cuisine, rep('italian', nrow(test)))) #split data mytrain <- newsparse[1:nrow(train),] mytest <- newsparse[-(1:nrow(train)),] Step 7. Model Building

As my first attempt, I couldn’t think of any algorithm better than naive bayes. Since I have a multi class categorical variable, I expected naive bayes to do wonders. But, to my surprise, the naive bayes model went in perpetuity. Perhaps, my machine specifications aren’t powerful enough.

Next, I tried Boosting. Thankfully, the model computed without any trouble. Boosting is a technique which convert weak learners into strong learners. In simple terms, I built three XGBoost model. All there were weak, means their accuracy weren’t good. I combined (ensemble) the predictions of three model to produce a strong model.  To know more about boosting, you can refer to this introduction.

The reason I used boosting is because, it works great on sparse matrices. Since, I’ve a sparse matrix here, I expected it to give good results. Sparse Matrix is a matrix which has large number of zeroes in it. It’s opposite is dense matrix. In a dense matrix, we have very few zeroes. XGBoost, precisely, deliver exceptional results on sparse matrices.

I did parameter tuning on XGBoost model to ensure that every model behaves in a different way. To read more on XGBoost, here’s a comprehensive documentation: XGBoost

Below is my complete code. I’ve used the packages xgboost and matrix. The package ‘matrix’ is used to create sparse matrix quickly.

library(xgboost) library(Matrix)

Now, I’ve created a sparse matrix using xgb.DMatrix of train data set. I’ve kept the set of independent variables and removed the dependent variable.

# creating the matrix for training the model ctrain <- xgb.DMatrix(Matrix(data.matrix(mytrain[,!colnames(mytrain) %in% c('cuisine')])), label = as.numeric(mytrain$cuisine)-1)

I’ve created a sparse matrix for test data set too. This is done to create a watchlist. Watchlist is a list of sparse form of train and test data set. It is served as an parameter in xgboost model to provide train and test error as the model runs.

dtest <- xgb.DMatrix(Matrix(data.matrix(mytest[,!colnames(mytest) %in% c(‘cuisine’)])))  watchlist <- list(train = ctrain, test = dtest)

To understand the modeling part, I suggest you to read this document. I’ve built 3 just models with different parameters . You can even create 40 – 50 models for ensembling. In the code below, I’ve used ‘Objective = multi:softmax’. Because, this is a case of multi classification.

Among other parameters, eta, min_child_weight, max.depth and gamma directly controls the model complexity. These parameters prevents the model to overfit. The model will be more conservative, if these values are chosen larger.

#train multiclass model using softmax #first model xgbmodel <- xgboost(data = ctrain, max.depth = 25, eta = 0.3, nround = 200, objective = "multi:softmax", num_class = 20, verbose = 1, watchlist = watchlist) #second model xgbmodel2 <- xgboost(data = ctrain, max.depth = 20, eta = 0.2, nrounds = 250, objective = "multi:softmax", num_class = 20, watchlist = watchlist) #third model xgbmodel3 <- xgboost(data = ctrain, max.depth = 25, gamma = 2, min_child_weight = 2, eta = 0.1, nround = 250, objective = "multi:softmax", num_class = 20, verbose = 2,watchlist = watchlist) #predict 1 xgbmodel.predict <- predict(xgbmodel, newdata = data.matrix(mytest[, !colnames(mytest) %in% c('cuisine')])) xgbmodel.predict.text <- levels(mytrain$cuisine)[xgbmodel.predict + 1] #predict 2 xgbmodel.predict2 <- predict(xgbmodel2, newdata = data.matrix(mytest[, !colnames(mytest) %in% c('cuisine')])) xgbmodel.predict2.text <- levels(mytrain$cuisine)[xgbmodel.predict2 + 1] #predict 3 xgbmodel.predict3 <- predict(xgbmodel3, newdata = data.matrix(mytest[, !colnames(mytest) %in% c('cuisine')])) xgbmodel.predict3.text <- levels(mytrain$cuisine)[xgbmodel.predict3 + 1] #data frame for predict 1 submit_match1 <- cbind(as.data.frame(test$id), as.data.frame(xgbmodel.predict.text)) colnames(submit_match1) <- c('id','cuisine') submit_match1 <- data.table(submit_match1, key = 'id') #data frame for predict 2 submit_match2 <- cbind(as.data.frame(test$id), as.data.frame(xgbmodel.predict2.text)) colnames(submit_match2) <- c('id','cuisine') submit_match2 <- data.table(submit_match2, key = 'id') #data frame for predict 3 submit_match3 <- cbind(as.data.frame(test$id), as.data.frame(xgbmodel.predict3.text)) colnames(submit_match3) <- c('id','cuisine') submit_match3 <- data.table(submit_match3, key = 'id')

Now I have three weak learners. You can check their accuracy using:

sum(diag(table(mytest$cuisine, xgbmodel.predict)))/nrow(mytest)  sum(diag(table(mytest$cuisine, xgbmodel.predict2)))/nrow(mytest) sum(diag(table(mytest$cuisine, xgbmodel.predict3)))/nrow(mytest)

The simple key is ensemble. Now, I have three data frame for model predict, predict2 and predict 3. I’ve now extracted the ‘cuisine’ column from predict and predict 2 into predict 3. With this step, I get all values of ‘cuisines’ in one data frame. Now I can easily ensemble their predictions

#ensembling submit_match3$cuisine2 <- submit_match2$cuisine submit_match3$cuisine1 <- submit_match1$cuisine

I’ve used the MODE function to extract the predicted value with highest frequency per id.

#function to find the maximum value row wise Mode <- function(x) { u <- unique(x) u[which.max(tabulate(match(x, u)))] } x <- Mode(submit_match3[,c("cuisine","cuisine2","cuisine1")]) y <- apply(submit_match3,1,Mode) final_submit <- data.frame(id= submit_match3$id, cuisine = y) #view submission file data.table(final_submit) #final submission write.csv(final_submit, 'ensemble.csv', row.names = FALSE)

After following the step mentioned above, you can easily get the same score as mine (0.798). You would have seen, I haven’t used any brainy method to improve this model. i just applied my basics. Since I’ve just started, I would like to see if I can push this further the highest level now.

End Notes

With this, I finish this tutorial for now! There are many things in this data set which you can try at your end. Due to time constraints, I couldn’t spent much time on it during the competition. But, it’s time you put on your thinking boots. I failed at Naive Bayes. So, why don’t you create an ensemble of naive bayes models? or may be, create a cluster of ingredients and build a model over it ?

I’m sure this strategy might give you a better score. Perhaps, more knowledge. In this tutorial, I’ve built a predictive model on What’s Cooking ? data set hosted by Kaggle. I took a step wise approach to cover various stages of model building. I used text mining and ensemble of 3 XGBoost models. XGBoost in itself is a deep topic. I plan to cover it deeply in my forthcoming articles. I’d suggest you to practice and learn.

Did you find the article useful? Share with us if you have done similar kind of analysis before. Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Related

Apple Acquires Emotient For Ai Emotion Detection

Apple acquires Emotient for AI emotion detection

Word of this acquisition comes from CNBC, who suggests that Apple has acquired Emotient, but that no pricing or terms have been disclosed. It is unclear what CNBC’s source is at this time, but based on the changes that’ve gone down on Emotient’s public web presence recently, it looks like something is really, truly going down.

Emotient began work several years ago with a facial expression recognition system that was originally made at the Machine Perception Lab at UCSD.

Emotient worked (and we assume still works) with emotion awareness with what they suggest (via an archived version of their website) with the following markets:

• Media/Entertainment – gain a deeper understanding of whether and how consumers will emotionally connect to your existing and planned programming.• Advertising – measure the attention, engagement and consumer sentiment your creative options provide before you launch.• Retail – uncover insights into customer satisfaction that can improve traffic and sales.• Online Experience – discover how your ecommerce or application experience connects with affects customers.• Healthcare – understand quantitatively how patients feel about their treatment or doctors view new equipment or pharmaceuticals.• Legal – Get beyond the transcription to a deeper discovery of depositions and testimony• Academic Research – use detailed facial expression data including facial expression action units (AUs) and evidence data to overlay emotional measurement onto your studies.

Emotient describes itself as a company that’s a leader in both emotion detection and sentiment analysis. They suggest that this is “part of a neuromarketing wave that is driving a quantum leap in customer understanding.”

“Our services quantify emotional response,” says Emotient, “leading to insights and actions that improves your products and how you market them.”

According to archived internet data, the company took down much of their website several months ago, leaving only the most basic of descriptions and no contact information for the public. Emotient works in software that’s able to detect emotions and outward indicators of intent in a wide variety of ways.

Below you’ll see some of the “Features And Benefits” listed by Emotient earlier last year (2024) on their software:

• High Accuracy: Behavioral Experts Processing Large Datasets = Highest Quality Output• Real-time: Automatic Processing Across All Available Platforms Without Manual Intervention• Robust: Range of Facial Types, Ethnicities, Lighting Conditions, Occlusions, and Backgrounds• Sensitive: Detects Natural, Low Intensity Expressions• Microexpressions: Frame-by-Frame Analysis to Detect Rapid, Subtle Emotions• Multiple Faces: Valuable for Group Applications, such as Digital Signage and Kiosks in Retail Environments• Cross Platform Support: Native Windows and Linux Applications, with a Small Memory Footprint

Emotient works with what they call Facial Action Units, 28 of which can be seen in the chart below.

As late late as February of 2024, Emotient listed the following on their “Customer Success” site: “Our technology is used currently to automate focus group evaluations and to test product preference by one of the world’s largest consumer packaged goods companies.”

They also suggest that they’ve partnered with a company by the name of iMotions Inc, aiming to bring their tech to “customers in the usability, gaming, market research and scientific research markets.”

Sound like something Apple might use with Siri? Or perhaps something new entirely?

Apple wants to know what you’re thinking when you’re using the iPhone – or maybe the MacBook – that’s for certain.

Update the detailed information about Multicollinearity: Problem, Detection And Solution on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!