You are reading the article Granger Causality In Time Series – Explained Using Chicken And Egg Problem updated in December 2023 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Granger Causality In Time Series – Explained Using Chicken And Egg Problem
This article was published as a part of the Data Science Blogathon
Introduction
Time series analysis provides insight into the pattern or characteristics of time series data. Time series data can be decomposed into three components:
Trend – This shows the tendency of the data over a long period of time, it may be upward, downward, or stable.
Seasonality – It is the variation that occurs in a periodic manner and repeats each year.
Noise or Random – Fluctuations in the data that are erratic.
Forecasting the data for a particular time in the future is known as Time Series Forecasting. It is one of the powerful machine learning models widely used in the fields like finance, weather forecasting, the health sector, environmental studies, business, retail, etc to arrive at strategic decisions.
In real-time, most of the data consists of multiple variables, in which independent variables might depend on other independent variables, these relations will have an impact on the predictions or forecasting. Most of the time people generally get misled and build multilinear regression models in such cases. A High R square value will further mislead and make a poor prediction.
Spurious RegressionLinear regression might indicate a strong relationship between two or more variables, but these variables may be totally unrelated in reality. Predictions fail when it comes to domain knowledge, this scenario is known as spurious regression.
There is a strong relationship between chicken consumption and crude oil exports in the below graph even though they are unrelated.
Strong trend / nonstationary and higher R square are observed in spurious regression. Spurious regression has to be eliminated while building the model since they are unrelated and have no causal relationship.
Multilinear regression makes use of a correlation matrix to check the dependency between all the independent variables. If the correlation coefficient value is high between two variables any one variable is retained and another one is discarded to remove the dependency. In the dataset when time is a confounding factor multilinear regression fails, the correlation coefficient which is used to eliminate the variable is not time-bound instead it just gives the correlation between the two variables.
Consider the above time-series graph, variable X has a direct influence on variable Y but there is a lag of 5 between X and Y in which case we cant use the correlation matrix. For eg. an increase in coronavirus positive cases in the city and an increase in the number of people getting hospitalized. For better forecasting, here we would like to know if there is a causal relationship.
Granger Causality comes to RescueIt is basically an econometric hypothetical test for verifying the usage of one variable in forecasting another in multivariate time series data with a particular lag.
A prerequisite for performing the Granger Causality test is that the data need to be stationary i.e it should have a constant mean, constant variance, and no seasonal component. Transform the non-stationary data to stationary data by differencing it, either first-order or second-order differencing. Do not proceed with the Granger causality test if the data is not stationary after second-order differencing.
Let us consider three variables Xt , Yt , and Wt preset in time series data.
Case 1: Forecast Xt+1 based on past values Xt .
Case 2: Forecast Xt+1 based on past values Xt and Yt.
Case3 : Forecast Xt+1 based on past values Xt , Yt , and Wt, where variable Yt has direct dependency on variable Wt.
Here Case 1 is univariate time series also known as the autoregressive model in which there is a single variable and forecasting is done based on the same variable lagged by say order p.
Xt = α + 𝛾1 X𝑡−1 + 𝛾2X𝑡−2 + ⋯ + 𝛾𝑝X𝑡−𝑝
where p parameters (degrees of freedom) to be estimated.
In Case 2 the past values of Y contain information for forecasting Xt+1. Yt is said to “Granger cause” Xt+1 provided Yt occurs before Xt+1 and it contains data for forecasting Xt+1.
Equation using a predictor Yt (UNRESTRICTED MODEL, UM)
Xt = α + 𝛾1 X𝑡−1 + 𝛾2X𝑡−2 + ⋯ + 𝛾𝑝X𝑡−𝑝 + α1Yt-1+ ⋯ + α𝑝 Yt-p
2p parameters (degrees of freedom) to be estimated.
If Yt causes Xt, then Y must precede X which implies:
Lagged values of Y should be significantly related to X.
Lagged values of X should not be significantly related to Y.
Case 3 can not be used to find Granger causality since variable Yt is influenced by variable Wt.
Hypothesis testNull Hypothesis (H0) : Yt does not “Granger cause” Xt+1 i.e., 𝛼1 = 𝛼2 = ⋯ = 𝛼𝑝 = 0
Alternate Hypothesis(HA): Yt does “Granger cause” Xt+1, i.e., at least one of the lags of Y is significant.
Calculate the f-statisticFp,n-2𝑝−1 = (𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒) / (𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)
Fp,n-2𝑝−1 = ( (𝑆𝑆𝐸𝑅𝑀−𝑆𝑆𝐸𝑈𝑀) /𝑝) /(𝑆𝑆𝐸𝑈𝑀 /𝑛−2𝑝−1)
SSE is Sum of Squared Errors.
Try different lags (p). The optimal lag can be determined using AIC.
Limitation
Granger causality does not provide any insight on the relationship between the variable hence it is not true causality unlike ’cause and effect’ analysis.
Granger causality fails to forecast when there is an interdependency between two or more variables (as stated in Case 3).
Granger causality test can’t be performed on non-stationary data.
Resolving Chicken and Egg problemLet us apply Granger causality to check whether the egg came first or chicken came first.
Importing libraries import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd Loading DatasetData is from the U.S. Department of Agriculture. It consists of two-time series variables from 1930 to 1983, one of U.S. egg production and the other the estimated U.S. chicken population.
df = pd.read_csv('chickegg.csv') Exploring the Dataset df.head() df.dtypes df.shape(53, 3)
df.describe()Check if the data is stationary, if not make it stationary to proceed.
# Draw Plot def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100): plt.figure(figsize=(16,5), dpi=dpi) plt.plot(x, y, color='tab:red') plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel) plt.show() plot_df(df, x=df.Year, y=df.chicken, title='Polulation of the chicken across US') plot_df(df, x=df.Year, y=df.egg, title='Egg Produciton')By visual inspection, both the chicken and egg data are not stationary. Let us confirm this by running Augmented Test (ADF Test).
(ADF test)ADF test is a popular statistical test for checking whether the Time Series is stationary or not which works based on the unit root test. The number of unit roots present in the series indicates the number of differencing operations that are required to make it stationary
Consider the hypothesis test where:
Null Hypothesis (H0): Series has a unit root and is non-stationary.
Alternative Hypothesis (HA): Series has no unit root and is stationary.
from statsmodels.tsa.stattools import adfuller result = adfuller(df['chicken']) print(f'Test Statistics: {result[0]}') print(f'p-value: {result[1]}') print(f'critical_values: {result[4]}')print(“Series is not stationary”) else: print(“Series is stationary”)
result = adfuller(df['egg']) print(f'Test Statistics: {result[0]}') print(f'p-value: {result[1]}') print(f'critical_values: {result[4]}') print("Series is not stationary") else: print("Series is stationary")p-values of both the egg and chicken variables are greater than the significant value (0.05), the Null hypothesis is valid and the series is not stationary.
Data TransformationGranger causality test is carried out only on stationary data hence we need to transform the data by differencing it to make it stationary. Let us perform the first-order differencing on chicken and egg data.
df_transformed = df.diff().dropna() df = df.iloc[1:] print(df.shape) df_transformed.shape df_transformed.head() plot_df(df_transformed, x=df.Year, y=df_transformed.chicken, title='Polulation of the chicken across US') plot_df(df_transformed, x=df.Year, y=df_transformed.egg, title='Egg Produciton')Repeat the ADF test again on differenced data to check for stationarity.
result = adfuller(df_transformed['chicken']) print(f'Test Statistics: {result[0]}') print(f'p-value: {result[1]}') print(f'critical_values: {result[4]}') print("Series is not stationary") else: print("Series is stationary") result = adfuller(df_transformed['egg']) print(f'Test Statistics: {result[0]}') print(f'p-value: {result[1]}') print(f'critical_values: {result[4]}') print("Series is not stationary") else: print("Series is stationary")Transformed chicken and egg data are stationary, hence there is no need to go for second-order differencing.
Test the Granger CausalityThere are several ways to find the optimal lag but for simplicity let’s consider 4th lag as of now.
Do eggs granger cause chickens?
Null Hypothesis (H0) : eggs do not granger cause chicken.
Alternative Hypothesis (HA) : eggs granger cause chicken.
from statsmodels.tsa.stattools import grangercausalitytests grangercausalitytests(df_transformed[['chicken', 'egg']], maxlag=4)p-value is very low, Null hypothesis is rejected hence eggs are granger causing chicken.
That implies eggs came first.
Now repeat the Granger causality test in the opposite direction.
Do chickens granger cause eggs at lag 4?
Null Hypothesis (H0) : chicken does not granger cause eggs.
Alternative Hypothesis (HA) : chicken granger causes eggs.
grangercausalitytests(df_transformed[['egg', 'chicken']], maxlag=4)The p-value is considerably high thus chickens do not granger cause eggs.
The above analysis concludes that the egg came first and not the chicken.
Once the analysis is done the next step is to begin forecasting using time series forecasting models.
EndNoteThank you for reading!
About AuthorHello, Pallavi Padav from Mangalore holding PG degree in Data Science from INSOFE. Passionate about participating in Data Science hackathons, blogathons and workshops.
Would love to catch you on Linkedin. Mail me here for any queries.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
You're reading Granger Causality In Time Series – Explained Using Chicken And Egg Problem
Stock Market Price Trend Prediction Using Time Series Forecasting
This article was published as a part of the Data Science Blogathon.
Introductionis used to predict future values based on previously observed values and one of the best tools for trend analysis and future prediction.
What is time-series data?It is recorded at regular time intervals, and the order of these data points is important. Therefore, any predictive model based on time series data will have time as an independent variable. The output of a model would be the predicted value or classification at a specific time.
Time series analysis vs time series forecastingLet’s talk about some possible confusion about the Time Series Analysis and Forecasting. Time series forecasting is an example of predictive modeling whereas time series analysis is a form of descriptive modeling.
For a new investor general research which is associated with the stock or share market is not enough to make the decision. The common trend towards the stock market among the society is highly risky for investment so most of the people are not able to make decisions based on common trends. The seasonal variance and steady flow of any index will help both existing and new investors to understand and make a decision to invest in the share market.
To solve this kind of problem time series forecasting is the best technique.
Stock marketStock markets are where individual and institutional investors come together to buy and sell shares in a public venue. Nowadays these exchanges exist as electronic marketplaces.
That supply and demand help determine the price for each security or the levels at which stock market participants — investors and traders — are willing to buy or sell.
The concept behind how the stock market works is pretty simple. Operating much like an auction house, the stock market enables buyers and sellers to negotiate prices and make trades.
Definition of ‘Stock’A Stock or share (also known as a company’s “equity”) is a financial instrument that represents ownership in a company
Machine learning in the stock marketThe stock market is very unpredictable, any geopolitical change can impact the share trend of stocks in the share market, recently we have seen how covid-19 has impacted the stock prices, which is why on financial data doing a reliable trend analysis is very difficult. The most efficient way to solve this kind of issue is with the help of Machine learning and Deep learning.
In this tutorial, we will be solving this problem with ARIMA Model.
To know about seasonality please refer to my previous blog, And to get a basic understanding of ARIMA I would recommend you to go through this blog, this will help you to get a better understanding of how Time Series analysis works.
Implementing stock price forecastingI will be using nsepy library to extract the historical data for SBIN.
Imports and Reading DataPython Code:
The data shows the stock price of SBIN from 2023-1-1 to 2023-11-1. The goal is to create a model that will forecast the closing price of the stock.Let us create a visualization which will show per day closing price of the stock-
plt.figure(figsize=(10,6)) plt.grid(True) plt.xlabel('Dates') plt.ylabel('Close Prices') plt.plot(sbin['Close']) plt.title('SBIN closing price') plt.show() plt.figure(figsize=(10,6)) df_close = sbin['Close'] df_close.plot(style='k.') plt.title('Scatter plot of closing price') plt.show() plt.figure(figsize=(10,6)) df_close = sbin['Close'] df_close.plot(style='k.',kind='hist') plt.title('Hisogram of closing price') plt.show()First, we need to check if a series is stationary or not because time series analysis only works with stationary data.
Testing For Stationarity:
To identify the nature of the data, we will be using the null hypothesis.
H0: The null hypothesis: It is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.
H1: The alternative hypothesis: It is a claim about the population that is contradictory to H0 and what we conclude when we reject H0.
#Ho: It is non-stationary
#H1: It is stationary
If we fail to reject the null hypothesis, we can say that the series is non-stationary. This means that the series can be linear.
If both mean and standard deviation are flat lines(constant mean and constant variance), the series becomes stationary.
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries): #Determing rolling statistics rolmean = timeseries.rolling(12).mean() rolstd = timeseries.rolling(12).std() #Plot rolling statistics: plt.plot(timeseries, color='yellow',label='Original') plt.plot(rolmean, color='red', label='Rolling Mean') plt.plot(rolstd, color='black', label = 'Rolling Std') plt.legend(loc='best') plt.title('Rolling Mean and Standard Deviation') plt.show(block=False) print("Results of dickey fuller test") adft = adfuller(timeseries,autolag='AIC') # output for dft will give us without defining what the values are. #hence we manually write what values does it explains using a for loop output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used']) for key,values in adft[4].items(): output['critical value (%s)'%key] = values print(output) test_stationarity(sbin['Close']) After analysing the above graph, we can see the increasing mean and standard deviation and hence our series is not stationary. Results of dickey fuller test Test Statistics -1.914523 p-value 0.325260 No. of lags used 3.000000 Number of observations used 5183.000000 critical value (1%) -3.431612 critical value (5%) -2.862098 critical value (10%) -2.567067 dtype: float64We see that the p-value is greater than 0.05 so we cannot reject the Null hypothesis. Also, the test statistics is greater than the critical values. so the data is non-stationary.
For time series analysis we separate Trend and Seasonality from the time series.
result = seasonal_decompose(df_close, model='multiplicative', freq = 30) fig = plt.figure() fig = result.plot() fig.set_size_inches(16, 9) from pylab import rcParams rcParams['figure.figsize'] = 10, 6 df_log = np.log(sbin['Close']) moving_avg = df_log.rolling(12).mean() std_dev = df_log.rolling(12).std() plt.legend(loc='best') plt.title('Moving Average') plt.plot(std_dev, color ="black", label = "Standard Deviation") plt.plot(moving_avg, color="red", label = "Mean") plt.legend() plt.show()Now we are going to create an ARIMA model and will train it with the closing price of the stock on the train data. So let us split the data into training and test set and visualize it.
train_data, test_data = df_log[3:int(len(df_log)*0.9)], df_log[int(len(df_log)*0.9):] plt.figure(figsize=(10,6)) plt.grid(True) plt.xlabel('Dates') plt.ylabel('Closing Prices') plt.plot(df_log, 'green', label='Train data') plt.plot(test_data, 'blue', label='Test data') plt.legend() model_autoARIMA = auto_arima(train_data, start_p=0, start_q=0, test='adf', # use adftest to find optimal 'd' max_p=3, max_q=3, # maximum p and q m=1, # frequency of series d=None, # let model determine 'd' seasonal=False, # No Seasonality start_P=0, D=0, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True) print(model_autoARIMA.summary()) Performing stepwise search to minimize aic ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=-16607.561, Time=2.19 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=-16607.961, Time=0.95 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=-16608.035, Time=2.27 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=-16609.560, Time=0.39 sec ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=-16606.477, Time=2.77 sec Best model: ARIMA(0,1,0)(0,0,0)[0] Total fit time: 9.079 seconds SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 4665 Model: SARIMAX(0, 1, 0) Log Likelihood 8305.780 Date: Tue, 24 Nov 2023 AIC -16609.560 Time: 20:08:50 BIC -16603.113 Sample: 0 HQIC -16607.293 - 4665 Covariance Type: opg ============================================================================== ------------------------------------------------------------------------------ sigma2 0.0017 1.06e-06 1566.660 0.000 0.002 0.002 =================================================================================== Ljung-Box (Q): 24.41 Jarque-Bera (JB): 859838819.58 Prob(Q): 0.98 Prob(JB): 0.00 Heteroskedasticity (H): 7.16 Skew: -37.54 Prob(H) (two-sided): 0.00 Kurtosis: 2105.12 =================================================================================== model_autoARIMA.plot_diagnostics(figsize=(15,8)) plt.show() model = ARIMA(train_data, order=(3, 1, 2)) fitted = model.fit(disp=-1) print(fitted.summary()) ARIMA Model Results ============================================================================== Dep. Variable: D.Close No. Observations: 4664 Model: ARIMA(3, 1, 2) Log Likelihood 8309.178 Method: css-mle S.D. of innovations 0.041 Date: Tue, 24 Nov 2023 AIC -16604.355 Time: 20:09:37 BIC -16559.222 Sample: 1 HQIC -16588.481 ================================================================================= --------------------------------------------------------------------------------- const 8.761e-06 0.001 0.015 0.988 -0.001 0.001 ar.L1.D.Close 1.3689 0.251 5.460 0.000 0.877 1.860 ar.L2.D.Close -0.7118 0.277 -2.567 0.010 -1.255 -0.168 ar.L3.D.Close 0.0094 0.021 0.445 0.657 -0.032 0.051 ma.L1.D.Close -1.3468 0.250 -5.382 0.000 -1.837 -0.856 ma.L2.D.Close 0.6738 0.282 2.391 0.017 0.122 1.226 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 0.9772 -0.6979j 1.2008 -0.0987 AR.2 0.9772 +0.6979j 1.2008 0.0987 AR.3 74.0622 -0.0000j 74.0622 -0.0000 MA.1 0.9994 -0.6966j 1.2183 -0.0969 MA.2 0.9994 +0.6966j 1.2183 0.0969 ----------------------------------------------------------------------------- # Forecast fc, se, conf = fitted.forecast(519, alpha=0.05) # 95% confidence fc_series = pd.Series(fc, index=test_data.index) lower_series = pd.Series(conf[:, 0], index=test_data.index) upper_series = pd.Series(conf[:, 1], index=test_data.index) plt.figure(figsize=(12,5), dpi=100) plt.plot(train_data, label='training') plt.plot(test_data, color = 'blue', label='Actual Stock Price') plt.plot(fc_series, color = 'orange',label='Predicted Stock Price') plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.10) plt.title('SBIN Stock Price Prediction') plt.xlabel('Time') plt.ylabel('Actual Stock Price') plt.legend(loc='upper left', fontsize=8) plt.show()
Time Series forecasting is really useful when we have to take future decisions or we have to do analysis, we can quickly do that using ARIMA, there are lots of other Models from we can do the time series forecasting but ARIMA is really easy to understand.
I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.
HAPPY CODING.
Prabhat Pathak (Linkedin profile) is a Senior Analyst and innovation Enthusiast.
Related
How Does Google Know My Location Using Vpn? (Explained)
Home » Tips » How Does Google Know My Location Using a VPN?
Privacy and security while browsing the internet are growing concerns for most of us. Why?
Fortunately, VPN services are an effective solution. They hide your real IP address so that the websites you visit won’t know where you’re located. They also encrypt your traffic so that your ISP and employer can’t log your browsing history.
But they don’t seem to fool Google. Many people report that Google seems to know users’ real locations even when using a VPN.
For example, Google sites show the language of the user’s original country, and Google Maps initially displays a location close to where the user lives.
Google hasn’t published how they determine your location, so I can’t give you a definitive answer.
But here are three methods they are likely to use.
1. You’re Logged Into Your Google Account
If you’re logged into your Google account, Google knows who you are, or at least who you told them you are. At some point, you may have given them some information about which part of the world you live in.
Perhaps you told Google Maps your home and work locations. Even navigating using Google Maps lets the company know where you are.
If you’re an Android user, Google probably knows where you are. Your phone’s GPS sends that information to them. It may continue to let them know even after you turn GPS tracking off.
The IDs of the cell phone towers you connect to can give away your location. Some Android features are location-specific and may provide clues to your whereabouts.
2. The Wireless Networks You’re Near Give Away Your Location
It’s possible to work out your location by triangulating from the wireless networks you’re closest to. Google has a massive database of where many network names are. Your computer or device’s Wi-Fi card provides a list of every network you’re close to.
Those databases were built in part by Google Street View cars. They collected Wi-Fi data as they drove around taking photos—something they found themselves in trouble for in 2010 and again in 2023.
They also use this information combined with your phone’s GPS to verify your location when using Google Maps.
3. They May Ask Your Web Browser to Reveal Your Local IP address
Your web browser knows your local IP address. It’s possible to store that information in a cookie accessible by Google’s websites and services.
If you have Java installed on your computer, a webmaster just needs to insert a single line of code into their website to read your real IP address without asking your permission.
So What Should You Do?
Realize that a VPN will fool most people most of the time, but probably not Google. You could go to a lot of trouble to try to fake them out, but I don’t think it’s worth the effort.
You’d have to sign out of your Google account and change the name of your home network. Then, you’ll need to convince your neighbors to change theirs, too.
If you have an Android phone, you’ll need to install a GPS spoofing app that gives Google a false location. After that, you need to surf using your browser’s private mode so that no cookies are saved.
Even then, you’ll probably miss something. You could spend a few hours Googling the topic for more clues, and then Google would be aware of your searches.
Personally, I accept that Google knows a great deal about me, and in return, I receive quite a lot of value from their services.
Multicollinearity: Problem, Detection And Solution
This article was published as a part of the Data Science Blogathon.
Multicollinearity causes the following 2 primary issues –
1. Multicollinearity generates high variance of the estimated coefficients and hence, the coefficient estimates corresponding to those interrelated explanatory variables will not be accurate in giving us the actual picture. They can become very sensitive to small changes in the model.
2. Consecutively the t-ratios for each of the individual slopes might get impacted leading to insignificant coefficients. It is also possible that the adjusted R squared for a model is pretty good and even the overall F-test statistic is also significant but some of the individual coefficients are statistically insignificant. This scenario can be a possible indication of the presence of multicollinearity as multicollinearity affects the coefficients and corresponding p-values, but it does not affect the goodness-of-fit statistics or the overall model significance.
How do we measure Multicollinearity?A very simple test known as the VIF test is used to assess multicollinearity in our regression model. The variance inflation factor (VIF) identifies the strength of correlation among the predictors.
Now we may think about why we need to use ‘VIF’s and why we are simply not using the Pairwise Correlations.
Since multicollinearity is the correlation amongst the explanatory variables it seems quite logical to use the pairwise correlation between all predictors in the model to assess the degree of correlation. However, we may observe a scenario when we have five predictors and the pairwise correlations between each pair are not exceptionally high and it is still possible that three predictors together could explain a very high proportion of the variance in the fourth predictor.
I know this sounds like a multiple regression model itself and this is exactly what VIFs do. Of course, the original model has a dependent variable (Y), but we don’t need to worry about it while calculating multicollinearity. The formula of VIF is
Here the Rj2 is the R squared of the model of one individual predictor against all the other predictors. The subscript j indicates the predictors and each predictor has one VIF. So more precisely, VIFs use a multiple regression model to calculate the degree of multicollinearity. Suppose we have four predictors – X1, X2, X3, and X4. So, to calculate VIF, all the independent variables will become dependent variables one by one. Each model will produce an R-squared value indicating the percentage of the variance in the individual predictor that the set of other predictors explain.
The name “variance inflation factor” was coined because VIF tells us the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 10 indicates that the existing multicollinearity is inflating the variance of the coefficients 10 times compared to a no multicollinearity model. The variances that we are talking about here are the standard errors of the coefficient estimates which indicates the precision of these estimates. These standard errors are used to calculate the confidence interval of the coefficient estimates.
Larger standard errors will produce wider confident intervals leading to less precise coefficient estimates. Additionally, wide confidence intervals may sometimes flip the coefficient signs as well.
VIFs do not have any upper limit. The lower the value the better. VIFs between 1 and 5 suggest that the correlation is not severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of multicollinearity where the coefficient estimates may not be trusted and the statistical significance is questionable. Well, the need to reduce multicollinearity depends on its severity.
How can we fix Multi-Collinearity in our model?The potential solutions include the following:
1. Simply drop some of the correlated predictors. From a practical point of view, there is no point in keeping 2 very similar predictors in our model. Hence, VIF is widely used as variable selection criteria as well when we have a lot of predictors to choose from.
3. Do some linear transformation e.g., add/subtract 2 predictors to create a new bespoke predictor.
4. As an extension of the previous 2 points, another very popular technique is to perform Principal components analysis (PCA). PCA is used when we want to reduce the number of variables in our data but we are not sure which variable to drop. It is a type of transformation where it combines the existing predictors in a way only to keep the most informative part.
It then creates new variables known as Principal components that are uncorrelated. So, if we have 10-dimensional data then a PCA transformation will give us 10 principal components and will squeeze maximum possible information in the first component and then the maximum remaining information in the second component and so on. The primary limitation of this method is the interpretability of the results as the original predictors lose their identity and there is a chance of information loss. At the end of the day, it is a trade-off between accuracy and interpretability.
How to calculate VIF (R and Python Code):I am using a subset of the house price data from Kaggle. The dependent/target variable in this dataset is “SalePrice”. There are around 80 predictors (both quantitative and qualitative) in the actual dataset. For Simplicity’s purpose, I have selected 10 predictors based on my intuition that I feel will be suitable predictors for the Sale price of the houses. Please note that I did not do any treatment e.g., creating dummies for the qualitative variables. This example is just for representation purposes.
The following table describes the predictors I chose and their description.
The below code shows how to calculate VIF in R. For this we need to install the ‘car’ package. There are other packages available in R as well.
The output is shown below. As we can see most of the predictors have VIF <= 5
Now if we want to do the same thing in python then please see the code and output below
Please note that in the python code I have added a column of intercept/constant to my data set before calculating the VIFs. This is because the variance_inflation_factor function in python does not assume the intercept by default while calculating the VIFs. Hence, often we may come across very different results in R and Python output. For details, please see this discussion here.
Related
Difference Between Series And Vectors In Python Pandas
Pandas is a well-known open-source Python library that provides a wide range of capabilities to make data analysis more effective. The Pandas package is mostly utilised for pre-processing data activities, including cleaning, transforming, and manipulating data. As a result, it is a highly useful tool for analysts and data scientists. The two most popular data structures in Pandas—Series, and DataFrame—as well as the comparison of Series and vectors, are discussed in this article.
Python Pandas SeriesLabels must be a hashable type but do not need to be unique. The object has a variety of methods for working with the index and supports integer and label-based indexing.
It has the following parameter −
Data − Any list, dictionary, or scalar value can be used as data.
index − The index’s value ought to be both distinct and hashable. It has to be the same size as the data. If no index is provided, np.arrange(n) will be used by default.
Dtype − It alludes to the series’ data type.
copy − It is utilized to copy info.
Creating a SeriesWe can create a Series in four ways −
Using the pd.Series function from the Pandas library import pandas as pd import numpy as np # Create a series from a list s = pd.Series([1, 3, 5, chúng tôi 6, 8]) print(s) Output 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64This will create a Pandas Series with the values 1, 3, 5, NaN, 6, 8.
Creating a Series directly from a NumPy array import numpy as np import pandas as pd # Create a NumPy array data = np.array([1, 3, 5, chúng tôi 6, 8]) # Create a series from the array s = pd.Series(data) print(s) Output 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64Both of these methods will create a Pandas Series with an index that is a range of integers starting from 0. You can also specify your own index values when creating the Series.
Creating a Series From Scalar ValuesMaking a Series with Scalar values is the last approach we’ll examine today. In this case, you may provide the data with a single value and have it repeated for the duration of the index.
Example import pandas as pd if __name__ == '__main__': series = pd.Series(data=3., index=['a', 'b', 'c', 'd'], name='series_from_scalar') print(series) Output a 3.0 b 3.0 c 3.0 d 3.0 Name: series_from_scalar, dtype: float64 Creating a Series From ndarrayNumPy’s random.randint() function, which creates a ndarray filled with random numbers, is one of the easiest methods to create a
Example import numpy as np import pandas as pd if __name__ == '__main__': data = np.random.randint(0, 10, 5) series = pd.Series(data=data, index=['a', 'b', 'c', 'd', 'e'], name='series_from_ndarray') print(series) Output a 5 b 7 c 0 d 8 e 5 Name: series_from_ndarray, dtype: int64 DataframesOn the other hand, a vector is a one-dimensional array of numerical values. In Pandas, a vector can be represented as a series with a single dtype (e.g., integer, float, or object). Vectors are commonly used in mathematical and statistical operations, and can be created using the pd.to_numeric() function or by selecting a single column from a data frame.
Using the pd, you may generate a DataFrame from several data sources, including dictionaries, 2D NumPy arrays, and series. Creating a Pandas DataFrame Using a Dictionary of Pandas Series
The index must be the same length as the Series. If the index is not specified, it will be created automatically with values: [0, …, len(data) – 1].
#Creating a DataFrame from a dictionary of Series import pandas as pd data = pd.DataFrame({ "Class 1": pd.Series([22, 33, 38], index=["math avg", "science avg", "english avg"]), "Class 2": pd.Series([45, 28, 36], index=["math avg", "science avg", "english avg"]), "Class 3": pd.Series([32, 41, 47], index=["math avg", "science avg", "english avg"]) }) print(data) Output Class 1 Class 2 Class 3 math avg 22 45 32 science avg 33 28 41 english avg 38 36 47Following is the conclusion of difference between series and Data frame in Python Pandas
DataFrame
Series
Data structure
2D table
1D array
Can contain heterogeneous data
Yes
Yes
Can contain column labels
Yes
No
Can contain row labels
Yes
No
Can be indexed by column or row labels
Yes
Yes
Can be sliced by column or row labels
Yes
Yes
Supports arithmetic operations
Yes
Yes
Supports arithmetic operations
Yes
Yes
ConclusionIn summary, the main differences between series and vectors in Python Pandas are −
Series can hold any data type, while vectors can only hold numerical values
Series have a label index, while vectors do not
Series can be accessed using labels or indices, while vectors can only be accessed using indices
Understanding the difference between series and vectors can be useful for selecting the appropriate data structure for your data and for manipulating and analyzing it in Pandas.
Pagerank Explained In Simple Terms!
In my previous article, we talked about information retrieval and how machines can read the context from free text. Let’s talk about the biggest web information retrieval engine, Google, and the algorithm that powers its search results: the Google PageRank algorithm. Imagine you were to create a Google search in a world devoid of any search engine. What basic rules would you code to build such a search engine? If your answer is to use a Term Frequency or TF-IDF framework, consider the following case:
But, do search engines like Google face this challenge today? Obviously not! This is because they take help of an algorithm known as PageRank. In this article, we will discuss the concept of PageRank. In the next article, we will take this algorithm a step forward by leveraging it to find the most important packages in R.
An artificial web worldImagine a web which has only 4 web pages, which are linked to each other. Each of the box below represents a web page. The words written in black and italics are the links between pages.
For instance, in the web page “Tavish”, it has 3 outgoing links : to the other three web pages. Now, let’s draw a simpler directed graph of this ecosystem.
Here is how Google ranks a page : The page with maximum number of incoming links is the most important page. In the current example, we see that the “Kunal Jain” page comes out as the most significant page.
Mathematical Formulation of Google Page RankFirst step of the formulation is to build a direction matrix. This matrix will have each cell as the proportion of the outflow. For instance, Tavish (TS) has 3 outgoing links which makes each proportion as 1/3.
Now we imagine that if there were a bot which will follow all the outgoing links, what will be the total time spent by this bot on each of these pages. This can be broken down mathematically into following equation :
A * X = XHere A is the proportions matrix mentioned above
X is the probability of the bot being on each of these pages
Clearly, we see that Kunal Jain’s page in this universe comes out to be most important which goes in the same direction as our intuition.
Teleportation adjustmentsNow, imagine a scenario where we have only 2 web pages : A and B. A has a link to B but B has no external links. In such cases, if you try solving the matrix, you will get a zero matrix. This looks unreasonable as B looks to be more important than A. But, our algorithm still gives same importance for both. To solve for this problem, a new concept of teleporatation was introduced. We include a constant probability of alpha to each of these pages. This is to compensate for instances where a user teleports from one webpage to other without any link. Hence, the equation is modified to the following equation :
(1-alpha) * A * X + alpha * b = XHere, b is a constant unit column matrix. Alpha is the proportion of teleportation. The most common value taken for alpha is 0.15 (but can depend on different cases).
Other uses of PageRank Algorithm & End NotesIn this article we discussed the most significant use of PageRank. But, the use of PageRank is no way restricted to Search Engines. Here are a few other uses of PageRank :
Finding how well connected a person is on Social Media : One of the unexplored territory in social media analytics is the network information. Using this network information we can estimate how influential is the user. And therefore prioritize our efforts to please the most influential customers. Networks can be easily analyzed using Page Rank algorithm.
Fraud Detection in Pharmaceutical industry : Many countries including US struggle with the problem of high percentage medical frauds. Such frauds can be spotted using Page Rank algorithm.
Understand the importance of packages in any programming language : Page Rank algorithm can also be used to understand the layers of packages used in languages like R and Python. We will take up this topic in our next article.
Thinkpot: Can you think of more usage of Page Rank algorithm? Share with us useful links to leverage Page Rank algorithm in various fields.
Did you find this article useful? Do let us know your thoughts about this article in the box below.
If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.Related
Update the detailed information about Granger Causality In Time Series – Explained Using Chicken And Egg Problem on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!