You are reading the article **Stock Prices Prediction Using Machine Learning And Deep Learning** updated in February 2024 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. *Suggested March 2024 Stock Prices Prediction Using Machine Learning And Deep Learning*

17 minutes

⭐⭐⭐⭐⭐

Rating: 5 out of 5.

IntroductionPredicting how the stock market will perform is one of the most difficult things to do. There are so many factors involved in the prediction – physical factors vs. psychological, rational and irrational behavior, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy.

Can we use machine learning as a game-changer in this domain? Using features like the latest announcements about an organization, their quarterly revenue results, etc., machine learning techniques have the potential to unearth patterns and insights we didn’t see before, and these can be used to make unerringly accurate predictions.

The core idea behind this article is to showcase how these algorithms are implemented. I will briefly describe the technique and provide relevant links to brush up on the concepts as and when necessary. In case you’re a newcomer to the world of time series, I suggest going through the following articles first:

Are you a beginner looking for a place to start your data science journey? Presenting a comprehensive course, full of knowledge and data science learning, curated just for you! This course covers everything from basics of Machine Learning to Advanced concepts of ML, Deep Learning and Time series.

Understanding the Problem StatementWe’ll dive into the implementation part of this article soon, but first it’s important to establish what we’re aiming to solve. Broadly, stock market analysis is divided into two parts – Fundamental Analysis and Technical Analysis.

Fundamental Analysis involves analyzing the company’s future profitability on the basis of its current business environment and financial performance.

Technical Analysis, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market.

As you might have guessed, our focus will be on the technical analysis part. We’ll be using a dataset from Quandl (you can find historical data for various stocks here) and for this particular project, I have used the data for ‘Tata Global Beverages’. Time to dive in!

Note: Here is the dataset I used for the code: Download

We will first load the dataset and define the target variable for the problem:

Python Code:

There are multiple variables in the dataset – date, open, high, low, last, close, total_trade_quantity, and turnover.

The columns Open and Close represent the starting and final price at which the stock is traded on a particular day.

High, Low and Last represent the maximum, minimum, and last price of the share for the day.

Total Trade Quantity is the number of shares bought or sold in the day and Turnover (Lacs) is the turnover of the particular company on a given date.

Another important thing to note is that the market is closed on weekends and public holidays. Notice the above table again, some date values are missing – 2/10/2024, 6/10/2024, 7/10/2024. Of these dates, 2nd is a national holiday while 6th and 7th fall on a weekend.

The profit or loss calculation is usually determined by the closing price of a stock for the day, hence we will consider the closing price as the target variable. Let’s plot the target variable to understand how it’s shaping up in our data:

#setting index as date df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d') df.index = df['Date'] #plot plt.figure(figsize=(16,8)) plt.plot(df['Close'], label='Close Price history')In the upcoming sections, we will explore these variables and use different techniques to predict the daily closing price of the stock.

Moving Average Introduction‘Average’ is easily one of the most common things we use in our day-to-day lives. For instance, calculating the average marks to determine overall performance, or finding the average temperature of the past few days to get an idea about today’s temperature – these all are routine tasks we do on a regular basis. So this is a good starting point to use on our dataset for making predictions.

The predicted closing price for each day will be the average of a set of previously observed values. Instead of using the simple average, we will be using the moving average technique which uses the latest set of values for each prediction. In other words, for each subsequent step, the predicted values are taken into consideration while removing the oldest observed value from the set. Here is a simple figure that will help you understand this with more clarity.

We will implement this technique on our dataset. The first step is to create a dataframe that contains only the Date and Close price columns, then split it into train and validation sets to verify our predictions.

Implementation

Just checking the RMSE does not help us in understanding how the model performed. Let’s visualize this to get a more intuitive understanding. So here is a plot of the predicted values along with the actual values.

#plot valid['Predictions'] = 0 valid['Predictions'] = preds plt.plot(train['Close']) plt.plot(valid[['Close', 'Predictions']]) InferenceThe RMSE value is close to 105 but the results are not very promising (as you can gather from the plot). The predicted values are of the same range as the observed values in the train set (there is an increasing trend initially and then a slow decrease).

In the next section, we will look at two commonly used machine learning techniques – Linear Regression and kNN, and see how they perform on our stock market data.

Linear Regression IntroductionThe most basic machine learning algorithm that can be implemented on this data is linear regression. The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable.

The equation for linear regression can be written as:

Here, x1, x2,….xn represent the independent variables while the coefficients θ1, θ2, …. θn represent the weights. You can refer to the following article to study linear regression in more detail:

For our problem statement, we do not have a set of independent variables. We have only the dates instead. Let us use the date column to extract features like – day, month, year, mon/fri etc. and then fit a linear regression model.

ImplementationWe will first sort the dataset in ascending order and then create a separate dataset so that any new feature created does not affect the original data.

#setting index as date values df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d') df.index = df['Date'] #sorting data = df.sort_index(ascending=True, axis=0) #creating a separate dataset new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close']) for i in range(0,len(data)): new_data['Date'][i] = data['Date'][i] new_data['Close'][i] = data['Close'][i] #create features from fastai.structured import add_datepart add_datepart(new_data, 'Date') new_data.drop('Elapsed', axis=1, inplace=True) #elapsed will be the time stampThis creates features such as:

‘Year’, ‘Month’, ‘Week’, ‘Day’, ‘Dayofweek’, ‘Dayofyear’, ‘Is_month_end’, ‘Is_month_start’, ‘Is_quarter_end’, ‘Is_quarter_start’, ‘Is_year_end’, and ‘Is_year_start’.

Note: I have used add_datepart from fastai library. If you do not have it installed, you can simply use the command pip install fastai. Otherwise, you can create these feature using simple for loops in python. I have shown an example below.

Apart from this, we can add our own set of features that we believe would be relevant for the predictions. For instance, my hypothesis is that the first and last days of the week could potentially affect the closing price of the stock far more than the other days. So I have created a feature that identifies whether a given day is Monday/Friday or Tuesday/Wednesday/Thursday. This can be done using the following lines of code:

new_data['mon_fri'] = 0 for i in range(0,len(new_data)): if (new_data['Dayofweek'][i] == 0 or new_data['Dayofweek'][i] == 4): new_data['mon_fri'][i] = 1 else: new_data['mon_fri'][i] = 0We will now split the data into train and validation sets to check the performance of the model.

#split into train and validation train = new_data[:987] valid = new_data[987:] x_train = train.drop('Close', axis=1) y_train = train['Close'] x_valid = valid.drop('Close', axis=1) y_valid = valid['Close'] #implement linear regression from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x_train,y_train) Results #make predictions and find the rmse preds = model.predict(x_valid) rms=np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2))) rms 121.16291596523156The RMSE value is higher than the previous technique, which clearly shows that linear regression has performed poorly. Let’s look at the plot and understand why linear regression has not done well:

#plot valid['Predictions'] = 0 valid['Predictions'] = preds valid.index = new_data[987:].index train.index = new_data[:987].index plt.plot(train['Close']) plt.plot(valid[['Close', 'Predictions']]) InferenceAs seen from the plot above, for January 2024 and January 2023, there was a drop in the stock price. The model has predicted the same for January 2023. A linear regression technique can perform well for problems such as Big Mart sales where the independent features are useful for determining the target value.

k-Nearest Neighbours IntroductionAnother interesting ML algorithm that one can use here is kNN (k nearest neighbours). Based on the independent variables, kNN finds the similarity between new data points and old data points. Let me explain this with a simple example.

Consider the height and age for 11 people. On the basis of given features (‘Age’ and ‘Height’), the table can be represented in a graphical format as shown below:

To determine the weight for ID #11, kNN considers the weight of the nearest neighbors of this ID. The weight of ID #11 is predicted to be the average of it’s neighbors. If we consider three neighbours (k=3) for now, the weight for ID#11 would be = (77+72+60)/3 = 69.66 kg.

For a detailed understanding of kNN, you can refer to the following articles:

Implementation #importing libraries from sklearn import neighbors from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1))Using the same train and validation set from the last section:

#scaling data x_train_scaled = scaler.fit_transform(x_train) x_train = pd.DataFrame(x_train_scaled) x_valid_scaled = scaler.fit_transform(x_valid) x_valid = pd.DataFrame(x_valid_scaled) #using gridsearch to find the best parameter params = {'n_neighbors':[2,3,4,5,6,7,8,9]} knn = neighbors.KNeighborsRegressor() model = GridSearchCV(knn, params, cv=5) #fit the model and make predictions model.fit(x_train,y_train) preds = model.predict(x_valid) Results #rmse rms=np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2))) rms 115.17086550026721There is not a huge difference in the RMSE value, but a plot for the predicted and actual values should provide a more clear understanding.

#plot valid['Predictions'] = 0 valid['Predictions'] = preds plt.plot(valid[['Close', 'Predictions']]) plt.plot(train['Close']) InferenceThe RMSE value is almost similar to the linear regression model and the plot shows the same pattern. Like linear regression, kNN also identified a drop in January 2023 since that has been the pattern for the past years. We can safely say that regression algorithms have not performed well on this dataset.

Let’s go ahead and look at some time series forecasting techniques to find out how they perform when faced with this stock prices prediction challenge.

Auto ARIMA IntroductionARIMA is a very popular statistical method for time series forecasting. ARIMA models take into account the past values to predict the future values. There are three important parameters in ARIMA:

p (past values used for forecasting the next value)

q (past forecast errors used to predict the future values)

d (order of differencing)

Parameter tuning for ARIMA consumes a lot of time. So we will use auto ARIMA which automatically selects the best combination of (p,q,d) that provides the least error. To read more about how auto ARIMA works, refer to this article:

Implementation from pyramid.arima import auto_arima data = df.sort_index(ascending=True, axis=0) train = data[:987] valid = data[987:] training = train['Close'] validation = valid['Close'] model = auto_arima(training, start_p=1, start_q=1,max_p=3, max_q=3, m=12,start_P=0, seasonal=True,d=1, D=1, trace=True,error_action='ignore',suppress_warnings=True) model.fit(training) forecast = model.predict(n_periods=248) forecast = pd.DataFrame(forecast,index = valid.index,columns=['Prediction']) Results rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-np.array(forecast['Prediction'])),2))) rms 44.954584993246954 #plot plt.plot(train['Close']) plt.plot(valid['Close']) plt.plot(forecast['Prediction']) InferenceAs we saw earlier, an auto ARIMA model uses past data to understand the pattern in the time series. Using these values, the model captured an increasing trend in the series. Although the predictions using this technique are far better than that of the previously implemented machine learning models, these predictions are still not close to the real values.

As its evident from the plot, the model has captured a trend in the series, but does not focus on the seasonal part. In the next section, we will implement a time series model that takes both trend and seasonality of a series into account.

Prophet IntroductionThere are a number of time series techniques that can be implemented on the stock prediction dataset, but most of these techniques require a lot of data preprocessing before fitting the model. Prophet, designed and pioneered by Facebook, is a time series forecasting library that requires no data preprocessing and is extremely simple to implement. The input for Prophet is a dataframe with two columns: date and target (ds and y).

Prophet tries to capture the seasonality in the past data and works well when the dataset is large. Here is an interesting article that explains Prophet in a simple and intuitive manner:

Implementation #importing prophet from fbprophet import Prophet #creating dataframe new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close']) for i in range(0,len(data)): new_data['Date'][i] = data['Date'][i] new_data['Close'][i] = data['Close'][i] new_data['Date'] = pd.to_datetime(new_data.Date,format='%Y-%m-%d') new_data.index = new_data['Date'] #preparing data new_data.rename(columns={'Close': 'y', 'Date': 'ds'}, inplace=True) #train and validation train = new_data[:987] valid = new_data[987:] #fit the model model = Prophet() model.fit(train) #predictions close_prices = model.make_future_dataframe(periods=len(valid)) forecast = model.predict(close_prices) Results #rmse forecast_valid = forecast['yhat'][987:] rms=np.sqrt(np.mean(np.power((np.array(valid['y'])-np.array(forecast_valid)),2))) rms 57.494461930575149 #plot valid['Predictions'] = 0 valid['Predictions'] = forecast_valid.values plt.plot(train['y']) plt.plot(valid[['y', 'Predictions']]) InferenceProphet (like most time series forecasting techniques) tries to capture the trend and seasonality from past data. This model usually performs well on time series datasets, but fails to live up to it’s reputation in this case.

As it turns out, stock prices do not have a particular trend or seasonality. It highly depends on what is currently going on in the market and thus the prices rise and fall. Hence forecasting techniques like ARIMA, SARIMA and Prophet would not show good results for this particular problem.

Long Short Term Memory (LSTM) IntroductionLSTMs are widely used for sequence prediction problems and have proven to be extremely effective. The reason they work so well is because LSTM is able to store past information that is important, and forget the information that is not. LSTM has three gates:

The input gate: The input gate adds information to the cell state

The forget gate: It removes the information that is no longer required by the model

The output gate: Output Gate at LSTM selects the information to be shown as output

For a more detailed understanding of LSTM and its architecture, you can go through the below article:

For now, let us implement LSTM as a black box and check it’s performance on our particular data.

Implementation #importing required libraries from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.layers import Dense, Dropout, LSTM #creating dataframe data = df.sort_index(ascending=True, axis=0) new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close']) for i in range(0,len(data)): new_data['Date'][i] = data['Date'][i] new_data['Close'][i] = data['Close'][i] #setting index new_data.index = new_data.Date new_data.drop('Date', axis=1, inplace=True) #creating train and test sets dataset = new_data.values train = dataset[0:987,:] valid = dataset[987:,:] #converting dataset into x_train and y_train scaler = MinMaxScaler(feature_range=(0, 1)) scaled_data = scaler.fit_transform(dataset) x_train, y_train = [], [] for i in range(60,len(train)): x_train.append(scaled_data[i-60:i,0]) y_train.append(scaled_data[i,0]) x_train, y_train = np.array(x_train), np.array(y_train) x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1)) # create and fit the LSTM network model = Sequential() model.add(LSTM(units=50, return_sequences=True, input_shape=(x_train.shape[1],1))) model.add(LSTM(units=50)) model.add(Dense(1)) model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=2) #predicting 246 values, using past 60 from the train data inputs = new_data[len(new_data) - len(valid) - 60:].values inputs = inputs.reshape(-1,1) inputs = scaler.transform(inputs) X_test = [] for i in range(60,inputs.shape[0]): X_test.append(inputs[i-60:i,0]) X_test = np.array(X_test) X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1)) closing_price = model.predict(X_test) closing_price = scaler.inverse_transform(closing_price) Results rms=np.sqrt(np.mean(np.power((valid-closing_price),2))) rms 11.772259608962642 #for plotting train = new_data[:987] valid = new_data[987:] valid['Predictions'] = closing_price plt.plot(train['Close']) plt.plot(valid[['Close','Predictions']]) InferenceWow! The LSTM model can be tuned for various parameters such as changing the number of LSTM layers, adding dropout value or increasing the number of epochs. But are the predictions from LSTM enough to identify whether the stock price will increase or decrease? Certainly not!

As I mentioned at the start of the article, stock price is affected by the news about the company and other factors like demonetization or merger/demerger of the companies. There are certain intangible factors as well which can often be impossible to predict beforehand.

ConclusionTime series forecasting is a very intriguing field to work with, as I have realized during my time writing these articles. There is a perception in the community that it’s a complex field, and while there is a grain of truth in there, it’s not so difficult once you get the hang of the basic techniques.

Frequently Asked QuestionsQ1. Is it possible to predict the stock market with Deep Learning?

A. Yes, it is possible to predict the stock market with Deep Learning algorithms such as moving average, linear regression, Auto ARIMA, LSTM, and more.

Q2. What can you use to predict stock prices in Deep Learning?

A. Moving average, linear regression, KNN (k-nearest neighbor), Auto ARIMA, and LSTM (Long Short Term Memory) are some of the most common Deep Learning algorithms used to predict stock prices.

Q3. What are the two methods to predict stock price?

A. Fundamental Analysis and Technical Analysis are the two ways of analyzing and predicting stock prices.

Related

You're reading __Stock Prices Prediction Using Machine Learning And Deep Learning__

## Google Colab For Machine Learning And Deep Learning

“Memory Error” – that all too familiar dreaded message in Jupyter notebooks when we try to execute a machine learning or deep learning algorithm on a large dataset. Most of us do not have access to unlimited computational power on our machines. And let’s face it, it costs an arm and a leg to get a decent GPU from existing cloud providers. So how do we build large deep learning models without burning a hole in our pockets? Step up – Google Colab!

It’s an incredible online browser-based platform that allows us to train our models on machines for free! Sounds too good to be true, but thanks to Google, we can now work with large datasets, build complex models, and even share our work seamlessly with others. That’s the power of Google Colab.

What is Google Colab?Google Colaboratory is a free online cloud-based Jupyter notebook environment that allows us to train our machine learning and deep learning models on CPUs, GPUs, and TPUs.

Here’s what I truly love about Colab. It does not matter which computer you have, what it’s configuration is, and how ancient it might be. You can still use Google Colab! All you need is a Google account and a web browser. And here’s the cherry on top – you get access to GPUs like Tesla K80 and even a TPU, for free!

TPUs are much more expensive than a GPU, and you can use it for free on Colab. It’s worth repeating again and again – it’s an offering like no other.

Are you are still using that same old Jupyter notebook on your system for training models? Trust me, you’re going to love Google Colab.

What is a Notebook in Google Colab? Google Colab Features

Colab provides users free access to GPUs and TPUs, which can significantly speed up the training and inference of machine learning and deep learning models.

Colab’s interface is web-based, so installing any software on your local machine is unnecessary. The interface is also intuitive and user-friendly, making it easy to get started with coding.

Colab allows multiple users to work on the same notebook simultaneously, making collaborating with team members easy. Colab also integrates with other Google services, such as Google Drive and GitHub, making it easy to share your work.

Colab notebooks support markdown, which allows you to include formatted text, equations, and images alongside your code. This makes it easier to document your work and communicate your ideas.

Colab comes pre-installed with many popular libraries and tools for machine learning and deep learning, such as TensorFlow and PyTorch. This saves time and eliminates the need to manually install and configure these tools.

GPUs and TPUs on Google ColabAsk anyone who uses Colab why they love it. The answer is unanimous – the availability of free GPUs and TPUs. Training models, especially deep learning ones, takes numerous hours on a CPU. We’ve all faced this issue on our local machines. GPUs and TPUs, on the other hand, can train these models in a matter of minutes or seconds.

If you still need a reason to work with GPUs, check out this excellent explanation by Faizan Shaikh.

It gives you a decent GPU for free, which you can continuously run for 12 hours. For most data science folks, this is sufficient to meet their computation needs. Especially if you are a beginner, then I would highly recommend you start using Google Colab.

Google Colab gives us three types of runtime for our notebooks:

CPUs,

GPUs, and

TPUs

As I mentioned, Colab gives us 12 hours of continuous execution time. After that, the whole virtual machine is cleared and we have to start again. We can run multiple CPU, GPU, and TPU instances simultaneously, but our resources are shared between these instances.

Let’s take a look at the specifications of different runtimes offered by Google Colab:

It will cost you A LOT to buy a GPU or TPU from the market. Why not save that money and use Google Colab from the comfort of your own machine?

How to Use Google Colab?You can go to Google Colab using this link. This is the screen you’ll get when you open Colab:

You can also import your notebook from Google Drive or GitHub, but they require an authentication process.

Google Colab Runtimes – Choosing the GPU or TPU OptionThe ability to choose different types of runtimes is what makes Colab so popular and powerful. Here are the steps to change the runtime of your notebook:

Step 2: Here you can change the runtime according to your need:

A wise man once said, “With great power comes great responsibility.” I implore you to shut down your notebook after you have completed your work so that others can use these resources because various users share them. You can terminate your notebook like this:

Using Terminal Commands on Google ColabYou can use the Colab cell for running terminal commands. Most of the popular libraries come installed by default on Google Colab. Yes, Python libraries like Pandas, NumPy, scikit-learn are all pre-installed.

If you want to run a different Python library, you can always install it inside your Colab notebook like this:

!pip install

library_name

Pretty easy, right? Everything is similar to how it works in a regular terminal. We just you have to put an exclamation(!) before writing each command like:

!ls

or:

!pwd

Cloning Repositories in Google ColabYou can also clone a Git repo inside Google Colaboratory. Just go to your GitHub repository and copy the clone link of the repository:

Then, simply run:

And there you go!

Uploading Files and DatasetsHere’s a must-know aspect for any data scientist. The ability to import your dataset into Colab is the first step in your data analysis journey.

The most basic approach is to upload your dataset to Colab directly:

You can also upload your dataset to any other platform and access it using its link. I tend to go with the second approach more often than not (when feasible).

Saving Your NotebookAll the notebooks on Colab are stored on your Google Drive. The best thing about Colab is that your notebook is automatically saved after a certain time period and you don’t lose your progress.

If you want, you can export and save your notebook in both *.py and *.ipynb formats:

Not just that, you can also save a copy of your notebook directly on GitHub, or you can create a GitHub Gist:

I love the variety of options we get.

Exporting Data/Files from Google ColabYou can export your files directly to Google Drive, or you can export it to the VM instance and download it by yourself:

Exporting directly to the Drive is a better option when you have bigger files or more than one file. You’ll pick up these nuances as you work on bigger projects in Colab.

Sharing Your NotebookGoogle Colab also gives us an easy way of sharing our work with others. This is one of the best things about Colab:

What’s Next?Google Colab now also provides a paid platform called Google Colab Pro, priced at $9.99 a month. In this plan, you can get the Tesla T4 or Tesla P100 GPU, and an option of selecting an instance with a high RAM of around 27 GB. Also, your maximum computation time is doubled from 12 hours to 24 hours. How cool is that?

You can consider this plan if you need high computation power because it is still quite cheap when compared to other cloud GPU providers like AWS, Azure, and even GCP.

RecommendationsIf you’re new to the world of Deep Learning, I have some excellent resources to help you get started in a comprehensive and structured manner:

Related

## Classification Of Crops Using Machine Learning!

This article was published as a part of the Data Science Blogathon

Introductionto think about these projects, rather than their implementation as many of us getting trouble in initiating and doing the ending of projects.

In this article, we do simple classification modeling and trying to get good accuracy.

You can download the dataset from here.

Aim :To determine the outcome of the harvest season, i.e. whether the crop would be healthy (alive), damaged by pesticides, or damaged by other reasons.

Data Description:We have two datasets given to train and test.

ID: UniqueID Estimated_Insects_Count: Estimated insects count per square meter Crop_Type: Category of Crop(0,1) Soil_Type: Category of Soil (0,1) Pesticide_Use_Category: Type of pesticides uses (1- Never, 2-Previously Used, 3-Currently Using) Number_Doses_Week: Number of doses per week Number_Weeks_Used: Number of weeks used Number_Weeks_Quit: Number of weeks quit Season: Season Category (1,2,3) Crop_Damage: Crop Damage Category (0=alive, 1=Damage due to other causes, 2=Damage due to Pesticides)

1. Importing Libraries and DatasetThen we move on to load the dataset from CSV format and convert it into panda DataFrame and check the top five rows to analyze the data.

2. Cleaning Dataset

1. Checking Null Values: By using dataset isnull().sum() we check that there were 9000 missing values present in the dataset i.e. in the Number_Weeks_Used variable.

2. Checking Datatypes: We checked Datatypes of all columns, to see any inconsistencies in the data.

3. Checking Unique Values: Now we have to understand unique values if present in columns, which will help to reduce dimensionality in future processing.

4. Replacing missing values: As there are 9000 missing values in the Number_Weeks_Used column so we can replace them by mode of the data. And again if we check null values we saw that there are no null values present in our dataset.

3. Exploratory Data Analysis :First, we will get information by using info() method.

Correlation:

Checking correlation with sns.heatmap() .

Inferences drawn from heatmap:

3.Number_weeks_Quit is highly negatively correlated with Pesticide_use_category and Number_weeks_used.

Data Visualization:

For gathering insights, data visualization is a must.

a. Univariate Analysis:For univariate, I plotted countplot of Crop_Damage.

Observations:

Crop damage due to pesticides is less in comparison to damage due to other causes.

Crop type 0 has a higher chance of survival compared to crop type 1.

Now I plotted countplot for Crop_Damage vs Insect count for Crop Type.

Observations:chúng tôi 2 pesticide is much safer to use as compared to Type 3 pesticide.

chúng tôi 3 pesticide shows most pesticide-related damage to crops.

Another plot in univariate analysis for gathering more insights.

Observations:chúng tôi Graph 1 we can conclude that till 20-25 weeks damage due to pesticide is negligible.

2. From Graph 3 we can see that after 20 weeks damage due to the use of pesticide increases significantly.

b. Bivariate Analysis:

Plotted barplot between Crop_Damage vs Estimated_Insects_Count.

Clearly observed from the above plot that Most insect attacks are done on crop type 0.

Barplot between Crop_Type vs Number_Weeks_Used.

Observations:1.Crop Type 0 is more vulnerable to pesticide-related and other damages as compared to Type1.

2.Avg. duration of pesticide-related damage is lower for Crop type 1.

4. Data Pre-processing :

Outliers Analysis :

Now we will check for outliers using Boxplot.

Now for removing these outliers I simply find the mean value of each column and replace it with an outlier value.

After removing outliers,

Skew Analysis :

Checking skewness of our data using histplot and observed that all the data is normally distributed.

Now, our dataset is ready to be put into the machine learning model for classification analysis.

5. Building Machine Learning Model: Scaling Dataset:

As usual, the first step is to drop the target variable and then scaling the dataset by using Standard Scaler to make the data normally distributed.

Splitting Dataset:After preprocessing, we now split the data into training/testing subsets.

Evaluating Models:We now checked various classification models and calculated metrics such as the precision, recall, and F1 score.

The models we will run here are:

Random Forest

K Nearest Neighbor(KNN)

Decision Tree Classifier

Gaussian NB

From initial model accuracy values, we see that KNN is performing better than others with 83% accuracy. It has a maximum accuracy score and minimum standard deviations.

Now I find best parameter i.e. n_neighbors using GridSearchCV taking ‘n_neighbors’ range = (2,30) , cv = 5 and scoring = ‘accuracy’ for our KNN model and found n_neighbor = 22.

Again running the KNN model with its best parameters i.e. n_neighbor = 22.

Result :

To check model performance, we will now plot different performance metrics.

a. Confusion Matrix:

From observation, we found decent accuracy ( ~0.84), precision, and recall for the model. This indicates that the model is a good fit for the prediction.

For better results, one can do hyperparameter tuning which will help in increasing the accuracy of the model.

But overall KNN gives good accuracy among all the models so we save it as our final model.

For your reference you can check my complete solution and dataset used from this link:

Thanks for reading !!! Cheers!

About the author:

Priyal Agarwal

Please feel free to contact me on Linkedin, Email.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

## Toolset For Using Machine Learning Without Matlab

Although Matlab is a popular programming language in the field of machine learning, it is expensive. Nowadays, many programmers are looking for substitute toolkits to build machine learning algorithms. Thankfully, there are a number of open-source, economical solutions that can provide comparable features. This post will examine some of the top toolkits for employing machine learning outside of Matlab, including R packages like caret and randomForest as well as Python libraries like scikit-learn and TensorFlow.

List of toolsetThere are many tools available for using machine learning without MATLAB. Here are some popular options −

1. Python

Python is a powerful and flexible programming language that has gained popularity for application in data analysis and machine learning. There are a number of machine-learning frameworks and tools that have been developed using this free and open-source language, which has a substantial and active development community.

Another is the well-known Python machine learning library PyTorch.Facebook created PyTorch, an open-source machine learning framework that offers a powerful tensor library for deep learning. Compared to rival frameworks, it is more adaptable and user-friendly due to its dynamic computational network.

Scikit-learn is another popular Python machine-learning package. It is a straightforward and effective data mining and data analysis tool that offers a variety of supervised and unsupervised learning methods for applications like classification, regression, and clustering.

Together with these libraries, Python also provides a wide range of additional beneficial machine-learning tools including Keras, Theano, and Pandas. Theano is a deep learning framework for numerical computing, Pandas is a data manipulation library that offers data structures for effective data analysis, and Keras is a high-level neural network library.

Generally, Python’s appeal in machine learning may be attributed to its simplicity, adaptability, and abundance of libraries and frameworks. Building and training machine learning models as well as analyzing and manipulating data for diverse applications are made simpler by these tools and frameworks.

2. R

R is a software environment and programming language for statistical computation and graphics. It also features several packages, like caret and randomForest, and is frequently used for machine learning applications.

R is a widely used programming language and computing environment for statistical computation and graphics. MATLAB has grown to be a well-liked option for data analysis, machine learning, and statistical modeling thanks to its large library of statistical and graphical tools.

R’s extensive library of packages created especially for data analysis and machine learning is one of the key factors contributing to its appeal in machine learning. Caret and randomForest are two of these tools that are frequently used for machine learning.

The R package Caret (Classification And Regression Training) offers a uniform interface for training and fine-tuning a wide range of machine-learning models. It supports a broad range of methods, including linear and nonlinear regression, decision trees, and support vector machines, and provides functions for data splitting, preprocessing, feature selection, and model training.

Another well-liked R package, RandomForest, offers the random forest technique implementation for classification and regression problems. Because of its capacity to manage high-dimensional data, cope with missing values, and handle relationships between variables, it is a preferred option for machine learning applications.

R features a wide range of other helpful machine learning packages, like the caretEnsemble package, which offers tools for merging several machine learning models, and the glmnet package, which offers effective generalized linear model implementations.

Overall, R’s large library of packages for statistical computation and data analysis makes it a popular language for machine learning.

3. RapidMiner

An integrated environment for model deployment, machine learning, and data preparation is provided by the data science platform RapidMiner. The interface is drag-and-drop and supports a wide range of data sources and formats.

Model deployment, machine learning, and data preparation can all be done in one integrated environment with the help of the potent data science platform RapidMiner. It seeks to simplify for users the processes of data collecting, machine learning model construction, and application in real-world scenarios.

The essential aspect is the rapid miner’s ability to use workflows to automate the machine learning process. The best-performing models may be generated quickly and easily, tested, and then put into production via a number of approaches.

Overall, RapidMiner is a capable and flexible data science platform that can be used for many machine learning and data analysis tasks. It is a well-liked option for both novice and experienced users because of its user-friendly drag-and-drop interface, wide selection of machine-learning algorithms, and compatibility with a number of data sources and formats.

4. KNIME

An open-source platform for data analytics called KNIME offers a graphical user interface for creating data pipelines and processes. It may be expanded with plugins and customized nodes in addition to having several built-in nodes for data preparation, machine learning, and visualization.

An open-source platform for data analytics called KNIME offers a visual interface for creating data pipelines and processes. Even those without programming skills may use it easily, yet it nonetheless has cutting-edge features for applications involving machine learning and data analytics.

Moreover, KNIME enables distinctive nodes and plugins that are developed and shared by the user base. Now, users may enhance platform features to meet their own demands.

KNIME’s capacity to interact with various platforms and data sources like Hadoop, Spark, and R is another important aspect. As a result, working with big, complicated datasets and incorporating KNIME into current data ecosystems are made simple.

A variety of machine learning methods, such as decision trees, clustering, and regression models, are offered by KNIME. With the use of a straightforward drag-and-drop interface, these can be set up, trained, and then applied to fresh data inside the platform.

Last but not least, KNIME offers a wide selection of charts, graphs, and other visualizations as part of its rich support for data visualization. This enables users to study and comprehend their data in a number of ways and successfully share their conclusions with others.

ConclusionPython is a powerful and flexible programming language that has gained popularity in data analysis and machine learning due to its simplicity, adaptability, and abundance of libraries and frameworks. RapidMiner provides an integrated environment for model deployment, machine learning, and data preparation, and KNIME offers a graphical user interface for creating data pipelines and processes. KNIME is a powerful and adaptable framework for data analytics that is suitable for both novice and expert users due to its large library of built-in nodes, support for new nodes, and plugins.

## Improving Your Deep Learning Model Using Model Checkpointing

Introduction

Note: If you are more interested in learning concepts in an Audio-Visual format, We have this entire article explained in the video below. If not, you may continue reading.

Saves the best model for us.

In case of system failure, not everything is lost

We’ll discuss each one in detail. Let’s begin!

1. Saving the Best ModelLet’s discuss what do we mean by “Best Model” and how it can be saved? Let’s say that this is the visualization of the performance of a model-

Here the blue line represents the training loss and the orange line represents the validation loss. On the X-axis, we have the number of epochs and on the Y-axis we have the loss values. Now, while making the predictions, the weights and biases stored at the very last epoch will be used. So the model will train completely till the specified number of epochs, which is 50 in this case. And the parameters learned during the last epoch will be used in order to make the predictions.

But if you look closely in this particular graph, the best validation loss is around this epoch, which is epoch number 45-

Let me take the model history in order to elaborate on this a bit more. So here is the model history for a model which has been trained for 50 epochs-

And you can see the epoch numbers here. Now we can see that we have the training loss, training accuracy, validation loss, and validation accuracy shown here. Let’s look at the valuation loss as highlighted here-

So what we generally do is we take the parameters of the model at the last epoch, which is epoch 50 here, and make the predictions. Now, in this case, we can see that the valuation loss at epoch number 50 is 0.629, whereas if you see the lowest validation loss was 0.61, which was at epoch 45.

So through the model checkpointing, instead of saving the last model or the parameters of the last epoch, we are going to save the model which produces the best results. And this model is called the Best Model. So basically Model Checkpointing will help us save the best model.

2- In case of system failure, not everything is lostSo to answer that in Keras, we have to define two parameters. One is “Monitor” and the other one is “Mode”.

The first one refers to the quantity that we wish to monitor, such as validation loss or validation accuracy and “Mode” refers to the mode of that quantity. Let me explain this with an example. So let’s say we wish to monitor the validation loss in this case. While we are monitoring the validation loss, the mode will be minimum because we want to minimize the loss.

Similarly, if we are monitoring the validation accuracy, the mode will be maximum since we want the maximum accuracy for the validation set.

So after every epoch, we will monitor either the validation loss or the validation accuracy and save the model, if these values have improved from the previous model.

Now, these are the common steps that we perform while creating any deep learning model, and we setup model checkpointing at the time of Model Training-

1. Loading the dataset

2. Pre-processing the data

3. Creating training and validation set

4. Defining the model architecture

5. Compiling the model

6. Training the model

Setting up model checkpointing

7. Evaluating model performance

End NotesAfter reading this article you should have got an intuition behind the Model Checkpointing technique which can be really helpful and can do wonders if you’re looking forward to improving your deep learning model. For the implementation of this technique, stay tuned! I’m going to cover its implementation in the next article.

If you are looking to kick start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program

Related

## Automated Machine Learning For Supervised Learning (Part 1)

This article was published as a part of the Data Science Blogathon

This article aims to demonstrate automated Machine Learning, also referred to as AutoML. To be specific, the AutoML will be applied to the problem statement requiring supervised learning, like regression and classification for tabular data. This article does not discuss other kinds of Machine Learning problems, such as clustering, dimensionality reduction, time series forecasting, Natural Language Processing, recommendation machine, or image analysis.

Understanding the problem statement and datasetBefore jumping to the AutoML, we will cover the basic knowledge of conventional Machine Learning workflow. After getting the dataset and understanding the problem statement, we need to identify the goal of the task. This article, as mentioned above, focuses on regression and classification tasks. So, make sure that the dataset is tabular. Other data formats, such as time series, spatial, image, or text, are not the main focus here.

Next, explore the dataset to understand some basic information, such as the:

Descriptive statistics (count, mean, standard deviation, minimum, maximum, and quartile) using .describe();

Data type of each feature using .info() or .dtypes;

Count of values using .value_counts();

Null value existance using .isnull().sum();

Correlation test using .corr();

etc.

Pre-processing

After understanding the dataset, do the data pre-processing. This part is very important in that it will result in a training dataset for Machine Learning fitting. Data pre-processing can start with handling the missing data. Users should decide whether to remove the observation with missing data or apply data imputation. Data imputation means to fill the missing value with the average, median, constant, or most occurring value. Users can also pay attention to outliers or bad data to remove them so that they will not be the noise.

Feature scaling is a very important process in data preprocessing. Feature scaling aims to scale the value range in each feature so that features with higher values and small variance do not dominate other features with low values and high variance. Some examples of feature scaling are standardization, normalization, log normalization, etc.

Feature scaling is suitable to apply to gradient descent- and distance-based Machine Learning algorithms. Tree-based algorithms do not need feature scaling The following table shows the examples of algorithms.

Table 1 Examples of algorithms

Machine Learning Type Algorithms

Gradient descent-based Linear Regression, Ridge Regression, Lasso Regression, Elasticnet Regression, Neural Network (Deep Learning)

Distance-based K Nearest Neighbors, Support Vector Machine, K-means, Hierarchical clustering

Tree-based Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting,

Notice that there are also clustering algorithms in the table. K-means and hierarchical clustering are unsupervised learning algorithms.

Feature engineering: generation, selection, and extraction refer to the activities of creating new features (expected to help the prediction), removing low importance features or noises, and adding new features from extracting partial information of combined existing features respectively. This part is very important that adding new features or removing features can improve model accuracy. Cutting the number of features can also reduce the running time.

Creating model, hyperparameter-tuning, and model evaluationThe main part of Machine Learning is choosing an algorithm and build it. The algorithm needs training dataset features, a target or label feature, and some hyperparameters as the arguments. After the model is built, it is then used for predicting validation or test dataset to check the score. To improve the score, hyperparameter-tuning is performed. Hyperparameter-tuning is the activity of changing the hyperparameter(s) of each Machine Learning algorithms repeatedly until a satisfied model is obtained with a set of hyperparameters. The model is evaluated using scorer metrics, such Root Mean Squared Error, Mean Squared Error, or R2 for regression problems and accuracy, Area Under the ROC Curve, or F1-score for classification problems. The model score is evaluated using cross-validation. To read more about hyperparameter-tuning, please find this article.

After getting the optimum model with a set of hyperparameters, we may want to try other Machine Learning algorithms, along with the hyperparameter-tuning. There are many algorithms for regression and classification problems with their pros and cons. Different datasets have different Machine Learning algorithms to build the best prediction models. I have made notebooks containing a number of commonly used Machine Learning algorithms using the steps mentioned above. Please check it here:

The datasets are provided by Kaggle. The regression task is to predict house prices using the parameters of the houses. The notebook contains the algorithms: Linear Regression, Ridge Regression, Lasso Regression, Elastic-net Regression, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine (GBM), Light GBM, Extreme Gradient Boosting (XGBoost), and Neural Network (Deep Learning).

The binary classification task is to predict whether the Titanic passengers would survive or not. This is a newer dataset published just this April 2023 (not the old Titanic dataset for Kaggle newcomers). The goal is to classify each observation into class “survived” or not survived” without probability. If the classes are more than 2, it is called multi-class classification. However, the technics are similar. The notebook contains the algorithms: Logistic Regression, Naive Bayes, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting, and Neural Network (Deep Learning). Notice that some algorithms can perform regression and classification works.

Another notebook I created is to predict binary classification with probability. It predicts whether each observation of location, date, and time was in high traffic or not with probability. If the probability of being high traffic is, for example, 0.8, the probability of not being high traffic is 0.2. There is also multi-label classification which predicts the probability of more than two classes.

If you have seen my notebooks from the hyperlinks above, there are many algorithms used to build the prediction models for the same dataset. But, which model should be used since the models predict different outputs. The simplest way is just picking the model with the best score (lowest RMSE or highest accuracy). Or, we can perform ensemble methods. Ensemble methods use multiple different machine learning algorithms for predicting the same dataset. The final output is determined by averaging the predicted outputs in regression or majority voting in classification. Actually, Random Forest, GBM, and XGBoost are also ensemble methods. But, they develop the same type of Machine Learning, which is a Decision Tree, from different subsets of the training data.

Finally, we can save the model if it is satisfying. The saved model can be loaded again in other notebooks to do the same prediction.

Fig. 1 Machine Learning Workflow. Source: created by the author

Automated Machine Learning

The process to build Machine Learning models and choose the best model is very long. It takes many lines of code and much time to complete. However, Data Science and Machine Learning are associated with automation. Then, we have automated Machine learning or autoML. AutoML only needs a few lines to do most of the steps above, but not all of the steps. Figure 1 shows the workflow of Machine Learning. The autoML covers only the parts of data pre-processing, choosing model, and hyperparameter-tuning. The users still have to understand the goals, explore the dataset, and prepare the data.

There are many autoML packages for regression and classification tasks for structured tabular data, image, text, and other predictions. Below is the code of one of the autoML packages, named Auto-Sklearn. The dataset is Titanic Survival, still the same as in the previous notebooks. Auto-Sklearn was developed by Matthias Feurer, et al. (2024) in the paper “Efficient and Robust Automated Machine Learning”. Auto-Sklearn is available openly in python scripting. Yes, Sklearn or Scikit-learn is the common package for performing Machine Learning in Python language. Almost all of the algorithms in the notebooks above are from Sklearn.

# Install and import packages !apt install -y build-essential swig curl !pip install auto-sklearn from autosklearn.classification import AutoSklearnClassifier # Create the AutoSklearnClassifier sklearn = AutoSklearnClassifier(time_left_for_this_task=3*60, per_run_time_limit=15, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_sklearn)))Output:

Dataset name: da588f6e-c217-11eb-802c-0242ac130202 Metric: accuracy Best validation score: 0.769936 Number of target algorithm runs: 26 Number of successful target algorithm runs: 7 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 19 Number of target algorithms that exceeded the memory limit: 0 Accuracy: 0.7710593242331447 # Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_sklearn))) print(classification_report(y_val, pred_sklearn))Output:

Confusion Matrix 0 1 0 8804 2215 1 2196 6052 precision recall f1-score support 0 0.80 0.80 0.80 11019 1 0.73 0.73 0.73 8248 accuracy 0.77 19267 macro avg 0.77 0.77 0.77 19267 weighted avg 0.77 0.77 0.77 19267The code is set to run for 3 minutes with no single algorithm running for more than 30 seconds. See, with only a few lines, we can create a classification algorithm automatically. We do not even need to think about which algorithm to use or which hyperparameters to set. Even a beginner in Machine Learning can do it right away. We can just get the final result. The code above has run 26 algorithms, but only 7 of them are completed. The other 19 algorithms exceeded the set time limit. It can achieve an accuracy of 0.771. To find the process of finding the selected model, run this line

print(sklearn.show_models()).The following code is also Auto-Sklearn, but for regression work. It develops an autoML model to predict the House Prices dataset. It can find a model with RMSE of 28,130 from successful 16 algorithms out of the total 36 algorithms.

# Install and import packages !apt install -y build-essential swig curl !pip install auto-sklearn from autosklearn.regression import AutoSklearnRegressor # Create the AutoSklearnRegessor sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=30, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the RMSE rmse_sklearn=MSE(y_val, pred_sklearn)**0.5 print('RMSE: ' + str(rmse_sklearn))Output:

Dataset name: 71040d02-c21a-11eb-803f-0242ac130202 Metric: r2 Best validation score: 0.888788 Number of target algorithm runs: 36 Number of successful target algorithm runs: 16 Number of crashed target algorithm runs: 1 Number of target algorithms that exceeded the time limit: 15 Number of target algorithms that exceeded the memory limit: 4 RMSE: 28130.17557050461 # Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()Output:

# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()Fig. 2 Scatter plot from autoSklearnRegressor. Source: created by the author

So, do you think that Machine Learning Scientists/Engineers are still needed?

There are still other autoML packages to discuss, like Hyperopt–Sklearn, Tree-based Pipeline Optimization Tool (TPOT), AuroKeras, MLJAR, and so on. But, we will discuss them in part 2.

About AuthorConnect with me here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Update the detailed information about **Stock Prices Prediction Using Machine Learning And Deep Learning** on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!