Trending December 2023 # Automated Machine Learning For Supervised Learning (Part 1) # Suggested January 2024 # Top 15 Popular

You are reading the article Automated Machine Learning For Supervised Learning (Part 1) updated in December 2023 on the website Kientrucdochoi.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Automated Machine Learning For Supervised Learning (Part 1)

This article was published as a part of the Data Science Blogathon                      

This article aims to demonstrate automated Machine Learning, also referred to as AutoML. To be specific, the AutoML will be applied to the problem statement requiring supervised learning, like regression and classification for tabular data. This article does not discuss other kinds of Machine Learning problems, such as clustering, dimensionality reduction, time series forecasting, Natural Language Processing, recommendation machine, or image analysis.

Understanding the problem statement and dataset

Before jumping to the AutoML, we will cover the basic knowledge of conventional Machine Learning workflow. After getting the dataset and understanding the problem statement, we need to identify the goal of the task. This article, as mentioned above, focuses on regression and classification tasks. So, make sure that the dataset is tabular. Other data formats, such as time series, spatial, image, or text, are not the main focus here.

Next, explore the dataset to understand some basic information, such as the:

Descriptive statistics (count, mean, standard deviation, minimum, maximum, and quartile) using .describe();

Data type of each feature using .info() or .dtypes;

Count of values using .value_counts();

Null value existance using .isnull().sum();

Correlation test using .corr();

etc.

 

Pre-processing

After understanding the dataset, do the data pre-processing. This part is very important in that it will result in a training dataset for Machine Learning fitting. Data pre-processing can start with handling the missing data. Users should decide whether to remove the observation with missing data or apply data imputation. Data imputation means to fill the missing value with the average, median, constant, or most occurring value. Users can also pay attention to outliers or bad data to remove them so that they will not be the noise.

Feature scaling is a very important process in data preprocessing. Feature scaling aims to scale the value range in each feature so that features with higher values and small variance do not dominate other features with low values and high variance. Some examples of feature scaling are standardization, normalization, log normalization, etc.

Feature scaling is suitable to apply to gradient descent- and distance-based Machine Learning algorithms. Tree-based algorithms do not need feature scaling The following table shows the examples of algorithms.

Table 1 Examples of algorithms

Machine Learning Type Algorithms

Gradient descent-based Linear Regression, Ridge Regression, Lasso Regression, Elasticnet Regression, Neural Network (Deep Learning)

Distance-based K Nearest Neighbors, Support Vector Machine, K-means, Hierarchical clustering

Tree-based Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting,

Notice that there are also clustering algorithms in the table. K-means and hierarchical clustering are unsupervised learning algorithms.

Feature engineering: generation, selection, and extraction refer to the activities of creating new features (expected to help the prediction), removing low importance features or noises, and adding new features from extracting partial information of combined existing features respectively. This part is very important that adding new features or removing features can improve model accuracy. Cutting the number of features can also reduce the running time.

Creating model, hyperparameter-tuning, and model evaluation

The main part of Machine Learning is choosing an algorithm and build it. The algorithm needs training dataset features, a target or label feature, and some hyperparameters as the arguments. After the model is built, it is then used for predicting validation or test dataset to check the score. To improve the score, hyperparameter-tuning is performed. Hyperparameter-tuning is the activity of changing the hyperparameter(s) of each Machine Learning algorithms repeatedly until a satisfied model is obtained with a set of hyperparameters. The model is evaluated using scorer metrics, such Root Mean Squared Error, Mean Squared Error, or R2 for regression problems and accuracy, Area Under the ROC Curve, or F1-score for classification problems. The model score is evaluated using cross-validation. To read more about hyperparameter-tuning, please find this article.

After getting the optimum model with a set of hyperparameters, we may want to try other Machine Learning algorithms, along with the hyperparameter-tuning. There are many algorithms for regression and classification problems with their pros and cons. Different datasets have different Machine Learning algorithms to build the best prediction models. I have made notebooks containing a number of commonly used Machine Learning algorithms using the steps mentioned above. Please check it here:

The datasets are provided by Kaggle. The regression task is to predict house prices using the parameters of the houses. The notebook contains the algorithms: Linear Regression, Ridge Regression, Lasso Regression, Elastic-net Regression, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine (GBM), Light GBM, Extreme Gradient Boosting (XGBoost), and Neural Network (Deep Learning).

The binary classification task is to predict whether the Titanic passengers would survive or not. This is a newer dataset published just this April 2023 (not the old Titanic dataset for Kaggle newcomers). The goal is to classify each observation into class “survived” or not survived” without probability. If the classes are more than 2, it is called multi-class classification. However, the technics are similar. The notebook contains the algorithms: Logistic Regression, Naive Bayes, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting, and Neural Network (Deep Learning). Notice that some algorithms can perform regression and classification works.

Another notebook I created is to predict binary classification with probability. It predicts whether each observation of location, date, and time was in high traffic or not with probability. If the probability of being high traffic is, for example, 0.8, the probability of not being high traffic is 0.2. There is also multi-label classification which predicts the probability of more than two classes.

If you have seen my notebooks from the hyperlinks above, there are many algorithms used to build the prediction models for the same dataset. But, which model should be used since the models predict different outputs. The simplest way is just picking the model with the best score (lowest RMSE or highest accuracy). Or, we can perform ensemble methods. Ensemble methods use multiple different machine learning algorithms for predicting the same dataset. The final output is determined by averaging the predicted outputs in regression or majority voting in classification. Actually, Random Forest, GBM, and XGBoost are also ensemble methods. But, they develop the same type of Machine Learning, which is a Decision Tree, from different subsets of the training data.

Finally, we can save the model if it is satisfying. The saved model can be loaded again in other notebooks to do the same prediction.

Fig. 1 Machine Learning Workflow. Source: created by the author

 

Automated Machine Learning

The process to build Machine Learning models and choose the best model is very long. It takes many lines of code and much time to complete. However, Data Science and Machine Learning are associated with automation. Then, we have automated Machine learning or autoML. AutoML only needs a few lines to do most of the steps above, but not all of the steps. Figure 1 shows the workflow of Machine Learning. The autoML covers only the parts of data pre-processing, choosing model, and hyperparameter-tuning. The users still have to understand the goals, explore the dataset, and prepare the data.

There are many autoML packages for regression and classification tasks for structured tabular data, image, text, and other predictions. Below is the code of one of the autoML packages, named Auto-Sklearn. The dataset is Titanic Survival, still the same as in the previous notebooks. Auto-Sklearn was developed by Matthias Feurer, et al. (2023) in the paper “Efficient and Robust Automated Machine Learning”. Auto-Sklearn is available openly in python scripting. Yes, Sklearn or Scikit-learn is the common package for performing Machine Learning in Python language. Almost all of the algorithms in the notebooks above are from Sklearn.

# Install and import packages !apt install -y build-essential swig curl !pip install auto-sklearn from autosklearn.classification import AutoSklearnClassifier # Create the AutoSklearnClassifier sklearn = AutoSklearnClassifier(time_left_for_this_task=3*60, per_run_time_limit=15, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_sklearn)))

Output:

Dataset name: da588f6e-c217-11eb-802c-0242ac130202 Metric: accuracy Best validation score: 0.769936 Number of target algorithm runs: 26 Number of successful target algorithm runs: 7 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 19 Number of target algorithms that exceeded the memory limit: 0 Accuracy: 0.7710593242331447 # Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_sklearn))) print(classification_report(y_val, pred_sklearn))

Output:

Confusion Matrix 0 1 0 8804 2215 1 2196 6052 precision recall f1-score support 0 0.80 0.80 0.80 11019 1 0.73 0.73 0.73 8248 accuracy 0.77 19267 macro avg 0.77 0.77 0.77 19267 weighted avg 0.77 0.77 0.77 19267

The code is set to run for 3 minutes with no single algorithm running for more than 30 seconds. See, with only a few lines, we can create a classification algorithm automatically. We do not even need to think about which algorithm to use or which hyperparameters to set. Even a beginner in Machine Learning can do it right away. We can just get the final result. The code above has run 26 algorithms, but only 7 of them are completed. The other 19 algorithms exceeded the set time limit. It can achieve an accuracy of 0.771. To find the process of finding the selected model, run this line

print(sklearn.show_models()).

The following code is also Auto-Sklearn, but for regression work. It develops an autoML model to predict the House Prices dataset. It can find a model with RMSE of 28,130 from successful 16 algorithms out of the total 36 algorithms.

# Install and import packages !apt install -y build-essential swig curl !pip install auto-sklearn from autosklearn.regression import AutoSklearnRegressor # Create the AutoSklearnRegessor sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=30, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the RMSE rmse_sklearn=MSE(y_val, pred_sklearn)**0.5 print('RMSE: ' + str(rmse_sklearn))

Output:

Dataset name: 71040d02-c21a-11eb-803f-0242ac130202 Metric: r2 Best validation score: 0.888788 Number of target algorithm runs: 36 Number of successful target algorithm runs: 16 Number of crashed target algorithm runs: 1 Number of target algorithms that exceeded the time limit: 15 Number of target algorithms that exceeded the memory limit: 4 RMSE: 28130.17557050461 # Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()

Output:

# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()

Fig. 2 Scatter plot from autoSklearnRegressor. Source: created by the author

So, do you think that Machine Learning Scientists/Engineers are still needed?

There are still other autoML packages to discuss, like Hyperopt–Sklearn, Tree-based Pipeline Optimization Tool (TPOT), AuroKeras, MLJAR, and so on. But, we will discuss them in part 2.

About Author

Connect with me here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading Automated Machine Learning For Supervised Learning (Part 1)

Google Colab For Machine Learning And Deep Learning

“Memory Error” – that all too familiar dreaded message in Jupyter notebooks when we try to execute a machine learning or deep learning algorithm on a large dataset. Most of us do not have access to unlimited computational power on our machines. And let’s face it, it costs an arm and a leg to get a decent GPU from existing cloud providers. So how do we build large deep learning models without burning a hole in our pockets? Step up – Google Colab!

It’s an incredible online browser-based platform that allows us to train our models on machines for free! Sounds too good to be true, but thanks to Google, we can now work with large datasets, build complex models, and even share our work seamlessly with others. That’s the power of Google Colab.

What is Google Colab?

Google Colaboratory is a free online cloud-based Jupyter notebook environment that allows us to train our machine learning and deep learning models on CPUs, GPUs, and TPUs.

Here’s what I truly love about Colab. It does not matter which computer you have, what it’s configuration is, and how ancient it might be. You can still use Google Colab! All you need is a Google account and a web browser. And here’s the cherry on top – you get access to GPUs like Tesla K80 and even a TPU, for free!

TPUs are much more expensive than a GPU, and you can use it for free on Colab. It’s worth repeating again and again – it’s an offering like no other.

Are you are still using that same old Jupyter notebook on your system for training models? Trust me, you’re going to love Google Colab.

What is a Notebook in Google Colab? Google Colab Features

Colab provides users free access to GPUs and TPUs, which can significantly speed up the training and inference of machine learning and deep learning models.

Colab’s interface is web-based, so installing any software on your local machine is unnecessary. The interface is also intuitive and user-friendly, making it easy to get started with coding.

Colab allows multiple users to work on the same notebook simultaneously, making collaborating with team members easy. Colab also integrates with other Google services, such as Google Drive and GitHub, making it easy to share your work.

Colab notebooks support markdown, which allows you to include formatted text, equations, and images alongside your code. This makes it easier to document your work and communicate your ideas.

Colab comes pre-installed with many popular libraries and tools for machine learning and deep learning, such as TensorFlow and PyTorch. This saves time and eliminates the need to manually install and configure these tools.

GPUs and TPUs on Google Colab

Ask anyone who uses Colab why they love it. The answer is unanimous – the availability of free GPUs and TPUs. Training models, especially deep learning ones, takes numerous hours on a CPU. We’ve all faced this issue on our local machines. GPUs and TPUs, on the other hand, can train these models in a matter of minutes or seconds.

If you still need a reason to work with GPUs, check out this excellent explanation by Faizan Shaikh.

It gives you a decent GPU for free, which you can continuously run for 12 hours. For most data science folks, this is sufficient to meet their computation needs. Especially if you are a beginner, then I would highly recommend you start using Google Colab.

Google Colab gives us three types of runtime for our notebooks:

CPUs,

GPUs, and

TPUs

As I mentioned, Colab gives us 12 hours of continuous execution time. After that, the whole virtual machine is cleared and we have to start again. We can run multiple CPU, GPU, and TPU instances simultaneously, but our resources are shared between these instances.

Let’s take a look at the specifications of different runtimes offered by Google Colab:

It will cost you A LOT to buy a GPU or TPU from the market. Why not save that money and use Google Colab from the comfort of your own machine?

How to Use Google Colab?

You can go to Google Colab using this link. This is the screen you’ll get when you open Colab:

You can also import your notebook from Google Drive or GitHub, but they require an authentication process.

Google Colab Runtimes – Choosing the GPU or TPU Option

The ability to choose different types of runtimes is what makes Colab so popular and powerful. Here are the steps to change the runtime of your notebook:

Step 2: Here you can change the runtime according to your need:

A wise man once said, “With great power comes great responsibility.” I implore you to shut down your notebook after you have completed your work so that others can use these resources because various users share them. You can terminate your notebook like this:

Using Terminal Commands on Google Colab

You can use the Colab cell for running terminal commands. Most of the popular libraries come installed by default on Google Colab. Yes, Python libraries like Pandas, NumPy, scikit-learn are all pre-installed.

If you want to run a different Python library, you can always install it inside your Colab notebook like this:

!pip install 

library_name

Pretty easy, right? Everything is similar to how it works in a regular terminal. We just you have to put an exclamation(!) before writing each command like:

!ls

or:

!pwd

Cloning Repositories in Google Colab

You can also clone a Git repo inside Google Colaboratory. Just go to your GitHub repository and copy the clone link of the repository:

Then, simply run:

And there you go!

Uploading Files and Datasets

Here’s a must-know aspect for any data scientist. The ability to import your dataset into Colab is the first step in your data analysis journey.

The most basic approach is to upload your dataset to Colab directly:

You can also upload your dataset to any other platform and access it using its link. I tend to go with the second approach more often than not (when feasible).

Saving Your Notebook

All the notebooks on Colab are stored on your Google Drive. The best thing about Colab is that your notebook is automatically saved after a certain time period and you don’t lose your progress.

If you want, you can export and save your notebook in both *.py and *.ipynb formats:

Not just that, you can also save a copy of your notebook directly on GitHub, or you can create a GitHub Gist:

I love the variety of options we get.

Exporting Data/Files from Google Colab

You can export your files directly to Google Drive, or you can export it to the VM instance and download it by yourself:

Exporting directly to the Drive is a better option when you have bigger files or more than one file. You’ll pick up these nuances as you work on bigger projects in Colab.

Sharing Your Notebook

Google Colab also gives us an easy way of sharing our work with others. This is one of the best things about Colab:

What’s Next?

Google Colab now also provides a paid platform called Google Colab Pro, priced at $9.99 a month. In this plan, you can get the Tesla T4 or Tesla P100 GPU, and an option of selecting an instance with a high RAM of around 27 GB. Also, your maximum computation time is doubled from 12 hours to 24 hours. How cool is that?

You can consider this plan if you need high computation power because it is still quite cheap when compared to other cloud GPU providers like AWS, Azure, and even GCP.

Recommendations

If you’re new to the world of Deep Learning, I have some excellent resources to help you get started in a comprehensive and structured manner:

Related

Toolset For Using Machine Learning Without Matlab

Although Matlab is a popular programming language in the field of machine learning, it is expensive. Nowadays, many programmers are looking for substitute toolkits to build machine learning algorithms. Thankfully, there are a number of open-source, economical solutions that can provide comparable features. This post will examine some of the top toolkits for employing machine learning outside of Matlab, including R packages like caret and randomForest as well as Python libraries like scikit-learn and TensorFlow.

List of toolset

There are many tools available for using machine learning without MATLAB. Here are some popular options −

1. Python

Python is a powerful and flexible programming language that has gained popularity for application in data analysis and machine learning. There are a number of machine-learning frameworks and tools that have been developed using this free and open-source language, which has a substantial and active development community.

Another is the well-known Python machine learning library PyTorch.Facebook created PyTorch, an open-source machine learning framework that offers a powerful tensor library for deep learning. Compared to rival frameworks, it is more adaptable and user-friendly due to its dynamic computational network.

Scikit-learn is another popular Python machine-learning package. It is a straightforward and effective data mining and data analysis tool that offers a variety of supervised and unsupervised learning methods for applications like classification, regression, and clustering.

Together with these libraries, Python also provides a wide range of additional beneficial machine-learning tools including Keras, Theano, and Pandas. Theano is a deep learning framework for numerical computing, Pandas is a data manipulation library that offers data structures for effective data analysis, and Keras is a high-level neural network library.

Generally, Python’s appeal in machine learning may be attributed to its simplicity, adaptability, and abundance of libraries and frameworks. Building and training machine learning models as well as analyzing and manipulating data for diverse applications are made simpler by these tools and frameworks.

2. R

R is a software environment and programming language for statistical computation and graphics. It also features several packages, like caret and randomForest, and is frequently used for machine learning applications.

R is a widely used programming language and computing environment for statistical computation and graphics. MATLAB has grown to be a well-liked option for data analysis, machine learning, and statistical modeling thanks to its large library of statistical and graphical tools.

R’s extensive library of packages created especially for data analysis and machine learning is one of the key factors contributing to its appeal in machine learning. Caret and randomForest are two of these tools that are frequently used for machine learning.

The R package Caret (Classification And Regression Training) offers a uniform interface for training and fine-tuning a wide range of machine-learning models. It supports a broad range of methods, including linear and nonlinear regression, decision trees, and support vector machines, and provides functions for data splitting, preprocessing, feature selection, and model training.

Another well-liked R package, RandomForest, offers the random forest technique implementation for classification and regression problems. Because of its capacity to manage high-dimensional data, cope with missing values, and handle relationships between variables, it is a preferred option for machine learning applications.

R features a wide range of other helpful machine learning packages, like the caretEnsemble package, which offers tools for merging several machine learning models, and the glmnet package, which offers effective generalized linear model implementations.

Overall, R’s large library of packages for statistical computation and data analysis makes it a popular language for machine learning.

3. RapidMiner

An integrated environment for model deployment, machine learning, and data preparation is provided by the data science platform RapidMiner. The interface is drag-and-drop and supports a wide range of data sources and formats.

Model deployment, machine learning, and data preparation can all be done in one integrated environment with the help of the potent data science platform RapidMiner. It seeks to simplify for users the processes of data collecting, machine learning model construction, and application in real-world scenarios.

The essential aspect is the rapid miner’s ability to use workflows to automate the machine learning process. The best-performing models may be generated quickly and easily, tested, and then put into production via a number of approaches.

Overall, RapidMiner is a capable and flexible data science platform that can be used for many machine learning and data analysis tasks. It is a well-liked option for both novice and experienced users because of its user-friendly drag-and-drop interface, wide selection of machine-learning algorithms, and compatibility with a number of data sources and formats.

4. KNIME

An open-source platform for data analytics called KNIME offers a graphical user interface for creating data pipelines and processes. It may be expanded with plugins and customized nodes in addition to having several built-in nodes for data preparation, machine learning, and visualization.

An open-source platform for data analytics called KNIME offers a visual interface for creating data pipelines and processes. Even those without programming skills may use it easily, yet it nonetheless has cutting-edge features for applications involving machine learning and data analytics.

Moreover, KNIME enables distinctive nodes and plugins that are developed and shared by the user base. Now, users may enhance platform features to meet their own demands.

KNIME’s capacity to interact with various platforms and data sources like Hadoop, Spark, and R is another important aspect. As a result, working with big, complicated datasets and incorporating KNIME into current data ecosystems are made simple.

A variety of machine learning methods, such as decision trees, clustering, and regression models, are offered by KNIME. With the use of a straightforward drag-and-drop interface, these can be set up, trained, and then applied to fresh data inside the platform.

Last but not least, KNIME offers a wide selection of charts, graphs, and other visualizations as part of its rich support for data visualization. This enables users to study and comprehend their data in a number of ways and successfully share their conclusions with others.

Conclusion

Python is a powerful and flexible programming language that has gained popularity in data analysis and machine learning due to its simplicity, adaptability, and abundance of libraries and frameworks. RapidMiner provides an integrated environment for model deployment, machine learning, and data preparation, and KNIME offers a graphical user interface for creating data pipelines and processes. KNIME is a powerful and adaptable framework for data analytics that is suitable for both novice and expert users due to its large library of built-in nodes, support for new nodes, and plugins.

Stock Prices Prediction Using Machine Learning And Deep Learning

17 minutes

⭐⭐⭐⭐⭐

Rating: 5 out of 5.

Introduction

Predicting how the stock market will perform is one of the most difficult things to do. There are so many factors involved in the prediction – physical factors vs. psychological, rational and irrational behavior, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy.

Can we use machine learning as a game-changer in this domain? Using features like the latest announcements about an organization, their quarterly revenue results, etc., machine learning techniques have the potential to unearth patterns and insights we didn’t see before, and these can be used to make unerringly accurate predictions.

The core idea behind this article is to showcase how these algorithms are implemented. I will briefly describe the technique and provide relevant links to brush up on the concepts as and when necessary. In case you’re a newcomer to the world of time series, I suggest going through the following articles first:

Are you a beginner looking for a place to start your data science journey? Presenting a comprehensive course, full of knowledge and data science learning, curated just for you! This course covers everything from basics of Machine Learning to Advanced concepts of ML, Deep Learning and Time series.

Understanding the Problem Statement

We’ll dive into the implementation part of this article soon, but first it’s important to establish what we’re aiming to solve. Broadly, stock market analysis is divided into two parts – Fundamental Analysis and Technical Analysis.

Fundamental Analysis involves analyzing the company’s future profitability on the basis of its current business environment and financial performance.

Technical Analysis, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market.

As you might have guessed, our focus will be on the technical analysis part. We’ll be using a dataset from Quandl (you can find historical data for various stocks here) and for this particular project, I have used the data for ‘Tata Global Beverages’. Time to dive in!

Note: Here is the dataset I used for the code: Download

We will first load the dataset and define the target variable for the problem:

Python Code:



There are multiple variables in the dataset – date, open, high, low, last, close, total_trade_quantity, and turnover.

The columns Open and Close represent the starting and final price at which the stock is traded on a particular day.

High, Low and Last represent the maximum, minimum, and last price of the share for the day.

Total Trade Quantity is the number of shares bought or sold in the day and Turnover (Lacs) is the turnover of the particular company on a given date.

Another important thing to note is that the market is closed on weekends and public holidays. Notice the above table again, some date values are missing – 2/10/2023, 6/10/2023, 7/10/2023. Of these dates, 2nd is a national holiday while 6th and 7th fall on a weekend.

The profit or loss calculation is usually determined by the closing price of a stock for the day, hence we will consider the closing price as the target variable. Let’s plot the target variable to understand how it’s shaping up in our data:

#setting index as date df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d') df.index = df['Date'] #plot plt.figure(figsize=(16,8)) plt.plot(df['Close'], label='Close Price history')

In the upcoming sections, we will explore these variables and use different techniques to predict the daily closing price of the stock.

Moving Average Introduction

‘Average’ is easily one of the most common things we use in our day-to-day lives. For instance, calculating the average marks to determine overall performance, or finding the average temperature of the past few days to get an idea about today’s temperature – these all are routine tasks we do on a regular basis. So this is a good starting point to use on our dataset for making predictions.

The predicted closing price for each day will be the average of a set of previously observed values. Instead of using the simple average, we will be using the moving average technique which uses the latest set of values for each prediction. In other words, for each subsequent step, the predicted values are taken into consideration while removing the oldest observed value from the set. Here is a simple figure that will help you understand this with more clarity.

We will implement this technique on our dataset. The first step is to create a dataframe that contains only the Date and Close price columns, then split it into train and validation sets to verify our predictions.

Implementation



Just checking the RMSE does not help us in understanding how the model performed. Let’s visualize this to get a more intuitive understanding. So here is a plot of the predicted values along with the actual values.

#plot valid['Predictions'] = 0 valid['Predictions'] = preds plt.plot(train['Close']) plt.plot(valid[['Close', 'Predictions']]) Inference

The RMSE value is close to 105 but the results are not very promising (as you can gather from the plot). The predicted values are of the same range as the observed values in the train set (there is an increasing trend initially and then a slow decrease).

In the next section, we will look at two commonly used machine learning techniques – Linear Regression and kNN, and see how they perform on our stock market data.

Linear Regression Introduction

The most basic machine learning algorithm that can be implemented on this data is linear regression. The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable.

The equation for linear regression can be written as:

Here, x1, x2,….xn represent the independent variables while the coefficients θ1, θ2, …. θn  represent the weights. You can refer to the following article to study linear regression in more detail:

For our problem statement, we do not have a set of independent variables. We have only the dates instead. Let us use the date column to extract features like – day, month, year,  mon/fri etc. and then fit a linear regression model.

Implementation

We will first sort the dataset in ascending order and then create a separate dataset so that any new feature created does not affect the original data.

#setting index as date values df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d') df.index = df['Date'] #sorting data = df.sort_index(ascending=True, axis=0) #creating a separate dataset new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close']) for i in range(0,len(data)): new_data['Date'][i] = data['Date'][i] new_data['Close'][i] = data['Close'][i] #create features from fastai.structured import add_datepart add_datepart(new_data, 'Date') new_data.drop('Elapsed', axis=1, inplace=True)  #elapsed will be the time stamp

This creates features such as:

‘Year’, ‘Month’, ‘Week’, ‘Day’, ‘Dayofweek’, ‘Dayofyear’, ‘Is_month_end’, ‘Is_month_start’, ‘Is_quarter_end’, ‘Is_quarter_start’,  ‘Is_year_end’, and  ‘Is_year_start’.

Note: I have used add_datepart from fastai library. If you do not have it installed, you can simply use the command pip install fastai. Otherwise, you can create these feature using simple for loops in python. I have shown an example below.

Apart from this, we can add our own set of features that we believe would be relevant for the predictions. For instance, my hypothesis is that the first and last days of the week could potentially affect the closing price of the stock far more than the other days. So I have created a feature that identifies whether a given day is Monday/Friday or Tuesday/Wednesday/Thursday. This can be done using the following lines of code:

new_data['mon_fri'] = 0 for i in range(0,len(new_data)): if (new_data['Dayofweek'][i] == 0 or new_data['Dayofweek'][i] == 4):     new_data['mon_fri'][i] = 1 else:     new_data['mon_fri'][i] = 0

We will now split the data into train and validation sets to check the performance of the model.

#split into train and validation train = new_data[:987] valid = new_data[987:] x_train = train.drop('Close', axis=1) y_train = train['Close'] x_valid = valid.drop('Close', axis=1) y_valid = valid['Close'] #implement linear regression from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x_train,y_train) Results #make predictions and find the rmse preds = model.predict(x_valid) rms=np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2))) rms 121.16291596523156

The RMSE value is higher than the previous technique, which clearly shows that linear regression has performed poorly. Let’s look at the plot and understand why linear regression has not done well:

#plot valid['Predictions'] = 0 valid['Predictions'] = preds valid.index = new_data[987:].index train.index = new_data[:987].index plt.plot(train['Close']) plt.plot(valid[['Close', 'Predictions']]) Inference

As seen from the plot above, for January 2023 and January 2023, there was a drop in the stock price. The model has predicted the same for January 2023. A linear regression technique can perform well for problems such as Big Mart sales where the independent features are useful for determining the target value.

k-Nearest Neighbours Introduction

Another interesting ML algorithm that one can use here is kNN (k nearest neighbours). Based on the independent variables, kNN finds the similarity between new data points and old data points. Let me explain this with a simple example.

Consider the height and age for 11 people. On the basis of given features (‘Age’ and ‘Height’), the table can be represented in a graphical format as shown below:

To determine the weight for ID #11, kNN considers the weight of the nearest neighbors of this ID. The weight of ID #11 is predicted to be the average of it’s neighbors. If we consider three neighbours (k=3) for now, the weight for ID#11 would be = (77+72+60)/3 = 69.66 kg.

For a detailed understanding of kNN, you can refer to the following articles:

Implementation #importing libraries from sklearn import neighbors from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1))

Using the same train and validation set from the last section:

#scaling data x_train_scaled = scaler.fit_transform(x_train) x_train = pd.DataFrame(x_train_scaled) x_valid_scaled = scaler.fit_transform(x_valid) x_valid = pd.DataFrame(x_valid_scaled) #using gridsearch to find the best parameter params = {'n_neighbors':[2,3,4,5,6,7,8,9]} knn = neighbors.KNeighborsRegressor() model = GridSearchCV(knn, params, cv=5) #fit the model and make predictions model.fit(x_train,y_train) preds = model.predict(x_valid) Results #rmse rms=np.sqrt(np.mean(np.power((np.array(y_valid)-np.array(preds)),2))) rms 115.17086550026721

There is not a huge difference in the RMSE value, but a plot for the predicted and actual values should provide a more clear understanding.

#plot valid['Predictions'] = 0 valid['Predictions'] = preds plt.plot(valid[['Close', 'Predictions']]) plt.plot(train['Close']) Inference

The RMSE value is almost similar to the linear regression model and the plot shows the same pattern. Like linear regression, kNN also identified a drop in January 2023 since that has been the pattern for the past years. We can safely say that regression algorithms have not performed well on this dataset.

Let’s go ahead and look at some time series forecasting techniques to find out how they perform when faced with this stock prices prediction challenge.

Auto ARIMA Introduction

ARIMA is a very popular statistical method for time series forecasting. ARIMA models take into account the past values to predict the future values. There are three important parameters in ARIMA:

p (past values used for forecasting the next value)

q (past forecast errors used to predict the future values)

d (order of differencing)

Parameter tuning for ARIMA consumes a lot of time. So we will use auto ARIMA which automatically selects the best combination of (p,q,d) that provides the least error. To read more about how auto ARIMA works, refer to this article:

Implementation from pyramid.arima import auto_arima data = df.sort_index(ascending=True, axis=0) train = data[:987] valid = data[987:] training = train['Close'] validation = valid['Close'] model = auto_arima(training, start_p=1, start_q=1,max_p=3, max_q=3, m=12,start_P=0, seasonal=True,d=1, D=1, trace=True,error_action='ignore',suppress_warnings=True) model.fit(training) forecast = model.predict(n_periods=248) forecast = pd.DataFrame(forecast,index = valid.index,columns=['Prediction']) Results rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-np.array(forecast['Prediction'])),2))) rms 44.954584993246954 #plot plt.plot(train['Close']) plt.plot(valid['Close']) plt.plot(forecast['Prediction']) Inference

As we saw earlier, an auto ARIMA model uses past data to understand the pattern in the time series. Using these values, the model captured an increasing trend in the series. Although the predictions using this technique are far better than that of the previously implemented machine learning models, these predictions are still not close to the real values.

As its evident from the plot, the model has captured a trend in the series, but does not focus on the seasonal part. In the next section, we will implement a time series model that takes both trend and seasonality of a series into account.

Prophet Introduction

There are a number of time series techniques that can be implemented on the stock prediction dataset, but most of these techniques require a lot of data preprocessing before fitting the model. Prophet, designed and pioneered by Facebook, is a time series forecasting library that requires no data preprocessing and is extremely simple to implement. The input for Prophet is a dataframe with two columns: date and target (ds and y).

Prophet tries to capture the seasonality in the past data and works well when the dataset is large. Here is an interesting article that explains Prophet in a simple and intuitive manner:

Implementation #importing prophet from fbprophet import Prophet #creating dataframe new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close']) for i in range(0,len(data)): new_data['Date'][i] = data['Date'][i] new_data['Close'][i] = data['Close'][i] new_data['Date'] = pd.to_datetime(new_data.Date,format='%Y-%m-%d') new_data.index = new_data['Date'] #preparing data new_data.rename(columns={'Close': 'y', 'Date': 'ds'}, inplace=True) #train and validation train = new_data[:987] valid = new_data[987:] #fit the model model = Prophet() model.fit(train) #predictions close_prices = model.make_future_dataframe(periods=len(valid)) forecast = model.predict(close_prices) Results #rmse forecast_valid = forecast['yhat'][987:] rms=np.sqrt(np.mean(np.power((np.array(valid['y'])-np.array(forecast_valid)),2))) rms 57.494461930575149 #plot valid['Predictions'] = 0 valid['Predictions'] = forecast_valid.values plt.plot(train['y']) plt.plot(valid[['y', 'Predictions']]) Inference

Prophet (like most time series forecasting techniques) tries to capture the trend and seasonality from past data. This model usually performs well on time series datasets, but fails to live up to it’s reputation in this case.

As it turns out, stock prices do not have a particular trend or seasonality. It highly depends on what is currently going on in the market and thus the prices rise and fall. Hence forecasting techniques like ARIMA, SARIMA and Prophet would not show good results for this particular problem.

Long Short Term Memory (LSTM) Introduction

LSTMs are widely used for sequence prediction problems and have proven to be extremely effective. The reason they work so well is because LSTM is able to store past information that is important, and forget the information that is not. LSTM has three gates:

The input gate: The input gate adds information to the cell state

The forget gate: It removes the information that is no longer required by the model

The output gate: Output Gate at LSTM selects the information to be shown as output

For a more detailed understanding of LSTM and its architecture, you can go through the below article:

For now, let us implement LSTM as a black box and check it’s performance on our particular data.

Implementation #importing required libraries from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.layers import Dense, Dropout, LSTM #creating dataframe data = df.sort_index(ascending=True, axis=0) new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close']) for i in range(0,len(data)): new_data['Date'][i] = data['Date'][i] new_data['Close'][i] = data['Close'][i] #setting index new_data.index = new_data.Date new_data.drop('Date', axis=1, inplace=True) #creating train and test sets dataset = new_data.values train = dataset[0:987,:] valid = dataset[987:,:] #converting dataset into x_train and y_train scaler = MinMaxScaler(feature_range=(0, 1)) scaled_data = scaler.fit_transform(dataset) x_train, y_train = [], [] for i in range(60,len(train)): x_train.append(scaled_data[i-60:i,0]) y_train.append(scaled_data[i,0]) x_train, y_train = np.array(x_train), np.array(y_train) x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1)) # create and fit the LSTM network model = Sequential() model.add(LSTM(units=50, return_sequences=True, input_shape=(x_train.shape[1],1))) model.add(LSTM(units=50)) model.add(Dense(1)) model.fit(x_train, y_train, epochs=1, batch_size=1, verbose=2) #predicting 246 values, using past 60 from the train data inputs = new_data[len(new_data) - len(valid) - 60:].values inputs = inputs.reshape(-1,1) inputs = scaler.transform(inputs) X_test = [] for i in range(60,inputs.shape[0]): X_test.append(inputs[i-60:i,0]) X_test = np.array(X_test) X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1)) closing_price = model.predict(X_test) closing_price = scaler.inverse_transform(closing_price) Results rms=np.sqrt(np.mean(np.power((valid-closing_price),2))) rms 11.772259608962642 #for plotting train = new_data[:987] valid = new_data[987:] valid['Predictions'] = closing_price plt.plot(train['Close']) plt.plot(valid[['Close','Predictions']]) Inference

Wow! The LSTM model can be tuned for various parameters such as changing the number of LSTM layers, adding dropout value or increasing the number of epochs. But are the predictions from LSTM enough to identify whether the stock price will increase or decrease? Certainly not!

As I mentioned at the start of the article, stock price is affected by the news about the company and other factors like demonetization or merger/demerger of the companies. There are certain intangible factors as well which can often be impossible to predict beforehand.

Conclusion

Time series forecasting is a very intriguing field to work with, as I have realized during my time writing these articles. There is a perception in the community that it’s a complex field, and while there is a grain of truth in there, it’s not so difficult once you get the hang of the basic techniques.

Frequently Asked Questions

Q1. Is it possible to predict the stock market with Deep Learning?

A. Yes, it is possible to predict the stock market with Deep Learning algorithms such as moving average, linear regression, Auto ARIMA, LSTM, and more.

Q2. What can you use to predict stock prices in Deep Learning?

A. Moving average, linear regression, KNN (k-nearest neighbor), Auto ARIMA, and LSTM (Long Short Term Memory) are some of the most common Deep Learning algorithms used to predict stock prices.

Q3. What are the two methods to predict stock price?

A. Fundamental Analysis and Technical Analysis are the two ways of analyzing and predicting stock prices.

Related

Data Scientist Vs Machine Learning

Differences Between Data Scientist vs Machine Learning

Hadoop, Data Science, Statistics & others

Data Scientist

Standard tasks:

Allocate, aggregate, and synthesize data from various structured and unstructured sources.

Explore, develop, and apply intelligent learning to real-world data and provide essential findings and successful actions based on them.

Analyze and provide data collected in the organization.

Design and build new processes for modeling, data mining, and implementation.

Develop prototypes, algorithms, predictive models, and prototypes.

Carry out requests for data analysis and communicate their findings and decisions.

In addition, there are more specific tasks depending on the domain in which the employer works, or the project is being implemented.

Machine Learning

The Machine Learning Engineer position is more “technical.” ML Engineer has more in common with classical Software Engineering than Data Scientists. It helps you learn the objective function, which plots the inputs to the target variable and independent variables to the dependent variables.

The standard tasks of ML Engineers are generally like Data scientists. You also need to be able to work with data, experiment with various Machine Learning algorithms that will solve the task, and create prototypes and ready-made solutions.

Strong programming skills in one or more popular languages (usually Python and Java) and databases.

Less emphasis on the ability to work in data analysis environments but more emphasis on Machine Learning algorithms.

R and Python for modeling are preferable to Matlab, SPSS, and SAS.

Ability to use ready-made libraries for various stacks in the application, for example, Mahout, Lucene for Java, and NumPy / SciPy for Python.

Ability to create distributed applications using Hadoop and other solutions.

As you can see, the position of ML Engineer (or narrower) requires more knowledge in Software Engineering and, accordingly, is well suited for experienced developers. The case often works when the usual developer must solve the ML task for his duty, and he starts to understand the necessary algorithms and libraries.

Head-to-Head Comparison Between Data Scientist and Machine Learning (Infographics)

Below are the top 5 differences between Data scientists and Machine Learning:

Key Difference Between Data Scientist and Machine Learning

Below are the lists of points that describe the key Differences Between Data Scientist and Machine Learning:

Machine learning and statistics are part of data science. The word learning in machine learning means that the algorithms depend on data used as a training set to fine-tune some model or algorithm parameters. This encompasses many techniques, such as regression, naive Bayes, or supervised clustering. But not all styles fit in this category. For instance, unsupervised clustering – a statistical and data science technique – aims at detecting clusters and cluster structures without any a-prior knowledge or training set to help the classification algorithm. A human being is needed to label the clusters found. Some techniques are hybrid, such as semi-supervised classification. Some pattern detection or density estimation techniques fit into this category.

Data science is much more than machine learning, though. Data in data science may or may not come from a machine or mechanical process (survey data could be manually collected, and clinical trials involve a specific type of small data), and it might have nothing to do with learning, as I have just discussed. But the main difference is that data science covers the whole spectrum of data processing, not just the algorithmic or statistical aspects. Data science also covers data integration, distributed architecture, automated machine learning, data visualization, dashboards, and Big data engineering.

Data Scientist and Machine Learning Comparison Table

Feature Data Scientist Machine Learning

Data It mainly focuses on extracting details of data in tabular or images. It mainly focuses on algorithms, polynomial structures, and word adding.

Complexity It handles unstructured data, and it works with a scheduler. It uses Algorithms and mathematical concepts, statistics, and spatial analysis.

Hardware Requirement Systems are Horizontally scalable and have High Disk and RAM storage. It requires Graphic processors and Tensor Processors, that is very high-level hardware.

Skills Data Profiling, ETL, NoSQL, Reporting. Python, R, Maths, Stats, SQL Model.

Focus Focuses on abilities to handle the data. Algorithms are used to gain knowledge from huge amounts of data.

Conclusion

Machine learning helps you learn the objective function, which plots the inputs to the target variable and independent variables to the dependent variables.

A Data scientist does a lot of data exploration and arrives at a broad strategy for tackling it. He is responsible for asking questions about the data and finding what answers one can reasonably draw from the data. Feature engineering belongs to the realm of Data scientists. Creativity also plays a role here, and An Machine Learning engineer knows more tools and can build models given a set of features and data – as per directions from the Data Scientist. The realm of Data preprocessing and feature extraction belongs to ML engineers.

Data science and examination utilize machine learning for this archetypal validation and creation. It is vital to note that all the algorithms in this model creation may not come from machine learning. They can arrive from numerous other fields. The model desires to be kept relevant always. If the situations change, the model we created earlier may become immaterial. The model must be checked for certainty at different times and adapted if its confidence reduces.

Data science is a whole extensive domain. If we try to put it in a pipeline, it would have data acquisition, data storage, data preprocessing or cleaning, learning patterns in data (via machine learning), and using knowledge for predictions. This is one way to understand how machine learning fits into data science.

Recommended Articles

This is a guide to Data Scientist vs Machine Learning. Here we have discussed Data Scientist vs Machine Learning head-to-head comparison, key differences, infographics, and comparison table. You may also look at the following articles to learn more –

How Machine Learning Improves Cybersecurity?

Here is how machine learning improves cybersecurity

Today, deploying robust cybersecurity solutions is unfeasible without significantly depending on machine learning. Simultaneously, without a thorough, rich, and full approach to the data set, it is difficult to properly use machine learning. MI can be used by cybersecurity systems to recognise patterns and learn from them in order to detect and prevent repeated attacks and adjust to different behaviour. It can assist cybersecurity teams in being more proactive in preventing dangers and responding to live attacks. It can help businesses use their assets more strategically by reducing the amount of time invested in mundane tasks.  

Machine Learning in Cyber Security

ML may be used in different areas within Cyber Security to improve security procedures and make it simpler for security analysts to swiftly discover, prioritise, cope with, and remediate new threats in order to better comprehend previous cyber-attacks and build appropriate defence measures.  

Automating Tasks

The potential of machine learning in cyber security to simplify repetitive and time-consuming processes like triaging intelligence, malware detection, network log analysis, and vulnerability analysis is a significant benefit. By adding machine learning into the security workflow, businesses may complete activities quicker and respond to and remediate risks at a rate that would be impossible to do with only manual human capabilities. By automating repetitive operations, customers may simply scale up or down without changing the number of people required, lowering expenses. AutoML is a term used to describe the process of using machine learning to automate activities. When repetitive processes in development are automated to help analysts, data scientists, and developers be more productive, this is referred to as AutoML.  

Threat Detection and Classification

In order to identify and respond to threats, machine learning techniques are employed in applications. This may be accomplished by analysing large data sets of security events and finding harmful behaviour patterns. When comparable occurrences are recognised, ML works to autonomously deal with them using the trained ML model. For example, utilising Indicators of Compromise, a database to feed a machine learning model may be constructed (IOCs). These can aid in real-time monitoring, identification, and response to threats. Malware activity may be classified using ML classification algorithms and IOC data sets. A study by Darktrace, a Machine Learning based Enterprise Immune Solution, alleges to have stopped assaults during the WannaCry ransomware outbreak as an example of such an application.  

Phishing

Traditional phishing detection algorithms aren’t fast enough or accurate enough to identify and distinguish between innocent and malicious URLs. Predictive URL categorization methods based on the latest machine learning algorithms can detect trends that signal fraudulent emails. To accomplish so, the models are trained on characteristics such as email headers, body data, punctuation patterns, and more in order to categorise and distinguish the harmful from the benign.  

WebShell

WebShell is a malicious block of software that is put into a website and allows users to make changes to the server’s web root folder. As a result, attackers have access to the database. As a result, the bad actor is able to acquire personal details. A regular shopping cart behaviour may be recognised using machine learning, and the system can be programmed to distinguish between normal and malicious behaviour.  

Network Risk Scoring

Quantitative methods for assigning risk rankings to network segments aid organisations in prioritising resources. ML may be used to examine prior cyber-attack datasets and discover which network regions were more frequently targeted in certain assaults. With regard to a specific network region, this score can assist assess the chance and effect of an attack. As a result, organisations are less likely to be targets of future assaults. When doing company profiling, you must determine which areas, if compromised, can ruin your company. It might be a CRM system, accounting software, or a sales system. It’s all about determining which areas of your business are the most vulnerable. If, for example, HR suffers a setback, your firm may have a low-risk rating. However, if your oil trading system goes down, your entire industry may go down with it. Every business has its own approach to security. And once you grasp the intricacies of a company, you’ll know what to safeguard. And if a hack occurs, you’ll know what to prioritise.  

Human Interaction

Computers, as we all know, are excellent at solving complex problems and automating things that people might accomplish, but which PCs excel at. Although AI is primarily concerned with computers, people are required to make educated judgments and receive orders. As a result, we may conclude that people cannot be replaced by machines. Machine learning algorithms are excellent at interpreting spoken language and recognising faces, but they still require people in the end.  

Conclusion

Machine learning is a powerful technology. However, it is not a magic bullet. It’s crucial to remember that, while technology is improving and AI and machine learning are progressing at a rapid pace, technology is only as powerful as the brains of the analysts who manage and use it.

Update the detailed information about Automated Machine Learning For Supervised Learning (Part 1) on the Kientrucdochoi.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!