Predicting Stock Prices using Machine Learning (XGBoost)

Today, I'll show you how to create a machine learning program to predict stock prices. Machine learning is used in a variety of fields, and we can utilize its ability to learn and predict from data to predict useful variables, with minimal error. Therefore in theory, we can apply the same on stocks to predict the next closing price, so that we can make a killing gain.

However, stock markets are highly unstable. Their price movement often depends on decisions taken by the company, favor and reaction of investors, social impact, human emotions, and price movements of some other related stocks. These types of data cannot be made available to a program. The prediction can be close only if the market remains relatively stable.

You don't actually need to learn machine learning using python to understand how this program works, if not minimal. I'll describe each machine learning process. And don't forget to download the source code of this program, link provided at the end. Let's jump in right away!

Data Source

The data should comprise of stock open and close prices, traded volumes and values and the corresponding date. We have already made a "Stock Market Recorder" program, to store stock price data everyday, which you can read here. If you run this program everyday and collect a good size of stock data in your database, you can predict stock prices conveniently. On the other hand, you can also download bulk data from the web and make appropriate changes. I shall feed the data from my database of the stock recorder program.

Note: The "Stock Recorder" program collects and stores data in the database one day after the actual market session. Therefore, to predict the stock prices, you need to run the "Stock Recorder" program, then run this "Stock Price Predictor" program on the same day of the next market session, essentially before the market begins. For example, If there is a market session today, I'll first run the stock data collector program and then this predictor today itself, which will give me the predicted price of the stock when the market closes for the day.

Required Modules and Libraries

Apart from time module, you'll have to install all other libraries if you haven't already done so. Firstly, we need xgboost for the main machine learning model. Secondly, we need sklearn to perform various processes on our machine learning model. Install mysql.connector for accessing data from the Mysql database, and pandas for getting the data in the form of a DataFrame.

The Code

Firstly, we set about importing all our required modules:

Our main body, which will consist of user input, data collection, data processing and result delivery is as follows:

So we have initiated our MySQL connection to get hold of our database, and a report_status input from the customer (If report status is enabled, we shall provide the user with some additional information along the process). The code after that comes in a while loop, so that we can keep predicting for as many stocks as we want.

First, we ask the user for the stock ticker, which is under a set of try-except statement for dealing with invalid stock tickers. Subsequently, we run a query to get all the data of the stock ticker entered by the user from our database, and record them in various lists, using a for loop.

Then we come to data processing. The concept of this model is we use today's stock data to predict the price tomorrow. We use all the rows except the last row of our stock table as training data and the last row as test data, which is the required data to predict tomorrow's price. The data is divided into X and y, where X contains 'features', which influence the value of the variable we want to predict, and y contains the column of the variable to be predicted, the 'result' arising from the set of 'features'.

The X and y part of the data is further divided into train (the rows that will be used to train the model) and valid (the rows that contain the result already, and therefore we compare our prediction with it to get the mean error in our prediction (No prediction is completely perfect, every prediction has at least a minimal error). Then we try different parameters of our machine learning model to lower the error. Such a method of trying to lower error in testing data may not necessarily lower the error in actual prediction, and in some cases might even lead to increase in error in actual prediction. This problem is called Overfitting, where our model depends too much on the testing data and gets less error, but fails to perform similarly for actual prediction. You may decrease the size of training data and therefore increase the size of testing data to reduce this effect, but it shouldn't cause much of a problem.

categorical_cols refers to columns in the data with non-numerical values like strings (Machine learning typically works with numerical data). numerical_cols refers to numerical columns. The following few lines of code uses Pipelines to pre-process the data, along with the categorical data, to make a model out of it.

Next, we initiate the learning rate ( a small constant multiplied to individual predictions to give better overall prediction) from 0.02. We run a for loop to send the pre-processed data to a defined function model_testing() to get the mean absolute error from all learning rates from 0.02 to 0.06 (a sweet spot, that hopefully is correct). We then use the learning rate corresponding to the lowest error to make our final model and perform the prediction using another function final_model(). The last lines are a couple of if-else statements to provide the result to the user appropriately. Moving on to our own defined function model_testing():

So we create a XGBRegressor model with the learning rate obtained from the for loop discussed above. It makes the prediction, compares it with the answer that we have from reality, and returns the mean absolute error, a.k.a score (The lower the score, the better the model). Moving on to the function final_model():

This is similar to the previous function model_testing(), except that it takes in the learning rate with the lowest score and returns the final prediction. Finally, we also have another function time_taken() which records the time taken to find out the best learning rate, and displays it to the user if report_status is True.

And so our program is complete and ready to run! Make sure that your program reflects upon any changes you might have made like password of your MySQL, name of database, file directories etc. Also make a change in the model parameter n_jobs and set it equal to the number of cores in your processor (for quick working).

Quick Important Note

If you read the blog post :Stock Market Recorder: Stock Market Data collection using Python, you will know that the stock collection program stores the stock data the next day after the market activity. So suppose you want to predict a stock price today, you will ideally run the program in the morning before the market opens today, and it will predict the price for today.

The Working

A sample input of a stock ticker, along with report_status set to True shows the output as follows:

Note: Your PC might run its cooling fans at higher speed as the program runs the model multiple times. This is normal and expected

And there you have it! A Stock Price Predictor that works at your ease. Here's the Source code of the entire program: Download Python Program. Stay tuned with this blog for more.

Comments

Kate Austen20 June 2022 at 00:16
Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.

Data Engineering Services

Artificial Intelligence Services

Data Analytics Services

Data Modernization Services
Kate Austen20 June 2022 at 00:16
Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.

Data Engineering Services

Artificial Intelligence Services

Data Analytics Services

Data Modernization Services

Fun Python Code!

Search This Blog