Skip to main content

Predicting Stock Prices using Machine Learning (XGBoost)

 

Today, I'll show you how to create a machine learning program to predict stock prices. Machine learning is used in a variety of fields, and we can utilize its ability to learn and predict from data to predict useful variables, with minimal error. Therefore in theory, we can apply the same on stocks to predict the next closing price, so that we can make a killing gain.

However, stock markets are highly unstable. Their price movement often depends on decisions taken by the company, favor and reaction of investors, social impact, human emotions, and price movements of some other related stocks. These types of data cannot be made available to a program. The prediction can be close only if the market remains relatively stable.



You don't actually need to learn machine learning using python to understand how this program works, if not minimal. I'll describe each machine learning process. And don't forget to download the source code of this program, link provided at the end. Let's jump in right away!

Data Source

The data should comprise of stock open and close prices, traded volumes and values and the corresponding date. We have already made a "Stock Market Recorder" program, to store stock price data everyday, which you can read here. If you run this program everyday and collect a good size of stock data in your database, you can predict stock prices conveniently. On the other hand, you can also download bulk data from the web and make appropriate changes. I shall feed the data from my database of the stock recorder program.

Note: The "Stock Recorder" program collects and stores data in the database one day after the actual market session. Therefore, to predict the stock prices, you need to run the "Stock Recorder" program, then run this "Stock Price Predictor" program on the same day of the next market session, essentially before the market begins. For example, If there is a market session today, I'll first run the stock data collector program and then this predictor today itself, which will give me the predicted price of the stock when the market closes for the day.

Required Modules and Libraries

Apart from time module, you'll have to install all other libraries if you haven't already done so. Firstly, we need xgboost for the main machine learning model. Secondly, we need sklearn to perform various processes on our machine learning model. Install mysql.connector for accessing data from the Mysql database, and pandas for getting the data in the form of a DataFrame.


The Code

Firstly, we set about importing all our required modules:

Our main body, which will consist of user input, data collection, data processing and result delivery is as follows:


So we have initiated our MySQL connection to get hold of our database, and a report_status input from the customer (If report status is enabled, we shall provide the user with some additional information along the process). The code after that comes in a while loop, so that we can keep predicting for as many stocks as we want.

First, we ask the user for the stock ticker, which is under a set of try-except statement for dealing with invalid stock tickers. Subsequently, we run a query to get all the data of the stock ticker entered by the user from our database, and record them in various lists, using a for loop.

Then we come to data processing. The concept of this model is we use today's stock data to predict the price tomorrow. We use all the rows except the last row of our stock table as training data and the last row as test data, which is the required data to predict tomorrow's price. The data is divided into X and y, where X contains 'features', which influence the value of the variable we want to predict, and y contains the column of the variable to be predicted, the 'result' arising from the set of 'features'.

The X and y part of the data is further divided into train (the rows that will be used to train the model) and valid (the rows that contain the result already, and therefore we compare our prediction with it to get the mean error in our prediction (No prediction is completely perfect, every prediction has at least a minimal error). Then we try different parameters of our machine learning model to lower the error. Such a method of trying to lower error in testing data may not necessarily lower the error in actual prediction, and in some cases might even lead to increase in error in actual prediction. This problem is called Overfitting, where our model depends too much on the testing data and gets less error, but fails to perform similarly for actual prediction. You may decrease the size of training data and therefore increase the size of testing data to reduce this effect, but it shouldn't cause much of a problem.

categorical_cols refers to columns in the data with non-numerical values like strings (Machine learning typically works with numerical data). numerical_cols refers to numerical columns. The following few lines of code uses Pipelines to pre-process the data, along with the categorical data, to make a model out of it.

Next, we initiate the learning rate ( a small constant multiplied to individual predictions to give better overall prediction) from 0.02. We run a for loop to send the pre-processed data to a defined function model_testing() to get the mean absolute error from all learning rates from 0.02 to 0.06 (a sweet spot, that hopefully is correct). We then use the learning rate corresponding to the lowest error to make our final model and perform the prediction using another function final_model(). The last lines are a couple of if-else statements to provide the result to the user appropriately. Moving on to our own defined function model_testing():


So we create a XGBRegressor model with the learning rate obtained from the for loop discussed above. It makes the prediction, compares it with the answer that we have from reality, and returns the mean absolute error, a.k.a score (The lower the score, the better the model). Moving on to the function final_model():


This is similar to the previous function model_testing(), except that it takes in the learning rate with the lowest score and returns the final prediction. Finally, we also have another function time_taken() which records the time taken to find out the best learning rate, and displays it to the user if report_status is True.


And so our program is complete and ready to run! Make sure that your program reflects upon any changes you might have made like password of your MySQL, name of database, file directories etc. Also make a change in the model parameter n_jobs and set it equal to the number of cores in your processor (for quick working).


Quick Important Note

If you read the blog post :Stock Market Recorder: Stock Market Data collection using Python, you will know that the stock collection program stores the stock data the next day after the market activity. So suppose you want to predict a stock price today, you will ideally run the program in the morning before the market opens today, and it will predict the price for today.


The Working

A sample input of a stock ticker, along with report_status set to True shows the output as follows: 


Note: Your PC might run its cooling fans at higher speed as the program runs the model multiple times. This is normal and expected

And there you have it! A Stock Price Predictor that works at your ease. Here's the Source code of the entire program: Download Python Program. Stay tuned with this blog for more.

Comments

  1. Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.

    Data Engineering Services 

    Artificial Intelligence Services

    Data Analytics Services

    Data Modernization Services

    ReplyDelete
  2. Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.

    Data Engineering Services 

    Artificial Intelligence Services

    Data Analytics Services

    Data Modernization Services

    ReplyDelete

Post a Comment

Popular posts from this blog

Simple Omegle Bot Using Selenium With Python

Omegle is an online text-based and video-based chatting platform, which allows users from around the world to talk to complete strangers anonymously, for free! The text-based Omegle has a simple concept: Complete a Captcha  verification, connect automatically to a stranger, and after chatting get on to the next stranger. Apart from having a chit-chat with a stranger from the far side of the globe, Omegle poses as the perfect platform for other uses as well. At Omegle, you  can advertise your content, website, products and more for free. With access to about 40,000 strangers using Omegle at any given time, you can benefit if your ideas/advertisement is seen by potential customers. But of course, we can't advertise to each guy we meet again and again. I mean, come on, that's a lot of hard work, even if your message consists of a few words. But worry not, since that's where our Omegle Bot  comes in play. This bot works on a pre-determined set of messages that are to be conveye

English Dictionary With Python And Tkinter!

English Dictionary on Python Have you ever tried to read from the small Oxford Dictionaries? Those small yet bulky books have thousands of words cramped on a page the length of your finger! And on top of that, not to forget the hassle of flipping across the pages, searching for your word, God knows where is hiding on which page! Nowadays though, hardly does anyone ever use the classic dictionaries. With the access of technology on our fingertips, one tap on Google, and you would be on with your way.  But how about making a program of your own which can do the same for you? Sounds interesting? Such a program can enable you to run it and keep it opened, while you are reading a book, so that you can search the definition of a new word you encountered. What if you are writing a book, perhaps a report? You know what you are supposed to write, but you choose to get a word for it, so that you sound professional. So you just search for a short definition and the program find the word with matc