Stock Predictions Using ML

Introduction and Background

In my machine learning class, I learned about the intersection of math and technology, specifically in the realm of supervised learning. Through exploring classification models and prediction techniques, I gained an understanding of how machine learning can be used to predict outcomes in various fields, including finance. As someone who has always been interested in finance, I was curious to see how these mathematical concepts could be applied to the stock market and how accurately they could predict stock prices.

One of the biggest takeaways from my machine learning class was the importance of probability and statistics in building accurate prediction models. I learned how to analyze and manipulate large datasets, identify key variables, and use mathematical models to make predictions. This prepared me for working with financial data, which is often complex and unpredictable. With this knowledge in mind, I decided to embark on a project to create a stock price prediction model using machine learning and see how it performed against real-world financial data. In particular, I was inspired to explore the correlation between certain variables and stock prices.

One of the statistical principles that stood out to me was the concept of correlation. I learned about the correlation coefficient, which measures the strength of the linear relationship between two variables. This led me to wonder: could I use this principle to find correlations between variables such as gas and gold prices, news sentiment, and technical indicators, and use these correlations to make predictions about stock prices?

To put this idea to the test, I decided to use a specific machine learning algorithm: the Random Forest Classifier. I was drawn to this algorithm because of its ability to handle large datasets with many variables, as well as its effectiveness in classification tasks. Through my project, I aimed to explore the potential of this algorithm in predicting stock prices based on various factors, and to compare its accuracy with that of other machine learning models.

Data Processing

The first step was collecting and cleaning the data. I used Python and a few popular libraries like Pandas and NumPy for data manipulation. I collected historical stock data from Yahoo Finance and used the pandas_datareader package to extract stock prices for a variety of companies over a certain time period.

Once I had the data, I cleaned it to ensure it was in the right format for analysis. This involved removing null values and ensuring the data types were correct. In addition, I had to align the data for each company so that it was uniform and ready for analysis. For example, I made sure that the stock prices were adjusted for stock splits, dividends, and other corporate actions that could affect the stock prices.

After cleaning the data, I then used feature engineering techniques to extract useful information from the raw data. For example, I calculated the moving averages for different time intervals and added those as features to my dataset. I also calculated the relative strength index (RSI) for each company and added that as a feature as well. Finally, I added news sentiment scores and gas/gold prices as additional features to see if they had any predictive power.

A Summary of the Final Dataset (After Cleaning).




Model Training and Selection

After collecting and cleaning the data, I began exploring various machine learning models to predict stock prices. I tested several popular models, including linear regression, decision tree, random forest, support vector regression (SVR), and a neural network. To compare the performance of these models, I used accuracy as the primary evaluation metric. Specifically, I compared the root mean squared error (RMSE) and coefficient of determination (R-squared) for each model.

I found that the random forest and neural network models performed the best, with the neural network slightly outperforming the random forest. To further improve the performance of these models, I performed hyperparameter tuning using grid search and cross-validation techniques. I tuned several hyperparameters for each model, including the number of trees for the random forest, the number of layers and neurons for the neural network, and the regularization parameter for SVR.

For the random forest model, I used a grid search to tune the number of trees in the forest and the maximum depth of each tree. After performing cross-validation on the hyperparameter space, I found that the optimal number of trees was 100 and the optimal maximum depth was 10. These values led to a significant improvement in the RMSE and R-squared values.

Similarly, for the neural network model, I used a grid search to tune the number of layers, the number of neurons in each layer, and the regularization parameter. After performing cross-validation on the hyperparameter space, I found that the optimal architecture consisted of two hidden layers, each with 32 neurons, and a regularization parameter of 0.001. These values also led to a significant improvement in the RMSE and R-squared values.

Our Proposed Solution

Overall, the performance of both models improved significantly after hyperparameter tuning. The neural network model achieved the highest accuracy, with an RMSE of 1.56 and R-squared of 0.996. The random forest model also performed well, with an RMSE of 2.08 and R-squared of 0.992. The hyperparameter tuning process allowed me to fine-tune these models and achieve the best possible performance for stock price prediction.

Summary of Model Performances




Results and Conclusion:

After testing multiple models and hyperparameters, I found that the Random Forest Regression model with 1000 estimators provided the best results. It achieved an R-squared value of 0.96 on the testing set, indicating that it was able to explain 96% of the variance in the stock prices. This was a significant improvement over the baseline model, which had an R-squared value of only 0.18.

A Comparison of all the algorithms we tried

Our Final Model’s Confusion Matrix

One of the interesting things I discovered was that technical indicators were the most important features in predicting stock prices. This is not surprising, as technical indicators are commonly used in financial analysis to inform trading decisions. However, it was interesting to see how well they performed in a machine learning context.

In addition to technical indicators, I also experimented with using other features such as news sentiment and commodity prices, but they did not significantly improve the model's performance. This could be due to the fact that these features are less directly related to the stock prices.

Our Best Model’s ROC Curve

One of the challenges I faced during this project was dealing with the noisy and inconsistent nature of financial data. There were often missing values, outliers, and inconsistencies that needed to be handled in a careful and systematic manner. However, I was able to overcome these challenges through careful data cleaning and feature engineering.

Overall, this project allowed me to apply the concepts and techniques I learned in my machine learning course to a real-world problem in finance. It also gave me valuable experience in data cleaning, feature engineering, and model selection and tuning. Through this project, I gained a deeper understanding of the strengths and limitations of different machine learning models, and how they can be used to make predictions in the stock market.

Next
Next

Valorant Data Analytics