Sentiment Analysis of Stock market using News Headlines
Welcome, All of you to learn about the complete capstone project name sentiment analysis of stock using news. This article is actually based on Natural Language Processing(NLP), here we will create a model which will actually analyze stock price using News Headline. There are various kinds of news articles and based on that the stock price fluctuates. We will analyze the news heading using sentiment analysis using NLP and then we will predict the stock will increase or decrease. It is all about stock sentiment analysis.
This is a complete capstone project that requires knowledge about Python and some statistical concepts. Environment for this project includes:
- Programming language — Python
- IDE — JupyterLab
Dataset Description
I have used the Kaggle dataset. The link is here This dataset is a combination of world news and stock prices available on Kaggle. There are 25 columns of top news headlines for each day in the data frame, Date, and Label(dependent feature). Data range from 2008 to 2016 and the data frame 2000 to 2008 was scrapped from Different Web sources
- Class 1 — The stock price increased.
- Class 0 — The stock price stayed the same or decreased.
Implementation:
Let’s first import required libraries.
Let’s read the dataset.
Let’s look at our dataset, here Label is our dependent feature(target value), and the remaining 26 features are independent. Whenever our label is 1, our stock price gets increased when we get these 25 news headlines. This is a kind of dataset we have, and we are going to use NLP in this problem statement and apply sentiment analysis and then we will predicting whether the stock price will increase or decrease
Dividing the dataset into the training and testing parts:
we are dividing our dataset by date. The dataset having Date < 20150101, I am taking that as training dataset, and the dataset having Date > 20141231, I am taking it as testing dataset.
Let’s Perform the pre-processing steps for refining the dataset:
Now let’s remove all these columns, full stops, or exclamation marks from the text dataset because these all are not required for doing sentiment analysis. I have taken all the 25 news columns then I have just applied regular expressions, where I said that apart from a-z and A-Z replace everything with blank. If any special character will come it will automatically remove them and replace them with blank space.
The updated Dataset after pre _processing looks like this:
Combining all news headlines based on the index :
Here if we considered all news headlines on a particular date as a one-paragraph that time only we can apply CountVectorizer which is the BagofWords model or TFIDF. So I will go on each and every index and I will combine them into one paragraph.
Now Headlines List looks like this:
Let’s apply CountVectorizer and RandomForestClassifier:
Here count vectorizer will basically convert these sentences into vectors. That is what a bag of words means.
Let’s predict for the testing dataset:
Now for the testing dataset, we will do the same feature transformation that I did for the training dataset
Finally, let’s check the accuracy:
Here we are applying the classification report, confusion matrix, and accuracy score to check the accuracy of our model.
Let’s Try Again with TFIDF Vectorizer and RandomForestClassifier:
TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as models can process only numerical data.
So let's try the Tfidf vectorizer to count and differentiate Score the procedure is quite the same as Count Vectorizer with a quite difference in the code
Let’s predict for the testing dataset:
Now for the testing dataset, we will do the same feature transformation that I did before now adding Tfidf instead of Count Vectorizer
Finally, let’s check the accuracy:
Here we are applying the classification report, confusion matrix, and accuracy score to check the accuracy of our model.
Conclusions:
we end up with all the steps and got an accuracy of 85% with Count vectorizer and 82% with Tfidf vectorizer using RandomForest Model. Now suppose you have to predict it for tomorrow, take the top 25 news headlines apply all the transformation methods, and finally give it to your model, your model will basically say whether your 0 or 1 means stock price will increase or not. This is how you can do stock sentiment analysis using news headlines.
That’s it. Thanks for being intact with the article, please do not forget to give your Feedback!!