Covid Vaccine Company Stock Prediction
Authors: Bangxi Xiao, Yunxuan Zeng, Zuxuan Huai, Siyu Shen, Daxin Niu
Covid-19 has been influencing the world for a bit more than one year. The world has been heavily impacted by the pandemic and many have suffered from the change in society. Biotechnology companies have been racing to develop the vaccine in order to help society to get back on track. Nonetheless, there exists great profit behind the vaccine development race. We would like to take a closer look at the major biotechnology companies that are developing Covid-19 vaccines and aim to predict their stock price.
Before we start, we would like to cite this paper from Alex K. Fine.
We took our inspiration from this paper and we are looking to apply a similar approach from this paper for a different task.
Links for our project are attached here:
Datalore - Online Data Science Notebook by JetBrains
Datalore is an online data science notebook with smart coding assistance. Work with familiar Jupyter Notebooks and…
Tasks and Goals
In this project, we are looking at six biotechnology companies that are developing Covid-19 vaccines. These companies are BioNTech(BNTX), Pfizer(PFE), Moderna(MRNA), Novavax(NVAX), Johnson&Johnson(JNJ), and AstraZeneca(AZN). We aim to use a variety of different sources to build a comprehensive model that helps us to predict the stock price of these six companies.
To achieve such a goal, we need to perform analysis on different sets of data. For the first stage, we decided to look at the stock price and tweets related to the selected six companies. Stock prices data allow us to have a better understanding of the past performance of our chosen companies. The tweets data plays an important role in this project. We aim to train an RNN model using the tweet data and help us to predict whether each of the six chosen stocks would go up or down.
Exploratory Data Analysis
We break down our EDA into two big sections base on the two datasets we obtained for the current stage of the project.
Stock Price EDA
In this section’s EDA, we focused on the analysis of the stock price data for the six Biotechnology companies. We decided to first take a look at the price change for the six companies start from June 2020. We picked June 2020 as the starting date because most companies react to the Covid crisis around that period of time. If we start from that time period, we can have a better look at how things have changed over time. The overall stock price change is shown below:
We can see that NVAX(Novavax), MRNA(Moderna), and BNTX(BioNTech) have quite significant changes over the past several months. Some big changes seem to happened in August 2020, February 2021, and March 2021. On the other hand, JNJ(Johnson&Johnson), AZN(AstraZeneca), and PFE(Pfizer) do not seem to have too big of a change over the past few months. But something we wanted to keep in mind is that we are putting all sixth companies on the same graph. Changes in companies with a small stock price might not be significant due to the scale of other companies with a larger stock price. Therefore, we were hesitant to make any conclusions just from this single price analysis.
With the doubt in our mind, we decided to take a look at the volume change over the past few months for all the companies. The volume overview looks like the following:
Contrary to the previous graph where we didn’t find too much fluctuation in PFE, we do see a huge increase in volume for PFE in November 2020. Another company that seems to have some fluctuation is MRNA. It had some increase in volume in July 2020 and December 2020. To prevent hidden information from not scaling, we decided to look at the percentage fluctuation for the six companies. The graph is shown below:
We can see that NVAX had some fluctuation around July 2020, December 2020, and February 2021. All the companies had some fluctuation over time but they don’t seem to be extremely significant.
After comparing the six companies together, we have created individual plots for each company showing their change in stock price and a moving average of the stock price.
This section focuses on the tweets data we obtained related to the six companies. We decided to do sentiment analysis on the tweets we have. We calculated the sentiment score for each tweet related to each company and count the number of tweets for each score to create a histogram. The graph is shown below:
We see that most of the tweets data we obtained have a sentiment score of zero. A majority of the data lies between a sentiment score of 0 and 0.5. We see that out of all the tweets counts, PFE seems to take the largest portion for each score.
Other than the sentiment score histogram, we have also created a correlation graph based on the type of tweet. We look at sentiment, replies count, retweets count, and like count. Then we created a correlation graph based on these features.
From the graph, we can see that retweets are highly correlated with likes, and replies are highly correlated with likes and retweets.
We also created a wordcloud analyzing our tweet data. It features the most used words from our data and puts them together to form the Twitter logo.
From the graph, we can see that mrna, nvax are some of the most used words.
The above analysis concludes our EDA on the existing data. We will make a baseline model that helps us to achieve our goal.
Since text data cannot be directly used in our RNN model, we need to clean the raw data first. We mainly use the Gensim library to preprocess the text data. Gensim focuses on topic modeling and it is used for natural language preprocessing. By using gensim parsing preprocessing, we removed punctuation, duplicated white spaces, numbers, stopwords, and stemmed the text.
Then we must tokenize the sentences before model building. By analyzing the sequential words generated by the tokenizer, it is easier for the model to understand the input text data. Now we are ready to embed the words from the text corpus. We use Word2Vec to convert words to vectors. These vectors will capture the word features and tell us the relationship between each word explicitly.
Besides, we also preprocess the data other than the text data we obtained from web scraping. We apply MinMaxScaler and StandardScaler to the continuous variables. Feature scaling will help us to normalize the features and ensure those features can be used in the same way in our model. Then our data is ready to be used for model training!
With some basic analysis completed for our existing data, we decided to test out some baseline models and see if we can get some decent results.
The first model we decided to test is an LSTM. The model structure looks like the following:
The training metrics are shown below:
We have also tracked the MSE and MAE of the model. The graph is attached here:
We see that the model has quite high MSE and MAE at the very beginning. But the loss has been decreasing and the loss seems to converge after 20 epochs. In the next two models, we will see some quite different behaviors.
The second model we decided to use is a SimpleRNN. The model structure is attached here:
The training process is recorded below
As we did for the previous model, we have also tracked the MSE and MAE for the model.
The graphs look a bit different than the previous model’s graph. The loss and MAE decreased dramatically at the very beginning and the model statistics started to converge.
The third baseline model we attempted is a GRU model. The model structure is attached here:
The training process is attached below:
Similarly, we tracked the MSE and MAE for the model.
This model seems to have similar behavior as the previous one. The MSE and MAE both converged to very small numbers which were quite desirable.
Tri-Attention GRU Model
After the previous baseline models, we decided to try out something more advanced. We decided to construct a Tri-Attention Model. We took our inspiration from this paper:
The reason why we use this particular paper is that the authors from this paper were performing a similar task as what we are performing. They were filtering new articles to construct a model that helps them to predict stock price.
The authors from the above paper applied different attention layers for different purposes. Similarly, we have constructed a Tri-Attention Model with different attention layers for different focuses. Our first set of attention layers occurs at the input level where we processed the document level inputs. After the data has gone through the Bi-directional GRU layer, we would add a self-attention layer. This is word-level attention. The structure for one document input is shown below.
After processing the input from all the given documents for one day, we add another self-attention layer that is used for the document-level attention. The graph below shows how it looks like.
After processing the document-level attention, we combined everything together and added a day-level attention layer at the very end. This tri-attention structure constructs our full model. A diagram of our model is shown below.
To see the code for our full model, please feel free to check out this link:
We have tracked the MSE, Loss, and MAE for all of our companies. We have selected the statistics from Johnson&Johnson to illustrate how they look like. The graphs are displayed below.
From the above training images, we can see that all statistics decreased dramatically in the first several epochs. These are very positive results from this tri-attention model.
Tri-Attention Model (BERT)
After trying the previous model, we decided to try out BERT for word processing. Instead of the Bi-directional GRU layers, we replace the whole section with BERT. The following plot shows our document-level processing structure.
As we can see from the above graph, BERT is used to get the word vectors instead of using Bi-directional GRU layers. The full structure of the model looks like the following.
We have tracked the loss, MAE, and MSE from the BERT-based tri-attention model as well. We will show Johnson&Johnson’s statistics as our example.
From the plot above, we can see that the BERT-based Tri-Attention Model provides us some decent results as well.
In this project, we have tried to use tweet data to predict stock price. We have attempted different models including LSTM, GRU, Tri-Attention model (GRU, BERT). We have received some decent results at the very end.
In the future, we are considering changing the regression task into a classification problem. Another idea we are considering is to stratify our result. This would help us to build detailed trading strategies with more accuracy. Finally, we would like to fine-tune our BERT model in the future so that it can accommodate our data much better.
Thank you for reading about our project, and feel free to contact us on Linkedin if you have any questions or suggestions.
Bangxi Xiao - Brown University - Providence, Rhode Island, United States | LinkedIn
View Bangxi Xiao's profile on LinkedIn, the world's largest professional community. Bangxi has 3 jobs listed on their…
Yunxuan Zeng - Research Assistant - RailTEC at Illinois | LinkedIn
View Yunxuan Zeng's profile on LinkedIn, the world's largest professional community. Yunxuan has 4 jobs listed on their…
Zuxuan Huai - Data Scientist Intern, Investment Banking - Guggenheim Partners | LinkedIn
View Zuxuan Huai's profile on LinkedIn, the world's largest professional community. Zuxuan has 4 jobs listed on their…
Siyu Shen - Brown University - Greater Boston Area | LinkedIn
View Siyu Shen's profile on LinkedIn, the world's largest professional community. Siyu has 2 jobs listed on their…