Covid Vaccine Stock Prediction

Daxin Niu
8 min readApr 5, 2021

Authors: Bangxi Xiao, Yunxuan Zeng, Zuxuan Huai, Siyu Shen, Daxin Niu

Motivation

Covid-19 has been influencing the world for a bit more than one year. The world has been heavily impacted by the pandemic and many have suffered from the change in society. Biotechnology companies have been racing to develop the vaccine in order to help society to get back on track. Nonetheless, there exists great profit behind the vaccine development race. We would like to take a closer look at the major biotechnology companies that are developing Covid-19 vaccines and aim to predict their stock price.

Before we start, we would like to cite this paper from Alex K. Fine.

We took our inspiration from this paper and we are looking to apply a similar approach from this paper for a different task.

Our notebook for this project is linked below:

Tasks and Goals

In this project, we are looking at six biotechnology companies that are developing Covid-19 vaccines. These companies are BioNTech(BNTX), Pfizer(PFE), Moderna(MRNA), Novavax(NVAX), Johnson&Johnson(JNJ), and AstraZeneca(AZN). We aim to use a variety of different sources to build a comprehensive model that helps us to predict the stock price of these six companies.

To achieve such a goal, we need to perform analysis on different sets of data. For the first stage, we decided to look at the stock price and tweets related to the selected six companies. Stock prices data allow us to have a better understanding of the past performance of our chosen companies. The tweets data plays an important role in this project. We aim to train an RNN model using the tweet data and help us to predict whether each of the six chosen stocks would go up or down.

Exploratory Data Analysis

We break down our EDA into two big sections base on the two datasets we obtained for the current stage of the project.

Stock Price EDA

In this section’s EDA, we focused on the analysis of the stock price data for the six Biotechnology companies. We decided to first take a look at the price change for the six companies start from June 2020. We picked June 2020 as the starting date because most companies react to the Covid crisis around that period of time. If we start from that time period, we can have a better look at how things have changed over time. The overall stock price change is shown below:

We can see that NVAX(Novavax), MRNA(Moderna), and BNTX(BioNTech) have quite significant changes over the past several months. Some big changes seem to happen in August 2020, February 2021, and March 2021. On the other hand, JNJ(Johnson&Johnson), AZN(AstraZeneca), and PFE(Pfizer) do not seem to have too big of a change over the past few months. But something we wanted to keep in mind is that we are putting all sixth companies on the same graph. Changes in companies with a small stock price might not be significant due to the scale of other companies with a larger stock price. Therefore, we were hesitant to make any conclusions just from this single price analysis.

With the doubt in our mind, we decided to take a look at the volume change over the past few months for all the companies. The volume overview looks like the following:

Contrary to the previous graph where we didn’t find too much fluctuation in PFE, we do see a huge increase in volume for PFE in November 2020. Another company that seems to have some fluctuation is MRNA. It had some increase in volume in July 2020 and December 2020. To prevent hidden information from not scaling, we decided to look at the percentage fluctuation for the six companies. The graph is shown below:

We can see that NVAX had some fluctuation around July 2020, December 2020, and February 2021. All the companies had some fluctuation over time but they don’t seem to be extremely significant.

After comparing the six companies together, we have created individual plots for each company showing their change in stock price and a moving average of the stock price.

Pfizer
Novavax
BioNTech
Moderna
AstraZeneca
Johnson&Johnson

Tweets EDA

This section focuses on the tweets data we obtained related to the six companies. We decided to do sentiment analysis on the tweets we have. We calculated the sentiment score for each tweet related to each company and count the number of tweets for each score to create a histogram. The graph is shown below:

We see that most of the tweets data we obtained have a sentiment score of zero. A majority of the data lies between a sentiment score of 0 and 0.5. We see that out of all the tweets counts, PFE seems to take the largest portion for each score.

Other than the sentiment score histogram, we have also created a correlation graph based on the type of tweet. We look at sentiment, replies count, retweets count, and like count. Then we created a correlation graph based on these features.

From the graph, we can see that retweets are highly correlated with likes, and replies are highly correlated with likes and retweets.

We also created a wordcloud analyzing our tweet data. It features the most used words from our data and put them together to form the Twitter logo.

From the graph, we can see that mrna, nvax are some of the most used words.

The above analysis concludes our EDA on the existing data. We will make a baseline model that helps us to achieve our goal.

Baseline Models

With some basic analysis completed for our existing data, we decided to test out some baseline models and see if we can get some decent results.

Preprocessing

For preprocessing, we generate sequence data with sequence length 20 and have the shape being (num_samples, sequence length, num_features). Then we performed the train test split. The data was split using num_samples so that we will not break the order of the sequence. We are using 0.2 of the data for testing.

Now we can look at our baseline models.

LSTM

The first model we decided to test is an LSTM. The model structure looks like the following:

The training metrics are shown below:

We have also tracked the MSE and MAE of the model. The graph is attached here:

We see that the model has quite high MSE and MAE at the very beginning. But the loss has been decreasing and the loss seems to converge after 20 epochs.

SimpleRNN

The second model we decided to use is a SimpleRNN. The model structure is attached here:

The training process is recorded below

As we did for the previous model, we have also tracked the MSE and MAE for the model.

The graphs look a bit different than the previous model’s graph. The loss and MAE decreased dramatically at the very beginning and the model statistics started to converge.

GRU

The third baseline model we attempted is a GRU model. The model structure is attached here:

The training process is attached below:

Similarly, we tracked the MSE and MAE for the model.

This model seems to have similar behavior as the previous one. Both performances are quite decent.

Future Goals

At the current stage, we are seeing different behaviors from different models. Eventually, we want to correctly predict the price or make the prediction as accurate as possible.

For the next step, we will try out different models and see if we can get interesting findings. We will also perform fine-tuning on existing models and try to see if we receive better results.

Thank you for reading about our project, and feel free to contact us on Linkedin if you have any questions or suggestions.

--

--