Collecting the dataset_Machine Learning Solutions-QQ阅读男生都市网

上QQ阅读APP看书，第一时间看更新

Collecting the dataset

In order to build the model, first we need to collect the data. We will use the following two data points:

Dow Jones Industrial Average (DJIA) index prices
News articles

DJIA index prices give us an overall idea about the stock market's movements on a particular day, whereas news articles help us find out how news affects the stock prices. We will build our model using these two data points. Now let's collect the data.

Collecting DJIA index prices

In order to collect the DJIA index prices, we will use Yahoo Finance. You can visit this link: https://finance.yahoo.com/quote/%5EDJI/history?period1=1196706600&period2=1512325800&interval=1d&filter=history&frequency=1d. Once you click on this link, you can see that the price data shows up. You can change the time period and click on the Download Data link and that's it; you can have all the data in .csv file format. Refer to the following screenshot of the Yahoo finance DJIA index price page:

Figure 2.1: Yahoo Finance page for DJIA index price

Here, we have downloaded the dataset for the years 2007-2016, which means we have 10 years of data for DJIA index prices. You can see this in Figure 2.1, as well. You can find this dataset using this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/data/DJIA_data.csv.

Just bear with me for a while; we will understand the meaning of each of the data attributes in the Understand the dataset section in this chapter. Now, let's look at how we can collect the news articles.

Collecting news articles

We want to collect news articles so that we can establish the correlation between how news affects the DJIA index value. We are going to perform a sentiment analysis on the news articles. You may wonder why we need to perform sentiment analysis. If any news has a negative effect on the financial market, then it is likely that the prices of stocks will go down, and if news about the financial market is positive, then it is likely that prices of the stocks will go up. For this dataset, we will use news articles from the New York Times (NYTimes). In order to collect the dataset of news articles, we will use the New York Times' developer API. So, let's start coding!

First of all, register yourself on the NYTimes developer website and generate your API key. The link is https://developer.nytimes.com/signup. I have generated the API key for the Archive API. Here, we are using newsapi, JSON, requests, and sys dependencies. You can also refer to the NYTimes developer documentation using this link: https://developer.nytimes.com/archive_api.json#/Documentation/GET/%7Byear%7D/%7Bmonth%7D.json.

You can find the code at this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/getdata_NYtimes.py. You can see the code snippet in the following screenshot:

Figure 2.2: Code snippet for getting the news article data from the New York Times

As you can see in the code, there are three methods. The first two methods are for exceptions and the third method checks for the validation and requests the URL that can generate the news article data for us. This NYTimes API URL takes three parameters, which are given as follows:

Year
Month
API key

After this step, we will call the third function and pass the year value from 2007 to 2016. We will save the data in the JSON format. You can refer to the code snippet in the following screenshot:

Figure 2.3: Code snippet for getting news article data from the New York Times

You can find the raw JSON dataset using this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/data/2016-01.json.

Now let's move on to the next section, in which we will understand the dataset and the attributes that we have collected so far.