Let's fetch the headline data from the API provided by the Guardian, as follows:
from bs4 import BeautifulSoup import urllib, json
dates = [] titles = [] for i in range(100): try: url = 'https://content.guardianapis.com/search?from-date=2010-01-01§ion=business&page-size=200&order-by=newest&page='+str(i+1)+'&q=amazon&api-key=207b6047-a2a6-4dd2-813b-5cd006b780d7' response = urllib.request.urlopen(url) encoding = response.info().get_content_charset('utf8') data = json.loads(response.read().decode(encoding)) for j in range(len(data['response']['results'])): dates.append(data['response']['results'][j]['webPublicationDate']) titles.append(data['response']['results'][j]['webTitle']) except: break
Once titles and dates are extracted, we shall preprocess the data to convert the date values to a date format, as follows:
import pandas as pd data = pd.DataFrame(dates, titles) data['date']=data['date'].str[:10] data['date']=pd.to_datetime(data['date'], format = '%Y-%m-%d') data = data.sort_values(by='date') data_final = data.groupby('date').first().reset_index()
Now that we have the most recent headline for every date on which we are trying to predict the stock price, we will integrate the two data sources, as follows:
Once the datasets are merged, we will go ahead and normalize the text data so that we remove the following:
Convert all words in a text into lowercase so that the words like Text and text are treated the same.
Remove punctuation so that words such as text. and text are treated the same.
Remove stop words such as a, and, the, which do not add much context to the text:
import nltk import re nltk.download('stopwords') stop = nltk.corpus.stopwords.words('english') def preprocess(text): text = str(text) text=text.lower() text=re.sub('[^0-9a-zA-Z]+',' ',text) words = text.split() words2=[w for w in words if (w not in stop)] words4=' '.join(words2) return(words4) data3['title'] = data3['title'].apply(preprocess)
Replace all the null values in the title column with a hyphen -:
Now that we have preprocessed the text data, let's assign an ID to each word. Once we have finished this assignment, we can perform text analysis in a way that is very similar to what we did in the Categorizing news articles into topics section, as follows:
docs = data3['title'].values
from collections import Counter counts = Counter() for i,review in enumerate(docs): counts.update(review.split()) words = sorted(counts, key=counts.get, reverse=True) vocab_size=len(words) word_to_int = {word: i for i, word in enumerate(words, 1)}
Given that we have encoded all the words, let's replace them with their corresponding text in the original text:
encoded_docs = [] for doc in docs: encoded_docs.append([word_to_int[word] for word in doc.split()])
def vectorize_sequences(sequences, dimension=vocab_size): results = np.zeros((len(sequences), dimension+1)) for i, sequence in enumerate(sequences): results[i, sequence] = 1. return results vectorized_docs = vectorize_sequences(encoded_docs)
Now that we have encoded the texts, we understand the way in which we will integrate the two data sources.
First, we shall prepare the training and test datasets, as follows:
x1 = np.array(x) x2 = np.array(vectorized_docs[5:]) y = np.array(y)
Typically, we would use a functional API when there are multiple inputs or multiple outputs expected. In this case, given that there are multiple inputs, we will be leveraging a functional API.
Essentially, a functional API takes out the sequential process of building the model and is performed as follows. Take the input of the vectorized documents and extract the output from it:
The preceding code results in a mean squared error of ~5000 and clearly shows that the model overfits, as the training dataset loss is much lower than the test dataset loss.
Potentially, the overfitting is a result of a very high number of dimensions in the vectorized text data. We will look at how we can improve upon this in Chapter 11, Building a Recurrent Neural Network.