官术网_书友最值得收藏!

How to do it...

The previous strategy is coded as follows:

  1. Let's fetch the headline data from the API provided by the Guardian, as follows:
from bs4 import BeautifulSoup
import urllib, json

dates = []
titles = []
for i in range(100):
try:
url = 'https://content.guardianapis.com/search?from-date=2010-01-01&section=business&page-size=200&order-by=newest&page='+str(i+1)+'&q=amazon&api-key=207b6047-a2a6-4dd2-813b-5cd006b780d7'
response = urllib.request.urlopen(url)
encoding = response.info().get_content_charset('utf8')
data = json.loads(response.read().decode(encoding))
for j in range(len(data['response']['results'])):
dates.append(data['response']['results'][j]['webPublicationDate'])
titles.append(data['response']['results'][j]['webTitle'])
except:
break
  1. Once titles and dates are extracted, we shall preprocess the data to convert the date values to a date format, as follows:
import pandas as pd
data = pd.DataFrame(dates, titles)
data['date']=data['date'].str[:10]
data['date']=pd.to_datetime(data['date'], format = '%Y-%m-%d')
data = data.sort_values(by='date')
data_final = data.groupby('date').first().reset_index()
  1. Now that we have the most recent headline for every date on which we are trying to predict the stock price, we will integrate the two data sources, as follows:
data2['Date'] = pd.to_datetime(data2['Date'],format='%Y-%m-%d')
data3 = pd.merge(data2,data_final, left_on = 'Date', right_on = 'date', how='left')
  1. Once the datasets are merged, we will go ahead and normalize the text data so that we remove the following:
    • Convert all words in a text into lowercase so that the words like Text and text are treated the same.
    • Remove punctuation so that words such as text. and text are treated the same.
    • Remove stop words such as a, and, the, which do not add much context to the text:
import nltk
import re
nltk.download('stopwords')
stop = nltk.corpus.stopwords.words('english')
def preprocess(text):
text = str(text)
text=text.lower()
text=re.sub('[^0-9a-zA-Z]+',' ',text)
words = text.split()
words2=[w for w in words if (w not in stop)]
words4=' '.join(words2)
return(words4)
data3['title'] = data3['title'].apply(preprocess)
  1. Replace all the null values in the title column with a hyphen -:
data3['title']=np.where(data3['title'].isnull(),'-','-'+data3['title'])

Now that we have preprocessed the text data, let's assign an ID to each word. Once we have finished this assignment, we can perform text analysis in a way that is very similar to what we did in the Categorizing news articles into topics section, as follows:

docs = data3['title'].values

from collections import Counter
counts = Counter()
for i,review in enumerate(docs):
counts.update(review.split())
words = sorted(counts, key=counts.get, reverse=True)
vocab_size=len(words)
word_to_int = {word: i for i, word in enumerate(words, 1)}
  1. Given that we have encoded all the words, let's replace them with their corresponding text in the original text:
encoded_docs = []
for doc in docs:
encoded_docs.append([word_to_int[word] for word in doc.split()])

def vectorize_sequences(sequences, dimension=vocab_size):
results = np.zeros((len(sequences), dimension+1))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
vectorized_docs = vectorize_sequences(encoded_docs)

Now that we have encoded the texts, we understand the way in which we will integrate the two data sources.

  1. First, we shall prepare the training and test datasets, as follows:
x1 = np.array(x)
x2 = np.array(vectorized_docs[5:])
y = np.array(y)

X1_train = x1[:2100,:]
X2_train = x2[:2100, :]
y_train = y[:2100]
X1_test = x1[2100:,:]
X2_test = x2[2100:,:]
y_test = y[2100:]

Typically, we would use a functional API when there are multiple inputs or multiple outputs expected. In this case, given that there are multiple inputs, we will be leveraging a functional API.

  1. Essentially, a functional API takes out the sequential process of building the model and is performed as follows. Take the input of the vectorized documents and extract the output from it:
input1 = Input(shape=(2406,))
input1_hidden = (Dense(100, activation='relu'))(input1)
input1_output = (Dense(1, activation='tanh'))(input1_hidden)

In the preceding code, note that we have not used the sequential modeling process but defined the various connections using the Dense layer.

Note that the input has a shape of 2406, as there are 2406 unique words that remain after the filtering process.
  1. Take the input of the previous 5 stock prices and build the model:
input2 = Input(shape=(5,))
input2_hidden = (Dense(100, activation='relu'))(input2)
input2_output = (Dense(1, activation='linear'))(input2_hidden)
  1. We will multiply the output of the two inputs:
from keras.layers import multiply
out = multiply([model, model2])
  1. Now that we have defined the output, we will build the model as follows:
model = Model([input1, input2], out)
model.summary()

Note that, in the preceding step, we used the Model layer to define the input (passed as a list) and the output:

A visualization of the preceding output is as follows:

  1. Compile and fit the model:
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x=[X2_train, X1_train], y=y_train, epochs=100,batch_size = 32, validation_data = ([X2_test, X1_test], y_test))

The preceding code results in a mean squared error of ~5000 and clearly shows that the model overfits, as the training dataset loss is much lower than the test dataset loss.

Potentially, the overfitting is a result of a very high number of dimensions in the vectorized text data. We will look at how we can improve upon this in Chapter 11, Building a Recurrent Neural Network.

主站蜘蛛池模板: 台山市| 乌拉特后旗| 恩平市| 内丘县| 乌兰县| 聂拉木县| 库车县| 彰化市| 稻城县| 浠水县| 龙井市| 苏尼特右旗| 四子王旗| 河西区| 双牌县| 汾阳市| 阿克陶县| 弥渡县| 赣州市| 荆州市| 连山| 清新县| 本溪| 大荔县| 五寨县| 高邮市| 汝南县| 那曲县| 屯门区| 黑龙江省| 彝良县| 大同市| 丰顺县| 什邡市| 北流市| 象州县| 县级市| 北川| 霸州市| 陵川县| 万源市|