官术网_书友最值得收藏!

How to do it...

The previous strategy is coded as follows:

  1. Let's fetch the headline data from the API provided by the Guardian, as follows:
from bs4 import BeautifulSoup
import urllib, json

dates = []
titles = []
for i in range(100):
try:
url = 'https://content.guardianapis.com/search?from-date=2010-01-01&section=business&page-size=200&order-by=newest&page='+str(i+1)+'&q=amazon&api-key=207b6047-a2a6-4dd2-813b-5cd006b780d7'
response = urllib.request.urlopen(url)
encoding = response.info().get_content_charset('utf8')
data = json.loads(response.read().decode(encoding))
for j in range(len(data['response']['results'])):
dates.append(data['response']['results'][j]['webPublicationDate'])
titles.append(data['response']['results'][j]['webTitle'])
except:
break
  1. Once titles and dates are extracted, we shall preprocess the data to convert the date values to a date format, as follows:
import pandas as pd
data = pd.DataFrame(dates, titles)
data['date']=data['date'].str[:10]
data['date']=pd.to_datetime(data['date'], format = '%Y-%m-%d')
data = data.sort_values(by='date')
data_final = data.groupby('date').first().reset_index()
  1. Now that we have the most recent headline for every date on which we are trying to predict the stock price, we will integrate the two data sources, as follows:
data2['Date'] = pd.to_datetime(data2['Date'],format='%Y-%m-%d')
data3 = pd.merge(data2,data_final, left_on = 'Date', right_on = 'date', how='left')
  1. Once the datasets are merged, we will go ahead and normalize the text data so that we remove the following:
    • Convert all words in a text into lowercase so that the words like Text and text are treated the same.
    • Remove punctuation so that words such as text. and text are treated the same.
    • Remove stop words such as a, and, the, which do not add much context to the text:
import nltk
import re
nltk.download('stopwords')
stop = nltk.corpus.stopwords.words('english')
def preprocess(text):
text = str(text)
text=text.lower()
text=re.sub('[^0-9a-zA-Z]+',' ',text)
words = text.split()
words2=[w for w in words if (w not in stop)]
words4=' '.join(words2)
return(words4)
data3['title'] = data3['title'].apply(preprocess)
  1. Replace all the null values in the title column with a hyphen -:
data3['title']=np.where(data3['title'].isnull(),'-','-'+data3['title'])

Now that we have preprocessed the text data, let's assign an ID to each word. Once we have finished this assignment, we can perform text analysis in a way that is very similar to what we did in the Categorizing news articles into topics section, as follows:

docs = data3['title'].values

from collections import Counter
counts = Counter()
for i,review in enumerate(docs):
counts.update(review.split())
words = sorted(counts, key=counts.get, reverse=True)
vocab_size=len(words)
word_to_int = {word: i for i, word in enumerate(words, 1)}
  1. Given that we have encoded all the words, let's replace them with their corresponding text in the original text:
encoded_docs = []
for doc in docs:
encoded_docs.append([word_to_int[word] for word in doc.split()])

def vectorize_sequences(sequences, dimension=vocab_size):
results = np.zeros((len(sequences), dimension+1))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
vectorized_docs = vectorize_sequences(encoded_docs)

Now that we have encoded the texts, we understand the way in which we will integrate the two data sources.

  1. First, we shall prepare the training and test datasets, as follows:
x1 = np.array(x)
x2 = np.array(vectorized_docs[5:])
y = np.array(y)

X1_train = x1[:2100,:]
X2_train = x2[:2100, :]
y_train = y[:2100]
X1_test = x1[2100:,:]
X2_test = x2[2100:,:]
y_test = y[2100:]

Typically, we would use a functional API when there are multiple inputs or multiple outputs expected. In this case, given that there are multiple inputs, we will be leveraging a functional API.

  1. Essentially, a functional API takes out the sequential process of building the model and is performed as follows. Take the input of the vectorized documents and extract the output from it:
input1 = Input(shape=(2406,))
input1_hidden = (Dense(100, activation='relu'))(input1)
input1_output = (Dense(1, activation='tanh'))(input1_hidden)

In the preceding code, note that we have not used the sequential modeling process but defined the various connections using the Dense layer.

Note that the input has a shape of 2406, as there are 2406 unique words that remain after the filtering process.
  1. Take the input of the previous 5 stock prices and build the model:
input2 = Input(shape=(5,))
input2_hidden = (Dense(100, activation='relu'))(input2)
input2_output = (Dense(1, activation='linear'))(input2_hidden)
  1. We will multiply the output of the two inputs:
from keras.layers import multiply
out = multiply([model, model2])
  1. Now that we have defined the output, we will build the model as follows:
model = Model([input1, input2], out)
model.summary()

Note that, in the preceding step, we used the Model layer to define the input (passed as a list) and the output:

A visualization of the preceding output is as follows:

  1. Compile and fit the model:
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x=[X2_train, X1_train], y=y_train, epochs=100,batch_size = 32, validation_data = ([X2_test, X1_test], y_test))

The preceding code results in a mean squared error of ~5000 and clearly shows that the model overfits, as the training dataset loss is much lower than the test dataset loss.

Potentially, the overfitting is a result of a very high number of dimensions in the vectorized text data. We will look at how we can improve upon this in Chapter 11, Building a Recurrent Neural Network.

主站蜘蛛池模板: 新竹市| 千阳县| 阿尔山市| 临湘市| 乌审旗| 淳化县| 万全县| 巴东县| 巩留县| 平果县| 金堂县| 肥乡县| 周口市| 盐山县| 连江县| 芜湖县| 东至县| 射阳县| 南召县| 同仁县| 三门县| 富裕县| 蚌埠市| 南部县| 合肥市| 安龙县| 香港 | 江源县| 漳浦县| 嘉鱼县| 兴化市| 疏附县| 报价| 体育| 荣成市| 新干县| 抚顺市| 海兴县| 钟山县| 余姚市| 乐陵市|