- Neural Networks with Keras Cookbook
- V Kishore Ayyadevara
- 755字
- 2021-07-02 12:46:34
How to do it...
The previous strategy is coded as follows:
- Let's fetch the headline data from the API provided by the Guardian, as follows:
from bs4 import BeautifulSoup
import urllib, json
dates = []
titles = []
for i in range(100):
try:
url = 'https://content.guardianapis.com/search?from-date=2010-01-01§ion=business&page-size=200&order-by=newest&page='+str(i+1)+'&q=amazon&api-key=207b6047-a2a6-4dd2-813b-5cd006b780d7'
response = urllib.request.urlopen(url)
encoding = response.info().get_content_charset('utf8')
data = json.loads(response.read().decode(encoding))
for j in range(len(data['response']['results'])):
dates.append(data['response']['results'][j]['webPublicationDate'])
titles.append(data['response']['results'][j]['webTitle'])
except:
break
- Once titles and dates are extracted, we shall preprocess the data to convert the date values to a date format, as follows:
import pandas as pd
data = pd.DataFrame(dates, titles)
data['date']=data['date'].str[:10]
data['date']=pd.to_datetime(data['date'], format = '%Y-%m-%d')
data = data.sort_values(by='date')
data_final = data.groupby('date').first().reset_index()
- Now that we have the most recent headline for every date on which we are trying to predict the stock price, we will integrate the two data sources, as follows:
data2['Date'] = pd.to_datetime(data2['Date'],format='%Y-%m-%d')
data3 = pd.merge(data2,data_final, left_on = 'Date', right_on = 'date', how='left')
- Once the datasets are merged, we will go ahead and normalize the text data so that we remove the following:
- Convert all words in a text into lowercase so that the words like Text and text are treated the same.
- Remove punctuation so that words such as text. and text are treated the same.
- Remove stop words such as a, and, the, which do not add much context to the text:
import nltk
import re
nltk.download('stopwords')
stop = nltk.corpus.stopwords.words('english')
def preprocess(text):
text = str(text)
text=text.lower()
text=re.sub('[^0-9a-zA-Z]+',' ',text)
words = text.split()
words2=[w for w in words if (w not in stop)]
words4=' '.join(words2)
return(words4)
data3['title'] = data3['title'].apply(preprocess)
- Replace all the null values in the title column with a hyphen -:
data3['title']=np.where(data3['title'].isnull(),'-','-'+data3['title'])
Now that we have preprocessed the text data, let's assign an ID to each word. Once we have finished this assignment, we can perform text analysis in a way that is very similar to what we did in the Categorizing news articles into topics section, as follows:
docs = data3['title'].values
from collections import Counter
counts = Counter()
for i,review in enumerate(docs):
counts.update(review.split())
words = sorted(counts, key=counts.get, reverse=True)
vocab_size=len(words)
word_to_int = {word: i for i, word in enumerate(words, 1)}
- Given that we have encoded all the words, let's replace them with their corresponding text in the original text:
encoded_docs = []
for doc in docs:
encoded_docs.append([word_to_int[word] for word in doc.split()])
def vectorize_sequences(sequences, dimension=vocab_size):
results = np.zeros((len(sequences), dimension+1))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
vectorized_docs = vectorize_sequences(encoded_docs)
Now that we have encoded the texts, we understand the way in which we will integrate the two data sources.
- First, we shall prepare the training and test datasets, as follows:
x1 = np.array(x)
x2 = np.array(vectorized_docs[5:])
y = np.array(y)
X1_train = x1[:2100,:]
X2_train = x2[:2100, :]
y_train = y[:2100]
X1_test = x1[2100:,:]
X2_test = x2[2100:,:]
y_test = y[2100:]
Typically, we would use a functional API when there are multiple inputs or multiple outputs expected. In this case, given that there are multiple inputs, we will be leveraging a functional API.
- Essentially, a functional API takes out the sequential process of building the model and is performed as follows. Take the input of the vectorized documents and extract the output from it:
input1 = Input(shape=(2406,))
input1_hidden = (Dense(100, activation='relu'))(input1)
input1_output = (Dense(1, activation='tanh'))(input1_hidden)
In the preceding code, note that we have not used the sequential modeling process but defined the various connections using the Dense layer.
- Take the input of the previous 5 stock prices and build the model:
input2 = Input(shape=(5,))
input2_hidden = (Dense(100, activation='relu'))(input2)
input2_output = (Dense(1, activation='linear'))(input2_hidden)
- We will multiply the output of the two inputs:
from keras.layers import multiply
out = multiply([model, model2])
- Now that we have defined the output, we will build the model as follows:
model = Model([input1, input2], out)
model.summary()
Note that, in the preceding step, we used the Model layer to define the input (passed as a list) and the output:

A visualization of the preceding output is as follows:

- Compile and fit the model:
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x=[X2_train, X1_train], y=y_train, epochs=100,batch_size = 32, validation_data = ([X2_test, X1_test], y_test))
The preceding code results in a mean squared error of ~5000 and clearly shows that the model overfits, as the training dataset loss is much lower than the test dataset loss.
Potentially, the overfitting is a result of a very high number of dimensions in the vectorized text data. We will look at how we can improve upon this in Chapter 11, Building a Recurrent Neural Network.
- ASP.NET Web API:Build RESTful web applications and services on the .NET framework
- Facebook Application Development with Graph API Cookbook
- Web全棧工程師的自我修養
- SAS數據統計分析與編程實踐
- Mastering Rust
- Swift Playgrounds少兒趣編程
- Creating Data Stories with Tableau Public
- Image Processing with ImageJ
- Building Dynamics CRM 2015 Dashboards with Power BI
- OpenMP核心技術指南
- scikit-learn Cookbook(Second Edition)
- Solr權威指南(下卷)
- VMware vSphere 5.5 Cookbook
- Isomorphic Go
- Mastering Python