import numpy as np import pandas as pd from sklearn.model_selection import train_test_split
from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping, ModelCheckpoint from keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
SEED = 2017
Load dataset:
data = pd.read_csv('Data/winequality-red.csv', sep=';') y = data['quality'] X = data.drop(['quality'], axis=1)
Split data for training and testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
Print average quality and first rows of training set:
print('Average quality training set: {:.4f}'.format(y_train.mean())) X_train.head()
In the following screenshot, we can see an example of the output of the training data:
Figure 2-8: Training data
An important next step is to normalize the input data:
# Predict the mean quality of the training data for each validation input print('MSE:', np.mean((y_test - ([y_train.mean()] * y_test.shape[0])) ** 2).round(4)) ## MSE: 0.594
Now, let's build our neural network by defining the network architecture:
model = Sequential() # First hidden layer with 100 hidden units model.add(Dense(200, input_dim=X_train.shape[1], activation='relu')) # Second hidden layer with 50 hidden units model.add(Dense(25, activation='relu')) # Output layer model.add(Dense(1, activation='linear')) # Set optimizer opt = Adam() # Compile model model.compile(loss='mse', optimizer=opt, metrics=['accuracy'])
Let's define the callback for early stopping and saving the best model:
We can now print the performance on the test set after loading the optimal weights:
best_model = model best_model.load_weights('checkpoints/multi_layer_best_model.h5') best_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
# Evaluate on test set score = best_model.evaluate(X_test.values, y_test, verbose=0) print('Test accuracy: %.2f%%' % (score[1]*100))
## Test accuracy: 66.25% ## Benchmark accuracy on dataset 62.4%
With a small dataset, it's advisable to retrain on the complete training set (without validation set) and increase the number of epochs proportional to the additional data. Another option, is to use cross-validation and average the results when making predictions.