官术网_书友最值得收藏!

Loading the dataset

While maybe not the most fun part of a machine learning problem, loading the data is an important step.  I'm going to cover my data loading methodology here so that you can get a feel for how I handle loading a dataset.

from sklearn.preprocessing import StandardScaler
import pandas as pd

TRAIN_DATA = "./data/train/train_data.csv"
VAL_DATA = "./data/val/val_data.csv"
TEST_DATA = "./data/test/test_data.csv"

def load_data():
"""Loads train, val, and test datasets from disk"""
train = pd.read_csv(TRAIN_DATA)
val = pd.read_csv(VAL_DATA)
test = pd.read_csv(TEST_DATA)

# we will use sklearn's StandardScaler to scale our data to 0 mean, unit variance.
scaler = StandardScaler()
train = scaler.fit_transform(train)
val = scaler.transform(val)
test = scaler.transform(test)
# we will use a dict to keep all this data tidy.
data = dict()

data["train_y"] = train[:, 10]
data["train_X"] = train[:, 0:9]
data["val_y"] = val[:, 10]
data["val_X"] = val[:, 0:9]
data["test_y"] = test[:, 10]
data["test_X"] = test[:, 0:9]
# it's a good idea to keep the scaler (or at least the mean/variance) so we can unscale predictions
data["scaler"] = scaler
return data

When I'm reading data from csv, excel, or even a DBMS, my first step is usually loading it into a pandas dataframe.  

 It's important to normalize our data so that each feature is on a comparable scale, and that all those scales fall within the bounds of our activation functions. Here, I used Scikit-Learn's StandardScaler to accomplish this task. 

This gives us an overall dataset with shape (4898, 10). Our target variable, alcohol, is given as a percentage between 8% and 14.2%.

I've randomly sampled and divided the data into train, val, and test datasets prior to loading the data, so we don't have to worry about that here.

Lastly,  the load_data() function returns a dictionary that keeps everything tidy and in one place.  If you see me reference data["X_train"] later, just know that I'm referencing the training dataset, that I've stored in a dictionary of data.

. The code and data for this project are both available on the book's GitHub site (https://github.com/mbernico/deep_learning_quick_reference). 

主站蜘蛛池模板: 阿巴嘎旗| 绿春县| 舟曲县| 琼中| 岳西县| 沅陵县| 万山特区| 吴旗县| 曲沃县| 青铜峡市| 龙胜| 丰台区| 庄河市| 乌拉特前旗| 电白县| 石门县| 日照市| 扬州市| 泗洪县| 康乐县| 赣州市| 桓仁| 沅江市| 成安县| 靖江市| 罗甸县| 郓城县| 杭锦后旗| 独山县| 武安市| 新化县| 太仓市| 贞丰县| 克东县| 遂溪县| 仁布县| 鹿泉市| 视频| 揭西县| 岳阳市| 临泽县|