官术网_书友最值得收藏!

Loading the dataset

While maybe not the most fun part of a machine learning problem, loading the data is an important step.  I'm going to cover my data loading methodology here so that you can get a feel for how I handle loading a dataset.

from sklearn.preprocessing import StandardScaler
import pandas as pd

TRAIN_DATA = "./data/train/train_data.csv"
VAL_DATA = "./data/val/val_data.csv"
TEST_DATA = "./data/test/test_data.csv"

def load_data():
"""Loads train, val, and test datasets from disk"""
train = pd.read_csv(TRAIN_DATA)
val = pd.read_csv(VAL_DATA)
test = pd.read_csv(TEST_DATA)

# we will use sklearn's StandardScaler to scale our data to 0 mean, unit variance.
scaler = StandardScaler()
train = scaler.fit_transform(train)
val = scaler.transform(val)
test = scaler.transform(test)
# we will use a dict to keep all this data tidy.
data = dict()

data["train_y"] = train[:, 10]
data["train_X"] = train[:, 0:9]
data["val_y"] = val[:, 10]
data["val_X"] = val[:, 0:9]
data["test_y"] = test[:, 10]
data["test_X"] = test[:, 0:9]
# it's a good idea to keep the scaler (or at least the mean/variance) so we can unscale predictions
data["scaler"] = scaler
return data

When I'm reading data from csv, excel, or even a DBMS, my first step is usually loading it into a pandas dataframe.  

 It's important to normalize our data so that each feature is on a comparable scale, and that all those scales fall within the bounds of our activation functions. Here, I used Scikit-Learn's StandardScaler to accomplish this task. 

This gives us an overall dataset with shape (4898, 10). Our target variable, alcohol, is given as a percentage between 8% and 14.2%.

I've randomly sampled and divided the data into train, val, and test datasets prior to loading the data, so we don't have to worry about that here.

Lastly,  the load_data() function returns a dictionary that keeps everything tidy and in one place.  If you see me reference data["X_train"] later, just know that I'm referencing the training dataset, that I've stored in a dictionary of data.

. The code and data for this project are both available on the book's GitHub site (https://github.com/mbernico/deep_learning_quick_reference). 

主站蜘蛛池模板: 鹤庆县| 舒兰市| 无棣县| 通河县| 嘉定区| 长乐市| 玛多县| 孟州市| 桦南县| 太仆寺旗| 汉阴县| 明光市| 阳山县| 高安市| 金坛市| 济南市| 察雅县| 尚志市| 尤溪县| 句容市| 铁岭市| 商城县| 宜都市| 那坡县| 庐江县| 建瓯市| 休宁县| 广州市| 漠河县| 武义县| 若尔盖县| 齐齐哈尔市| 安多县| 永靖县| 香河县| 通河县| 扶余县| 霍州市| 广东省| 明光市| 蒙自县|