官术网_书友最值得收藏!

Normalization or standardization

This technique aims to give the dataset the properties of a normal distribution, that is, a mean of 0 and a standard deviation of 1.

The way to obtain these properties is by calculating the so-called z scores, based on the dataset samples, with the following formula:

Let's visualize and practice this new concept with the help of scikit-learn, reading a file from the MPG dataset, which contains city-cycle fuel consumption in miles per gallon, based on the following features: mpg, cylinders, displacementhorsepower, weight, acceleration, model year, origin, and car name.

from sklearn import preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("data/mpg.csv")
plt.figure(figsize=(10,8))
print df.columns
partialcolumns = df[['acceleration', 'mpg']]
std_scale = preprocessing.StandardScaler().fit(partialcolumns)
df_std = std_scale.transform(partialcolumns)
plt.scatter(partialcolumns['acceleration'], partialcolumns['mpg'], color="grey", marker='^')
plt.scatter(df_std[:,0], df_std[:,1])
The following picture allows us to compare the non normalized and normalized data representations:
Depiction of the original dataset, and its normalized counterpart.
It's very important to have an account of the denormalizing of the resulting data at the time of evaluation so that you do not lose the representative of the data, especially if the model is applied to regression, when the regressed data won't be useful if not scaled.
主站蜘蛛池模板: 濮阳县| 兴隆县| 苏尼特右旗| 新田县| 湘西| 汨罗市| 定西市| 游戏| 邮箱| 乌拉特前旗| 河源市| 商都县| 岫岩| 合肥市| 高邑县| 东至县| 长春市| 萝北县| 泸溪县| 合阳县| 屏南县| 沭阳县| 探索| 河南省| 阿尔山市| 曲松县| 福鼎市| 通山县| 平邑县| 陆良县| 蒲城县| 八宿县| 保康县| 枞阳县| 崇明县| 连云港市| 隆尧县| 武鸣县| 平南县| 南康市| 平谷区|