官术网_书友最值得收藏!

  • Machine Learning Algorithms
  • Giuseppe Bonaccorso
  • 329字
  • 2021-07-02 18:53:30

Data scaling and normalization

A generic dataset (we assume here that it is always numerical) is made up of different values which can be drawn from different distributions, having different scales and, sometimes, there are also outliers. A machine learning algorithm isn't naturally able to distinguish among these various situations, and therefore, it's always preferable to standardize datasets before processing them. A very common problem derives from having a non-zero mean and a variance greater than one. In the following figure, there's a comparison between a raw dataset and the same dataset scaled and centered:

This result can be achieved using the StandardScaler class:

from sklearn.preprocessing import StandardScaler

>>> ss = StandardScaler()
>>> scaled_data = ss.fit_transform(data)

It's possible to specify if the scaling process must include both mean and standard deviation using the parameters with_mean=True/False and with_std=True/False (by default they're both active). If you need a more powerful scaling feature, with a superior control on outliers and the possibility to select a quantile range, there's also the class RobustScaler. Here are some examples with different quantiles:

from sklearn.preprocessing import RubustScaler

>>> rb1 = RobustScaler(quantile_range=(15, 85))
>>> scaled_data1 = rb1.fit_transform(data)

>>> rb1 = RobustScaler(quantile_range=(25, 75))
>>> scaled_data1 = rb1.fit_transform(data)

>>> rb2 = RobustScaler(quantile_range=(30, 60))
>>> scaled_data2 = rb2.fit_transform(data)

The results are shown in the following figures:

Other options include MinMaxScaler and MaxAbsScaler, which scale data by removing elements that don't belong to a given range (the former) or by considering a maximum absolute value (the latter).

scikit-learn also provides a class for per-sample normalization, Normalizer. It can apply max, l1 and l2 norms to each element of a dataset. In a Euclidean space, they are defined in the following way:

An example of every normalization is shown next:

from sklearn.preprocessing import Normalizer

>>> data = np.array([1.0, 2.0])

>>> n_max = Normalizer(norm='max')
>>> n_max.fit_transform(data.reshape(1, -1))
[[ 0.5, 1. ]]

>>> n_l1 = Normalizer(norm='l1')
>>> n_l1.fit_transform(data.reshape(1, -1))
[[ 0.33333333, 0.66666667]]

>>> n_l2 = Normalizer(norm='l2')
>>> n_l2.fit_transform(data.reshape(1, -1))
[[ 0.4472136 , 0.89442719]]
主站蜘蛛池模板: 汕头市| 若尔盖县| 阜平县| 开平市| 扶沟县| 尼勒克县| 合阳县| 衡南县| 南宫市| 棋牌| 通山县| 河北区| 隆昌县| 罗山县| 牡丹江市| 积石山| 丰顺县| 平乡县| 林州市| 板桥市| 石林| 上蔡县| 姚安县| 自贡市| 图们市| 永靖县| 宁城县| 获嘉县| 香格里拉县| 西华县| 昌宁县| 安龙县| 沙田区| 富锦市| 方正县| 抚顺县| 保德县| 报价| 曲阜市| 盐边县| 梅河口市|