- Scala for Machine Learning(Second Edition)
- Patrick R. Nicolas
- 332字
- 2021-07-08 10:43:05
Profiling data
The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax
class for normalizing a dataset using the minimum and maximum values
.
Immutable statistics
The mean and standard deviation are the most commonly used statistics.
Note
Mean and variance
Arithmetic mean:

Variance:

Variance adjusted for sampling bias:

Let's extend the MinMax
class with some basic statistics capabilities, Stats
:
class Stats[T: ToDouble](values: Vector[T]) extends MinMax[T](values) { val zero = (0.0. 0.0) val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1 lazy val mean = sums._1/values.size //2 lazy val variance = (sums._2 - mean*mean*values.size)/(values.size-1) lazy val stdDev = sqrt(variance) … }
The class Stats
implements immutable statistics. Its constructor computes the sum of values
and sum of square values, sums
(line 1). The statistics such as mean
and variance
are computed once when needed by declaring these values lazy (line 2). The class Stats
inherits the normalization functions of MinMax
.
Z-score and Gauss
The Gaussian distribution of input data is implemented by the gauss
method of the Stats
class:
Note
Gaussian distribution
M1: Gaussian for a mean μ and a standard deviation σ transformation:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
val y = (x - mu)/sigma
INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)
The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.
Note
Z-score normalization
M2: Z-score for a mean μ and a standard deviation σ:

The computation of the Z-score is implemented by the method zScore
of Stats
:
def zScore: DblVec = values.map(x => (x - mean)/stdDev )
The following chart illustrates the relative behavior of the normalization, zScore
, and normal transformation:

Comparative analysis of linear, Gaussian, and Z-score normalization
- Learn Blockchain Programming with JavaScript
- Python數據挖掘與機器學習實戰
- ArcGIS By Example
- 從Excel到Python:用Python輕松處理Excel數據(第2版)
- Oracle 18c 必須掌握的新特性:管理與實戰
- .NET 3.5編程
- OpenCV 4計算機視覺項目實戰(原書第2版)
- 深度學習原理與PyTorch實戰(第2版)
- 大學計算機基礎
- MySQL從入門到精通
- Django 3 Web Development Cookbook
- Hadoop MapReduce v2 Cookbook(Second Edition)
- 亮劍Java Web項目開發案例導航
- RabbitMQ Essentials
- Spark Streaming技術內幕及源碼剖析