官术网_书友最值得收藏!

Profiling data

The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.

Note

Mean and variance

Arithmetic mean:

Variance:

Variance adjusted for sampling bias:

Let's extend the MinMax class with some basic statistics capabilities, Stats:

class Stats[T: ToDouble](values: Vector[T]) 
extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1
  lazy val mean = sums._1/values.size //2
  lazy val variance = 
     (sums._2 - mean*mean*values.size)/(values.size-1)
  lazy val stdDev = sqrt(variance)
…
}

The class Stats implements immutable statistics. Its constructor computes the sum of values and sum of square values, sums (line 1). The statistics such as mean and variance are computed once when needed by declaring these values lazy (line 2). The class Stats inherits the normalization functions of MinMax.

Z-score and Gauss

The Gaussian distribution of input data is implemented by the gauss method of the Stats class:

Note

Gaussian distribution

M1: Gaussian for a mean μ and a standard deviation σ transformation:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
   val y = (x - mu)/sigma
   INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)

The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.

Note

Z-score normalization

M2: Z-score for a mean μ and a standard deviation σ:

The computation of the Z-score is implemented by the method zScore of Stats:

def zScore: DblVec = values.map(x => (x - mean)/stdDev )

The following chart illustrates the relative behavior of the normalization, zScore, and normal transformation:

Comparative analysis of linear, Gaussian, and Z-score normalization

主站蜘蛛池模板: 曲周县| 阳山县| 大同市| 西昌市| 资源县| 长乐市| 张家口市| 皋兰县| 桃源县| 册亨县| 长春市| 康定县| 通榆县| 惠来县| 文安县| 浦北县| 临西县| 万年县| 西乡县| 辉县市| 宽城| 桂平市| 门源| 日照市| 奉化市| 梓潼县| 大足县| 商洛市| 方山县| 京山县| 双桥区| 波密县| 沈阳市| 丹寨县| 福贡县| 剑川县| 增城市| 伽师县| 沁水县| 黎川县| 宝丰县|