官术网_书友最值得收藏!

Profiling data

The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.

Note

Mean and variance

Arithmetic mean:

Variance:

Variance adjusted for sampling bias:

Let's extend the MinMax class with some basic statistics capabilities, Stats:

class Stats[T: ToDouble](values: Vector[T]) 
extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1
  lazy val mean = sums._1/values.size //2
  lazy val variance = 
     (sums._2 - mean*mean*values.size)/(values.size-1)
  lazy val stdDev = sqrt(variance)
…
}

The class Stats implements immutable statistics. Its constructor computes the sum of values and sum of square values, sums (line 1). The statistics such as mean and variance are computed once when needed by declaring these values lazy (line 2). The class Stats inherits the normalization functions of MinMax.

Z-score and Gauss

The Gaussian distribution of input data is implemented by the gauss method of the Stats class:

Note

Gaussian distribution

M1: Gaussian for a mean μ and a standard deviation σ transformation:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
   val y = (x - mu)/sigma
   INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)

The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.

Note

Z-score normalization

M2: Z-score for a mean μ and a standard deviation σ:

The computation of the Z-score is implemented by the method zScore of Stats:

def zScore: DblVec = values.map(x => (x - mean)/stdDev )

The following chart illustrates the relative behavior of the normalization, zScore, and normal transformation:

Comparative analysis of linear, Gaussian, and Z-score normalization

主站蜘蛛池模板: 涡阳县| 侯马市| 漠河县| 康乐县| 瓮安县| 东乡| 泌阳县| 永仁县| 突泉县| 章丘市| 祁连县| 普兰店市| 绥芬河市| 电白县| 玉林市| 眉山市| 大渡口区| 荣昌县| 柳江县| 渑池县| 涿州市| 隆德县| 康乐县| 梁山县| 顺平县| 乐至县| 鄂伦春自治旗| 亚东县| 郯城县| 贡山| 上思县| 托克逊县| 新丰县| 玉田县| 炉霍县| 吉隆县| 丰宁| 渑池县| 江口县| 武功县| 施甸县|