官术网_书友最值得收藏!

Profiling data

The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.

Note

Mean and variance

Arithmetic mean:

Variance:

Variance adjusted for sampling bias:

Let's extend the MinMax class with some basic statistics capabilities, Stats:

class Stats[T: ToDouble](values: Vector[T]) 
extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1
  lazy val mean = sums._1/values.size //2
  lazy val variance = 
     (sums._2 - mean*mean*values.size)/(values.size-1)
  lazy val stdDev = sqrt(variance)
…
}

The class Stats implements immutable statistics. Its constructor computes the sum of values and sum of square values, sums (line 1). The statistics such as mean and variance are computed once when needed by declaring these values lazy (line 2). The class Stats inherits the normalization functions of MinMax.

Z-score and Gauss

The Gaussian distribution of input data is implemented by the gauss method of the Stats class:

Note

Gaussian distribution

M1: Gaussian for a mean μ and a standard deviation σ transformation:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
   val y = (x - mu)/sigma
   INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)

The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.

Note

Z-score normalization

M2: Z-score for a mean μ and a standard deviation σ:

The computation of the Z-score is implemented by the method zScore of Stats:

def zScore: DblVec = values.map(x => (x - mean)/stdDev )

The following chart illustrates the relative behavior of the normalization, zScore, and normal transformation:

Comparative analysis of linear, Gaussian, and Z-score normalization

主站蜘蛛池模板: 保定市| 如皋市| 唐河县| 建始县| 庆安县| 肥城市| 新蔡县| 广水市| 江都市| 宿州市| 叶城县| 博白县| 中牟县| 清河县| 巴中市| 淮滨县| 嘉荫县| 泰宁县| 江孜县| 原阳县| 鄱阳县| 新郑市| 当雄县| 二手房| 龙川县| 墨竹工卡县| 苏尼特右旗| 巴林右旗| 武隆县| 宁明县| 桂东县| 自治县| 柳河县| 嘉峪关市| 临汾市| 宜川县| 长海县| 清涧县| 屏东县| 东兰县| 台中县|