- Scala for Machine Learning(Second Edition)
- Patrick R. Nicolas
- 1878字
- 2021-07-08 10:43:09
Expectation-Maximization (EM)
The EM was originally introduced to estimate the maximum likelihood in the case of incomplete data [4:7]. The EM algorithm is an iterative method to compute the model features that maximize the likely estimate for observed values, considering unobserved values.
The iterative algorithm consists of computing:
- The expectation, E, of the maximum likelihood for the observed data by inferring the latent values (E-step)
- The model features that maximize the expectation E (M-step)
The EM algorithm is applied to solve clustering problems by if each latent variable follows a Normal or Gaussian distribution. This is similar to the K-means algorithm for which the distance of each data point to the center of each cluster follows a Gaussian distribution [4:8]. Therefore, a set of latent variables is a mixture of Gaussian distributions.
Gaussian mixture model
Latent variables, Zi can be visualized as the behavior (or symptoms) of a model (observed), X, for which Z are the root-cause of the behavior:

Visualization of observed and latent features
The latent values Z follow a Gaussian distribution. For the statisticians among us, the mathematics of a mixture model are described here:
Note
Maximization of log likelihood
M7: If x = {xi } is a set of observed features associated with latent features z = {zi }, the probability for the feature xi of observation x given a model parameter θ is:

M8: The objective is to maximize the likelihood L(θ):

EM overview
As far as the implementation is concerned, the Expectation-Maximization algorithm can be broken down into three stages:
- The computation of the log likelihood for the model features given some latent variables (Initial-step).
- The computation of the expectation of the log likelihood at iteration t (E-step).
- The maximization of the expectation at iteration t (M-step).
Note
E-step
M9: The expectation Q of the complete data log likelihood for the model parameters θ n at iteration n is computed using the posterior distribution of latent variable z, p(z|x, θ) and the joint probability of the observation and the latent variable:

M-step
M10: The expectation function is maximized for the model features θ to compute the model parameters θ n+1 for the next iteration:

A formal, detailed but short mathematical formulation of the EM algorithm can be found in the S. Borman's tutorial [4:9].
Implementation
Let's implement the three steps (Initial-step, E-step, and M-step) in Scala. The internal calculations of the EM algorithm are a bit complex and overwhelming. You may not benefit much from the details of a specific implementation such as computation of the eigenvalues of the covariance matrix of the expectation of the log likelihood. This implementation hides some complexity by using the Apache Common Math library package [4:10].
Tip
Inner workings of EM
You may want to download the source code for the implementation of the EM algorithm in the Apache Commons Math library, if you need to understand the condition for which an exception is thrown.
The expectation-maximization algorithm of type MultivariateEM
is implemented as a data transformation of type ITransform
as described in the Monadic data transformation section of Chapter 2, Hello World!. The two arguments of the constructors are the number of clusters (or gauss distribution) K
and the training set xt
(line 12). The constructor initializes the type V
of output as MCluster
(line 21):
type V = EMCluster //1 class MultivariateEM[@specialized(Double) T: ToDouble]( K: Int, xt: Vector[Array[T]] ) extends ITransform[Array[T], V] with Monitor[T] { //2 val model: Option[EMModel] = train //3 override def |> : PartialFunction[Arraon[U, Try[V]] }
The multivariate expectation-maximization class has a model that consists of a list of EM clusters of type EMCluster
. The Monitor
trait is used to collect profiling information during training (refer to the Monitor section under Helper classes in the Appendix).
The information about an EM cluster, EMCluster
, is defined by a key, the centroid, or means
value and the density
of the cluster that is the standard deviation of the distance of all the data points to the mean (line 4):
case class EMCluster(
key: Double,
means: Array[Double],
density: Array[Double]) //4
type EMModel = List[EMCluster]
The implementation of the EM algorithm in the method train
uses the Apache Commons Math MultivariateNormalMixture
for the Gaussian mixture model and MultivariateNormalMixtureExpectationMaximization
for the EM algorithm:
def train: Option[EMModel ] = Try { val data: DblMatrix = xt.map( _.map(implicitly[ToDouble[T]].apply(_)) ).toArray //5 val multivariateEM = new EM(data) multivariateEM.fit( estimate(data, K) ) //6 val newMixture = multivariateEM.getFittedModel //7 val components = newMixture.getComponents.toList //8 components.map(p => EMCluster( p.getKey, p.getValue.getMeans, p.getValue.getStandardDeviations) ) //9 } match {/* … */}
Let's look at the main method train
of the wrapper class, MultivariateEM
. The first step is to convert the time series into a primitive Matrix of Double with observations, with historical quotes as rows and the stock symbols as columns.
The time series xt
of type Vector[Array[T]]
is converted to a DblMatrix
through an induced implicit conversion (line 5).
The initial mixture of Gaussian distributions can be provided by the user or can be extracted from the dataset, estimate
(line 6). The getFittedModel
triggers the M-step (line 7).
Note
Conversion from Java and Scala collections
Java primitives need to be converted to Scala types using the package import scala.collection.JavaConversions
. For example, java.util.List
is converted to scala.collection.immutable.List
by invoking the method asScalaIterator
of the class WrapAsScala
, one of the base traits of JavaConversions
.
The Apache Commons Math method, getComponents
, returns a java.util.List
that is converted to a scala.collection.immutable.List
by invoking the method toList
(line 8). Finally, the data transform returns a list of cluster information of type EMCluster
(line 9).
Tip
Third-party library exceptions
Scala does not enforce declaration of exception as part of the signature of a method. Therefore, there is no guarantee that all types of exceptions will be caught, locally. This problem occurs when exceptions are thrown from a third-party library in two scenarios:
- The documentation of the API does not list all the types of exceptions
- The library is updated and a new type of exception is added to a method
- One easy workaround is to leverage the Scala exception handling mechanism:
Try { … } match { case Success(results) => … case Failure(excetion) => ... }
Classification
The classification of new observations or data points is implemented by the method |>
:
override def |> : PartialFunction[Array[T], Try[V]] = {
case x:Array[T] if(isModel && x.length == dimension(xt)) =>
Try( model.map(_.minBy(c =>
euclidean[Double, Double](
c.means,
x.map(z => implicitly[ToDOuble[T]].apply(z)))
)
).getOrElse(.means,x))).get)
}
The method |>
is similar to the KMeans.|>
classifier.
Testing
Let's apply the MultivariateEM
class to the clustering of the same 127 stocks used in evaluating the K-means algorithm.
As discussed in the paragraph related to the curse of dimensionality, the number of stocks (127) to analyze restricts the number of observations to be used by the EM algorithm. A simple option is to filter out some of the noise of the stocks' prices and apply a simple sampling method. The maximum sampling rate is restricted by the frequencies in the spectrum of noises of different types on the historical price of every stock.
Tip
Filtering and sampling
The pre-processing of the data using a combination of a simple moving average and fixed interval sampling prior to clustering is very rudimentary in this example. For instance, we cannot assume that the historical prices of all the stocks share the same noise characteristics. The noise pattern in the price of momentum and heavily traded stocks is certainly different from blue chips securities with a strong ownership, held by large mutual funds.
The sampling rate should take into account the frequencies spectrum of the noise. It should be set as at least twice the frequency of the noise with the lowest frequency.
The object of the test is to evaluate the impact of the sampling rate, samplingRate
, and the number of clusters K used in the EM algorithm:
val K = 4; val period = 8 val smAve = SimpleMovingAverage[Double](period) //10 val pfnSmAve = smAve |> //11 val obs = symbolFiles.map(sym => ( val pfnSrc = DataSource(sym, path, true, 1) |> for { xs <- pfnSrc(extractor) //12 if(pfnSmAve.isDefined) values <- pfnSmAve(xs.head) //13 y <- Try { values.view.indices.drop(period+1).toVector .filter( _ % samplingRate == 0) .map(values(_)).toArray //14 } } yield y).getOrElse(Array.empty[Double])) em(K, obs) //15
The first step is to create a simple moving average with a predefined period
(line 10) as described in the Simple moving average section of Chapter 3, Data Pre-Processing . The test code instantiates the partial function, pfnSmAve
that implements the moving average computation (line 11). The symbols of the stocks under consideration are extracted from the names of the files in the path directory (line 12).
The execution of the moving average (line 13) generates a set of smoothed values
that are sampled given a sampling rate, samplingRate
(line 14). Finally, the expectation maximization is instantiated to cluster the sampled data in the em
method (line 15):
def em(K: Int, obs: DblMatrix): Int = { val em = MultivariateEM[Double](K, obs.toVector) //16 show(s"${em.toString}") //17 }
The method em
instantiates the EM algorithm for a specific number of clusters K (line 16). The content of the model is displayed by invoking MultivariateEM.toString
. The results are aggregated and then displayed in a textual format on the standard output (line 17).
The first test is to execute the EM algorithm with K=3 clusters and a sampling period of 10 on data smoothed by a simple moving average with a period 8. The sampling of historical prices of the 127 stocks between January 1, 2013 and December 31, 2013 with a frequency of 0.1 Hertz produces 24 data points. The following chart displays the mean of each of the three clusters:

Chart of the normalized means per cluster using EM K=3
The mean vectors of clusters 2 and 3 have similar patterns, which may suggest that a set of three clusters is accurate enough to provide us with a first insight into the similarity within groups of stocks. The following is a chart of the normalized standard deviation per cluster using EM K = 3:

Chart of the normalized standard deviation per cluster using EM K=3
The distribution of the standard deviation along the mean vector of each cluster can be explained by the fact that the price of stocks from a couple of industries went down in synergy while others went up as a semi-homogenous group following the announcement from the Federal Reserve that the monthly quantity of bonds purchased as part of the quantitative easing program will be reduced soon.
Note
Relation to K-Means
You may wonder what the relation is between EM and K-means, as both techniques address the same problem. The K-means algorithm assigns each observation uniquely to one and only one cluster. The EM algorithm assigns an observation based on posterior probability. K-means is a special case of the EM for Gaussian mixtures [4:11].
Online EM
Online learning is a powerful strategy for training a clustering model when dealing with very large datasets. This strategy has regained interest from scientists lately. The description of online EM is beyond the scope of this tutorial. However, you may need to know that there are several algorithms to online EM available if you ever have to deal with large datasets: batch EM, stepwise EM, incremental EM, and Monte Carlo EM [4:12].
- SPSS數據挖掘與案例分析應用實踐
- Arduino by Example
- 圖解Java數據結構與算法(微課視頻版)
- Offer來了:Java面試核心知識點精講(原理篇)
- ASP.NET Core 2 and Vue.js
- Quarkus實踐指南:構建新一代的Kubernetes原生Java微服務
- 低代碼平臺開發實踐:基于React
- 計算機應用基礎案例教程
- 深入實踐Kotlin元編程
- Mastering Elasticsearch(Second Edition)
- 大學計算機基礎實驗指導
- JavaScript悟道
- Kohana 3.0 Beginner's Guide
- 計算機輔助設計與繪圖技術(AutoCAD 2014教程)(第三版)
- SQL Server 2012數據庫管理與開發(慕課版)