- Scala for Machine Learning(Second Edition)
- Patrick R. Nicolas
- 1459字
- 2021-07-08 10:43:05
Monadic data transformation
The first step is to define a trait and a method that describe the transformation of data by the computation units of a workflow. The data transformation is the foundation of any workflow for processing and classifying a dataset, training and validating a model, and displaying results.
There are two symbolic models for defining a data processing or data transformation:
- Explicit model: The developer creates a model explicitly from a set of configuration parameters. Most deterministic algorithms and unsupervised learning techniques use an explicit model.
- Implicit model: The developer provides a training set that is a set of labeled observations (observations with expected outcome). A classifier extracts a model through the training set. Supervised learning techniques rely on a model implicitly generated from labeled data.
Error handling
The simplest form of data transformation is morphism between two types U
and V
. The data transformation enforces a contract for validating input and returning either a value or an error. From now on, we will use the following convention:
- Input value: The validation is implemented through a partial function, type
PartialFunction
, that is returned by the data transformation. AMatchErr
is thrown in case the input value does not meet the required condition (contract). - Output value: The type of return value is
Try[V]
for which an exception is returned in case of an error.
Tip
Partial function reusability
Reusability is another benefit of partial functions, as illustrated in the following code snippet:
class F { def f: PartialFunction[Int, Try[Double]] { .. } } valpfn = (new F).f pfn(4) pfn(10)
Partial functions enable developers to implement methods that address the most common (primary) use case for which input values has been tested. All other non-trivial use cases (or input values) generate a MatchErr
exception. At a later stage in the development cycle, the developer may implement the code to handle the less common use cases.
Note
Runtime validation of a partial function
It is good practice to validate whether a partial function is defined for a specific value of the argument:
for {pfn.isDefinedAt(input)
value<- pfn(input)} yield { … }
This preemptive approach allows the developer to select an alternative method or a full function [2:3]. It is an efficient alternative to catching a MathErr
exception.
The validation of partial functions is omitted throughout the book for the sake of clarity.
Therefore, the signature of a data transformation is defined as follows:
def |> : PartialFunction[T, Try[A]]
F# language reference
The notation |>
used as the signature of the transform is borrowed from the F# language [2:2].
Monads to the rescue
The objective is to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation.
Note
This section illustrates the concept of monadic data transformation, which is not essential to the understanding of machine learning algorithms as described throughout the book. You can safely skip to the Workflow computational models section.
Implicit models
Supervised learning models are extracted from a training set. Transformations such as classification or regression use the implicit models to process input data, as illustrated in the following diagram:

Visualization of implicit models
trait ITransform[T, A] { //1 self => def |> : PartialFunction[T, Try[A]] //2 def map[B](f: A => B): ITransform[T, B] def flatMap[B](f: A => ITransform[T, B]): ITransform[T, B] def andThen[B](tr: ITransform[A, B]): ITransform[T, B] }
An implicit transformation has a type ITransform
with two parameter types (line 1):
- T: Type of feature or element of the input collection
- A: Type of element of the output collection
For instance, the moving average on a time series of a single variable is computed by a ITransform[Double, Double]
. The input collection is the time series and the output is a smoothed time series.
Note
Apache Spark ML transformers
The concept behind ITransform
is somewhat similar to the Apache Spark MLlib transformers on data frames described in the ML Reusable Pipelines section of Chapter 17, Apache Spark MLlib.
The method |>
declares the transformation that is defined by implementing the trait ITransform
(line 2). Let's look at the monadic operators.
The map
method applies a function to each element of the output of the transformation |>
. It generates a new ITransform
by overriding the |>
method (line 3).
A new implementation of the data transformation |>
returning an instance of PartialFunction[T
, Try[B]]
(line 4) is created by overriding the methods isDefinedAt
(line 5) and apply
(line 6):
def map[B](f: A => B): ITransform[T,B] = new ITransform[T,B] { override def |> : PartialFunction[T, Try[B]] = //3 new PartialFunction[T, Try[B]] { //4 override def isDefinedAt(t: T): Boolean = //5 self.|>.isDefinedAt(t) override def apply(t: T): Try[B] = self.|>(t).map(f) //6 }
The overridden methods for the instantiation of ITransform
in flatMap
follow the same design pattern as the map method. The argument f
converts each output element into an implicit transformation of type ITransform[T, B]
(line 7), and outputs a new instance of ITransform
after flattening (line 8).
As with the map
method, it overrides the implementation of the data transformation |>
returning a new partial function (line 9) after overriding the isDefinedAt
and apply
methods:
def flatMap[B]( f: A => ITransform[T, B] //7 ): ITransform[T, B] = new ITransform[T, B] { //8 override def |> : PartialFunction[T, Try[B]] = new PartialFunction[T, Try[B]] { //9 override def isDefinedAt(t: T): Boolean = self.|>.isDefinedAt(t) override def apply(t: T): Try[B] = self.|>(t).flatMap(f(_).|>(t) }
The method andThen
is not a proper element of a monad. Its meaning is similar to the Scala method Function1.andThen
that chains a function with another one. It is indeed useful to create chains of implicit transformations. The method applies the transformation tr
(line 10) to the output of this transformation. The output type of the first transformation is the input type of the next transformation, tr
.
The implementation of the method andThen
follows a pattern similar to the implementation of map
and flatMap
:
def andThen[B](
tr: ITransform[A, B] (line 10)
): ITransform[T, B] = new ITransform[T, B] {
override def |> : PartialFunction[T, Try[B]] =
new PartialFunction[T, Try[B]] {
override def isDefinedAt(t: T): Boolean =
self.|>.isDefinedAt(t) &&
tr.|>.isDefinedAt(self.|>(t).get)
override def apply(t: T):Try[B] = tr.|>(self.|>(t).get)
}
Note
andThen and compose
The reader is invited to implement a compose method which executes the |>
in a reverse order as andThen
.
Explicit models
The transformation on a dataset is performed using a model or configuration fully defined by the user, as illustrated in the following diagram:

Visualization of explicit models
The execution of a data transformation may depend on some context or external configuration. Such transformations are defined having the type ETransform
parameterized by the type T
of the elements of input collection and the type A
of element of output collection (line 11). The context or configuration is defined by the trait config
(line 12).
An explicit transformation is a transformation with the extra capability to use a set of external configuration parameters to generate an output. Therefore, ETransform
inherits from ITransform
(line 13):
abstract class ETransform[T,A]( //11 config: Config//12 ) extends ITransform[T,A] { //13 self => def map[B](f: A => B): ETransform[T,B] = new ETransform[T,B](config) { override def |> : PartialFunction[T, Try[B]] = super.|> } def flatMap[B](f: A => ETransform[T,B]): ETransform[T,B] = new ETransform[T, B](config) { override def |> : PartialFunction[T, Try[B]] = super.|> } def andThen[B](tr: ETransform[A,B]): ETransform[T,B] = new ETransform[T, B](config) { override def|> :PartialFunction[T, Try[B]] = super.|> } }
The client code is responsible for specifying the type and value of the configuration used by a given explicit transformation. Here are a few examples of configuration classes:
Trait Config
case class ConfigInt(iParam: Int) extends Config
case class ConfigDouble(fParam: Double) extends Config
case class ConfigArrayDouble(fParams: Array[Double)
extends Config
Tip
Memory cleaning
Instances of ITransformand ETransform
do not release memory allocated for the input data. The client code is responsible for the memory management of input and output data. However, the method |>
is to release any memory associated to temporary data structure(s) used for the transformation.
The supervised learning models described in future chapters, such as logistic regression, support vector machines, Na?ve Bayes or multilayer perceptron are defined as implicit transformations and implement the ITransform
trait. Filtering and data processing algorithms such as data extractor, moving averages, or Kalman filters inherit the ETransform
abstract class.
Note
Immutable transformations
The model for a data transformation (or processing unit or classifier) class should be immutable: any modification would alter the integrity of the model or parameters used to process data. In order to ensure that the same model is used in processing input data for the entire lifetime of a transformation:
- A model for an
ETransform
is defined as an argument of its constructor. - The constructor of an
ITransform
generates the model from a given training set. The model has to be rebuilt from the training set (not altered), if it starts to provide an incorrect outcome or prediction.
Models are created by the constructor of classifiers or data transformation classes to ensure their immutability. The design of immutable transformation is described in the Design template for classifiers subsection in the Scala programming section of the Appendix.
- Spring 5.0 By Example
- Moodle Administration Essentials
- TypeScript實(shí)戰(zhàn)指南
- PhpStorm Cookbook
- VMware虛擬化技術(shù)
- 匯編語(yǔ)言編程基礎(chǔ):基于LoongArch
- Python爬蟲(chóng)、數(shù)據(jù)分析與可視化:工具詳解與案例實(shí)戰(zhàn)
- Swift 4從零到精通iOS開(kāi)發(fā)
- Python 3.7從入門(mén)到精通(視頻教學(xué)版)
- Image Processing with ImageJ
- 區(qū)塊鏈架構(gòu)之美:從比特幣、以太坊、超級(jí)賬本看區(qū)塊鏈架構(gòu)設(shè)計(jì)
- UML基礎(chǔ)與Rose建模實(shí)用教程(第三版)
- Android智能手機(jī)APP界面設(shè)計(jì)實(shí)戰(zhàn)教程
- Mastering OpenStack
- Building Microservices with Go