- Scala for Machine Learning(Second Edition)
- Patrick R. Nicolas
- 2296字
- 2021-07-08 10:43:05
Workflow computational model
Monads are very useful for manipulating and chaining data transformation using implicit configuration or explicit models. However, they are restricted to a single morphism type T => U. More complex and flexible workflows require weaving transformations of different types using a generic factory pattern.
Traditional factory patterns rely on a combination of composition and inheritance and do not provide developers with the same level of flexibility as stackable traits.
In this section, we introduce the concept of modeling using mixins and a variant of the cake pattern to provide a workflow with three degrees of configurability.
Supporting mathematical abstractions
Stackable traits enable developers to follow a strict mathematical formalism while implementing a model in Scala. Scientists use a universally accepted template to solve mathematical problems:
- Declare the variables relevant to the problem.
- Define a model (equation, algorithm, formulas…) as the solution to the problem.
- Instantiate the variables and execute the model to solve the problem.
Let's consider the example of the concept of kernel functions (see the Kernel functions section of Chapter 12, Kernel Models and Support Vector Machines), a model that consists of the composition of two mathematical functions, and its potential implementation in Scala.
Step 1 – variable declaration
The implementation consists of wrapping (scope) the two functions into traits and defining these functions as abstract values.
The mathematical formalism is as follows:

The Scala implementation is represented here:
type V = Vector[Double] trait F{ val f: V => V} trait G{ val g: V => Double }
Step 2 – model definition
The model is defined as the composition of the two functions. The stack of traits G
, F
describes the type of compatible functions that can be composed using the self-referenced constraint self: G with F
:
Formalism h = f o g
The implementation is as follows:
class H {self: G with F =>def apply(v:V): Double =g(f(v))}
Step 3 – instantiation
The model is executed once the variable f and g are instantiated.
The formalism is as follows:

The implementation is as follows:
val h = new H with G with F { val f: V=>V = (v: V) =>v.map(exp(_)) val g: V => Double = (v: V) =>v.sum
Tip
Lazy value trigger
In the preceding example, the value of h(v) = g(f(v))
can be automatically computed as soon as g and f are initialized, by declaring h
a lazy value.
Clearly, Scala preserves the formalism of mathematical models, making it easier for scientists and developers to migrate their existing projects written in scientific oriented languages such as R.
Note
Emulation of R
Most data scientists use the language R to create models and apply learning strategies. They may consider Scala as an alternative to R in some cases, as Scala preserves the mathematical formalism used in models implemented in R.
Let's extend the concept preservation of mathematical formalism to the dynamic creation of workflows using traits. The design pattern described in the next section is sometimes referred to as the cake pattern.
Composing mixins to build workflow
This section presents the key constructs behind the cake pattern. A workflow composed of configurable data transformations requires a dynamic modularization (substitution) of the different stages of the workflow.
Note
Traits and mixins
Mixins are traits that are stacked against a class. The composition of mixins and the cake pattern described in this section are important for defining sequences of data transformation. However, the topic is not directly related to machine learning and the reader can skip this section.
The cake pattern is an advanced class composition pattern that uses mixin traits to meet the demands of a configurable computation workflow. It is also known as stackable modification traits [2:4].
This is not an in-depth analysis of the stackable trait injection and self-reference in Scala. There are a few interesting articles on dependencies injection that are worth a look [2:5].
Java relies on packages tightly coupled with the directory structure and prefixed to modularize the code base. Scala provides developers with a flexible and reusable approach to create and organize modules: traits. Traits can be nested, mixed in with classes, stacked, and inherited.
Understanding the problem
Dependency injection is a fancy name for a reverse look-up and binding to dependencies. Let's consider the simple application that requires data preprocessing, classification and validation.
A simple implementation using traits looks like this:
val app = new Classification with Validation with PreProcessing { val filter = ??? }
If, at a later stage, you need to use an unsupervised clustering algorithm instead of a classifier, then the application has to be re-wired:
val app = new Clustering with Validation with PreProcessing { val filter = ??? }
This approach results in code duplication and lack of flexibility. Moreover, the class member, filter
, needs to be redefined for each new class in the composition of the application. The problem arises when there is a dependency between traits used in the composition. Let's consider the case for which filter
depends on the validation methodology.
Tip
Mixins linearization [2:6]
The linearization or invocation of methods between mixins follows a right-to-left and base-to-subtype pattern:
- Trait B extends A
- Trait C extends A
- Class M extends N with C with B
- The Scala compiler implements the linearization as follows: A => B => C => N
Although you can define filter
as an abstract value, it still has to be redefined each time a new validation
type is introduced. The solution is to use self-type in the definition of the new composed trait PreProcessingWithValidation
:
trait PreProcessiongWithValidation extends PreProcessing {
self: Validation => val filter = ???
}
The application is built by stacking the PreProcessingWithValidation
mixin against the class Classification
:
val app = new Classification with PreProcessingWithValidation {
val validation: Validation
}
Tip
Overriding def with val
It is advantageous to override the declaration of a method with a declaration of a value with the same signature. Contrary to a value which is assigned once for all during instantiation, a method may return a different value for each invocation.
A def
is a proc
that can be redefined as a def
, a val
, or a lazy val
. Therefore, you should not override a value declaration with a method with the same signature:
trait Validator {val g = (n: Int) => ??? } trait MyValidator extends Validator {def g(n: Int) = }//WRONG
Let's adapt and generalize this pattern to construct a boilerplate template in order to create dynamic computational workflows.
Defining modules
The first step is to generate different modules to encapsulate different types of data transformation.
Tip
Use case for describing the cake pattern
It is difficult to build an example of real-world workflow using classes and algorithms introduced later in the book.
The following simple example is realistic enough to illustrate the different component of the cake pattern.
Let's define a sequence of three parameterized modules that each define a specific data transformation using the explicit configuration of type Etransform
:
Sampling
to extract a sample from raw dataNormalization
to normalize the sampled data over [0, 1]Aggregation
to aggregate or reduce the data:trait Sampling[T,A] { val sampler: ETransform[T, A] } trait Normalization[T,A] { val normalizer: ETransform[T, A]} trait Aggregation[T,A] { valaggregator: ETransform[T, A] }
The modules contain a single abstract value. One characteristic of the cake pattern is to enforce strict modularity by initializing the abstract values with the type encapsulated in the module. One of the objectives in building the framework is allowing developers to create data transformation (inherited from ETransform
) independently from any workflow.
Tip
Scala traits and Java packages
There is a major difference between Scala and Java in terms of modularity. Java packages constrain developers into following a strict syntax requiring, for instance, that the source file has the same name as the class it contains. Scala modules based on stackable traits are far more flexible.
Instantiating the workflow
The next step is to write the different modules into a workflow. This is achieved by using the self reference to the stack of the three traits defined in the previous paragraph.
Here's the code:
class Workflow[T,U,V,W] { self: Sampling[T,U] with Normalization[U,V] with Aggregation[V,W] => def |> (t: T): Try[W] = for { u <- sampler |> t v <- normalizer |> u w <- aggregator |> v } yield w }
A picture is worth a thousand words; the following UML class diagram illustrates the workflow factory (or cake) design pattern:

UML class diagram of the workflow factory
Finally, the workflow is instantiated by dynamically initializing the abstract values, sampler
, normalizer
, and aggregator
of the transformation as long as the signature (input and output types) matches the parameterized types defined in each module (line 1):
Type DblF = Double => Double type DblVec = Vector[Double] val samples = 100; val normRatio = 10; val splits = 4 val workflow = new Workflow[DblF, DblVec, DblVec, Int] with Sampling[DblF, DblVec] with Normalization[DblVec, DblVec] with Aggregation[DblVec, Int] { val sampler = ??? //1 val normalizer = ??? val aggregator = ??? }
Let's implement the data transformation function for each of the three modules/traits by assigning a transformation to the abstract values.
The first transformation, sampler
, samples a function f
with frequency 1/samples over the interval [0, 1]. The second transformation, normalizer
, normalizes the data over the range [0, 1] using the Stats
class introduced in the next chapter.
The last transformation, aggregator
, extracts the index of the large sample (value 1.0):
val sampler = new ETransform[DblF,DblVec](ConfigInt(samples)) {//2 override def |> : PartialFunction[DblF, Try[DblVec]] = { case f: => Try(Vector.tabulate(samples)(n =>f(1.0*n/samples))) //5 } }
The transformation sampler
uses a single model or configuration parameter sample
(line 2). The type DblF
of input
is defined as Double=> Double
(line 3) and the type of output as a vector of floating point values, DblVec
(line 4). In this particular case, the transformation consists of applying the input function f
to a vector of increasing normalized values (line 5).
The normalizer
and aggregator
transforms follow the same design pattern as the sampler:
val normalizer = new ETransform[DblVec, DblVec]( ConfigDouble(normRatio) ) { override def |> : PartialFunction[DblVec,Try[DblVec]] = { case x: DblVec if(x.size > 0) => Try((Stats[Double](x)).normalize) } } val aggregator = new ETransform[DblVec, Int](ConfigInt(splits)) { override def |> : PartialFunction[DblVec, Try[Int]] = { case x:DblVec if(x.size> 0) => Try((0 until x.size).find(x(_)==1.0).getOrElse(-1)) } }
The instantiation of the transformation function follows the template described in the Monadic data transformation section of Chapter 1, Getting Started.
The workflow is now ready to process any function as input:
val g = (x: Double) =>Math.log(x+1.0) + nextDouble
Try (workflow |> g) //6
The workflow is executed by providing the input function g
to the first mixin, sampler (line 6).
Scala's strong type checking catches any inconsistent data types at compilation time. It reduces the development cycle because runtime errors are more difficult to track down.
Note
Mixin composition for ITransform
We arbitrarily selected a data transformation using an explicit configuration, ETransform
, to illustrate the concept of mixin composition. The same pattern applies to implicit data transformation, ITransform
.
Modularizing
The last step is the modularization of the workflow. For complex scientific computations, you need to be able to do the following:
- Select the appropriate workflow as a sequence of module or tasks according to the objective of the execution (regression, classification, clustering…)
- Select the appropriate algorithm to fulfill a task according to the quality of the data (noisy, incomplete, …)
- Select the appropriate implementation of the algorithm according to the environment (distributed with high latency network, single host…):
Illustration of the dynamic creation of workflow from modules/traits
Let's consider a simple preprocessing task, defined in the module PreprocessingModule
. The module (or task) is declared as a trait to hide its internal workings from other modules. The pre-processing task is executed by a preprocessor of type Preprocessor
. We have arbitrarily listed two algorithms, the exponential moving average of type ExpMovingAverage
and the discrete Fourier transform low pass filter of type DFTFilter
, as potential pre-processors:
trait PreprocessingModule[T] { trait Preprocessor[T] { //7 def execute(x: Vector[T]): Try[DblVec] } val preprocessor: Preprocessor[T] //8 class ExpMovingAverage[T : ToDouble](p: Int)//9 (implicit num: Numeric[T]) extendsPreprocessor[T] { val expMovingAvg = filtering.ExpMovingAverage[T](p) //10 val pfn= expMovingAvg |> //11 override def execute(x: Vector[T]): Try[DblVec] = pfn(x) } class DFTFilter[T : ToDouble]( fc: Double, g: (Double,Double) =>Double ) extends Preprocessor[T] { //12 val filter = filtering.DFTFir[T](g, fc, 1e-5) val pfn = filter |> override def execute(x: Vector[T]):Try[DblVec]= pfn(x) } }
The generic pre-processor trait Preprocessor
declares a single method, execute
, whose purpose is to filter an input vector x
of element of type T
for noise (line 7). The instance of the pre-processor is declared as an abstract class to be instantiated as one of the filtering algorithm (line 8).
The first filtering algorithm of type ExpMovingAverage
implements the Preprocessor
trait and overrides the execute
method (line 9). The class declares the algorithm but delegates its implementation to a class with an identical signature, org.scalaml.filtering.ExpMovingAverage
(line 10). Data of generic type T
is automatically converted into a vector of Double
using a context bound with the syntax T: ToDouble
. The context bound is implemented by the following trait:
Trait ToDouble[T] { def apply(t: T): Double }
The partial function returned from the |>
method is instantiated as a value, pfn
, so it can be applied multiple times (line 11). The same design pattern is used for the discrete Fourier transform filter (line 12).
The filtering algorithm (ExpMovingAverageorDFTFir)
is selected according to the profile or characteristic of the input data. Its implementation in the org.scalaml.filtering
package depends on the environment (single host, Akka cluster, Apache Spark…).
Note
Filtering algorithms
The filtering algorithms used to illustrate the concept of modularization in the context of the cake pattern are described in detail in Chapter 3, Data Pre-processing.
- Vue 3移動Web開發(fā)與性能調(diào)優(yōu)實戰(zhàn)
- Java Web開發(fā)學(xué)習(xí)手冊
- Effective C#:改善C#代碼的50個有效方法(原書第3版)
- Python數(shù)據(jù)分析基礎(chǔ)
- Web Application Development with R Using Shiny(Second Edition)
- 組態(tài)軟件技術(shù)與應(yīng)用
- Natural Language Processing with Java and LingPipe Cookbook
- Fast Data Processing with Spark(Second Edition)
- Xamarin Cross-Platform Development Cookbook
- 安卓工程師教你玩轉(zhuǎn)Android
- Node.js實戰(zhàn):分布式系統(tǒng)中的后端服務(wù)開發(fā)
- jBPM6 Developer Guide
- 精通Oracle 12c 數(shù)據(jù)庫管理
- HTML 5與CSS 3權(quán)威指南(第4版·上冊)
- Learning RSLogix 5000 Programming