- Scala for Data Science
- Pascal Bugnion
- 471字
- 2021-07-23 14:33:04
Towards re-usable code
In the previous section, we performed all of the computation in a single script. While this is fine for data exploration, it means that we cannot reuse the logistic regression code that we have built. In this section, we will start the construction of a machine learning library that you can reuse across different projects.
We will factor the logistic regression algorithm out into its own class. We construct a LogisticRegression
class:
import breeze.linalg._ import breeze.numerics._ import breeze.optimize._ class LogisticRegression( val training:DenseMatrix[Double], val target:DenseVector[Double]) {
The class takes, as input, a matrix representing the training set and a vector denoting the target variable. Notice how we assign these to vals
, meaning that they are set on class creation and will remain the same until the class is destroyed. Of course, the DenseMatrix
and DenseVector
objects are mutable, so the values that training
and target
point to might change. Since programming best practice dictates that mutable state makes reasoning about program behavior difficult, we will avoid taking advantage of this mutability.
Let's add a method that calculates the cost function and its gradient:
def costFunctionAndGradient(coefficients:DenseVector[Double]) :(Double, DenseVector[Double]) = { val xBeta = training * coefficients val expXBeta = exp(xBeta) val cost = - sum((target :* xBeta) - log1p(expXBeta)) val probs = sigmoid(xBeta) val grad = training.t * (probs - target) (cost, grad) }
We are now all set up to run the optimization to calculate the coefficients that best reproduce the training set. In traditional object-oriented languages, we might define a getOptimalCoefficients
method that returns a DenseVector
of the coefficients. Scala, however, is more elegant. Since we have defined the training
and target
attributes as vals
, there is only one possible set of values of the optimal coefficients. We could, therefore, define a val optimalCoefficients = ???
class attribute that holds the optimal coefficients. The problem with this is that it forces all the computation to happen when the instance is constructed. This will be unexpected for the user and might be wasteful: if the user is only interested in accessing the cost function, for instance, the time spent minimizing it will be wasted. The solution is to use a lazy val
. This value will only be evaluated when the client code requests it:
lazy val optimalCoefficients = ???
To help with the calculation of the coefficients, we will define a private helper method:
private def calculateOptimalCoefficients :DenseVector[Double] = { val f = new DiffFunction[DenseVector[Double]] { def calculate(parameters:DenseVector[Double]) = costFunctionAndGradient(parameters) } minimize(f, DenseVector.zeros[Double](training.cols)) } lazy val optimalCoefficients = calculateOptimalCoefficients
We have refactored the logistic regression into its own class, that we can reuse across different projects.
If we were planning on reusing the height-weight data, we could, similarly, refactor it into a class of its own that facilitates data loading, feature scaling, and any other functionality that we find ourselves reusing often.
- Web前端開發技術:HTML、CSS、JavaScript(第3版)
- 劍指JVM:虛擬機實踐與性能調優
- Java面向對象程序開發及實戰
- JAVA程序設計實驗教程
- Kotlin開發教程(全2冊)
- 零基礎看圖學ScratchJr:少兒趣味編程(全彩大字版)
- Arduino電子設計實戰指南:零基礎篇
- 從零開始:C語言快速入門教程
- Dart:Scalable Application Development
- Learn Linux Quickly
- Building Scalable Apps with Redis and Node.js
- 新手學ASP.NET 3.5網絡開發
- C++教程
- 片上系統設計思想與源代碼分析
- Learning Puppet Security