官术网_书友最值得收藏!

Towards re-usable code

In the previous section, we performed all of the computation in a single script. While this is fine for data exploration, it means that we cannot reuse the logistic regression code that we have built. In this section, we will start the construction of a machine learning library that you can reuse across different projects.

We will factor the logistic regression algorithm out into its own class. We construct a LogisticRegression class:

import breeze.linalg._
import breeze.numerics._
import breeze.optimize._

class LogisticRegression(
    val training:DenseMatrix[Double], 
    val target:DenseVector[Double])
{

The class takes, as input, a matrix representing the training set and a vector denoting the target variable. Notice how we assign these to vals, meaning that they are set on class creation and will remain the same until the class is destroyed. Of course, the DenseMatrix and DenseVector objects are mutable, so the values that training and target point to might change. Since programming best practice dictates that mutable state makes reasoning about program behavior difficult, we will avoid taking advantage of this mutability.

Let's add a method that calculates the cost function and its gradient:

  def costFunctionAndGradient(coefficients:DenseVector[Double])
  :(Double, DenseVector[Double]) = {
    val xBeta = training * coefficients
    val expXBeta = exp(xBeta)
    val cost = - sum((target :* xBeta) - log1p(expXBeta))
    val probs = sigmoid(xBeta)
    val grad = training.t * (probs - target)
    (cost, grad)
  }

We are now all set up to run the optimization to calculate the coefficients that best reproduce the training set. In traditional object-oriented languages, we might define a getOptimalCoefficients method that returns a DenseVector of the coefficients. Scala, however, is more elegant. Since we have defined the training and target attributes as vals, there is only one possible set of values of the optimal coefficients. We could, therefore, define a val optimalCoefficients = ??? class attribute that holds the optimal coefficients. The problem with this is that it forces all the computation to happen when the instance is constructed. This will be unexpected for the user and might be wasteful: if the user is only interested in accessing the cost function, for instance, the time spent minimizing it will be wasted. The solution is to use a lazy val. This value will only be evaluated when the client code requests it:

lazy val optimalCoefficients = ???

To help with the calculation of the coefficients, we will define a private helper method:

private def calculateOptimalCoefficients
:DenseVector[Double] = {
  val f = new DiffFunction[DenseVector[Double]] {
    def calculate(parameters:DenseVector[Double]) = 
      costFunctionAndGradient(parameters)
  }

  minimize(f, DenseVector.zeros[Double](training.cols))
}

lazy val optimalCoefficients = calculateOptimalCoefficients

We have refactored the logistic regression into its own class, that we can reuse across different projects.

If we were planning on reusing the height-weight data, we could, similarly, refactor it into a class of its own that facilitates data loading, feature scaling, and any other functionality that we find ourselves reusing often.

主站蜘蛛池模板: 绥德县| 漳平市| 桃园县| 胶南市| 济宁市| 杭锦后旗| 寿光市| 木里| 剑河县| 恩平市| 社旗县| 科技| 太康县| 福建省| 普宁市| 兰溪市| 台湾省| 当雄县| 福州市| 崇州市| 平谷区| 毕节市| 丹东市| 德昌县| 牙克石市| 米脂县| 长乐市| 兴隆县| 孝感市| 荃湾区| 太谷县| 灵山县| 高邮市| 八宿县| 碌曲县| 兰西县| 天门市| 保康县| 南岸区| 大新县| 蓬安县|