官术网_书友最值得收藏!

Labeled point vector

Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset into a labeled point vector which maps features to a given label/response; labels are stored as doubles which facilitates their use for both classification and regression tasks. For all binary classification problems, labels should be stored as either 0 or 1, which we confirmed from the preceding summary statistics holds true for our example.

val higgs = response.zip(features).map {  
case (response, features) =>  
LabeledPoint(response, features) } 
 
higgs.setName("higgs").cache() 

An example of a labeled point vector follows:

(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789]) 

In the preceding example, all doubles inside the bracket are the features and the single number outside the bracket is our label. Note that we are yet to tell Spark that we are performing a classification task and not a regression task which will happen later.

In this example, all input features contain only numeric values, but in many situations data that contains categorical values or string data. All this non-numeric representation needs to be converted into numbers, which we will show later in this book.
主站蜘蛛池模板: 渝北区| 息烽县| 乳山市| 中宁县| 南宫市| 陇西县| 崇明县| 穆棱市| 武清区| 扶风县| 色达县| 清远市| 九寨沟县| 两当县| 犍为县| 杭州市| 洛浦县| 灌阳县| 同江市| 姚安县| 竹山县| 栾城县| 亳州市| 资中县| 若羌县| 松原市| 宜宾市| 武宁县| 安龙县| 图木舒克市| 桂阳县| 大余县| 安乡县| 云阳县| 西吉县| 嘉义县| 泸定县| 陇川县| 肥城市| 大余县| 乐东|