官术网_书友最值得收藏!

Labeled point vector

Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset into a labeled point vector which maps features to a given label/response; labels are stored as doubles which facilitates their use for both classification and regression tasks. For all binary classification problems, labels should be stored as either 0 or 1, which we confirmed from the preceding summary statistics holds true for our example.

val higgs = response.zip(features).map {  
case (response, features) =>  
LabeledPoint(response, features) } 
 
higgs.setName("higgs").cache() 

An example of a labeled point vector follows:

(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789]) 

In the preceding example, all doubles inside the bracket are the features and the single number outside the bracket is our label. Note that we are yet to tell Spark that we are performing a classification task and not a regression task which will happen later.

In this example, all input features contain only numeric values, but in many situations data that contains categorical values or string data. All this non-numeric representation needs to be converted into numbers, which we will show later in this book.
主站蜘蛛池模板: 云阳县| 宜阳县| 杭州市| 新沂市| 沛县| 大理市| 九寨沟县| 徐汇区| 安平县| 衡南县| 密云县| 松溪县| 南投市| 鄢陵县| 洪雅县| 昔阳县| 奉化市| 肥东县| 凤庆县| 天柱县| 赫章县| 泽库县| 定结县| 阳东县| 隆德县| 蓬莱市| 大新县| 名山县| 雷州市| 广南县| 平原县| 佛学| 泸水县| 镇平县| 大同县| 芜湖市| 金湖县| 临夏县| 浦北县| 怀柔区| 北安市|