官术网_书友最值得收藏!

Creating a training and testing set

As with most supervised learning tasks, we will create a split in our dataset so that we teach a model on one subset and then test its ability to generalize on new data against the holdout set. For the purposes of this example, we split the data 80/20 but there is no hard rule on what the ratio for a split should be - or for that matter - how many splits there should be in the first place:

// Create Train & Test Splits 
val trainTestSplits = higgs.randomSplit(Array(0.8, 0.2)) 
val (trainingData, testData) = (trainTestSplits(0), trainTestSplits(1)) 

By creating our 80/20 split on the dataset, we are taking a random sample of 8.8 million examples as our training set and the remaining 2.2 million as our testing set. We could just as easily take another random 80/20 split and generate a new training set with the same number of examples (8.8 million) but with different data. Doing this type of hard splitting of our original dataset introduces a sampling bias, which basically means that our model will learn to fit the training data but the training data may not be representative of "reality". Given that we are working with 11 million examples already, this bias is not as prominent versus if our original dataset is 100 rows, for example. This is often referred to as the holdout method for model validation.

You can also use the H2O Flow to split the data:

  1. Publish the Higgs data as H2OFrame:
val higgsHF = h2oContext.asH2OFrame(higgs.toDF, "higgsHF") 
  1. Split data in the Flow UI using the command splitFrame (see Figure 07).
  2. And then publish the results back to RDD.
Figure 7 - Splitting Higgs dataset into two H2O frames representing 80 and 20 percent of data.

In contrast to Spark lazy evaluation, the H2O computation model is eager. That means the splitFrame invocation processes the data right away and creates two new frames, which can be directly accessed.

主站蜘蛛池模板: 屏山县| 色达县| 鄂州市| 两当县| 兴义市| 龙里县| 饶河县| 全椒县| 永丰县| 灵石县| 尉犁县| 延庆县| 遵义县| 连南| 通海县| 固安县| 金溪县| 浦城县| 石渠县| 朔州市| 剑川县| 通化县| 泗洪县| 武威市| 松阳县| 平阳县| 祁门县| 星子县| 上栗县| 宜黄县| 余姚市| 阿拉善右旗| 彭泽县| 亳州市| 明光市| 酒泉市| 余江县| 武川县| 内丘县| 云南省| 上犹县|