- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 330字
- 2021-07-02 18:46:08
Creating a training and testing set
As with most supervised learning tasks, we will create a split in our dataset so that we teach a model on one subset and then test its ability to generalize on new data against the holdout set. For the purposes of this example, we split the data 80/20 but there is no hard rule on what the ratio for a split should be - or for that matter - how many splits there should be in the first place:
// Create Train & Test Splits val trainTestSplits = higgs.randomSplit(Array(0.8, 0.2)) val (trainingData, testData) = (trainTestSplits(0), trainTestSplits(1))
By creating our 80/20 split on the dataset, we are taking a random sample of 8.8 million examples as our training set and the remaining 2.2 million as our testing set. We could just as easily take another random 80/20 split and generate a new training set with the same number of examples (8.8 million) but with different data. Doing this type of hard splitting of our original dataset introduces a sampling bias, which basically means that our model will learn to fit the training data but the training data may not be representative of "reality". Given that we are working with 11 million examples already, this bias is not as prominent versus if our original dataset is 100 rows, for example. This is often referred to as the holdout method for model validation.
You can also use the H2O Flow to split the data:
- Publish the Higgs data as H2OFrame:
val higgsHF = h2oContext.asH2OFrame(higgs.toDF, "higgsHF")
- Split data in the Flow UI using the command splitFrame (see Figure 07).
- And then publish the results back to RDD.

In contrast to Spark lazy evaluation, the H2O computation model is eager. That means the splitFrame invocation processes the data right away and creates two new frames, which can be directly accessed.
- Vue.js設(shè)計(jì)與實(shí)現(xiàn)
- 零基礎(chǔ)PHP學(xué)習(xí)筆記
- Mastering SVG
- Mastering Natural Language Processing with Python
- PHP 編程從入門到實(shí)踐
- JavaScript前端開發(fā)與實(shí)例教程(微課視頻版)
- Java程序設(shè)計(jì)與實(shí)踐教程(第2版)
- Scratch3.0趣味編程動(dòng)手玩:比賽訓(xùn)練營
- 代碼閱讀
- Python網(wǎng)絡(luò)爬蟲技術(shù)與應(yīng)用
- Java并發(fā)編程:核心方法與框架
- Instant GLEW
- C++17 By Example
- PHP動(dòng)態(tài)網(wǎng)站開發(fā)實(shí)踐教程
- Web前端測(cè)試與集成:Jasmine/Selenium/Protractor/Jenkins的最佳實(shí)踐