- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 306字
- 2021-07-02 18:46:08
What about cross-validation?
Often, in the case of smaller datasets, data scientists employ a technique known as cross-validation, which is also available to you in Spark. The CrossValidator class starts by splitting the dataset into N-folds (user declared) - each fold is used N-1 times as part of the training set and once for model validation. For example, if we declare that we wish to use a 5-fold cross-validation, the CrossValidator class will create five pairs (training and testing) of datasets using four-fifths of the dataset to create the training set with the final fifth as the test set, as shown in the following figure.
The idea is that we would see the performance of our algorithm across different, randomly sampled datasets to account for the inherent sampling bias when we create our training/testing split on 80% of the data. An example of a model that does not generalize well would be one where the accuracy - as measured by overall error, for example - would be all over the map with wildly different error rates, which would suggest we need to rethink our model.

There is no set rule on how many folds you should perform, as these questions are highly individual with respect to the type of data being used, the number of examples, and so on. In some cases, it makes sense to have extreme cross-validation where N is equal to the number of data points in the input dataset. In this case, the Test set contains only one row. This method is called as Leave-One-Out (LOO) validation and is more computationally expensive.
In general, it is recommended that you perform some cross-validation (often 5-folds, or 10-folds cross-validation is recommended) during the model construction to validate the quality of a model - especially when the dataset is small.
- Advanced Machine Learning with Python
- 深度實(shí)踐OpenStack:基于Python的OpenStack組件開(kāi)發(fā)
- PWA入門(mén)與實(shí)踐
- Drupal 8 Blueprints
- Building a Game with Unity and Blender
- MySQL 8從入門(mén)到精通(視頻教學(xué)版)
- aelf區(qū)塊鏈應(yīng)用架構(gòu)指南
- 區(qū)塊鏈:以太坊DApp開(kāi)發(fā)實(shí)戰(zhàn)
- 21天學(xué)通C++(第6版)
- 從學(xué)徒到高手:汽車(chē)電路識(shí)圖、故障檢測(cè)與維修技能全圖解
- Python深度學(xué)習(xí)原理、算法與案例
- C語(yǔ)言程序設(shè)計(jì)
- Struts 2.x權(quán)威指南
- jQuery從入門(mén)到精通(微課精編版)
- Flask Web開(kāi)發(fā)實(shí)戰(zhàn):入門(mén)、進(jìn)階與原理解析