- Apache Spark Machine Learning Blueprints
- Alex Liu
- 201字
- 2021-07-16 10:39:51
Summary
Machine learning professionals and data scientists often spend 80% or more of their time on data preparation, which makes data preparation the most important task to perform even though it could be the most boiling task.
In this chapter, after discussing locating datasets and loading them into Apache Spark, we covered the methods of completing the six critical data preparation tasks, which include:
- Treating dirty data with a focus on missing cases
- Resolving entity problems to match datasets
- Reorganizing datasets, with creating subsets and aggregating data as examples
- Joining tables together
- Developing features
- Organizing data preparation workflows and automating them
In covering these, we studied the Spark SQL and R as two primary tools in combination with some special Spark packages, such as SampleClean, and some R packages, such as reshape
. We also explored ways of making data preparation easy and fast.
After this chapter, we should master all the necessary data preparation methods plus a few advanced methods and become capable of cleaning datasets, such as the four used as examples in this chapter. From now on, we should be able to complete data preparation tasks fast with a workflow approach and be ready for practical machine learning tasks.
- Hands-On Graph Analytics with Neo4j
- 工業機器人產品應用實戰
- Hands-On Machine Learning on Google Cloud Platform
- 西門子PLC與InTouch綜合應用
- Learning Social Media Analytics with R
- Google App Inventor
- 最簡數據挖掘
- STM32G4入門與電機控制實戰:基于X-CUBE-MCSDK的無刷直流電機與永磁同步電機控制實現
- 機器人創新實訓教程
- Windows游戲程序設計基礎
- Learning C for Arduino
- 走近大數據
- SAP Business Intelligence Quick Start Guide
- 基于神經網絡的監督和半監督學習方法與遙感圖像智能解譯
- Mastering Text Mining with R