- Apache Spark Machine Learning Blueprints
- Alex Liu
- 201字
- 2021-07-16 10:39:51
Summary
Machine learning professionals and data scientists often spend 80% or more of their time on data preparation, which makes data preparation the most important task to perform even though it could be the most boiling task.
In this chapter, after discussing locating datasets and loading them into Apache Spark, we covered the methods of completing the six critical data preparation tasks, which include:
- Treating dirty data with a focus on missing cases
- Resolving entity problems to match datasets
- Reorganizing datasets, with creating subsets and aggregating data as examples
- Joining tables together
- Developing features
- Organizing data preparation workflows and automating them
In covering these, we studied the Spark SQL and R as two primary tools in combination with some special Spark packages, such as SampleClean, and some R packages, such as reshape
. We also explored ways of making data preparation easy and fast.
After this chapter, we should master all the necessary data preparation methods plus a few advanced methods and become capable of cleaning datasets, such as the four used as examples in this chapter. From now on, we should be able to complete data preparation tasks fast with a workflow approach and be ready for practical machine learning tasks.
- 工業機器人技術及應用
- Hands-On Cloud Solutions with Azure
- Cloud Analytics with Microsoft Azure
- SharePoint 2010開發最佳實踐
- Learning C for Arduino
- 聊天機器人:入門、進階與實戰
- 空間機械臂建模、規劃與控制
- 單片機技術項目化原理與實訓
- The DevOps 2.1 Toolkit:Docker Swarm
- 基于Proteus的單片機應用技術
- 和機器人一起進化
- 電腦故障排除與維護終極技巧金典
- 基于人工免疫原理的檢測系統模型及其應用
- Microsoft 365 Mobility and Security:Exam Guide MS-101
- 多媒體技術應用教程