官术网_书友最值得收藏!

Summary

Machine learning professionals and data scientists often spend 80% or more of their time on data preparation, which makes data preparation the most important task to perform even though it could be the most boiling task.

In this chapter, after discussing locating datasets and loading them into Apache Spark, we covered the methods of completing the six critical data preparation tasks, which include:

  • Treating dirty data with a focus on missing cases
  • Resolving entity problems to match datasets
  • Reorganizing datasets, with creating subsets and aggregating data as examples
  • Joining tables together
  • Developing features
  • Organizing data preparation workflows and automating them

In covering these, we studied the Spark SQL and R as two primary tools in combination with some special Spark packages, such as SampleClean, and some R packages, such as reshape. We also explored ways of making data preparation easy and fast.

After this chapter, we should master all the necessary data preparation methods plus a few advanced methods and become capable of cleaning datasets, such as the four used as examples in this chapter. From now on, we should be able to complete data preparation tasks fast with a workflow approach and be ready for practical machine learning tasks.

主站蜘蛛池模板: 嵊泗县| 那曲县| 平邑县| 密云县| 长丰县| 怀仁县| 双城市| 镇江市| 如皋市| 广昌县| 金寨县| 偃师市| 布尔津县| 黄平县| 临武县| 雷山县| 开原市| 永昌县| 南溪县| 高青县| 乌海市| 岗巴县| 天峨县| 库伦旗| 民县| 康保县| 绥江县| 阿拉善右旗| 新民市| 仁化县| 明溪县| 水富县| 毕节市| 安西县| 定州市| 海门市| 苏尼特右旗| 东平县| 陇川县| 惠水县| 浠水县|