官术网_书友最值得收藏!

Summary

Machine learning professionals and data scientists often spend 80% or more of their time on data preparation, which makes data preparation the most important task to perform even though it could be the most boiling task.

In this chapter, after discussing locating datasets and loading them into Apache Spark, we covered the methods of completing the six critical data preparation tasks, which include:

  • Treating dirty data with a focus on missing cases
  • Resolving entity problems to match datasets
  • Reorganizing datasets, with creating subsets and aggregating data as examples
  • Joining tables together
  • Developing features
  • Organizing data preparation workflows and automating them

In covering these, we studied the Spark SQL and R as two primary tools in combination with some special Spark packages, such as SampleClean, and some R packages, such as reshape. We also explored ways of making data preparation easy and fast.

After this chapter, we should master all the necessary data preparation methods plus a few advanced methods and become capable of cleaning datasets, such as the four used as examples in this chapter. From now on, we should be able to complete data preparation tasks fast with a workflow approach and be ready for practical machine learning tasks.

主站蜘蛛池模板: 嘉祥县| 昭通市| 集贤县| 焦作市| 平定县| 柘荣县| 泰兴市| 丹江口市| 瓮安县| 河间市| 罗城| 上栗县| 化德县| 塔城市| 荔波县| 海丰县| 朝阳市| 图们市| 抚远县| 武清区| 温州市| 永德县| 革吉县| 长汀县| 南木林县| 七台河市| 霍邱县| 玉龙| 荔浦县| 邢台市| 福泉市| 镇巴县| 漠河县| 永吉县| 靖宇县| 潢川县| 鹤山市| 潢川县| 平乐县| 科技| 葫芦岛市|