官术网_书友最值得收藏!

Summary

This chapter looked at the common problems in large, messy datasets common in machine learning projects. These include, but are not limited to the following:

  • Missing or invalid values
  • Novel levels in a categorical feature that show up in algorithm production
  • High cardinality in categorical features such as zip code
  • High dimensionality
  • Duplicate observations

This chapter provided a disciplined approach to dealing with these problems by showing how to explore the data, treat it, and create a dataframe that you can use for developing your learning algorithm. It's also flexible enough that you can modify the code to suit your circumstances. This methodology should make what many feels is the most arduous, time-consuming, and least enjoyable part of the job an easy task.

With this task behind us, we can now get started on our first modeling task using linear regression in the following chapter.

主站蜘蛛池模板: 洱源县| 苗栗县| 屏南县| 永丰县| 麻城市| 南和县| 淅川县| 邯郸市| 武义县| 桐庐县| 厦门市| 长顺县| 陇川县| 浦县| 宾川县| 广东省| 田林县| 寻乌县| 丹棱县| 广宁县| 侯马市| 曲水县| 张家川| 海丰县| 泸溪县| 石柱| 呼图壁县| 常熟市| 永城市| 若羌县| 永胜县| 千阳县| 罗平县| 开鲁县| 和顺县| 木兰县| 密云县| 宁强县| 普宁市| 漳平市| 五家渠市|