官术网_书友最值得收藏!

Obtaining a dataset

As you can imagine, one of the most important aspects of the model building process is obtaining a high-quality dataset. A dataset is used to train the model on what the output should be in the case of the aforementioned case of supervised learning. In the case of unsupervised learning, labeling is required for the dataset. A common misconception when creating a dataset is that bigger is better. This is far from the truth in a lot of cases. Continuing the preceding example, what if all of the poll results answered the same way for every single question? At that point, your dataset is composed of all the same data points and your model will not be able to properly predict any of the other candidates. This outcome is called overfitting. A diverse but representative dataset is required for machine learning algorithms to properly build a production-ready model. 

In Chapter 11Training and Building Production Models, we will deep dive into the methodology of obtaining quality datasets, looking at helpful resources, ways to manage your datasets, and transforming data, commonly referred to as data wrangling.

主站蜘蛛池模板: 隆子县| 当雄县| 长岭县| 海兴县| 濮阳市| 赤城县| 龙泉市| 宁南县| 青冈县| 五指山市| 禄劝| 石景山区| 濉溪县| 防城港市| 舒城县| 凤阳县| 商河县| 三门峡市| 全州县| 定兴县| 明星| 山东省| 湖北省| 松阳县| 邮箱| 永宁县| 天峻县| 鹤峰县| 长兴县| 扬中市| 左权县| 普兰县| 大竹县| 涪陵区| 清镇市| 嵊泗县| 贵南县| 安陆市| 井陉县| 宜春市| 易门县|