官术网_书友最值得收藏!

Obtaining a dataset

As you can imagine, one of the most important aspects of the model building process is obtaining a high-quality dataset. A dataset is used to train the model on what the output should be in the case of the aforementioned case of supervised learning. In the case of unsupervised learning, labeling is required for the dataset. A common misconception when creating a dataset is that bigger is better. This is far from the truth in a lot of cases. Continuing the preceding example, what if all of the poll results answered the same way for every single question? At that point, your dataset is composed of all the same data points and your model will not be able to properly predict any of the other candidates. This outcome is called overfitting. A diverse but representative dataset is required for machine learning algorithms to properly build a production-ready model. 

In Chapter 11Training and Building Production Models, we will deep dive into the methodology of obtaining quality datasets, looking at helpful resources, ways to manage your datasets, and transforming data, commonly referred to as data wrangling.

主站蜘蛛池模板: 陆良县| 贺州市| 武宣县| 长泰县| 乌拉特前旗| 交口县| 秀山| 江口县| 乾安县| 阜康市| 临武县| 武汉市| 关岭| 沈丘县| 津南区| 防城港市| 钟山县| 三河市| 乐都县| 西平县| 台东县| 商城县| 桑日县| 桦甸市| 枝江市| 甘泉县| 方山县| 多伦县| 红安县| 南乐县| 惠州市| 大安市| 怀仁县| 兴隆县| 昆山市| 五大连池市| 阜康市| 三原县| 兴业县| 贡嘎县| 周口市|