官术网_书友最值得收藏!

Obtaining a dataset

As you can imagine, one of the most important aspects of the model building process is obtaining a high-quality dataset. A dataset is used to train the model on what the output should be in the case of the aforementioned case of supervised learning. In the case of unsupervised learning, labeling is required for the dataset. A common misconception when creating a dataset is that bigger is better. This is far from the truth in a lot of cases. Continuing the preceding example, what if all of the poll results answered the same way for every single question? At that point, your dataset is composed of all the same data points and your model will not be able to properly predict any of the other candidates. This outcome is called overfitting. A diverse but representative dataset is required for machine learning algorithms to properly build a production-ready model. 

In Chapter 11Training and Building Production Models, we will deep dive into the methodology of obtaining quality datasets, looking at helpful resources, ways to manage your datasets, and transforming data, commonly referred to as data wrangling.

主站蜘蛛池模板: 定州市| 桂东县| 通山县| 扶沟县| 济源市| 吉安市| 平武县| 临桂县| 玉门市| 呼伦贝尔市| 西峡县| 平山县| 商河县| 民权县| 称多县| 泗阳县| 溆浦县| 黔南| 东莞市| 灌阳县| 禹城市| 宜章县| 边坝县| 河间市| 永德县| 进贤县| 汉寿县| 东明县| 瓦房店市| 怀仁县| 紫阳县| 建德市| 南澳县| 滨海县| 吕梁市| 桐乡市| 比如县| 尖扎县| 宝丰县| 平远县| 黎城县|