官术网_书友最值得收藏!

Acquiring and exploring data

We argued earlier that it is critical to understand the input dataset before specifying project objectives, particularly objectives related to accuracy. As a general rule, ML algorithms will produce the best results when there are large training datasets available. The more data is used to train them, the better they will perform.

Acquiring data is, therefore, a key step in the ML development life cycle—one that can be very time-consuming and fraught with difficulty. In certain industries, privacy legislation may cause a lack of availability of personal data, making it difficult to create personalized products or requiring anonymization of source data before it can be used. Some datasets may be available but could require such extensive preparation or even manual labeling that it may put the project timeline or budget under stress.

Even if you do not have a proprietary dataset to apply to your problem, you may be able to find public datasets to use. Often, public datasets will have received attention from researchers, so you may find that the particular problem you are attempting to tackle has already been solved and the solution is open source. Some good sources of public datasets areas follows:

Once the dataset has been acquired, it should be explored to gain a basic understanding of how the different features (independent variables) may affect the desired output. For example, when attempting to predict correct height and weight from self-reported figures, researchers determined from initial exploration that older subjects were more likely to under-report obesity and therefore that age was thus a relevant feature when building their model. Attempting to build a model from all available data, even features that may not be relevant, can lead to longer training times in the best case, and can severely hamper accuracy in the worst case by introducing noise.

It is worth spending a bit more time to process and transform a dataset as this will improve the accuracy of the end result and maybe even the training time. All the code examples in this book include data processing and transformation. 

In Chapter 2, Setting Up the ML Environment, we will see how to explore data using Go and an interactive browser-based tool called Jupyter.

主站蜘蛛池模板: 墨脱县| 兴城市| 车致| 渝中区| 淮南市| 涟源市| 兴安县| 久治县| 沈阳市| 南丰县| 阳朔县| 简阳市| 老河口市| 永福县| 常州市| 浑源县| 藁城市| 镇宁| 新河县| 正镶白旗| 吉水县| 安泽县| 伊宁市| 南川市| 文水县| 新乡县| 铁岭市| 清镇市| 神农架林区| 靖安县| 昔阳县| 德保县| 浦城县| 阿合奇县| 芜湖县| 德江县| 剑川县| 应用必备| 怀宁县| 淅川县| 叙永县|