官术网_书友最值得收藏!

Acquiring and exploring data

We argued earlier that it is critical to understand the input dataset before specifying project objectives, particularly objectives related to accuracy. As a general rule, ML algorithms will produce the best results when there are large training datasets available. The more data is used to train them, the better they will perform.

Acquiring data is, therefore, a key step in the ML development life cycle—one that can be very time-consuming and fraught with difficulty. In certain industries, privacy legislation may cause a lack of availability of personal data, making it difficult to create personalized products or requiring anonymization of source data before it can be used. Some datasets may be available but could require such extensive preparation or even manual labeling that it may put the project timeline or budget under stress.

Even if you do not have a proprietary dataset to apply to your problem, you may be able to find public datasets to use. Often, public datasets will have received attention from researchers, so you may find that the particular problem you are attempting to tackle has already been solved and the solution is open source. Some good sources of public datasets areas follows:

Once the dataset has been acquired, it should be explored to gain a basic understanding of how the different features (independent variables) may affect the desired output. For example, when attempting to predict correct height and weight from self-reported figures, researchers determined from initial exploration that older subjects were more likely to under-report obesity and therefore that age was thus a relevant feature when building their model. Attempting to build a model from all available data, even features that may not be relevant, can lead to longer training times in the best case, and can severely hamper accuracy in the worst case by introducing noise.

It is worth spending a bit more time to process and transform a dataset as this will improve the accuracy of the end result and maybe even the training time. All the code examples in this book include data processing and transformation. 

In Chapter 2, Setting Up the ML Environment, we will see how to explore data using Go and an interactive browser-based tool called Jupyter.

主站蜘蛛池模板: 英山县| 越西县| 唐山市| 黑河市| 吴江市| 泰顺县| 辽中县| 呼玛县| 基隆市| 嘉黎县| 盱眙县| 巧家县| 天台县| 时尚| 师宗县| 芜湖市| 手机| 丽江市| 元朗区| 霍山县| 赫章县| 乌兰浩特市| 泗洪县| 蓬安县| 沈阳市| 漯河市| 托里县| 清苑县| 无为县| 定陶县| 三亚市| 东兴市| 晴隆县| 垦利县| 永修县| 会理县| 田林县| 墨竹工卡县| 酉阳| 浦东新区| 望奎县|