官术网_书友最值得收藏!

Acquiring and exploring data

We argued earlier that it is critical to understand the input dataset before specifying project objectives, particularly objectives related to accuracy. As a general rule, ML algorithms will produce the best results when there are large training datasets available. The more data is used to train them, the better they will perform.

Acquiring data is, therefore, a key step in the ML development life cycle—one that can be very time-consuming and fraught with difficulty. In certain industries, privacy legislation may cause a lack of availability of personal data, making it difficult to create personalized products or requiring anonymization of source data before it can be used. Some datasets may be available but could require such extensive preparation or even manual labeling that it may put the project timeline or budget under stress.

Even if you do not have a proprietary dataset to apply to your problem, you may be able to find public datasets to use. Often, public datasets will have received attention from researchers, so you may find that the particular problem you are attempting to tackle has already been solved and the solution is open source. Some good sources of public datasets areas follows:

Once the dataset has been acquired, it should be explored to gain a basic understanding of how the different features (independent variables) may affect the desired output. For example, when attempting to predict correct height and weight from self-reported figures, researchers determined from initial exploration that older subjects were more likely to under-report obesity and therefore that age was thus a relevant feature when building their model. Attempting to build a model from all available data, even features that may not be relevant, can lead to longer training times in the best case, and can severely hamper accuracy in the worst case by introducing noise.

It is worth spending a bit more time to process and transform a dataset as this will improve the accuracy of the end result and maybe even the training time. All the code examples in this book include data processing and transformation. 

In Chapter 2, Setting Up the ML Environment, we will see how to explore data using Go and an interactive browser-based tool called Jupyter.

主站蜘蛛池模板: 惠东县| 绵竹市| 策勒县| 安福县| 博湖县| 拉孜县| 永新县| 会东县| 贵阳市| 赤峰市| 方正县| 扬中市| 阿拉尔市| 霸州市| 乐亭县| 甘肃省| 昌宁县| 井陉县| 庆阳市| 沂南县| 盐城市| 时尚| 南澳县| 利辛县| 绥棱县| 北碚区| 昔阳县| 珠海市| 灵寿县| 东海县| 延川县| 哈尔滨市| 嵩明县| 陵川县| 郧西县| 河东区| 胶州市| 营山县| 会宁县| 龙海市| 曲周县|