- Machine Learning with Go Quick Start Guide
- Michael Bironneau Toby Coleman
- 395字
- 2021-06-24 13:34:00
Acquiring and exploring data
We argued earlier that it is critical to understand the input dataset before specifying project objectives, particularly objectives related to accuracy. As a general rule, ML algorithms will produce the best results when there are large training datasets available. The more data is used to train them, the better they will perform.
Acquiring data is, therefore, a key step in the ML development life cycle—one that can be very time-consuming and fraught with difficulty. In certain industries, privacy legislation may cause a lack of availability of personal data, making it difficult to create personalized products or requiring anonymization of source data before it can be used. Some datasets may be available but could require such extensive preparation or even manual labeling that it may put the project timeline or budget under stress.
Even if you do not have a proprietary dataset to apply to your problem, you may be able to find public datasets to use. Often, public datasets will have received attention from researchers, so you may find that the particular problem you are attempting to tackle has already been solved and the solution is open source. Some good sources of public datasets areas follows:
- Awesome datasets: https://github.com/awesomedata/awesome-public-datasets
- Skymind open datasets: https://skymind.ai/wiki/open-datasets
- OpenML: https://www.openml.org/
- Kaggle: https://www.kaggle.com/datasets
- UK Governments open data: https://data.gov.uk/
- US Governments open data: https://www.data.gov/
Once the dataset has been acquired, it should be explored to gain a basic understanding of how the different features (independent variables) may affect the desired output. For example, when attempting to predict correct height and weight from self-reported figures, researchers determined from initial exploration that older subjects were more likely to under-report obesity and therefore that age was thus a relevant feature when building their model. Attempting to build a model from all available data, even features that may not be relevant, can lead to longer training times in the best case, and can severely hamper accuracy in the worst case by introducing noise.
In Chapter 2, Setting Up the ML Environment, we will see how to explore data using Go and an interactive browser-based tool called Jupyter.
- Learning Stencyl 3.x Game Development Beginner's Guide
- The Deep Learning with Keras Workshop
- Building 3D Models with modo 701
- 筆記本電腦使用、維護(hù)與故障排除從入門到精通(第5版)
- 微型計(jì)算機(jī)系統(tǒng)原理及應(yīng)用:國(guó)產(chǎn)龍芯處理器的軟件和硬件集成(基礎(chǔ)篇)
- Internet of Things Projects with ESP32
- 無(wú)蘋果不生活:OS X Mountain Lion 隨身寶典
- STM32自學(xué)筆記
- Blender for Video Production Quick Start Guide
- FPGA實(shí)戰(zhàn)訓(xùn)練精粹
- Learning Less.js
- 微服務(wù)架構(gòu)基礎(chǔ)(Spring Boot+Spring Cloud+Docker)
- 創(chuàng)客電子:Arduino和Raspberry Pi智能制作項(xiàng)目精選
- 從企業(yè)級(jí)開發(fā)到云原生微服務(wù):Spring Boot實(shí)戰(zhàn)
- 零基礎(chǔ)輕松學(xué)修電腦主板