- Mastering Machine Learning with R
- Cory Lesmeister
- 237字
- 2021-07-09 21:28:18
Data understanding
After enduring the all-important pain of the first step, you can now get your hands on the data. The tasks in this process consist of the following:
- Collect the data
- Describe the data
- Explore the data
- Verify the data quality
This step is the classic case of ETL is Extract, Transform, Load. There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine if the variables are sparse and identify the extent to which the data may be missing. This may drive the learning method that you use and/or whether the imputation of the missing data is necessary and feasible.
Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon an incomplete data collection, cases where unintended IT issues led to errors in the data, or there were planned changes in the business rules. This is critical in the time series where often business rules change over time on how the data is classified. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself the heartache later on and make one.
- C程序設(shè)計(jì)簡明教程(第二版)
- Learning PostgreSQL
- 新一代通用視頻編碼H.266/VVC:原理、標(biāo)準(zhǔn)與實(shí)現(xiàn)
- 微信公眾平臺(tái)開發(fā):從零基礎(chǔ)到ThinkPHP5高性能框架實(shí)踐
- 深入RabbitMQ
- 硅谷Python工程師面試指南:數(shù)據(jù)結(jié)構(gòu)、算法與系統(tǒng)設(shè)計(jì)
- Linux Shell核心編程指南
- Java SE實(shí)踐教程
- Python商務(wù)數(shù)據(jù)分析(微課版)
- Android Game Programming by Example
- R的極客理想:量化投資篇
- Python預(yù)測分析實(shí)戰(zhàn)
- 數(shù)據(jù)結(jié)構(gòu)與算法詳解
- Swift Essentials(Second Edition)
- MySQL從入門到精通