官术网_书友最值得收藏!

Data understanding

After enduring the all-important pain of the first step, you can now get busy with the data. The tasks in this process consist of the following:

  1. Collecting the data.
  2. Describing the data.
  3. Exploring the data.
  4. Verifying the data quality.

This step is the classic case of Extract, Transform, Load (ETL). There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine whether the variables are sparse and identify the extent to which data may be missing. This may drive the learning method that you use and/or determine whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon incomplete data collection, cases where unintended IT issues led to errors in the data, or planned changes in the business rules. This is critical in time series where often business rules on how the data is classified change over time. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself potential heartache and make one.

主站蜘蛛池模板: 瑞金市| 白河县| 长兴县| 蚌埠市| 金平| 商洛市| 绵阳市| 石河子市| 绵竹市| 茶陵县| 邵武市| 东城区| 收藏| 博野县| 博客| 泾阳县| 台州市| 新密市| 乐亭县| 昭通市| 桐乡市| 休宁县| 娱乐| 南溪县| 大邑县| 佛坪县| 石泉县| 龙游县| 丰宁| 武穴市| 荃湾区| 渭南市| 宜兴市| 健康| 门头沟区| 淮南市| 连云港市| 清涧县| 安阳市| 高淳县| 富顺县|