官术网_书友最值得收藏!

Data understanding

After enduring the all-important pain of the first step, you can now get your hands on the data. The tasks in this process consist of the following:

  1. Collect the data
  2. Describe the data
  3. Explore the data
  4. Verify the data quality

This step is the classic case of ETL is Extract, Transform, Load. There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine if the variables are sparse and identify the extent to which the data may be missing. This may drive the learning method that you use and/or whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon an incomplete data collection, cases where unintended IT issues led to errors in the data, or there were planned changes in the business rules. This is critical in the time series where often business rules change over time on how the data is classified. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself the heartache later on and make one.

主站蜘蛛池模板: 塘沽区| 普兰店市| 新蔡县| 德保县| 驻马店市| 定结县| 瑞丽市| 宜都市| 新昌县| 香港 | 玛曲县| 湘潭县| 马公市| 黑河市| 平昌县| 柳河县| 什邡市| 石河子市| 曲麻莱县| 新河县| 瑞安市| 新乡市| 西平县| 高要市| 丹寨县| 榆中县| 敦煌市| 香河县| 榆社县| 莎车县| 罗平县| 广饶县| 灵璧县| 旬阳县| 绵阳市| 肥西县| 竹北市| 桂林市| 玉山县| 丰台区| 西贡区|