官术网_书友最值得收藏!

Common data issues

We can categorize data difficulties into several groups. The most generally accepted groupings (of data issues) include:

  • Accuracy: There are many varieties of data inaccuracies and the most common examples include poor math, out-of-range values, invalid values, duplication, and more.
  • Completeness: Data sources may be missing values from particular columns, missing entire columns, or even missing complete transactions.
  • Update status: As part of your quality assurance, you need to establish the cadence of data refresh or update, as well as have the ability to determine when the data was last saved or updated. This is also referred to as latency.
  • Relevance: It is identification and elimination of information that you don't need or care about, given your objectives. An example would be removing sales transactions for pickles if you are intending on studying personal grooming products.
  • Consistency: It is common to have to cross-reference or translate information from data sources. For example, recorded responses to a patient survey may require translation to a single consistent indicator to later make processing or visualizing easier.
  • Reliability: This is chiefly concerned with making sure that the method of data gathering leads to consistent results. A common data assurance process involves establishing baselines and ranges, and then routinely verifying that the data results fall within the established expectations. For example, districts that typically have a mix of both registered Democrat and Republican voters would warrant the investigation if the data suddenly was 100 percent single partied.
  • Appropriateness: Data is considered appropriate if it is suitable for the intended purpose; this can be subjective. For example, it is considered a fact that holiday traffic affects purchasing habits (an increase in US Flags Memorial day week does not indicate an average or expected weekly behavior).
  • Accessibility: Data of interest may be watered down in a sea of data you are not interested in, thereby reducing the quality of the interesting data since it would be mostly inaccessible. This is particularly common in big data projects. Additionally, security may play a role in the quality of your data. For example, particular computers might be excluded from captured logging files or certain health-related information may be hidden and not part of shared patient data.
主站蜘蛛池模板: 河北省| 西充县| 偃师市| 慈溪市| 通许县| 青冈县| 凉城县| 无极县| 安宁市| 保康县| 阜城县| 宁陕县| 泗洪县| 迁西县| 奉节县| 通榆县| 理塘县| 韶山市| 兴城市| 新津县| 裕民县| 大丰市| 商洛市| 五峰| 上犹县| 天全县| 南京市| 河池市| 乌鲁木齐县| 益阳市| 兴安盟| 潜山县| 新乡县| 天全县| 乐山市| 龙川县| 陆丰市| 定结县| 华坪县| 贡觉县| 彭州市|