官术网_书友最值得收藏!

Common data issues

We can categorize data difficulties into several groups. The most generally accepted groupings (of data issues) include:

  • Accuracy: There are many varieties of data inaccuracies and the most common examples include poor math, out-of-range values, invalid values, duplication, and more.
  • Completeness: Data sources may be missing values from particular columns, missing entire columns, or even missing complete transactions.
  • Update status: As part of your quality assurance, you need to establish the cadence of data refresh or update, as well as have the ability to determine when the data was last saved or updated. This is also referred to as latency.
  • Relevance: It is identification and elimination of information that you don't need or care about, given your objectives. An example would be removing sales transactions for pickles if you are intending on studying personal grooming products.
  • Consistency: It is common to have to cross-reference or translate information from data sources. For example, recorded responses to a patient survey may require translation to a single consistent indicator to later make processing or visualizing easier.
  • Reliability: This is chiefly concerned with making sure that the method of data gathering leads to consistent results. A common data assurance process involves establishing baselines and ranges, and then routinely verifying that the data results fall within the established expectations. For example, districts that typically have a mix of both registered Democrat and Republican voters would warrant the investigation if the data suddenly was 100 percent single partied.
  • Appropriateness: Data is considered appropriate if it is suitable for the intended purpose; this can be subjective. For example, it is considered a fact that holiday traffic affects purchasing habits (an increase in US Flags Memorial day week does not indicate an average or expected weekly behavior).
  • Accessibility: Data of interest may be watered down in a sea of data you are not interested in, thereby reducing the quality of the interesting data since it would be mostly inaccessible. This is particularly common in big data projects. Additionally, security may play a role in the quality of your data. For example, particular computers might be excluded from captured logging files or certain health-related information may be hidden and not part of shared patient data.
主站蜘蛛池模板: 德昌县| 老河口市| 兴安县| 黎川县| 瓮安县| 子长县| 运城市| 怀来县| 桦川县| 兰溪市| 资兴市| 新化县| 获嘉县| 景德镇市| 灵宝市| 定兴县| 房山区| 滦平县| 沙湾县| 南川市| 景洪市| 双柏县| 马公市| 那坡县| 保靖县| 中超| 同仁县| 台前县| 阿瓦提县| 邹平县| 乳源| 泗洪县| 左贡县| 新建县| 措勤县| 栾城县| 平舆县| 淳安县| 丰顺县| 乐至县| 东丽区|