- Statistics for Data Science
- James D. Miller
- 369字
- 2021-07-02 14:58:57
Common data issues
We can categorize data difficulties into several groups. The most generally accepted groupings (of data issues) include:
- Accuracy: There are many varieties of data inaccuracies and the most common examples include poor math, out-of-range values, invalid values, duplication, and more.
- Completeness: Data sources may be missing values from particular columns, missing entire columns, or even missing complete transactions.
- Update status: As part of your quality assurance, you need to establish the cadence of data refresh or update, as well as have the ability to determine when the data was last saved or updated. This is also referred to as latency.
- Relevance: It is identification and elimination of information that you don't need or care about, given your objectives. An example would be removing sales transactions for pickles if you are intending on studying personal grooming products.
- Consistency: It is common to have to cross-reference or translate information from data sources. For example, recorded responses to a patient survey may require translation to a single consistent indicator to later make processing or visualizing easier.
- Reliability: This is chiefly concerned with making sure that the method of data gathering leads to consistent results. A common data assurance process involves establishing baselines and ranges, and then routinely verifying that the data results fall within the established expectations. For example, districts that typically have a mix of both registered Democrat and Republican voters would warrant the investigation if the data suddenly was 100 percent single partied.
- Appropriateness: Data is considered appropriate if it is suitable for the intended purpose; this can be subjective. For example, it is considered a fact that holiday traffic affects purchasing habits (an increase in US Flags Memorial day week does not indicate an average or expected weekly behavior).
- Accessibility: Data of interest may be watered down in a sea of data you are not interested in, thereby reducing the quality of the interesting data since it would be mostly inaccessible. This is particularly common in big data projects. Additionally, security may play a role in the quality of your data. For example, particular computers might be excluded from captured logging files or certain health-related information may be hidden and not part of shared patient data.
推薦閱讀
- 虛擬儀器設計測控應用典型實例
- PowerShell 3.0 Advanced Administration Handbook
- 機器人智能運動規劃技術
- VMware Performance and Capacity Management(Second Edition)
- Maya極速引擎:材質篇
- Mastering ServiceNow Scripting
- 筆記本電腦維修90個精選實例
- 網絡管理工具實用詳解
- Web編程基礎
- 生成對抗網絡項目實戰
- Oracle 11g Anti-hacker's Cookbook
- Mastercam X5應用技能基本功特訓
- 電機與電力拖動
- Flash 8中文版全程自學手冊
- 巧學活用AutoCAD