- Mastering Predictive Analytics with R(Second Edition)
- James D. Miller Rui Miguel Forte
- 900字
- 2021-07-02 20:25:16
Categorizing data quality
It is perhaps an accepted notion that issues with data quality may be categorized into one of the following areas:
- Accuracy
- Completeness
- Update status
- Relevance
- Consistency (across sources)
- Reliability
- Appropriateness
- Accessibility
The quality or level of quality of your data can be affected by the way it is entered, stored, and managed. The process of addressing data quality (referred to most often as data quality assurance (DQA)) requires a routine and regular review and evaluation of the data and performing ongoing processes termed profiling and scrubbing (this is vital even if the data is stored in multiple disparate systems, making these processes difficult).
Here, tidying the data will be much more project centric in that we're probably not concerned with creating a formal DQA process, but are only concerned with making certain that the data is correct for your particular predictive project.
In statistics, data unobserved or not yet reviewed by the data scientist is considered raw and cannot be reliably used in predictive projects. The process of tidying the data will usually involve several steps. Taking the extra time to break out the work is strongly recommended (rather than haphazardly addressing multiple data issues together).
The first step
The first step requires bringing the data to what may be called mechanical correctness. In this first step, you focus on things such as:
- File format and organization: Field order, column headers, number of records, and so on
- Record data typing (such as numeric values stored as strings)
- Date and time processing (typically reformatting values into standard formats or consistent formats)
- Miss-content: Wrong category labels, unknown or unexpected character encoding, and so on
The next step
The second step is to address the statistical soundness of the data. Here we correct issues that may be mechanically correct but will most likely (depending upon the subject matter) impact a statistical outcome.
These issues may include:
- Positive/negative mismatch: Age variables may be reported as negative
- Invalid (based on accepted logic) data: An under-aged person may be registered to possess a driver's license
- Missing data: Key data values may just be missing from the data source
The final step
Finally, the last step (before actually attempting to use the data) may be the re-formatting step. In this step, the data scientist will determine the form that the data must be in in order to most efficiently process it, based upon the intended use or objective.
For example, one might decide to:
- Reorder or repeat columns; that is to say, some final processing may require redundant or repeated data be generated within a file source to be correctly or more easily processed
- Drop columns and/or records (based upon specific criteria)
- Set decimal places
- Pivot data
- Truncate or rename values
- And so on
There are a variety of somewhat routine methods for using R to resolve the aforementioned data errors.
For example:
- Changing a data type: Also referred to as "data type conversion," one can utilize the R
is
functions to test for an object's data type and theas
functions for an explicit conversion. A simplest example is shown here: - Date and time: There are multiple ways to manage date information with R. In fact, we can extend the preceding example and mention the
as.Date
function. Typically, date values are important to a statistical model and therefore it is important to take the time to understand the format of a model's date fields and ensure that they are properly dealt with. Mostly, dates and times will appear in raw data format as strings, which can be converted and formatted as required. In the following code, the string fields containing asaledate
and areturndate
are converted to date type values and used with a common time function,difftime
: - Category labels are critical to statistical modeling as well as data visualization. An example of using labels with a sample of categorized data might be assigning a label to a participant in a study, perhaps by level of education: 1 = Doctoral, 2 = Masters, 3 = Bachelors, 4 = Associates, 5 = Nondegree, 6 = Some College, 7 = High School, or 8 = None:
> participant<-c(1,2,3,4,5,6,7,8) > recode<-c(Doctoral=1, Masters=2, Bachelors=3, Associates=4, Nondegree=5, SomeCollege=6, HighSchool=7, None=8)) > (participant<-factor (participant, levels=recode, labels=names(recode))) [1] Doctoral Masters Bachelors Associates Nondegree SomeCollege HighSchool None Levels: Doctoral Masters Bachelors Associates Nondegree SomeCollege HighSchool None
- Assigning labels to data not only helps with readability, but allows a machine learning algorithm to learn from the sample, and apply the same labels to other, unlabeled data.
- Missing data parameters: many times missing data can be excluded from a calculation simply by setting an appropriate parameter value. For example, the R functions
var
,cov
, andcor
compute variance, covariance or correlation of variables. These functions have the option to setna.rm
to TRUE. Doing this tells R to exclude any and all records or cases with missing values. - Various other data tidying nuisances can exist within your data, such as incorrectly signed numeric data (that is, a negative value for data such as a participant's age), invalid data values based upon accepted scenario logic (for example, participant's age versus level of education, in that it isn't feasible that a 10-year-old would have earned a Master's degree), data values simply missing (is a participant's lack of response an indication of a not applicable question or an error?), and more. Thankfully, there are at least several approaches to these data scenarios with R.
- Kibana Essentials
- Building Modern Web Applications Using Angular
- Instant Apache Stanbol
- PHP 7底層設(shè)計(jì)與源碼實(shí)現(xiàn)
- 精通軟件性能測(cè)試與LoadRunner實(shí)戰(zhàn)(第2版)
- 深入淺出RxJS
- Scala編程實(shí)戰(zhàn)(原書第2版)
- Node學(xué)習(xí)指南(第2版)
- OpenCV with Python By Example
- Python機(jī)器學(xué)習(xí)算法與應(yīng)用
- Deep Learning with R Cookbook
- 人工智能算法(卷1):基礎(chǔ)算法
- Scala Functional Programming Patterns
- Learning Unreal Engine Game Development
- Android編程權(quán)威指南(第4版)