官术网_书友最值得收藏!

Data cleaning

In this section, we will review some methods for data cleaning on Spark with a focus on data incompleteness. Then, we will discuss some of Spark's special features for data cleaning and also some data cleaning solutions made easy with Spark.

After this section, we will be able to clean data and make datasets ready for machine learning.

Dealing with data incompleteness

For machine learning, the more the data the better. However, as is often the case, the more the data, the dirtier it could be—that is, the more the work to clean the data.

There are many issues to deal with data quality control, which can be as simple as data entry errors or data duplications. In principal, the methods of treating them are similar—for example, utilizing data logic for discovery and subject matter knowledge and analytical logic to correct them. For this reason, in this section, we will focus on missing value treatment so as to illustrate our usage of Spark for this topic. Data cleaning covers data accuracy, completeness, uniqueness, timeliness, and consistency.

Treating missing values and dealing with incompleteness is not an easy task, though it may sound simple. It involves many issues and often requires the following steps:

  1. Counting the missing percentage.

    If the percentage is lower than 5% or 10% then, depending on the studies, we may not need to spend time on it.

  2. Studying the missing patterns.

    There are two patterns of missing data: completely at random or not at random. If they are missing completely at random, we can ignore this issue.

  3. Deciding the methods to deal with missing patterns.

    There are several commonly used methods to deal with missing cases. Filling with mean, deleting the missing cases, and imputation are among the main ones.

  4. Performing filling for missing patterns.

    To work with missing cases and incompleteness, data scientists and machine learning professionals often utilize their familiar SQL tools or R programming. Fortunately, within the Spark environment, there are Spark SQL and R notebooks for users to continue their familiar paths, for which we will have detailed reviews in the following two sections.

There are also other issues with data cleaning, such as treating data entry errors and outliers.

Data cleaning in Spark

In the preceding section, we discussed working with data incompleteness.

With Spark installed, we can easily use the Spark SQL and R notebook on DataBricks Workspace for the data cleaning work described in the previous section.

Especially, the sql function on sqlContext enables applications to run SQL queries programmatically and return the result as a DataFrame.

For example, with R notebook, we can use the following to perform SQL commands and turn the results into a data.frame:

sqlContext <- sparkRSQL.init(sc)
df <- sql(sqlContext, "SELECT * FROM table")

Data cleaning is a very tedious and time-consuming work and, in this section, we would like to bring your attention to SampleClean, which can make data cleaning, and especially distributed data cleaning, easy for machine learning professionals.

SampleClean is a scalable data cleaning library built on AMPLab Berkeley Data Analytics Stack (BDAS). The library uses Apache Spark SQL 1.2.0 and above as well as Apache Hive to support distributed data cleaning operations and related query processing on dirty data. SampleClean implements a set of interchangeable and composable physical and logical data cleaning operators, which makes quick construction and adaptation of data cleaning pipelines possible.

To get our work started, let's first import Spark and SampleClean with the following commands:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import sampleclean.api.SampleCleanContext

To begin using SampleClean, we need to create an object called SampleCleanContext, and then use this context to manage all of the information for working sessions and provide the API primitives to interact with the data. SampleCleanContext is constructed with a SparkContext object, as follows:

new SampleCleanContext(sparkContext)

Data cleaning made easy

With SampleClean and Spark together, we can make data cleaning easy, which is to write less code and utilize less data.

Overall, SampleClean employs a good strategy; it uses asynchrony to hide latency and sampling to hide scale. Also, SampleClean combines all the three (Algorithms, Machines, and People) in one system to become more efficient than others.

Note

For more information on using SampleClean, go to: http://sampleclean.org/guide/ and http://sampleclean.org/release.html.

For the purposes of illustration, let's imagine a machine learning project with four data tables:

  • Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
  • Events(userId INT, action INT, Default)
  • WebLog(userId, webAction)
  • Demographic(memberId, age, edu, income)

To clean this dataset, we need to:

  • Count how many are missing for each variable, either with the SQL or R commands
  • Fill in the missing cases with the mean value if this is the strategy we agree to

Even though the preceding are very easy to implement, it could be very time consuming if our data is huge. Therefore, for efficiency, we may need to divide the data into many subsets and complete the previous steps in parallel, for which Spark becomes the best computing platform to use.

In the Databricks R notebook environment, we can first create notebooks with the R command sum(is.na(x)) to count the missing cases.

To replace the missing cases with the mean, we can use the following code:

for(i in 1:ncol(data)){
  data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}

In Spark, we can easily schedule to implement R notebooks in all the data clusters.

主站蜘蛛池模板: 柞水县| 武平县| 子长县| 富宁县| 大港区| 中阳县| 东海县| 阳谷县| 香河县| 尉氏县| 洱源县| 大安市| 化德县| 林口县| 双鸭山市| 海门市| 遵义县| 沈丘县| 平远县| 满洲里市| 静乐县| 广南县| 富裕县| 田阳县| 丁青县| 宜黄县| 宜兰县| 五台县| 那坡县| 虹口区| 稻城县| 舒兰市| 荥经县| 三穗县| 临澧县| 庄浪县| 湟中县| 根河市| 承德市| 宜黄县| 临沧市|