官术网_书友最值得收藏!

Dataset reorganizing

In this section, we will cover dataset reorganization techniques. Then, we will discuss some of Spark's special features for data reorganizing and also some of R's special methods for data reorganizing that can be used with the Spark notebook.

After this section, we will be able to reorganize datasets for various machine learning needs.

Dataset reorganizing tasks

Reorganizing datasets sounds easy but could be very challenging and also often very time consuming.

Two common data reorganizing tasks are—firstly, to obtain a subset of the data for modeling and, secondly, to aggregate data to a higher level. For example, we have students' data, but we need to have a dataset at the classroom level. For this, we will need to calculate some attributes for students and then reorganize it into new data.

To work with data reorganizing, data scientists and machine learning professionals often utilize their familiar SQL or R programming tools. Fortunately within the Spark environment, there are Spark SQL and R notebooks for users to continue their familiar paths; we will have detailed reviews in the following two sections for this.

Overall, we recommend using SparkSQL to reorganizing datasets. However, for the learning purpose, in this section, our focus will be on the utilization of R Notebook from Databricks Workspace.

R and Spark nicely complement each other for several important use cases in statistics and data science. The Databricks R notebooks include the SparkR package by default so that data scientists can effortlessly benefit from the power of Apache Spark in their R analyses. In addition to SparkR, any R package can be easily installed into the notebook. In this blog post, I will highlight a few of the features in our R notebooks.

Dataset reorganizing tasks

To get started with R in Databricks, simply choose R as the language when creating a notebook. Since SparkR is a recent addition to Spark, remember to attach the R notebook to any cluster running Spark version 1.4 or higher. The SparkR package is imported and configured by default. You can run Spark queries in R.

Dataset reorganizing with Spark SQL

In the last section, we discussed using SparkSQL to reorganize datasets.

SQL can be a powerful tool to perform complex aggregations with many familiar examples to machine learning professionals.

SELECT is a command to obtain some data subsets.

For data aggregation, machine learning professionals may use some of SpartSQL's simple.aggregate or window functions.

Note

For more information about SparkSQL's various aggregation functions, go to https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$.

For more information on SparkSQL's window functions, go to https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html.

Dataset reorganizing with R on Spark

R has a subset command to create subsets with the following formats:

# using subset function newdata <- subset(olddata, var1 >= 20, select=c(ID, var2))

Also, we may use the aggregate command from R, as follows:

aggdata <-aggregate(mtcars, by=list(cyl,vs), FUN=mean, na.rm=TRUE)

However, data often has multiple levels of grouping (nested treatments, split plot designs, or repeated measurements) and typically requires investigation at multiple levels. For example, from a long-term clinical study, we may be interested in investigating relationships over time or between times or patients or treatments. To make your job even more difficult, the data probably has been collected and stored in a way optimized for ease and accuracy of collection and in no way resembles the form you need for statistical analysis. You need to be able to fluently and fluidly reshape the data to meet your needs, but most software packages make it difficult to generalize these tasks, and new code needs to be written for each new case.

Especially, R has a reshape package that was specially designed for data reorganization. The package reshape uses a paradigm of melting and casting, where the data is melted into a form which distinguishes measured and identifying variables and then "casts" it into a new shape, whether it be a data frame, list, or highly dimensional array.

As we may recall, in section Data cleaning made easy, we had four tables for the purposes of illustration:

  • Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
  • Events(userId INT, action INT, Default)
  • WebLog(userId, webAction)
  • Demographic(memberId, age, edu, income)

For this example, we often need to obtain a subset from the first data and aggregate the fourth data.

主站蜘蛛池模板: 漳平市| 郓城县| 蓬溪县| 通许县| 巴青县| 扬中市| 蒙山县| 定州市| 乐清市| 垦利县| 当阳市| 湄潭县| 阆中市| 田林县| 高陵县| 察雅县| 韶关市| 邹平县| 布拖县| 阿拉尔市| 固原市| 从化市| 北安市| 柏乡县| 宁远县| 青浦区| 安塞县| 长宁区| 乌兰县| 青州市| 贵阳市| 崇文区| 鱼台县| 延边| 和顺县| 镇安县| 怀来县| 木兰县| 嘉鱼县| 桃园市| 长乐市|