- Hands-On Big Data Analytics with PySpark
- Rudy Lai Bart?omiej Potaczek
- 473字
- 2021-06-24 15:52:32
Spark SQL
Spark SQL is one of the four components on top of the Spark platform, as we saw earlier in the chapter. It can be used to execute SQL queries or read data from any existing Hive insulation, where Hive is a database implementation also from Apache. Spark SQL looks very similar to MySQL or Postgres. The following code snippet is a good example:
#Register the DataFrame as a SQL temporary view
df.CreateOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
#+----+-------+
#| age| name|
#+----+-------+
#+null|Jackson|
#| 30| Martin|
#| 19| Melvin|
#+----|-------|
You'll need to select all the columns from a certain table, such as people, and using the Spark objects, you'll feed in a very standard-looking SQL statement, which is going to show an SQL result much like what you would expect from a normal SQL implementation.
Let's now look at datasets and DataFrames. A dataset is a distributed collection of data. It is an interface added in Spark 1.6 that provides benefits on top of RDDs. A DataFrame, on the other hand, is very familiar to those who have used pandas or R. A DataFrame is simply a dataset organized into named columns, which is similar to a relational database or a DataFrame in Python. The main difference between a dataset and a DataFrame is that DataFrames have column names. As you can imagine, this would be very convenient for machine learning work and feeding into things such as scikit-learn.
Let's look at how DataFrames can be used. The following code snippet is a quick example of a DataFrame:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
#+----+-------+
#| age| name|
#+----+-------+
#+null|Jackson|
#| 30| Martin|
#| 19| Melvin|
#+----|-------|
In the same way, as pandas or R would do, read.json allows us to feed in some data from a JSON file, and df.show shows us the contents of the DataFrame in a similar way to pandas.
MLlib, as we know, is used to make machine learning scalable and easy. MLlib allows you to do common machine learning tasks, such as featurization; creating pipelines; saving and loading algorithms, models, and pipelines; and also some utilities, such as linear algebra, statistics, and data handling. The other thing to note is that Spark and RDD are almost inseparable concepts. If your main use case for Spark is machine learning, Spark now actually encourages you to use the DataFrame-based API for MLlib, which is quite beneficial to us as we are already familiar with pandas, which means a smooth transition into Spark.
In the next section, we will see how we can set up Spark on Windows, and set up PySpark as the interface.
- 公有云容器化指南:騰訊云TKE實(shí)戰(zhàn)與應(yīng)用
- 大規(guī)模數(shù)據(jù)分析和建模:基于Spark與R
- Python數(shù)據(jù)分析與挖掘?qū)崙?zhàn)
- ETL數(shù)據(jù)整合與處理(Kettle)
- Developing Mobile Games with Moai SDK
- OracleDBA實(shí)戰(zhàn)攻略:運(yùn)維管理、診斷優(yōu)化、高可用與最佳實(shí)踐
- Scratch 3.0 藝術(shù)進(jìn)階
- Hadoop 3.x大數(shù)據(jù)開發(fā)實(shí)戰(zhàn)
- TextMate How-to
- HikariCP連接池實(shí)戰(zhàn)
- Unity 2018 By Example(Second Edition)
- 智慧城市中的大數(shù)據(jù)分析技術(shù)
- Oracle 11g+ASP.NET數(shù)據(jù)庫系統(tǒng)開發(fā)案例教程
- 企業(yè)大數(shù)據(jù)處理:Spark、Druid、Flume與Kafka應(yīng)用實(shí)踐
- 離線和實(shí)時(shí)大數(shù)據(jù)開發(fā)實(shí)戰(zhàn)