官术网_书友最值得收藏!

Spark SQL

Spark SQL is where developers can work with structured and semi-structured data such as Hive tables, MySQL tables, Parquet files, AVRO files, JSON files, CSV files, and more. Another alternative to process structured data is using Hive. Hive processes structured data stored on HDFS using Hive Query Language (HQL). It internally uses MapReduce for its processing, and we shall see how Spark can deliver better performance than MapReduce. In the initial version of Spark, structured data used to be defined as schema RDD (another type of an RDD). When there is data along with the schema, SQL becomes the first choice of processing that data. Spark SQL is Spark's component that enables developers to process data with Structured Query Language (SQL).

Using Spark SQL, business logic can be easily written in SQL and HQL. This enables data warehouse engineers with a good knowledge of SQL to make use of Spark for their extract, transform, load (ETL) processing. Hive projects can easily be migrated on Spark using Spark SQL, without changing the Hive scripts. 

Spark SQL is also the first choice for data analysis and data warehousing. Spark SQL enables the data analysts to write ad hoc queries for their exploratory analysis. Spark provides Spark SQL shell, where you can run the SQL-like queries and they get executed on Spark. Spark internally converts the code into a chain of RDD computations, while Hive converts the HQL job into a series of MapReduce jobs. Using Spark SQL, developers can also make use of caching (a Spark feature that enables data to be kept in memory), which can significantly increase the performance of their queries.

主站蜘蛛池模板: 阳曲县| 上虞市| 乌兰县| 永新县| 巴中市| 健康| 香港 | 眉山市| 绥芬河市| 仁化县| 乃东县| 剑河县| 南川市| 遵义县| 东兴市| 兖州市| 松滋市| 昌都县| 寿光市| 龙井市| 勐海县| 苏州市| 南溪县| 牟定县| 星座| 靖西县| 开远市| 盘山县| 上杭县| 黄石市| 炉霍县| 应用必备| 阿坝县| 色达县| 千阳县| 托克逊县| 深州市| 屏南县| 常州市| 多伦县| 肃宁县|