- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 433字
- 2021-07-02 19:01:54
Exploring the Spark ecosystem
Apache Spark is considered as the general purpose system in the big data world. It consists of a lot of libraries that help to perform various analytics on your data. It provides built-in libraries to perform batch analytics, perform real-time analytics, apply machine learning algorithms on your data, and much more.
The following are the various built-in libraries available in Spark:
- Spark Core: As its name says, the Spark Core library consists of all the core modules of Spark. It consists of the basics of the Spark model, including RDD and various transformation and actions that can be performed with it. Basically, all the batch analytics that can be performed with the Spark programming model using the MapReduce paradigm is the part of this library. It also helps to analyze different varieties of data.
- Spark Streaming: The Spark Streaming library consists of modules that help users to run near real-time streaming processing on the incoming data. It helps to handle the velocity part of the big data territory. It consists of a lot of modules that help to listen to various streaming sources and perform analytics in near real time on the data received from those sources.
- Spark SQL: The Spark SQL library helps to analyze structured data using the very popular SQL queries. It consists of a dataset library which helps to view the structured data in the form of a table and provides the capabilities of running SQL on top of it. The SQL library consists of a lot of functions which are available in SQL of RDBMS. It also provides an option to write your own function, called the User Defined Function (UDF).
- MLlib: Spark MLlib helps to apply various machine learning techniques on your data, leveraging the distributed and scalable capability of Spark. It consists of a lot of learning algorithms and utilities and provides algorithms for classification, regression, clustering, decomposition, collaborative filtering, and so on.
- GraphX: The Spark GraphX library provides APIs for graph-based computations. With the help of this library, the user can perform parallel computations on graph-based data. GraphX is the one of the fastest ways of performing graph-based computations.
- Spark-R: The Spark R library is used to run R scripts or commands on Spark Cluster. This helps to provide distributed environment for R scripts to execute. Spark comes with a shell called sparkR which can be used to run R scripts on Spark Cluster. Users which are more familiar with R, can use tool such as RStudio or Rshell and can execute R scripts which will run on the Spark cluster.
推薦閱讀
- 自制編譯器
- TypeScript Blueprints
- SOA實踐
- 垃圾回收的算法與實現(xiàn)
- Python爬蟲開發(fā):從入門到實戰(zhàn)(微課版)
- Blender 3D Incredible Machines
- 算法訓(xùn)練營:提高篇(全彩版)
- 深入理解Elasticsearch(原書第3版)
- 時空數(shù)據(jù)建模及其應(yīng)用
- Learning Docker Networking
- Emgu CV Essentials
- Python網(wǎng)絡(luò)爬蟲實例教程(視頻講解版)
- Android智能手機APP界面設(shè)計實戰(zhàn)教程
- 程序員的英語
- Building E-Commerce Solutions with WooCommerce(Second Edition)