jdb电子变脸大奖截图

書名： Learning Apache Spark 2
作者名： Muhammad Asif Abbasi
本章字數： 417字
更新時間： 2021-07-09 18:46:01

What is ETL?

ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process to build data pipelines to deliver BI and Analytics. ETL processes are widely used on the data migration and master data management initiatives. Since the focus of our book is on Spark, we'll lightly touch upon the subject of ETL, but will not go into more detail.

Exaction

Extraction is the first part of the ETL process representing the extraction of data from source systems. This is often one of the most important parts of the ETL process, and it sets the stage for further downstream processing. There are a few major things to consider during an extraction process:

The source system type (RDBMS, NoSQL, FlatFiles, Twitter/Facebook streams)
The file formats (CSV, JSON, XML, Parquet, Sequence, Object files)
The frequency of the extract ( Daily, Hourly, Every second)
The size of the extract

Loading

Once the data is extracted, the next logical step is to load the data into the relevant framework for processing. The objective of loading the data into the relevant framework/tool before transformation is to allow the transformations to happen on the system that is more relevant and performant for such a processing. For example, if you extract data from a system for which Spark does not have a connector, say Ingres database and save it as a text file. Now you may need to do a few transformations before the data is usable. You have two options here: either do the transformations on the file that you have extracted, or first load the data into a framework such as Spark for processing. The benefit of the latter approach is that MPP frameworks like Spark will be much more performant than doing the same processing on the filesystem.

Transformation

Once the data is available inside the framework, you can then apply the relevant transformations. Since the core abstraction within Spark is an RDD, we have already seen the transformations available to RDDs.

Spark provides connectors to certain systems, which essentially combines the process of extraction and loading into a single activity, as it streams the data directly from the source system to Spark. In many cases, since we have a huge variety of source systems available, Spark will not provide you with such connectors, which means you will have to extract the data using the tools made available by the particular system or third-party tools.

官术网_书友最值得收藏!

Learning Apache Spark 2

What is ETL?

Exaction

Loading

Transformation