官术网_书友最值得收藏!

What is ETL?

ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process to build data pipelines to deliver BI and Analytics. ETL processes are widely used on the data migration and master data management initiatives. Since the focus of our book is on Spark, we'll lightly touch upon the subject of ETL, but will not go into more detail.

Exaction

Extraction is the first part of the ETL process representing the extraction of data from source systems. This is often one of the most important parts of the ETL process, and it sets the stage for further downstream processing. There are a few major things to consider during an extraction process:

  • The source system type (RDBMS, NoSQL, FlatFiles, Twitter/Facebook streams)
  • The file formats (CSV, JSON, XML, Parquet, Sequence, Object files)
  • The frequency of the extract ( Daily, Hourly, Every second)
  • The size of the extract

Loading

Once the data is extracted, the next logical step is to load the data into the relevant framework for processing. The objective of loading the data into the relevant framework/tool before transformation is to allow the transformations to happen on the system that is more relevant and performant for such a processing. For example, if you extract data from a system for which Spark does not have a connector, say Ingres database and save it as a text file. Now you may need to do a few transformations before the data is usable. You have two options here: either do the transformations on the file that you have extracted, or first load the data into a framework such as Spark for processing. The benefit of the latter approach is that MPP frameworks like Spark will be much more performant than doing the same processing on the filesystem.

Transformation

Once the data is available inside the framework, you can then apply the relevant transformations. Since the core abstraction within Spark is an RDD, we have already seen the transformations available to RDDs.

Spark provides connectors to certain systems, which essentially combines the process of extraction and loading into a single activity, as it streams the data directly from the source system to Spark. In many cases, since we have a huge variety of source systems available, Spark will not provide you with such connectors, which means you will have to extract the data using the tools made available by the particular system or third-party tools.

主站蜘蛛池模板: 依兰县| 双辽市| 大安市| 屯昌县| 黑山县| 孟连| 长治市| 林西县| 佛坪县| 邵东县| 来安县| 安平县| 万州区| 克什克腾旗| 潮安县| 新蔡县| 兰坪| 商河县| 塔河县| 太谷县| 宿州市| 桂阳县| 惠来县| 越西县| 杭州市| 呈贡县| 行唐县| 拜泉县| 靖江市| 永宁县| 宁武县| 南和县| 新密市| 天柱县| 洛阳市| 台州市| 冷水江市| 寻甸| 文山县| 宽城| 台江县|