- Learning Apache Spark 2
- Muhammad Asif Abbasi
- 417字
- 2021-07-09 18:46:01
What is ETL?
ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process to build data pipelines to deliver BI and Analytics. ETL processes are widely used on the data migration and master data management initiatives. Since the focus of our book is on Spark, we'll lightly touch upon the subject of ETL, but will not go into more detail.
Exaction
Extraction is the first part of the ETL process representing the extraction of data from source systems. This is often one of the most important parts of the ETL process, and it sets the stage for further downstream processing. There are a few major things to consider during an extraction process:
- The source system type (RDBMS, NoSQL, FlatFiles, Twitter/Facebook streams)
- The file formats (CSV, JSON, XML, Parquet, Sequence, Object files)
- The frequency of the extract ( Daily, Hourly, Every second)
- The size of the extract
Loading
Once the data is extracted, the next logical step is to load the data into the relevant framework for processing. The objective of loading the data into the relevant framework/tool before transformation is to allow the transformations to happen on the system that is more relevant and performant for such a processing. For example, if you extract data from a system for which Spark does not have a connector, say Ingres database and save it as a text file. Now you may need to do a few transformations before the data is usable. You have two options here: either do the transformations on the file that you have extracted, or first load the data into a framework such as Spark for processing. The benefit of the latter approach is that MPP frameworks like Spark will be much more performant than doing the same processing on the filesystem.
Transformation
Once the data is available inside the framework, you can then apply the relevant transformations. Since the core abstraction within Spark is an RDD, we have already seen the transformations available to RDDs.
Spark provides connectors to certain systems, which essentially combines the process of extraction and loading into a single activity, as it streams the data directly from the source system to Spark. In many cases, since we have a huge variety of source systems available, Spark will not provide you with such connectors, which means you will have to extract the data using the tools made available by the particular system or third-party tools.
- 集成架構中型系統
- Hands-On Cloud Solutions with Azure
- 大數據改變世界
- JMAG電機電磁仿真分析與實例解析
- Security Automation with Ansible 2
- iClone 4.31 3D Animation Beginner's Guide
- FPGA/CPLD應用技術(Verilog語言版)
- 從零開始學SQL Server
- INSTANT Adobe Story Starter
- 21天學通Linux嵌入式開發
- Web編程基礎
- Building Google Cloud Platform Solutions
- PHP求職寶典
- 運動控制系統
- AVR單片機C語言程序設計實例精粹