- The Artificial Intelligence Infrastructure Workshop
- Chinmay Arankalle Gareth Dwyer Bas Geerdink Kunal Gera Kevin Liao Anand N.S.
- 461字
- 2021-06-11 18:35:25
ETL
ETL is the standard term that is used for Extracting, Transforming, and Loading data. In traditional data warehousing systems, the entire data pipeline consists of multiple ETL steps that follow after each other to bring the data from the source to the target (usually a report on a dashboard). Let's explore this in more detail:
E: Data is extracted from a source. This can be a file, a database, or a direct call to an API or web service. Once loaded with a query, the data is kept in memory, ready to be transformed. For example, a daily export file from a source system that produces client orders is read every day at 01:00.
T: The data that was captured in memory during the extraction phase (or in the loading phase with ELT) is transformed using calculations, aggregations, and/or filters into a target dataset. For example, the customer order data is cleaned, enriched, and narrowed down per region.
L: The data that was transformed is loaded (stored) into a data store.
This completes an ETL step. Similarly, in ELT, all the extracted data gets stored in the data store and then later transformed.
The following figure is an example of a full data pipeline, from a source system to a trained model:

Figure 3.1: An example of a typical ETL data pipeline
In modern systems such as data lakes, the ETL chain is often replaced by ELT. Rather than having, say, five ETL steps, where data is slowly refined and made ready for analysis, from a raw format to a queryable form, all the data is loaded into one large data store. Then, a series of transformations that are mostly virtual (not stored on disk) runs directly on top of the stored data to produce a similar outcome for analytics. In this way, a gain in storage space and performance can be achieved since modern (cloud-based) storage systems are capable of handling massive amounts of data. The data pipeline becomes somewhat simpler, although the various T(Transform) steps still have to be managed as separate software pieces:

Figure 3.2: An example of an ELT data pipeline
In the remainder of this chapter, we will look in detail at the ETL and ELT steps. Use the text and exercises to form a good understanding of the possibilities for preparing your data. Remember that there is no silver bullet; every use case will have specific needs when it comes to data processing and storage. There are many tools and techniques that can be used to get data from A to B; pick the ones that suit your company best, and whatever you pick, never forget the best practices of software development, such as version control, test-driven development, clean code, documentation, and common sense.
- 圖解西門子S7-200系列PLC入門
- 電腦組裝與維修從入門到精通(第2版)
- Deep Learning with PyTorch
- 深入淺出SSD:固態存儲核心技術、原理與實戰(第2版)
- AMD FPGA設計優化寶典:面向Vivado/SystemVerilog
- CC2530單片機技術與應用
- Spring Cloud微服務架構實戰
- SiFive 經典RISC-V FE310微控制器原理與實踐
- 龍芯自主可信計算及應用
- 基于Proteus仿真的51單片機應用
- “硬”核:硬件產品成功密碼
- 單片機原理及應用:基于C51+Proteus仿真
- USB應用開發寶典
- 詳解FPGA:人工智能時代的驅動引擎
- Exceptional C++:47個C++工程難題、編程問題和解決方案(中文版)