官术网_书友最值得收藏!

ETL

ETL is the standard term that is used for Extracting, Transforming, and Loading data. In traditional data warehousing systems, the entire data pipeline consists of multiple ETL steps that follow after each other to bring the data from the source to the target (usually a report on a dashboard). Let's explore this in more detail:

E: Data is extracted from a source. This can be a file, a database, or a direct call to an API or web service. Once loaded with a query, the data is kept in memory, ready to be transformed. For example, a daily export file from a source system that produces client orders is read every day at 01:00.

T: The data that was captured in memory during the extraction phase (or in the loading phase with ELT) is transformed using calculations, aggregations, and/or filters into a target dataset. For example, the customer order data is cleaned, enriched, and narrowed down per region.

L: The data that was transformed is loaded (stored) into a data store.

This completes an ETL step. Similarly, in ELT, all the extracted data gets stored in the data store and then later transformed.

The following figure is an example of a full data pipeline, from a source system to a trained model:

Figure 3.1: An example of a typical ETL data pipeline

In modern systems such as data lakes, the ETL chain is often replaced by ELT. Rather than having, say, five ETL steps, where data is slowly refined and made ready for analysis, from a raw format to a queryable form, all the data is loaded into one large data store. Then, a series of transformations that are mostly virtual (not stored on disk) runs directly on top of the stored data to produce a similar outcome for analytics. In this way, a gain in storage space and performance can be achieved since modern (cloud-based) storage systems are capable of handling massive amounts of data. The data pipeline becomes somewhat simpler, although the various T(Transform) steps still have to be managed as separate software pieces:

Figure 3.2: An example of an ELT data pipeline

In the remainder of this chapter, we will look in detail at the ETL and ELT steps. Use the text and exercises to form a good understanding of the possibilities for preparing your data. Remember that there is no silver bullet; every use case will have specific needs when it comes to data processing and storage. There are many tools and techniques that can be used to get data from A to B; pick the ones that suit your company best, and whatever you pick, never forget the best practices of software development, such as version control, test-driven development, clean code, documentation, and common sense.

主站蜘蛛池模板: 金溪县| 台北县| 乐陵市| 衡阳市| 长泰县| 和龙市| 辽中县| 永州市| 漠河县| 宜章县| 满洲里市| 图木舒克市| 土默特右旗| 本溪| 昭平县| 呼图壁县| 分宜县| 波密县| 旺苍县| 保定市| 登封市| 那坡县| 资阳市| 台南县| 广宗县| 余江县| 温州市| 临沂市| 封开县| 蒲城县| 高雄市| 綦江县| 瓮安县| 鄂托克前旗| 汤阴县| 苍南县| 平陆县| 富蕴县| 武威市| 比如县| 泸州市|