官术网_书友最值得收藏!

The ETL process

The previous stages in the big data processing field evolved over several decades under the name of data mining, and then adopted the popular name of big data.

One of the best outcomes of these disciplines is the specification of the Extraction, Transform, Load (ETL) process.

This process starts with a mix of many data sources from business systems, then moves to a system that transforms the data into a readable state, and then finishes by generating a data mart with very structured and documented data types.

For the sake of applying this concept, we will mix the elements of this process with the final outcome of a structured dataset, which includes in its final form an additional label column (in the case of supervised learning problems).

This process is depicted in the following diagram: 

Depiction of the ETL process, from raw data to a useful dataset

The diagram illustrates the first stages of the data pipeline, starting with all the organization's data, whether it is commercial transactions, IoT device raw values, or other valuable data sources' information elements, which are commonly in very different types and compositions. The ETL process is in charge of gathering the raw information from them using different software filters, applying the necessary transforms to arrange the data in a useful manner, and finally, presenting the data in tabular format (we can think of this as a single database table with a last feature or result column, or a big CSV file with consolidated data). The final result can be conveniently used by the following processes without practically thinking of the many quirks of data formatting, because they have been standardized into a very clear table structure.

主站蜘蛛池模板: 新乡县| 平安县| 法库县| 宁武县| 凤山市| 临湘市| 济南市| 武乡县| 曲靖市| 新丰县| 河西区| 罗江县| 宁晋县| 彭山县| 阜新市| 湄潭县| 贵定县| 开原市| 德州市| 增城市| 苗栗县| 洪泽县| 曲松县| 泰顺县| 那曲县| 正蓝旗| 新邵县| 澄迈县| 丰镇市| 丹棱县| 营山县| 原阳县| 治多县| 凤山市| 万山特区| 德江县| 宜阳县| 遂平县| 竹北市| 杭锦后旗| 临海市|