官术网_书友最值得收藏!

Building complex data pipelines

As soon as the data tools that we're building grow into something a bigger than a simple script, it's useful to split data pre-processing tasks into small units, in order to map all the steps and dependencies of the data pipeline.

With the term data pipeline, we intend a sequence of data processing operations, which cleans, augments, and manipulates the original data, transforming it into something digestible by the analytics engine. Any non-trivial data analytics project will require a data pipeline that is composed of a number of steps.

In the prototyping phase, it is common to split these steps into different scripts, which are then run individually, for example:

$ python download_some_data.py
$ python clean_some_data.py
$ python augment_some_data.py

Each script in this example produces the output for the following script, so there are dependencies between the different steps. We can refactor data processing scripts into a large script that does everything, and then run it in one go:

$ python do_everything.py

The content of such script might look similar to the following code:

if __name__ == '__main__': 
  download_some_data() 
  clean_some_data() 
  augment_some_data() 

Each of the preceding functions will contain the main logic of the initial individual scripts. The problem with this approach is that errors can occur in the data pipeline, so we should also include a lot of boilerplate code with try and except to have control over the exceptions that might occur. Moreover, parameterizing this kind of code might feel a little clumsy.

In general, when moving from prototyping to something more stable, it's worth thinking about the use of a data orchestrator, also called workflow manager. A good example of this kind of tool in Python is given by Luigi, an open source project introduced by Spotify. The advantages of using a data orchestrator such as Luigi include the following:

  • Task templates: Each data task is defined as a class with a few methods that define how the task runs, its dependencies, and its output

Dependency graph: Visual tools assist the data engineer to visualize and understand the dependencies between tasks Recovery from intermediate failure: If the data pipeline fails halfway through the tasks, it's possible to restart it from the last consistent state

  • Integration with command-line interface, as well as system job schedulers such as cron job
  • Customizable error reporting

We won't dig into all the features of Luigi, as a detailed discussion would go beyond the scope of this book, but the readers are encouraged to take a look at this tool and use it to produce a more elegant, reproducible, and easily maintainable and expandable data pipeline.

主站蜘蛛池模板: 东平县| 霍邱县| 宝应县| 汨罗市| 武城县| 玉田县| 独山县| 虞城县| 边坝县| 鱼台县| 杭锦后旗| 彭阳县| 海林市| 永康市| 栾川县| 通江县| 德清县| 福州市| 桃江县| 武宁县| 沈阳市| 海宁市| 洮南市| 本溪| 简阳市| 安达市| 彰化市| 蒲江县| 郸城县| 贺州市| 梨树县| 宁阳县| 孝昌县| 东乡族自治县| 德钦县| 平泉县| 克山县| 玉环县| 黄骅市| 延庆县| 沙坪坝区|