官术网_书友最值得收藏!

  • Learning Apache Apex
  • Thomas Weise Munagala V. Ramanath David Yan Kenneth Knowles
  • 448字
  • 2021-07-02 22:38:39

Development process and methodology

Development of an Apex application starts with mapping the functional specification to operators (smaller functional building blocks), which can then be composed into a DAG to collectively provide the functionality required for the use case.

This involves identifying the data sources, formats, transformations and sinks for the application, and finding matching operators from the Apex library (which will be covered in the next chapter). In most cases, the required connectors will be available from the library that support frequent sources, such as files and Kafka, along with many other external systems that are part of the Apache Big Data ecosystem.

With the comprehensive operator library and set of examples to cover frequently used I/O cases and transformations, it is often possible to assemble a preliminary end-to-end flow that covers a subset of the functionality quickly, before building out the complete business logic in detail.

Examples that show how to work with frequently used library operators and accelerate the path to an initial running application can be found at https://github.com/apache/apex-malhar/tree/master/examples.

Having a basic pipeline working early on in the target environment (or at least close to it) allows for various important integration and operational requirements to be evaluated in parallel, such as security and access control. It also establishes a baseline for iterative and parallel development, and for testing the full-featured operators. Experience from working on complex pipelines shows how having an early basic pipeline can reduce risk and provides better visibility into the progress of a bigger project, especially when it has many integration points and a larger development team. Essentially, development dependencies can follow the modular structure of the DAG, allowing the full pipeline to be gradually built up and functions further downstream to be developed in parallel with mocked input, when needed.

A large project broken down into a series of smaller and more manageable milestones would roughly involve the following sequence of steps:

  1. Writing the Java code for new or customized operator.
  2. Unit testing (in IDE, no cluster environment needed).
  3. Integrating the operator into DAG.
  4. Integration testing (testing the DAG with potentially mocked data, in IDE).
  5. Configuring operator properties for the target environment (connector setting, and so on).
  6. End-to-end testing with realistic data set in the target environment.
  7. Tuning (optimizing resource utilization, configuring appropriate platform attributes such as processing locality, memory and CPU allocation, scaling and so on).

Following a similar sequence will ensure that basic functional issues are discovered early on (ideally within the IDE environment where it is far more efficient to debug and fix) before fully packaging and deploying the pipeline to a cluster.

In subsequent sections, we will look at each of these phases in more detail.

主站蜘蛛池模板: 茶陵县| 太湖县| 千阳县| 北川| 石阡县| 汝阳县| 郓城县| 寿阳县| 松潘县| 兖州市| 焦作市| 洛川县| 宜宾县| 赤城县| 淅川县| 东方市| 武安市| 通化市| 新田县| 红桥区| 宝兴县| 科技| 津市市| 广宁县| 金溪县| 定远县| 桐柏县| 黑龙江省| 阜城县| 康乐县| 绵竹市| 凤城市| 洱源县| 明星| 咸阳市| 融水| 称多县| 长顺县| 鱼台县| 海伦市| 黄陵县|