官术网_书友最值得收藏!

Development process and methodology

Development of an Apex application starts with mapping the functional specification to operators (smaller functional building blocks), which can then be composed into a DAG to collectively provide the functionality required for the use case.

This involves identifying the data sources, formats, transformations and sinks for the application, and finding matching operators from the Apex library (which will be covered in the next chapter). In most cases, the required connectors will be available from the library that support frequent sources, such as files and Kafka, along with many other external systems that are part of the Apache Big Data ecosystem.

With the comprehensive operator library and set of examples to cover frequently used I/O cases and transformations, it is often possible to assemble a preliminary end-to-end flow that covers a subset of the functionality quickly, before building out the complete business logic in detail.

Examples that show how to work with frequently used library operators and accelerate the path to an initial running application can be found at https://github.com/apache/apex-malhar/tree/master/examples.

Having a basic pipeline working early on in the target environment (or at least close to it) allows for various important integration and operational requirements to be evaluated in parallel, such as security and access control. It also establishes a baseline for iterative and parallel development, and for testing the full-featured operators. Experience from working on complex pipelines shows how having an early basic pipeline can reduce risk and provides better visibility into the progress of a bigger project, especially when it has many integration points and a larger development team. Essentially, development dependencies can follow the modular structure of the DAG, allowing the full pipeline to be gradually built up and functions further downstream to be developed in parallel with mocked input, when needed.

A large project broken down into a series of smaller and more manageable milestones would roughly involve the following sequence of steps:

  1. Writing the Java code for new or customized operator.
  2. Unit testing (in IDE, no cluster environment needed).
  3. Integrating the operator into DAG.
  4. Integration testing (testing the DAG with potentially mocked data, in IDE).
  5. Configuring operator properties for the target environment (connector setting, and so on).
  6. End-to-end testing with realistic data set in the target environment.
  7. Tuning (optimizing resource utilization, configuring appropriate platform attributes such as processing locality, memory and CPU allocation, scaling and so on).

Following a similar sequence will ensure that basic functional issues are discovered early on (ideally within the IDE environment where it is far more efficient to debug and fix) before fully packaging and deploying the pipeline to a cluster.

In subsequent sections, we will look at each of these phases in more detail.

主站蜘蛛池模板: 天气| 咸宁市| 乌鲁木齐市| 镇巴县| 清徐县| 四子王旗| 永丰县| 启东市| 襄城县| 海宁市| 岗巴县| 万州区| 郓城县| 尼木县| 南雄市| 卢氏县| 特克斯县| 邵武市| 夏河县| 安化县| 新田县| 桂平市| 赤峰市| 荣成市| 县级市| 都兰县| 肥东县| 龙州县| 田阳县| 崇信县| 富川| 中超| 黔西县| 扬中市| 大庆市| 虹口区| 获嘉县| 盐津县| 隆尧县| 左云县| 乌兰县|