官术网_书友最值得收藏!

Development process and methodology

Development of an Apex application starts with mapping the functional specification to operators (smaller functional building blocks), which can then be composed into a DAG to collectively provide the functionality required for the use case.

This involves identifying the data sources, formats, transformations and sinks for the application, and finding matching operators from the Apex library (which will be covered in the next chapter). In most cases, the required connectors will be available from the library that support frequent sources, such as files and Kafka, along with many other external systems that are part of the Apache Big Data ecosystem.

With the comprehensive operator library and set of examples to cover frequently used I/O cases and transformations, it is often possible to assemble a preliminary end-to-end flow that covers a subset of the functionality quickly, before building out the complete business logic in detail.

Examples that show how to work with frequently used library operators and accelerate the path to an initial running application can be found at https://github.com/apache/apex-malhar/tree/master/examples.

Having a basic pipeline working early on in the target environment (or at least close to it) allows for various important integration and operational requirements to be evaluated in parallel, such as security and access control. It also establishes a baseline for iterative and parallel development, and for testing the full-featured operators. Experience from working on complex pipelines shows how having an early basic pipeline can reduce risk and provides better visibility into the progress of a bigger project, especially when it has many integration points and a larger development team. Essentially, development dependencies can follow the modular structure of the DAG, allowing the full pipeline to be gradually built up and functions further downstream to be developed in parallel with mocked input, when needed.

A large project broken down into a series of smaller and more manageable milestones would roughly involve the following sequence of steps:

  1. Writing the Java code for new or customized operator.
  2. Unit testing (in IDE, no cluster environment needed).
  3. Integrating the operator into DAG.
  4. Integration testing (testing the DAG with potentially mocked data, in IDE).
  5. Configuring operator properties for the target environment (connector setting, and so on).
  6. End-to-end testing with realistic data set in the target environment.
  7. Tuning (optimizing resource utilization, configuring appropriate platform attributes such as processing locality, memory and CPU allocation, scaling and so on).

Following a similar sequence will ensure that basic functional issues are discovered early on (ideally within the IDE environment where it is far more efficient to debug and fix) before fully packaging and deploying the pipeline to a cluster.

In subsequent sections, we will look at each of these phases in more detail.

主站蜘蛛池模板: 清远市| 正镶白旗| 肃宁县| 漳浦县| 崇左市| 济南市| 开鲁县| 商南县| 衡东县| 宿迁市| 石林| 泸溪县| 鄂托克旗| 泸水县| 嵊泗县| 丹江口市| 葵青区| 苍南县| 灌阳县| 留坝县| 土默特右旗| 海丰县| 景洪市| 福海县| 朔州市| 夏邑县| 昂仁县| 嘉祥县| 云浮市| 金平| 开封县| 湘乡市| 黄山市| 滁州市| 松原市| 逊克县| 武乡县| 云南省| 仲巴县| 建昌县| 北安市|