官术网_书友最值得收藏!

Stream processing systems

The first open source stream processing framework in the big data ecosystem was Apache Storm. Since then, several other Apache projects for stream processing have emerged. Next-generation streaming first architectures such as Apache Apex and Apache Flink come with stronger capabilities and are more broadly applicable. They are not only able to process data with low latency, but also provide for state management (for data that an operation may require across inpidual events), strong processing guarantees (correctness), fault tolerance, scalability, and high performance.

Users can now also expect such frameworks to come with comprehensive libraries of connectors, other building blocks and APIs that make development of non-trivial streaming applications productive and allow for predictable project implementation cycles. Equally importantly, next-generation frameworks should cater to aspects such as operability, security, and the ability to run on shared infrastructure (multi-tenancy) to satisfy DevOps requirements for successful production launch and uptime.

Streaming can do it all!

Limitations of early stream processing systems lead to the so-called Lambda Architecture, essentially a parallel setup of stream and batch processing path to obtain fast but potentially unreliable results through the stream processor and, in parallel, correct but slow results through a batch processing system like Apache Hadoop MapReduce:

The fast processing path in the preceding diagram can potentially produce incorrect results, hence the need to re-compute the same results in an alternate batch processing path. Correctness issues are caused by previous technical limitations of stream processing, not by the paradigm itself. For example, if events are processed multiple times or lost, it leads to double or under counting, which would be a problem for an application that relies on accurate results, for example, in the financial sector.

This setup requires the same functionality to be implemented with two different frameworks, as well as extra infrastructure and operational skills, and therefore, results in longer time to production and higher Total Cost of Ownership (TOC). With recent advances in stream processing, Lambda Architecture is no longer necessary. Instead, a unified streaming architecture can be used for reliable processing in a much more TOC effective solution.

This approach based on a single system was outlined in 2014 as Kappa Architecture, and today there are several stream processing technology options, including Apache Apex, that support batch as a special case of streaming.

To know more about the Kappa Architecture, please refer to following link: https://www.oreilly.com/ideas/questioning-the-lambda-architecture.

These newer systems are fault-tolerant, produce correct results, can achieve low latency as well as high throughput, and provide options for enterprise-grade operability and support. Potential users are no longer confronted with the shortcomings that previously justified a parallel batch processing system. We will later see how Apache Apex ensures correct processing, including its support for exactly-once processing.

主站蜘蛛池模板: 镇康县| 浦东新区| 吴堡县| 马尔康县| 浠水县| 宁安市| 边坝县| 云梦县| 无极县| 淳化县| 六安市| 滨海县| 格尔木市| 平舆县| 荃湾区| 桂林市| 金华市| 且末县| 肥乡县| 中宁县| 宣恩县| 涟水县| 攀枝花市| 新河县| 阜城县| 宁武县| 讷河市| 建昌县| 舒城县| 日照市| 永福县| 太康县| 罗江县| 万州区| 富顺县| 张家界市| 改则县| 嘉黎县| 扎赉特旗| 泸定县| 和平县|