官术网_书友最值得收藏!

Stream processing

Stream processing means processing event by event, as soon as it is available. Because there is no waiting for more input after an event arrives, there is no artificially added latency (unlike with batching). This is important for real-time use cases, where information should be processed and results available with minimum latency or delay. However, stream processing is not limited to real-time data. We will see there are benefits to applying this continuous processing in a uniform manner to historical data as well.

Consider data that is stored in a file. By reading the file line by line and processing each line as soon as it is read, subsequent processing steps can be performed while the file is still being read, instead of waiting for the entire input to be read before initiating the next stage. Stream processing is a pipeline, and each item can be acted upon immediately. Apart from low latency, this can also lead to even resource consumption (memory, CPU, network) with steady (versus bursty) throughput, when operations performed inherently don't require any blocking:

An example of a data pipeline

Data flows through the pipeline as inpidual events, and all processing steps are active at the same time. In a distributed system, operations are performed on different nodes and data flows through the system, allowing for parallelism and high throughput. Processing is decentralized and without inherent bottlenecks, in contrast to architectures that attempt to move processing to where the data resides.

Stream processing is a natural fit for how events occur in the real world. Sources generate data continuously (mobile devices, financial transactions, web traffic, sensors, and so on). It therefore makes sense to also process them that way instead of artificially breaking the processing into batches (or micro-batches).

The meaning of real time, or time for fast decision making, varies significantly between businesses. Some use cases, such as online fraud detection, may require processing to complete within milliseconds, but for others multiple seconds or even minutes might be sufficiently fast. In any case, the underlying platform needs to be equipped for fast and correct low-latency processing.

Streaming applications can process data fast, with low latency. Stream processing has gained popularity along with growing demand for faster processing of current data, but it is not a synonym for real-time processing. Input data does not need to be real-time. Older data can also be processed as stream (for example, reading from a file) and results are not always emitted in real-time either. Stream processing can perform operations such as sum, average, or top, that are performed over multiple events before the result becomes available.

To perform such operations, the stream needs to be sliced at temporal boundaries. This is called windowing. It demarcates finite datasets for computations. All data belonging to a window needs to be observed before a result can be emitted and windowing provides these boundaries. There are different strategies to define such windows over a data stream, and these will be covered in Chapter 3, The Apex Library:

Windowing of a stream

In the preceding diagram we see the sum of incoming readings computed over tumbling (non-overlapping) and sliding (overlapping) windows. At the end of each window, the result is emitted.

With windowing, the final result of an operation for a given window is only known after all its data elements are processed. However, many windowed operations still benefit from event-by-event arrival of data and incremental computation. Windowing doesn't always mean that processing can only start once all input has arrived. In our example, the sum can be updated whenever the next event arrives vs. storing all inpidual events and deferring computation until the end of the window. Sometimes, even the intermediate result of a windowed computation is of interest and can be made available for downstream consumption and subsequently refined with the final result.

主站蜘蛛池模板: 新泰市| 新竹市| 河津市| 罗城| 石阡县| 峨山| 台北市| 兴仁县| 芦山县| 建湖县| 乌拉特后旗| 泗阳县| 平利县| 广德县| 沽源县| 大英县| 汉阴县| 盘锦市| 泸溪县| 新宾| 文山县| 深州市| 邛崃市| 祁连县| 嘉荫县| 铜梁县| 油尖旺区| 灌南县| 湖口县| 太保市| 和龙市| 广西| 娱乐| 额济纳旗| 淮阳县| 年辖:市辖区| 左权县| 和顺县| 桑日县| 澄江县| 都昌县|