官术网_书友最值得收藏!

  • Learning Apache Apex
  • Thomas Weise Munagala V. Ramanath David Yan Kenneth Knowles
  • 450字
  • 2021-07-02 22:38:37

Windowing and time

Streams of unbounded data require windowing to establish boundaries to process data and emit results. Processing always occurs in a window and there are different types of windows and strategies to assign inpidual data records to windows.

Often, the relationship of data processed and time is explicit, with the data containing a timestamp identifying the event time or when an event occurred. This is usually the case with streaming sources that emit inpidual events. However, there are also cases where time can be derived from a container. For example, when data arrives batched in hourly files, time may be derived from the file name instead of inpidual records. Sometimes, data may arrive without any timestamp, and the processor at the source needs to assign a timestamp based on arrival time or processing time in order to perform stateful windowed operations.

The windowing support provided by the Apex library largely follows the Apache Beam model. It is flexible and broadly applicable to different use cases. It is also completely different from and not to be confused with the Apex engine's native arrival time based streaming window mechanism.

The streaming window is a processing interval that can be applied to use cases that don't require handling of out-of-order inputs based on event time. It assumes that the stream can be sliced into fixed time intervals (default 500 ms), at which the engine performs callbacks that the operator can use to (globally) perform aggregate operations over multiple records that arrived in that interval.

The intervals are aligned with the internal checkpointing mechanism and suitable for processing optimizations such as flushing files or batching writes to a database. It cannot support transformation and other processing based on event time, because events in the real world don't necessarily arrive in order and perfectly aligned with these internal intervals. The windowing support provided by the Apex library is more flexible and broadly applicable, including for processing based on event time with out-of-order arrival of events.

The preceding example shows a sequence of events, each with a timestamp (t) and key (k) and their processing order. Note the difference between processing and event time. It should be possible to process the same sequence at different times with the same result. That's only possible when the transformations understand event time and are capable maintaining the computational state (potentially multiple open windows at the same time with high key cardinality). The example shows how the state tracks multiple windows (w). Each window has an associated timestamp (4:00 and 5:00 and global for a practically infinite window) and accumulates its own counts regardless of processing time based on the timestamps of the events (and optionally by key).

主站蜘蛛池模板: 合肥市| 邵阳市| 太谷县| 长顺县| 自治县| 淳安县| 东乡| 瓮安县| 永州市| 涟水县| 弥勒县| 恩平市| 锡林浩特市| 颍上县| 博爱县| 普兰县| 岢岚县| 响水县| 凤台县| 防城港市| 保定市| 益阳市| 永宁县| 靖江市| 托克逊县| 福建省| 商城县| 嘉黎县| 科尔| 榆林市| 宜君县| 蒲城县| 安西县| 宝鸡市| 瓮安县| 林周县| 绥宁县| 龙海市| 阿拉尔市| 财经| 德昌县|