- Practical Real-time Data Processing and Analytics
- Shilpi Saxena Saurabh Gupta
- 590字
- 2021-07-08 10:23:10
Collection
Now that we have identified the source of data and its characteristics and frequency of arrival, next we need to consider the various collection tools available for tapping the live data into the application:
- Apache Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tenable reliability mechanisms and many fail over and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. (Source: https://flume.apache.org/). The salient features of Flume are:
- It can easily read streaming data, and has a built in failure recovery. It has memory and disk channels to handle surges or spikes in incoming data without impacting the downstream processing system.
- Guaranteed delivery: It has a built-in channel mechanism that works on acknowledgments, thus ensuring that the messages are delivered.
- Scalability: Like all other Hadoop components, Flume is easily horizontally scalable.
- FluentD: FluentD is an open source data collector which lets you unify data collection and consumption for a better use and understanding of data. (Source: http://www.fluentd.org/architecture). The salient features of FluentD are:
- Reliability: This component comes with both memory and file-based channel configurations which can be configured based on reliability needs for the use case in consideration
- Low infrastructure foot print: The component is written in Ruby and C and has a very low memory and CPU foot print
- Pluggable architecture: This component leads to an ever-growing contribution to the community for its growth
- Uses JSON: It unifies the data into JSON as much as possible thus making unification, transformation, and filtering easier
- Logstash: Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite stash (ours is Elasticsearch, naturally). (Source: https://www.elastic.co/products/logstash). The salient features of Logstash are:
- Variety: It supports a wide variety of input sources, varying from metrics to application logs, real-time sensor data, social media, and so on, in streams.
- Filtering the incoming data: Logstash provides the ability to parse, filter, and transform data using very low latency operations, on the fly. There could be situations where we want the data arriving from a variety of sources to be filtered and parsed as per a predefined, a common format before landing into the broker or stash. This makes the overall development approach decoupled and easy to work with due to convergence to the common format. It has the ability to format and parse very highly complex data, and the overall processing time is independent of source, format, complexity, or schema.
- It can club the transformed output to a variety of storage, processing, or downstream application systems such as Spark, Storm, HDFS, ES, and so on.
- It's robust, scalable and extensible: where the developers have the choice to use a wide variety of available plugins or write their own custom plugins. The plugins can be developed using the Logstash tool called plugin generator.
- Monitoring API: It enables the developers to tap into the Logstash clusters and monitor the overall health of the data pipeline.
- Security: It provides the ability to encrypt data in motion to ensure that the data is secure.

- Cloud API for data collection: This is yet another method of data collection where most cloud platforms offer a variety of data collection API's such as:
- AWS Amazon Kinesis Firehose
- Google Stackdriver Monitoring API
- Data Collector API
- IBM Bluemix Data Connect API
推薦閱讀
- 軟件安全技術
- Objective-C Memory Management Essentials
- Building Modern Web Applications Using Angular
- AngularJS Web Application Development Blueprints
- HTML5游戲開發案例教程
- C語言程序設計立體化案例教程
- aelf區塊鏈應用架構指南
- UML+OOPC嵌入式C語言開發精講
- R的極客理想:工具篇
- Python高效開發實戰:Django、Tornado、Flask、Twisted(第3版)
- 算法訓練營:提高篇(全彩版)
- Learning Three.js:The JavaScript 3D Library for WebGL
- Hands-On Reinforcement Learning with Python
- 組態軟件技術與應用
- C語言程序設計實驗指導 (第2版)