- Practical Real-time Data Processing and Analytics
- Shilpi Saxena Saurabh Gupta
- 444字
- 2021-07-08 10:23:13
Flume
Flume is the most famous project of Apache for log processing. To download it, refer to the following link: https://flume.apache.org/download.html. Download the apache-flume-1.7.0-bin.tar.gz Flume setup file and unzip it, as follows:
cp apache-flume-1.7.0-bin.tar.gz ~/demo/ tar -xvf ~/demo/apache-flume-1.7.0-bin.tar.gz
The extracted folders and files will be as per the following screenshot:

We will demonstrate the same example that we executed for the previous tools, involving reading from a file and pushing to a Kafka topic. First, let's configure the Flume file:
a1.sources = r1 a1.sinks = k1 a1.channels = c1 a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /home/ubuntu/demo/flume/tail_dir.json a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /home/ubuntu/demo/files/test a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.topic = flume-example a1.sinks.k1.kafka.bootstrap.servers = localhost:9092 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 6 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Flume has three components that define flow. The first are sources, from which the logs or events come. There are multiple sources available in Flume to define the flow. A few are kafka, TAILDIR, and HTTP, and you can also define your own custom source. The second component is sink, which is the destination of events where it will be consumed. The third is channels, which defines the medium between source and sink. The most commonly used channels are Memory, File, and Kafka, but there are also many more. Here, we will use TAILDIR as source, Kafka as sink, and Memory as channel. As of previously configuration a1 is the agent name, r1 is the source, k1 is the sink, and c1 is the channel.
Let's start with source configuration. First of all, you have to define the type of source using <agent-name>.<sources/sinks/channels>.<alias name>.type. The next parameter is positionFile which is required to keep track of the tailing file. filegroups indicates a set of files to be tailed. filegroups.<filegroup-name> is the absolute path of the file directory. Sink configuration is simple and straightforward. Kafka requires bootstrap servers and topic names. Channels configuration is long, but here we used only the most important ones. Capacity is the maximum number of events stored in the channel and transaction Capacity is the maximum number of events the channel will take from a source or give to a sink per transaction.
Now, start the Flume agent using the following command:
bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name a1 -Dflume.root.logger=INFO,console
It will be started and the output will be as follows:

Create a Kafka topic and name it flume-example:
bin/kafka-topics.sh --create --topic flume-example --zookeeper localhost:2181 --partitions 1 --replication-factor 1
Next, start the Kafka console consumer:
bin/kafka-console-consumer.sh --topic flume-example --bootstrap-server localhost:9092
Now, push some messages in the file /home/ubuntu/demo/files/test as in the following screenshot:

The output from Kafka will be as seen in the following screenshot:

- Java程序設計實戰教程
- Learning Cython Programming(Second Edition)
- 機器學習系統:設計和實現
- 控糖控脂健康餐
- 趣學Python算法100例
- Python完全自學教程
- 表哥的Access入門:以Excel視角快速學習數據庫開發(第2版)
- Learning Laravel's Eloquent
- UML2面向對象分析與設計(第2版)
- C++ Application Development with Code:Blocks
- OpenCV 3計算機視覺:Python語言實現(原書第2版)
- SwiftUI極簡開發
- 百萬在線:大型游戲服務端開發
- C++從零開始學(視頻教學版)(第2版)
- 零基礎入門學習C語言:帶你學C帶你飛