财神捕鱼输了几十万能赢回来吗

書名： Real-Time Big Data Analytics
作者名： Sumit Gupta Shilpi
本章字數： 2221字
更新時間： 2021-07-16 12:54:33

An overview of Storm

If someone would ask me to describe Storm in a line, I would use the well-known statement, "Storm is actually Hadoop of real time." Hadoop provides the solution to the volume dimension of Big Data, but it's essentially a batch processing platform. It doesn't come with speed and immediate results/analysis. Though Hadoop has been a turning point for the data storage and computation regime, it cannot be a solution to a problem requiring real-time analysis and results.

Storm addresses the velocity aspect of Big Data. This framework provides the capability to execute distributed computations at lightning fast speed in real-time streaming data. It's a widely used solution to provide real-time alerts and analytics over high velocity streaming data. Storm is a project that's now adopted by Apache. It is proven and known for its capabilities; being under the Apache canopy, it is free and open source. It is a distributed compute engine, which is highly scalable, reliable, and fault tolerant and comes with a guaranteed processing mechanism. It's capable of processing unbounded streaming data and provides extensive computation and micro batching facilities on the top of it.

Storm provides solutions to a very wide spectrum of use cases, ranging from alerts and real-time business analytics, machine learning, and ETL use cases to name a few.

Storm has a wide variety of input and output integration points. For input data, it conjuncts well with leading queuing mechanisms like RabbitMQ, JMS, Kafka, Krestel, Amazon Kinesis, and so on. For the output end points, it connects well with most of the traditional databases such as Oracle and MySQL. The adaptability is not limited to traditional RDBMS systems; Storm interfaces very well with Big Data stores such as Cassandra, HBase, and so on.

All these capabilities make Storm the most sought after framework when it comes to providing real-time solutions. Around high-velocity data, Storm is the perfect choice. The following diagram perfectly captures and describes Storm as a black box:

The journey of Storm

Now that we know about the capabilities of Storm, knowing a little history about this wonderful marvel of engineering will get us acquainted with how great frameworks are built and how small experiments become huge successes. Here is the excerpt of Storm's journey, which I learned from Nathan Marz's blog (http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html).

Storm is Nathan's brainchild and in his words, there are two important aspects any successful initiative needs to abide by (Nathan has said that in the context of Storm, but I can see these are generic commandments that can be linked to the fundamental success of any invention):

There should be a real problem that this solution/invention can solve effectively
There must be a considerable number of people who believe in the solution/invention as a tool to handle the problem

The problem at hand that led to idea of Storm was the real-time analytics solution that was being used back in 2010-2011 for analyzing social media data and providing insight to business houses. Now the question here is what were the issues with their existing solution that led to inception of frameworks such as Storm? The long answer made short, as it's said, "all good things come to an end" and so "all good solutions run into problems and eventually the end of life." Here, the existing solution at BackType was very complex. It was not scaling effectively and the business logic was hardwired into the application framework, thus making the life of developers a bigger challenge. They were actually using a solution wherein data from Twitter Firehose was pushed to a set of queues, and they had Python processes to subscribe to these queues, read and deserialize them, and then publish them to other queues and workers. Though the application had a distributed set of workers and brokers, orchestration and management was a very tedious and complicated code segment.

Step 1

That's when the idea struck Nathan Marz and he came up with a few abstracted concepts:

Stream: This is a distributed abstraction of endless events, which has the ability to be produced and processed in parallel
Spout: This is the abstract component that produces the streams
Bolt: This is the abstract component that processes the streams

Step 2

Once the abstractions were in place, the next big leap in realization of the idea was actually figuring out the solution for the following:

Intermediate brokers
Solving the dilemma that comes with the promise of guaranteed processing in a real-time paradigm

The solution to these two aspects gave the vision to build the smartest component of this framework. Thereafter, the path to success was planned. Of course, it was not simple but the vision, thoughts, and ideas were in place.

It took around 5 months to get the first version in place. A few key aspects that were kept in mind right from the beginning are as follows:

It was to be open source
All APIs were in Java, thus making it easily accessible by the majority of the community and applications
It was developed using Clojure which made the project development fast and efficient
It should use the Thrift data structure and lend usability to non-JVM platforms and applications

In May 2011, BackType was acquired by Twitter and thus Storm became a prodigy of the Twitter powerhouse. It was officially released in September 2011. It was readily adopted by the industry and later adopted by industry majors such as Apache, HDP, Microsoft, and so on.

That was the short synopsis of the long journey of Apache Storm. Storm was a success because of the following six KPIs that no tool/technology/framework could offer in a package at that point of time:

Performance: Storm is best suited for the velocity dimension of Big Data, where data arriving at tremendous rates has to be processed in strict SLA of order of seconds.
Scalability: This is another dimension that's the key to the success of Storm. It provides massive parallel processing with linear scalability. That makes it very easy to build systems that can be easily scaled up/down depending on the need of the business.
Fail safe: That's another differentiating factor of Storm. It's distributed and that comes with an advantage of being fault tolerant. This means that the application won't go down or stop processing if some component or node of the cluster snaps. The same process or the work unit would be handled by some other worker on some other node, thus making the processing seamless even in event of failure.
Reliability: This framework provides the ability to process each event, once and exactly once by the application.
Ease: This framework is built over very logical abstractions that are easy to understand and comprehend. Thus, the adoption or migration to Storm is simple, and that is one of the key attribute to its phenomenal success.
Open source: This is the most lucrative feature that any software framework can offer—a reliable and scalable piece of excellent engineering without the tag of license fees.

After the synopsis of the long journey of Storm, let's understand some of its abstractions before delving deeper into its internals.

Storm abstractions

Now that we understand how Storm was born and how it completed the long journey from BackType Storm to Twitter Storm to Apache Storm, in this section, we'll introduce you to certain abstractions that are part of Storm.

Streams

A stream is one of the most basic abstraction and core concepts of Storm. It's basically unbounded (no start or end sequence) of data. Storm as a framework assumes that the streams can be created in parallel and processed in parallel by a set of distributed components. As an analogy, it can be co-related to a stream.

In the context of Storm, a stream is an endless sequence of tuples or data. A tuple is data comprising of key-value pairs. These streams conform to a schema. This schema is like a template to interpret and understand the data/tuple in the stream, and it has specifications such as fields. The fields can have primitive data types such as integer, long, bytes, string, Boolean, double, and so on. But Storm provides facility to developers to define support for custom data types by writing their own serializers.

Tuple: These are data that form streams. It's the core data structure of Storm, and it's an interface under the backtype.storm.tuple package.
OutputFieldsDeclarer: This is an interface under backtype.storm.topology. It is used for defining the schemas and declaring corresponding streams.

Topology

The entire Storm application is packaged and bundled together in Storm topology. The same analogy can be built for a MapReduce job. The only difference between a Storm topology and a MapReduce job is that while the latter terminates, the former runs forever. Yes you read that correctly, Storm topologies don't end unless terminated, because they are there to process data streams and streams don't end. Thus, the topologies stay alive forever, processing all incoming tuples.

A topology is actually a DAG (short for Directed Acyclic Graph), where spouts and bolts form the nodes and streams are the edges of the graph. Storm topologies are basically Thrift structures that have a wrapper layer of Java APIs to access and manipulate them. TopologyBuilder is the template class that exposes Java API for creating and submitting the topology.

Spouts

While describing the topology, we touched upon the spout abstraction of Storm. There are two types of node elements in a topology DAG, one of them is spout—the feeder of data streams into the topology. These are the components that connect to an external source in a queue and read the tuples from it and push them into the topology for processing. A topology essentially operates on tuples, thus the DAG always starts from one or more spouts that are basically source elements for getting tuples into the system.

Spouts come in two flavors with the Storm framework: reliable and unreliable. They function in exactly the same manner as their name suggests. The reliable spout remembers all the tuples that it pushes into the topology by the acknowledgement or acking method, and thus replays failed tuples. This is not possible in an unreliable spout, which doesn't remember or keeps track of the tuples it pushes into the topology:

IRichSpout: This interface is used for implementation of spouts.
declareStream(): This is a method under the OutputFieldsDeclarer interface and it's used to specify and bind the stream for emission to a spout.
emit(): This is a method of the SpoutOutputCollector class, and was created with a purpose to expose Java APIs to emit tuples from IRichSpout instances.
nextTuple(): This is the live wire method of the spout that's continuously called. When it has tuples that are read from external source, the same is emitted into the topology, else it simply returns.
ack(): This method is called by the Storm framework when the emitted tuple is successfully processed by the topology.
fail(): This method is called by the spout when the emitted tuple fails to be processed completely by the topology. Reliable spouts line up such failed tuples for replay.

Bolts

Bolts are the second types of nodes that exist in a Storm topology. These abstract components of the Storm framework are basically the key players in processing. They are the components that perform the processing in the topology. Thus, they are the hub of all logic, processing, merging, and transformation.

Bolts subscribe to streams for input tuples and once they are done with their execution logic, they emit the processed tuples to streams. These processing powerhouses have the capability to subscribe and emit to multiple streams:

declareStream(): This method from OutputFieldsDeclarer is used to declare all the streams a particular bolt would emit to. Please note the functionality of bolts emitting to a stream is identical to a spout emitting on a stream.
emit(): Similar to a spout, a bolt also uses an emit method to emit into the stream, but the difference here is that the bolt calls the emit method from OutputCollector.
InputDeclarer: This interface has methods for defining various grouping (subscriptions on streams) that control the binding of bolts to input streams, thus ensuring bolts read tuples from the streams based on definitions in groupings.
execute(): This method holds the crux of all processing, and it's executed for each input tuple for the bolt. Once the bolt completes the processing it calls the ack() method.
IRichBolt and IBasicBolt: These are the two interfaces provided by the Storm framework for implementation of bolts.

Now that we have understood the basic paradigms and their functions with respect to Storm, the following diagram summarizes all of it:

Tasks

Within the bolt and spout, the actual processing task is broken into smaller work items and is executed or computed in parallel. These threads of execution that actually perform the computation within the bolts or spouts are called tasks. One bolt and spout can spawn any number of tasks (of course, each node has a limit to resources in terms of RAM and CPU, but the framework itself does not levy any limit).

Workers

These are the processes which are spawned to cater to the execution of the topology. Each worker is executed in isolation from other workers in the topology and to achieve this, they are executed in different JVMs.

官术网_书友最值得收藏!

Real-Time Big Data Analytics