官术网_书友最值得收藏!

Back to our sentiment analysis of Twitter hashtags project

The quick data pipeline prototype we built gave us a good understanding of the data, but then we needed to design a more robust architecture and make our application enterprise ready. Our primary goal was still to gain experience in building data analytics, and not spend too much time on the data engineering part. This is why we tried to leverage open source tools and frameworks as much as possible:

  • Apache Kafka (volume of tweets in a reliable and fault-tolerant way.
  • Apache Spark (provides a programming interface that abstracts a complexity of parallel computing.
  • Jupyter Notebooks (users remotely connect to a computing environment (Kernel) to create advanced data analytics. Jupyter Kernels support a variety of programming languages (Python, R, Java/Scala, and so on) as well as multiple computing frameworks (Apache Spark, Hadoop, and so on).

For the sentiment analysis part, we decided to replace the code we wrote using the textblob Python library with the Watson Tone Analyzer service (sentiment analysis including detection of emotional, language, and social tone. Even though the Tone Analyzer is not open source, a free version that can be used for development and trial is available on IBM Cloud (https://www.ibm.com/cloud).

Our architecture now looks like this:

Twitter sentiment analysis data pipeline architecture

In the preceding diagram, we can break down the workflow in to the following steps:

  1. Produce a stream of tweets and publish them into a Kafka topic, which can be thought of as a channel that groups events together. In turn, a receiver component can subscribe to this topic/channel to consume these events.
  2. Enrich the tweets with emotional, language, and social tone scores: use Spark Streaming to subscribe to Kafka topics from component 1 and send the text to the Watson Tone Analyzer service. The resulting tone scores are added to the data for further downstream processing. This component was implemented using Scala and, for convenience, was run using a Jupyter Scala Notebook.
  3. Data analysis and exploration: For this part, we decided to go with a Python Notebook simply because Python offer a more attractive ecosystem of libraries, especially around data visualizations.
  4. Publish results back to Kafka.
  5. Implement a real-time dashboard as a Node.js application.

With a team of three people, it took us about 8 weeks to get the dashboard working with real-time Twitter sentiment data. There are multiple reasons for this seemingly long time:

  • Some of the frameworks and services, such as Kafka and Spark Streaming, were new to us and we had to learn how to use their APIs.
  • The dashboard frontend was built as a standalone Node.js application using the Moza?k framework (https://github.com/plouc/mozaik), which made it easy to build powerful live dashboards. However, we found a few limitations with the code, which forced us to dive into the implementation and write patches, hence adding delays to the overall schedule.

The results are shown in the following screenshot:

Twitter sentiment analysis real-ime dashboard

主站蜘蛛池模板: 犍为县| 湟源县| 抚顺县| 胶南市| 顺义区| 墨脱县| 柏乡县| 赫章县| 桐庐县| 荥经县| 休宁县| 安宁市| 长治市| 新丰县| 外汇| 田林县| 秦安县| 博湖县| 灵石县| 巴林左旗| 青州市| 峨眉山市| 阜城县| 邮箱| 荆州市| 黄龙县| 博罗县| 罗甸县| 奉节县| 双桥区| 谢通门县| 阿拉善左旗| 临武县| 绥棱县| 饶河县| 杂多县| 巴塘县| 诸暨市| 平安县| 麻城市| 延庆县|