- Data Analysis with Python
- David Taieb
- 506字
- 2021-06-11 13:31:43
Back to our sentiment analysis of Twitter hashtags project
The quick data pipeline prototype we built gave us a good understanding of the data, but then we needed to design a more robust architecture and make our application enterprise ready. Our primary goal was still to gain experience in building data analytics, and not spend too much time on the data engineering part. This is why we tried to leverage open source tools and frameworks as much as possible:
- Apache Kafka (volume of tweets in a reliable and fault-tolerant way.
- Apache Spark (provides a programming interface that abstracts a complexity of parallel computing.
- Jupyter Notebooks (users remotely connect to a computing environment (Kernel) to create advanced data analytics. Jupyter Kernels support a variety of programming languages (Python, R, Java/Scala, and so on) as well as multiple computing frameworks (Apache Spark, Hadoop, and so on).
For the sentiment analysis part, we decided to replace the code we wrote using the textblob Python library with the Watson Tone Analyzer service (sentiment analysis including detection of emotional, language, and social tone. Even though the Tone Analyzer is not open source, a free version that can be used for development and trial is available on IBM Cloud (https://www.ibm.com/cloud).
Our architecture now looks like this:

Twitter sentiment analysis data pipeline architecture
In the preceding diagram, we can break down the workflow in to the following steps:
- Produce a stream of tweets and publish them into a Kafka topic, which can be thought of as a channel that groups events together. In turn, a receiver component can subscribe to this topic/channel to consume these events.
- Enrich the tweets with emotional, language, and social tone scores: use Spark Streaming to subscribe to Kafka topics from component 1 and send the text to the Watson Tone Analyzer service. The resulting tone scores are added to the data for further downstream processing. This component was implemented using Scala and, for convenience, was run using a Jupyter Scala Notebook.
- Data analysis and exploration: For this part, we decided to go with a Python Notebook simply because Python offer a more attractive ecosystem of libraries, especially around data visualizations.
- Publish results back to Kafka.
- Implement a real-time dashboard as a Node.js application.
With a team of three people, it took us about 8 weeks to get the dashboard working with real-time Twitter sentiment data. There are multiple reasons for this seemingly long time:
- Some of the frameworks and services, such as Kafka and Spark Streaming, were new to us and we had to learn how to use their APIs.
- The dashboard frontend was built as a standalone Node.js application using the Moza?k framework (https://github.com/plouc/mozaik), which made it easy to build powerful live dashboards. However, we found a few limitations with the code, which forced us to dive into the implementation and write patches, hence adding delays to the overall schedule.
The results are shown in the following screenshot:

Twitter sentiment analysis real-ime dashboard
- SQL Server 2012數據庫技術與應用(微課版)
- Java Data Science Cookbook
- 大數據可視化
- Effective Amazon Machine Learning
- SQL查詢:從入門到實踐(第4版)
- 深入淺出 Hyperscan:高性能正則表達式算法原理與設計
- 大數據精準挖掘
- 編寫有效用例
- Hadoop大數據開發案例教程與項目實戰(在線實驗+在線自測)
- 淘寶、天貓電商數據分析與挖掘實戰(第2版)
- 智能與數據重構世界
- 數據中心UPS系統運維
- 數據庫原理及應用實驗:基于GaussDB的實現方法
- Creating Mobile Apps with Appcelerator Titanium
- 高效使用Redis:一書學透數據存儲與高可用集群