- Practical Real-time Data Processing and Analytics
- Shilpi Saxena Saurabh Gupta
- 387字
- 2021-07-08 10:23:08
Data collection
This is the beginning of the journey of all data processing. Be it batch or real-time, the foremost challenge is getting the data from its source to the systems for processing. We can look at the processing unit as a black box and a data source, and at consumers as publishers and subscribers. This is captured in the following diagram:

The key aspects that come under the criteria of data collection tools, in the general context of big data and real-time specifically, are as follows:
- Performance and low latency
- Scalability
- Ability to handle structured and unstructured data
Apart from this, any data collection tool should be able to cater for data from a variety of sources such as:
- Data from traditional transnational systems: When considering software applications, we must understand that the industry has been collating and collecting data in traditional warehouses for a long time. This data can be in the form of sequential files on tapes, Oracle, Teradata, Netezza, and so on. So, starting with a real-time application and its associated data collection, the three options the system architects have are:
- To duplicate the ETL process of these traditional systems and tap the data from the source
- Tap the data from these ETL systems
- The third and a better approach is to go the virtual data lake architecture for data replication
- Structured data from IOT/Sensors/Devices, or CDRs: This is the data that comes at a very high velocity and in a fixed format—the data can be from a variety of sensors and telecom devices. The main complexity or challenge of data collection/ingestion of this data is the variety and the speed of data arrival. The collection tools should be capable of handling both the variety and the velocity aspects, but one good aspect of this kind of data for the upstream processing is that the formats are pretty standardized and fixed.
- Unstructured data from media files, text data, social media, and so on: This is the most complex of all incoming data where the complexity is due to the dimensions of volume, velocity, variety, and structure. The data formats may vary widely and could be in non-text format such as audio/ videos, and so on. The data collection tools should be capable of collecting this data and assimilating it for processing.
推薦閱讀
- 軟件架構設計:大型網站技術架構與業務架構融合之道
- C++ Builder 6.0下OpenGL編程技術
- Visual C++實例精通
- Developing Middleware in Java EE 8
- PHP+MySQL網站開發技術項目式教程(第2版)
- Windows Server 2012 Unified Remote Access Planning and Deployment
- 精通網絡視頻核心開發技術
- Go并發編程實戰
- Protocol-Oriented Programming with Swift
- Spring MVC+MyBatis開發從入門到項目實踐(超值版)
- Building Serverless Architectures
- Java Web應用開發給力起飛
- Mastering Leap Motion
- Google Maps JavaScript API Cookbook
- 微信公眾平臺服務號開發:揭秘九大高級接口