- Mastering Spark for Data Science
- Andrew Morgan Antoine Amend David George Matthew Hallett
- 174字
- 2021-07-09 18:49:32
Chapter 2. Data Acquisition
As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.
Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.
In this chapter, we will cover the following topics:
- Introduce the Global Database of Events, Language, and Tone (GDELT) dataset
- Data pipelines
- Universal ingestion framework
- Real-time monitoring for new data
- Receiving streaming data via Kafka
- Registering new content and vaulting for tracking purposes
- Visualization of content metrics in Kibana to monitor ingestion processes and data health
推薦閱讀
- 三菱FX3U/5U PLC從入門到精通
- Getting Started with Oracle SOA B2B Integration:A Hands-On Tutorial
- 樂高機器人EV3設(shè)計指南:創(chuàng)造者的搭建邏輯
- 機器學習與大數(shù)據(jù)技術(shù)
- 新手學電腦快速入門
- 新編計算機組裝與維修
- Mastering Game Development with Unreal Engine 4(Second Edition)
- MCGS嵌入版組態(tài)軟件應(yīng)用教程
- 氣動系統(tǒng)裝調(diào)與PLC控制
- 走近大數(shù)據(jù)
- 嵌入式操作系統(tǒng)原理及應(yīng)用
- Mastering Geospatial Analysis with Python
- Flink原理與實踐
- 從實踐中學嵌入式Linux操作系統(tǒng)
- Internet of Things with Raspberry Pi 3