官术网_书友最值得收藏!

Chapter 2. Data Acquisition

As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.

Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.

In this chapter, we will cover the following topics:

  • Introduce the Global Database of Events, Language, and Tone (GDELT) dataset
  • Data pipelines
  • Universal ingestion framework
  • Real-time monitoring for new data
  • Receiving streaming data via Kafka
  • Registering new content and vaulting for tracking purposes
  • Visualization of content metrics in Kibana to monitor ingestion processes and data health
主站蜘蛛池模板: 宜阳县| 宜宾县| 寻乌县| 彭阳县| 梅河口市| 抚顺县| 永济市| 育儿| 余姚市| 正定县| 丹棱县| 营口市| 诏安县| 民勤县| 射洪县| 民勤县| 乐清市| 安岳县| 新乡市| 兴隆县| 修文县| 通道| 新乡市| 临桂县| 宜良县| 洪江市| 正蓝旗| 山西省| 喀喇| 登封市| 太原市| 莱阳市| 喀喇沁旗| 沙田区| 沙洋县| 枣强县| 临沂市| 建宁县| 新余市| 大英县| 巴青县|