Introduction
In the previous chapter, we discussed the layers of a data-driven system and explained the important storage requirements for each layer. The storage containers in the data layers of AI solutions serve one main purpose: to build and train models that can run in a production environment. In this chapter, we will discuss how to transfer data between the layers in a pipeline so that the data is prepared to be used to train a model to create an actual forecast (called the execution or scoring of the model).
In an Artificial Intelligence (AI) system, data is continuously updated. Once data enters the system via an upload, application program interface (API), or data stream, it has to be stored securely and typically goes through a few ETL steps. In systems that handle streaming data, the incoming data has to be directed into a stable and usable data pipeline. Data transformations have to be managed, scheduled, and orchestrated. Further, the lineage of the data has to be stored to trace back the origins of a data point in a report or application. This chapter explains all data preparation (sometimes called pre-processing) mechanisms that ensure raw data can be used for machine learning by data scientists. This is important since raw data is hardly in a form that can be used by models. We will elaborate on the architecture and technology as explained by the layered model in Chapter 1, Data Storage Fundamentals. To start with, let's pe into the details of ETL.
- 筆記本電腦使用、維護與故障排除實戰
- 24小時學會電腦組裝與維護
- 深入理解Spring Cloud與實戰
- 精選單片機設計與制作30例(第2版)
- 計算機維修與維護技術速成
- OUYA Game Development by Example
- 分布式系統與一致性
- Rapid BeagleBoard Prototyping with MATLAB and Simulink
- 固態存儲:原理、架構與數據安全
- 龍芯自主可信計算及應用
- Intel Edison智能硬件開發指南:基于Yocto Project
- RISC-V處理器與片上系統設計:基于FPGA與云平臺的實驗教程
- “硬”核:硬件產品成功密碼
- Spring Cloud實戰
- Mastering Machine Learning on AWS