- Machine Learning with Spark(Second Edition)
- Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
- 219字
- 2021-07-09 21:07:54
Data ingestion and storage
The first step in our machine learning pipeline will be taking in the data that we require for training our models. Like many other businesses, MovieStream's data is typically generated by user activity, other systems (this is commonly referred to as machine-generated data), and external sources (for example, the time of day and weather during a particular user's visit to the site).
This data can be ingested in various ways, for example, gathering user activity data from the browser and mobile application event logs or accessing external web APIs to collect data on geolocation or weather.
Once the collection mechanisms are in place, the data usually needs to be stored. This includes the raw data, data resulting from intermediate processing, and final model results to be used in production.
Data storage can be complex and involve a wide variety of systems, including HDFS, Amazon S3, and other filesystems; SQL databases such as MySQL or PostgreSQL; distributed NoSQL data stores such as HBase, Cassandra, and DynamoDB; and search engines such as Solr or Elasticsearch to stream data systems such as Kafka, Flume, or Amazon Kinesis.
For the purposes of this book, we will assume that the relevant data is available to us, so we will focus on the processing and modeling steps in the following pipeline.
- Hands-On Graph Analytics with Neo4j
- R Machine Learning By Example
- Dreamweaver CS3網(wǎng)頁(yè)制作融會(huì)貫通
- 腦動(dòng)力:PHP函數(shù)速查效率手冊(cè)
- 控制系統(tǒng)計(jì)算機(jī)仿真
- 21天學(xué)通C語(yǔ)言
- 步步圖解自動(dòng)化綜合技能
- 網(wǎng)絡(luò)安全與防護(hù)
- LAMP網(wǎng)站開(kāi)發(fā)黃金組合Linux+Apache+MySQL+PHP
- 網(wǎng)絡(luò)管理工具實(shí)用詳解
- AI的25種可能
- 單片機(jī)技術(shù)項(xiàng)目化原理與實(shí)訓(xùn)
- Linux Shell Scripting Cookbook(Third Edition)
- 計(jì)算機(jī)硬件技術(shù)基礎(chǔ)學(xué)習(xí)指導(dǎo)與練習(xí)
- 智能小車(chē)機(jī)器人制作大全(第2版)