官术网_书友最值得收藏!

Data ingestion and storage

The first step in our machine learning pipeline will be taking in the data that we require for training our models. Like many other businesses, MovieStream's data is typically generated by user activity, other systems (this is commonly referred to as machine-generated data), and external sources (for example, the time of day and weather during a particular user's visit to the site).

This data can be ingested in various ways, for example, gathering user activity data from the browser and mobile application event logs or accessing external web APIs to collect data on geolocation or weather.

Once the collection mechanisms are in place, the data usually needs to be stored. This includes the raw data, data resulting from intermediate processing, and final model results to be used in production.

Data storage can be complex and involve a wide variety of systems, including HDFS, Amazon S3, and other filesystems; SQL databases such as MySQL or PostgreSQL; distributed NoSQL data stores such as HBase, Cassandra, and DynamoDB; and search engines such as Solr or Elasticsearch to stream data systems such as Kafka, Flume, or Amazon Kinesis.

For the purposes of this book, we will assume that the relevant data is available to us, so we will focus on the processing and modeling steps in the following pipeline.

主站蜘蛛池模板: 基隆市| 扬中市| 子长县| 社会| 会昌县| 长葛市| 西乌珠穆沁旗| 昌乐县| 奉节县| 廉江市| 淮滨县| 墨脱县| 沐川县| 辽源市| 新干县| 定结县| 嫩江县| 泉州市| 正安县| 巴马| 乳源| 枣强县| 武邑县| 彝良县| 饶平县| 昌乐县| 苏尼特右旗| 韶关市| 怀化市| 大丰市| 正镶白旗| 大方县| 当阳市| 嘉定区| 邵阳市| 黑水县| 榕江县| 大姚县| 张家川| 四会市| 贵溪市|