官术网_书友最值得收藏!

ML project pipeline

Most of the content available on ML projects, either through books, blogs, or tutorials, explains the mechanics of machine learning in such a way that the dataset available is split into training, validation, and test datasets. Models are built using training datasets, and model improvements through hyperparameter tuning are done iteratively through validation data. Once a model is built and improved upon to a point that is acceptable, it is tested for goodness with unseen test data and the results of testing are reported out. Most of the public content available, ends at this point.

In reality, the ML projects in a business situation go beyond this step. We may observe that if one stops at testing and reporting a built model performance, there is no real use of the model in terms of predicting about data that is coming up in future. We also need to realize that the idea of building a model is to be able to deploy the model in production and have the predictions based on new data so that businesses can take appropriate action.

In a nutshell, the model needs to be saved and reused. This also means that any new data on which predictions need to be made needs to be preprocessed in the same way as training data. This ensures that, the new data has the same number of columns and also the same types of columns as training data. This part of productionalization of the models built in the lab is totally ignored when being taught. This section covers an end-to-end pipeline for the models, right from data preprocessing to building the models in the lab to productionalization of the models.

ML pipelines describe the entire process from raw data acquisition to obtaining post processing of the prediction results on unseen data so as to make it available for some kind of action by business. It is possible that a pipeline may be depicted at a generalized level or described at a very granular level. This current section focuses on describing a generic pipeline that may be applied to any ML project. Figure 1.8 shows the various components of the ML project pipeline otherwise known as the cross-industry standard process for data mining (CRISP-DM).

主站蜘蛛池模板: 芒康县| 株洲县| 鄂尔多斯市| 新源县| 凤阳县| 囊谦县| 德清县| 什邡市| 桃园市| 宁都县| 隆昌县| 柏乡县| 邵武市| 田东县| 遂宁市| 特克斯县| 合山市| 昌乐县| 湘潭县| 马鞍山市| 上饶市| 武义县| 林口县| 察隅县| 视频| 永善县| 垫江县| 黄冈市| 九江市| 牙克石市| 丹东市| 平泉县| 五指山市| 威宁| 股票| 梓潼县| 会理县| 秀山| 宾川县| 巩义市| 靖州|