官术网_书友最值得收藏!

ML project pipeline

Most of the content available on ML projects, either through books, blogs, or tutorials, explains the mechanics of machine learning in such a way that the dataset available is split into training, validation, and test datasets. Models are built using training datasets, and model improvements through hyperparameter tuning are done iteratively through validation data. Once a model is built and improved upon to a point that is acceptable, it is tested for goodness with unseen test data and the results of testing are reported out. Most of the public content available, ends at this point.

In reality, the ML projects in a business situation go beyond this step. We may observe that if one stops at testing and reporting a built model performance, there is no real use of the model in terms of predicting about data that is coming up in future. We also need to realize that the idea of building a model is to be able to deploy the model in production and have the predictions based on new data so that businesses can take appropriate action.

In a nutshell, the model needs to be saved and reused. This also means that any new data on which predictions need to be made needs to be preprocessed in the same way as training data. This ensures that, the new data has the same number of columns and also the same types of columns as training data. This part of productionalization of the models built in the lab is totally ignored when being taught. This section covers an end-to-end pipeline for the models, right from data preprocessing to building the models in the lab to productionalization of the models.

ML pipelines describe the entire process from raw data acquisition to obtaining post processing of the prediction results on unseen data so as to make it available for some kind of action by business. It is possible that a pipeline may be depicted at a generalized level or described at a very granular level. This current section focuses on describing a generic pipeline that may be applied to any ML project. Figure 1.8 shows the various components of the ML project pipeline otherwise known as the cross-industry standard process for data mining (CRISP-DM).

主站蜘蛛池模板: 陵川县| 沈丘县| 闵行区| 乌鲁木齐市| 凉城县| 开化县| 开鲁县| 平凉市| 甘孜县| 临漳县| 绵阳市| 淳安县| 紫云| 蓝山县| 朔州市| 新龙县| 墨竹工卡县| 称多县| 如东县| 错那县| 拜泉县| 建水县| 桃园县| 凌源市| 曲周县| 和林格尔县| 浦县| 阿拉善右旗| 资阳市| 建水县| 泸水县| 库尔勒市| 正蓝旗| 镇原县| 钦州市| 金寨县| 颍上县| 龙泉市| 临夏市| 安达市| 论坛|