官术网_书友最值得收藏!

Staging–target scenario

The scenario described in the preceding section is good for very simple tasks, but it's more common that data comes from more sources. Let's imagine that the data is coming from an accounting system and a production-tracking system. In this case, data in the accounting system could somehow correlate with production data (for example, more downtimes in production has an impact on profit as well as inaccurate products). It's often desirable to have data from multiple sources to get a better picture of a certain subject. In this case, database architecture has to be more complex. The database architecture is shown in the following screenshot:

The preceding screenshot demonstrates the way in which data is obtained and transformed in two phases. In the first phase, data is taken from several data sources and saved into the staging database. In the staging database, almost all transformations are complete. In the second phase, prepared data is placed into the target database, which holds the final version of data that is eligible for machine-learning training and possibly also for making predictions.

Let's look into the staging database. In this database, the following tasks are done:

  • Data is reliably copied from data sources
  • If needed, the staging database holds high watermarks or mapping tables used for incremental data loads from sources
  • Data is cleansed, deduplicated, and consolidated together
  • Data is prepared (or almost prepared) for a simple movement into the target database

As seen in the preceding bullet list, a staging database is very diligent. So, what is the target database for? Basically, the target database holds clean and well-modelled schema and data, both strictly used to train predictive models, save trained models, and provide trained models for the purpose of making predictions. Aside from this base role that the target database plays, we can also use the database as a source of data for statistics computation or reporting purposes. 

Using this architecture, we need to keep in mind several considerations:

  • Data sources should provide rather reliable data from a data quality perspective.
  • Data sources are of a similar type. In the best-case scenario, all data sources are relational databases.
  • The data schema of data coming from data sources does not change, or it at least changes very infrequently.
  • It's not easy to extract data continuously from data sources. Architecture with more cooperating databases is more prepared for batch data extraction. This could become a limitation later when predictions are made in real time.

In staging, the target model provides the following valuable benefits:

  • Both databases (the staging database as well as the target database) have schemas designed by developers who know the requirements for later tasks in data science. Both databases will fully support final data science needs.
  • Both databases are isolated from the workload coming to data sources. This reduces conflict between the incoming OLTP workload and analytical needs.

The described scenario presumes that reliable, rigidly designed data comes from accessible data sources. However, we also have a lot of schema-agnostic NoSQL data sources or data sources with unstable frequently changed schema, which are potentially not accessible every time. This is why we need to provide an example of more complicated scenario, which is covered in next section.

主站蜘蛛池模板: 全州县| 乐昌市| 灯塔市| 天津市| 怀化市| 阿克陶县| 甘肃省| 庐江县| 丹凤县| 新沂市| 北宁市| 杨浦区| 吉林市| 略阳县| 富源县| 海阳市| 潍坊市| 巴东县| 泰安市| 绍兴市| 永康市| 德令哈市| 灵寿县| 桃江县| 红桥区| 水城县| 台湾省| 固安县| 顺义区| 凤翔县| 石狮市| 南澳县| 曲松县| 桐城市| 金溪县| 乐安县| 焉耆| 三都| 米易县| 威宁| 个旧市|