書(shū)名： Hands-On Data Science with SQL Server 2017
作者名： Marek Chmel Vladimír Mu?n?
本章字?jǐn)?shù)： 548字
更新時(shí)間： 2021-06-10 19:14:04

Staging–target scenario

The scenario described in the preceding section is good for very simple tasks, but it's more common that data comes from more sources. Let's imagine that the data is coming from an accounting system and a production-tracking system. In this case, data in the accounting system could somehow correlate with production data (for example, more downtimes in production has an impact on profit as well as inaccurate products). It's often desirable to have data from multiple sources to get a better picture of a certain subject. In this case, database architecture has to be more complex. The database architecture is shown in the following screenshot:

The preceding screenshot demonstrates the way in which data is obtained and transformed in two phases. In the first phase, data is taken from several data sources and saved into the staging database. In the staging database, almost all transformations are complete. In the second phase, prepared data is placed into the target database, which holds the final version of data that is eligible for machine-learning training and possibly also for making predictions.

Let's look into the staging database. In this database, the following tasks are done:

Data is reliably copied from data sources
If needed, the staging database holds high watermarks or mapping tables used for incremental data loads from sources
Data is cleansed, deduplicated, and consolidated together
Data is prepared (or almost prepared) for a simple movement into the target database

As seen in the preceding bullet list, a staging database is very diligent. So, what is the target database for? Basically, the target database holds clean and well-modelled schema and data, both strictly used to train predictive models, save trained models, and provide trained models for the purpose of making predictions. Aside from this base role that the target database plays, we can also use the database as a source of data for statistics computation or reporting purposes.

Using this architecture, we need to keep in mind several considerations:

Data sources should provide rather reliable data from a data quality perspective.
Data sources are of a similar type. In the best-case scenario, all data sources are relational databases.
The data schema of data coming from data sources does not change, or it at least changes very infrequently.
It's not easy to extract data continuously from data sources. Architecture with more cooperating databases is more prepared for batch data extraction. This could become a limitation later when predictions are made in real time.

In staging, the target model provides the following valuable benefits:

Both databases (the staging database as well as the target database) have schemas designed by developers who know the requirements for later tasks in data science. Both databases will fully support final data science needs.
Both databases are isolated from the workload coming to data sources. This reduces conflict between the incoming OLTP workload and analytical needs.

The described scenario presumes that reliable, rigidly designed data comes from accessible data sources. However, we also have a lot of schema-agnostic NoSQL data sources or data sources with unstable frequently changed schema, which are potentially not accessible every time. This is why we need to provide an example of more complicated scenario, which is covered in next section.

官术网_书友最值得收藏!

Hands-On Data Science with SQL Server 2017

Staging–target scenario