- Hands-On Data Science with SQL Server 2017
- Marek Chmel Vladimír Mu?n?
- 548字
- 2021-06-10 19:14:04
Staging–target scenario
The scenario described in the preceding section is good for very simple tasks, but it's more common that data comes from more sources. Let's imagine that the data is coming from an accounting system and a production-tracking system. In this case, data in the accounting system could somehow correlate with production data (for example, more downtimes in production has an impact on profit as well as inaccurate products). It's often desirable to have data from multiple sources to get a better picture of a certain subject. In this case, database architecture has to be more complex. The database architecture is shown in the following screenshot:

The preceding screenshot demonstrates the way in which data is obtained and transformed in two phases. In the first phase, data is taken from several data sources and saved into the staging database. In the staging database, almost all transformations are complete. In the second phase, prepared data is placed into the target database, which holds the final version of data that is eligible for machine-learning training and possibly also for making predictions.
Let's look into the staging database. In this database, the following tasks are done:
- Data is reliably copied from data sources
- If needed, the staging database holds high watermarks or mapping tables used for incremental data loads from sources
- Data is cleansed, deduplicated, and consolidated together
- Data is prepared (or almost prepared) for a simple movement into the target database
As seen in the preceding bullet list, a staging database is very diligent. So, what is the target database for? Basically, the target database holds clean and well-modelled schema and data, both strictly used to train predictive models, save trained models, and provide trained models for the purpose of making predictions. Aside from this base role that the target database plays, we can also use the database as a source of data for statistics computation or reporting purposes.
Using this architecture, we need to keep in mind several considerations:
- Data sources should provide rather reliable data from a data quality perspective.
- Data sources are of a similar type. In the best-case scenario, all data sources are relational databases.
- The data schema of data coming from data sources does not change, or it at least changes very infrequently.
- It's not easy to extract data continuously from data sources. Architecture with more cooperating databases is more prepared for batch data extraction. This could become a limitation later when predictions are made in real time.
In staging, the target model provides the following valuable benefits:
- Both databases (the staging database as well as the target database) have schemas designed by developers who know the requirements for later tasks in data science. Both databases will fully support final data science needs.
- Both databases are isolated from the workload coming to data sources. This reduces conflict between the incoming OLTP workload and analytical needs.
The described scenario presumes that reliable, rigidly designed data comes from accessible data sources. However, we also have a lot of schema-agnostic NoSQL data sources or data sources with unstable frequently changed schema, which are potentially not accessible every time. This is why we need to provide an example of more complicated scenario, which is covered in next section.
- 高效能辦公必修課:Word圖文處理
- 我的J2EE成功之路
- 傳感器技術(shù)實(shí)驗(yàn)教程
- TIBCO Spotfire:A Comprehensive Primer(Second Edition)
- IoT Penetration Testing Cookbook
- 來(lái)吧!帶你玩轉(zhuǎn)Excel VBA
- 程序設(shè)計(jì)語(yǔ)言與編譯
- 數(shù)據(jù)運(yùn)營(yíng)之路:掘金數(shù)據(jù)化時(shí)代
- 大數(shù)據(jù)安全與隱私保護(hù)
- 工業(yè)機(jī)器人操作與編程
- 學(xué)會(huì)VBA,菜鳥(niǎo)也高飛!
- 大數(shù)據(jù)驅(qū)動(dòng)的機(jī)械裝備智能運(yùn)維理論及應(yīng)用
- 邊緣智能:關(guān)鍵技術(shù)與落地實(shí)踐
- 基于神經(jīng)網(wǎng)絡(luò)的監(jiān)督和半監(jiān)督學(xué)習(xí)方法與遙感圖像智能解譯
- AI的25種可能