The least complicated database architecture is uses source data directly as a data source for further analysis. The following screenshot shows this scenario:
The only database in the preceding screenshot is used for both data manipulation from source applications as well as for reading data in machine-learning models. This architecture is generally suitable for limited scenarios only, and we have to consider its limitations. These include the following:
First of all, we must not block incoming work by reading data into our data science model. Source databases are usually designed for DML operations or as a data warehouse. When the database is an online transactional processing(OLTP) database such as libraries, airlines or banks, we need to consider the fact that incoming transactions have priority over the range read operations generated by machine-learning training. When the source database is a data warehouse, the situations are not as complicated because data warehouses are designed for range reads.
We have a very limited capability to adjust the database schema for our purposes (one or two datasets). In this case, almost the only way to transform data into a desired dataset is to create database views. The need for more complex transformations leads to the necessity to create new tables, and this is not a direct source.
Furthermore, we have a very limited capability for checking data quality. We are used to believing in the data quality of original data. This limitation is quite similar to the previous two bullets; the only type of database object that is actually suitable is the database view.
We also don't need other data sources to be combined with incoming data. It's very difficult and also inefficient to combine data from more data sources in this direct model because of the need for distributed queries with their probable impact on performance and accessibility.
Aside from the previously described limitations, this approach also hassome of the following benefits:
Data for making predictions is accessible as soon as it comes to the source database. Because of this, our machine-learning model can access incoming data directly without the extra effort that is required to transform data.
Data for training is also always accessible. When the source system is running, our data is always accessible.