官术网_书友最值得收藏!

Direct source for data analysis

The least complicated database architecture is uses source data directly as a data source for further analysis. The following screenshot shows this scenario:

The only database in the preceding screenshot is used for both data manipulation from source applications as well as for reading data in machine-learning models. This architecture is generally suitable for limited scenarios only, and we have to consider its limitations. These include the following:

  • First of all, we must not block incoming work by reading data into our data science model. Source databases are usually designed for DML operations or as a data warehouse. When the database is an  online transactional processing (OLTP) database such as libraries, airlines or banks, we need to consider the fact that incoming transactions have priority over the range read operations generated by machine-learning training. When the source database is a data warehouse, the situations are not as complicated because data warehouses are designed for range reads.
  • We have a very limited capability to adjust the database schema for our purposes (one or two datasets). In this case, almost the only way to transform data into a desired dataset is to create database views. The need for more complex transformations leads to the necessity to create new tables, and this is not a direct source.
  • Furthermore, we have a very limited capability for checking data quality. We are used to believing in the data quality of original data. This limitation is quite similar to the previous two bullets; the only type of database object that is actually suitable is the database view.
  • We also don't need other data sources to be combined with incoming data. It's very difficult and also inefficient to combine data from more data sources in this direct model because of the need for distributed queries with their probable impact on performance and accessibility.

Aside from the previously described limitations, this approach also has some of the following benefits:

  • Data for making predictions is accessible as soon as it comes to the source database. Because of this, our machine-learning model can access incoming data directly without the extra effort that is required to transform data.
  • Data for training is also always accessible. When the source system is running, our data is always accessible.
主站蜘蛛池模板: 梁平县| 图木舒克市| 仁化县| 蒲城县| 汉中市| 蓝田县| 济阳县| 长宁区| 上思县| 武夷山市| 周至县| 二连浩特市| 文山县| 宜黄县| 民权县| 安徽省| 六枝特区| 海阳市| 元江| 宣汉县| 宁武县| 石首市| 唐山市| 濮阳县| 博乐市| 上虞市| 哈尔滨市| 岐山县| 枣阳市| 宿迁市| 正定县| 和田市| 凯里市| 增城市| 平阳县| 栾城县| 团风县| 伊川县| 徐州市| 洛南县| 西华县|