- Hands-On Big Data Modeling
- James Lee Tao Wei Suresh Kumar Mukhiya
- 464字
- 2021-06-10 18:58:49
Data integration
In a general scenario, data comes from different sources. Data integration is one of the techniques of combining data from disparate sources and providing end users with a unified view of that data. This gives a sense of abstraction to the end users.
Mathematically, data integration systems are formally defined as a <G, S, M>:
- G is the global schema
- S is the heterogeneous set of source schemas
- M is the mapping that maps queries between the source and the global schemas
Both G and S are expressed in languages over alphabets composed of symbols for each of their respective relations. The mapping M consists of assertions between queries over G and queries over S.
There are a few other big data management capabilities; they can be explained as follows:
- Data migration: This is the process of transferring data from one environment to another. Most migration occurs between computers and storage devices (for example, transferring data from in-house data centers to the cloud).
- Data preparation: Data that is used for analysis is often messy and inconsistent, and not standardized. This data must be collected and cleaned into one file or data table, before an actual analysis can take place. This step is referred to as data preparation. It involves handling messy data, trying to combine data from multiple sources, and reporting on the data sources that were entered manually.
- Data enrichment: This step involves enhancing the existing set of data by refining the data, in order to improve its quality. It can be done in several ways. Some common ways are by adding new datasets, correcting miniature errors, or extrapolating new information from raw data.
- Data analytics: This is the process of drawing insights from datasets by analyzing them with a variety of algorithms. Most steps are automated by using various tools.
- Data quality: This is the act of confirming that the data is accurate and reliable. There are several ways in which data quality is controlled.
- Master data management (MDM): This is a method that is used to define and manage the important data of any enterprise, in order to facilitate the process of linking critical enterprise data to one master set. The master set works as a single source of truth for the organization.
- Data governance: This is a data management concept that deals with the ability of a company to ensure high data quality throughout the analytical process. This process includes warranting the availability, usability, integrity, and accuracy of data.
- Extract transform load (ETL): As the name implies, this is the process of moving data from an existing repository to a different database, or a new data warehouse.
推薦閱讀
- Design for the Future
- Google Cloud Platform Cookbook
- 機器學習及應用(在線實驗+在線自測)
- Hands-On Machine Learning on Google Cloud Platform
- MicroPython Projects
- Mastering Elastic Stack
- 大型數據庫管理系統技術、應用與實例分析:SQL Server 2005
- RPA(機器人流程自動化)快速入門:基于Blue Prism
- C語言開發技術詳解
- JSP從入門到精通
- INSTANT Drools Starter
- 運動控制系統應用與實踐
- 菜鳥起飛系統安裝與重裝
- 機器人制作入門(第4版)
- Serverless Design Patterns and Best Practices