Data integration

In a general scenario, data comes from different sources. Data integration is one of the techniques of combining data from disparate sources and providing end users with a unified view of that data. This gives a sense of abstraction to the end users.

Mathematically, data integration systems are formally defined as a <G, S, M>:

G is the global schema
S is the heterogeneous set of source schemas
M is the mapping that maps queries between the source and the global schemas

Both G and S are expressed in languages over alphabets composed of symbols for each of their respective relations. The mapping M consists of assertions between queries over G and queries over S.

There are a few other big data management capabilities; they can be explained as follows:

Data migration: This is the process of transferring data from one environment to another. Most migration occurs between computers and storage devices (for example, transferring data from in-house data centers to the cloud).
Data preparation: Data that is used for analysis is often messy and inconsistent, and not standardized. This data must be collected and cleaned into one file or data table, before an actual analysis can take place. This step is referred to as data preparation. It involves handling messy data, trying to combine data from multiple sources, and reporting on the data sources that were entered manually.
Data enrichment: This step involves enhancing the existing set of data by refining the data, in order to improve its quality. It can be done in several ways. Some common ways are by adding new datasets, correcting miniature errors, or extrapolating new information from raw data.
Data analytics: This is the process of drawing insights from datasets by analyzing them with a variety of algorithms. Most steps are automated by using various tools.
Data quality: This is the act of confirming that the data is accurate and reliable. There are several ways in which data quality is controlled.
Master data management (MDM): This is a method that is used to define and manage the important data of any enterprise, in order to facilitate the process of linking critical enterprise data to one master set. The master set works as a single source of truth for the organization.
Data governance: This is a data management concept that deals with the ability of a company to ensure high data quality throughout the analytical process. This process includes warranting the availability, usability, integrity, and accuracy of data.
Extract transform load (ETL): As the name implies, this is the process of moving data from an existing repository to a different database, or a new data warehouse.

官术网_书友最值得收藏!

Hands-On Big Data Modeling

Data integration