DATA INTEGRATION

Data integration is a process of collecting and combining data from various sources in order to provide any database user with a standardized view of the data. Not only commercial but also scientific organizations can benefit from data integration, as the volume and the necessity of sharing the data are increasing constantly.

The problems with combining sources of different backgrounds arose as early as in the 1960s when databases were adopted and the need to share existing repositories rapidly emerged. The purpose of data integration is paramount in the software industry and true data integration occurs only when two systems look and act the same and produce/consume the same data.

There are several solutions to this on various levels of the database architecture and the most popular one is data warehousing.

Data from numerous sources is extracted, transformed and loaded into a single repository where it resides at query-time. When a query is submitted, the data warehouse locates the data, regains it and presents it to an end user as an integrated whole. The goal of the data integration system's designers is to create as little work as possible and thus they focus on applications and various warehouse techniques. This system is generally known to have many advantages for the user but it is not an ideal one. Any changes in original data sources call for re-execution of the ETL process to avoid any failures caused by unreliable data in data warehouse.

When information changes frequently, one has to rely on a different data integration system, which does not use centralized database and data is being pulled directly from individual sources.

The current trend in networked data integration recognises two major approaches to this process - LAV and GAV. In the LAV (Local As View - referring to the local sources) approach, data sources function as a view over the mediated schema (organizational arrangement of database), whereas in the GAV (Global As View - giving reference to the global schema) the mediated schema acts as a view over the sources. Due to its simplicity, the 'Global' approach is more attractive than the 'Local'. However changing data sources in the system may seem problematic since it affects all the data in the system (in 'Global'), it is still less complicated than redefining the whole system as it is in analogous case of the 'Local'.

As the private sector is truly concerned with data integration, Enterprise Information Integration (EII) industry has emerged with an offer of simplicity and speed instead of correctness and tractability.

    There are several issues to be addressed by EII:
  • simplicity of understanding
  • simplicity of deployment
  • handling higher-order information

According to practitioners, only by meeting these issues, EII industry will perform to its full potential as well as become mature. Although data integration may seem like an easy idea, it is in fact fairly complicated discipline, where there is no universal approach and the evolution of the techniques is still on.

Learn more

  • Examples and guides about data integration: Informatica tutorial
  • Market overview and Data integration software tools comparisons