EXTRACTION

Extraction is one of the three elements of the ETL process. ETL stands for extraction , transformation and loading ,its one of the most important area of business intelligence because allows to create warehouse in a very fast and reliable way. In this article we will look closer at the extraction process , because the appropriate extraction is a key to success in creating data warehouses and data marts.

The logical data map

The first and the most important step in designing the extract system is creating a logical data map.

It is a document which shows the relationship between the primary source and the final point and it is usually presented in a form of tables or a spreadsheet format .The logical data map(called sometimes the lineage report) includes:

  • Components of the target: target table name , column name , table type(fact , dimension or subdimension), data type(f.ex a number) and SCD type
  • Components of the source: the source database ,table name , column name, data type
  • Transformation

Lets look closer at these components…

Target components

The first four components of the target are obvious while the fifth requires explanation .SCD is a datawarehousing term for a dimension that change in time. The SCD type is an indicator and its translated as slowly changing dimension.

    There are three types of SCD:
  • SCD 1 - overwrites old data with new data
  • SCD 2 - tracks historical data by creating multiply records in the dimensional tables and its unlimited
  • SCD 3 - tracks changes using separate columns and its limited by a number of columns we design

The SCD type may be different for each column in the dimension.

Sources components

The source database is a name of the instance of the database where the data is located and it is a connect string which is necessary to connect to the database. The second and third components are required to populate the target in the target data warehouse( usually it is a list of tables and columns ).

Transformation

The transformation is a kind of conversion data to the expected format in the target point .It can be expressed in SQL or pseudo-code. Sometimes the transformation in a logical data map is blank that means the transformation is not required and the data is loaded directly to the target.

    Advantages of using the logical data map:
  • Retains all information in one place to improve the extraction process.
  • Facilitates the information/functional analysis .
  • Shows the most important points of the developers work. It is a line along which they should move in the ETL process. However there are some traps .They should be very focused while operating with the data types because sometimes the conversion from the source type to the target type is connected with large decrease of characters for example from 225 to 70 and its implied. In this situation they will probably loose the data. To facilitate the work the developers may use appropriate tools which can capture data and detect implied data conversion.

And how to build the logical data map?

Before you start building it first you have to be sure that you now how the target needs to look like (know all needs of the target) .

  • 1.Identifying the data sources. The data modeling session may be helpful but we have to remember that it usually indicates only the major source systems. It depends on the team if it dives deeper and find every source that can be used. However identifying a data source may be sometimes quite complex because of the legacy systems( database appears with various names).The most reliable solution is to use then a central repository for arrangement all of it.
  • Collecting and documenting source systems.
  • Creating the source systems tracking report .It shows information about who is responsible for each source. It contains many characteristics of the source like: subject area, interface name , business name, business owner ,technical owner ,DBMS , production server ,DB size , DB complexity , transaction per day , priority , daily use count , department/ business use , platform and some comments.
  • Identyfing the SOR(system-of-record). It is an information storage systems (data sources for every element or peace of information).It is necessary because in most enterprises the data is re-processed in order to a new business use. It is copied ,manipulated , moved , transformed so we have many versions of the same data. But remember that the system-of-records is only for specific use and it should no be used to load data because it will be not appropriate and nearly impossible.
  • Creating and analyzing the ER(entity relation) diagrams of SOR. It shows how two or more entities are related to one another. The relationships are showed as diamonds connected by lines.
      The ER diagram contains:
    • Unique identifiers :status_ id , status_code and status_description
    • Data types
    • Relationship between tables. It is very important characteristic because it influences on appropriate data extracting.
    • Discrete relationship .It is necessary for mapping many of the dimensions. It is a single look-up table storing all data from all of the tables .
    • Cardinality of relationships and columns. It is number of elements in diamonds that are related. There are three cardinality types:
      -One- to- one. For example: the relationship is between the primary key from tables.
      -One- to- many .It is the most common for foreign key columns in tables.
      -Many-to-many .It usually involves three tables with two one-to-many cardinality types

After preparing a logical data map…

    When you already have a logical data map you need to:
  • Analyze the data content which allows to make a better choice. But there are some traps within it like f.ex datas in nondate fields or null values. If you will connect tables with null values it may cause the data lost. If such situation appears you must outer join tables.
  • Complete business rules for ETL process.
  • Integrate data sources to a single source. This process ties those sources which support the same interest and implement the conformed dimensions into one source. If the dimensions are entirely disparate you have failed!!!!

The logical data map is a specification for developers to create the physical plan of work. It shows the data sources components and the targets components to focused them on the most important points. It saves much time and ensure very good results. It may be used also as an information source and be presented to end users.


Extracting changed data

Capturing changes of data is crucial in creating database. There are many methods of detecting changes:

  • Using audit columns(capture changes, show time when record was added or modified)
  • Database log scraping or sniffing (take a snapshot of detecting changes)
  • Timed extracts (time-based data selection, select yesterdays records , possibility to loose data , not recommended )
  • Process of elimination(store one copy of each previous extraction in the staging area and compare row-by-row ,only different data are send to the wharehous , its very effective)
  • Initial or incremental load(create two tables: previous load and current load and compare and results are loaded to the database)

Sometimes the data may be deleted or overwritten at the source without any notification. We may then negotiate with the source owners or periodically check the history to find the error. When its detected you may use methods described above.

Summarizing the extraction process is quite complex. To ensure appropriate extraction we need to make a logical data map , handle with a big variety of sources and be alert during the whole process. It requires huge effort but the results will be for sure very satisfying.

  • Further reading on the most popular Data extraction sources