Muliple data extraction sources

The first ETL step - extraction is aimed at extracting data from external or internal sources. The primary sources have usually many characteristics, are managed by incompatible systems or different hardware.

The data sources in the ETL process are usually very differentiated. It depends mainly on:

  • Type of platform(Mainframe, Windows…)
  • Type of DBMS(Database Management System).
  • Type of language that is required to connect with the source.The standard languages are: COBOL, FOCUS , EasyTrieve , PL/SQL , Transact-SQL or RPG.

Sometimes the extraction is quite difficult because of the variety of the source system. In this part of the article we will show how multiple sources influence the extraction and how to resolve problems with connecting different sources by discussing the following points…

ODBC

ODBC is translated as Open Database connectivity.

It is an interface that allows to connect with every Database Management System(DBMS) and even with the flat files(simple linear set of data).

    The ODBC interface consists of:
  • ODBC manager-connect the ETL application with the ODBC driver(recognize SQL from the ETL application and forward it to the drive).
  • ODBC driver-translates the ODBC SQL to the language of the Database source.

Mainframe computer

The mainframe computers are used by big corporations for critical applications(financial , statistic) and have a high performance of data processing(many inputs and outputs are managed by additional components ).The extraction data from such sources is unfortunately not so simple and requires from the ETL team very large knowledge.

The most characteristic elements of the mainframe computer that influence the ETL process are:

  • COBOL copybooks - The data are stored in a form of strings of numbers and texts and require transformation before loading it into a table in the warehouse.
  • Numeric data - The numbers are stored in variety ways f.ex: the value 3,500.01 is stored as number(data) 0035001 or there may be a situation when the value 35,000.1 is stored as a number 35,000.01!It is a vey big problem in handling with quantities like dollars or counts and may cause catastrophe! The PIC clause in COBOL copybooks is an element in programming language used to describe values by using characters that indicate the item characteristics and size.
  • Character set - The data is transferred in a form of bytes and bits. Different platforms use different character sets for coding it. The mainframe computers use the EDBDIC interchange code while for example the Windows system use the ASCII so there is a need to translate data from one to another. Luckily it is automatic by using FTP(File Transfer Protocol) or specific ETL tools. But here is another problem. In mainframe computers many of the numeric datas are stored in special formats like COMP-3 and then translating data requires using special , additional techniques.
  • REDEFINES clause and Multiple Mainframe Record Types - Mainframe language very often redefine an area in a record to save space and then a one position may have two or more possible uses(redefines).It cause a problem with EBCDIC-ASCII translation. The best solution is to separate redefines.
    Going further the broad set of data cannot be stored only in one record , so it has multiple record types. Then those records may be put in one file. It also saves much space.
  • OCCURS clause - This clause is used to define repeating groups within a data record, indicates a number of iterations and store records in an array. There is a need to normalize data-eliminate repeating data.
  • Variable length record - It is a record that has at least one variable-length field. The length of a whole record varies ,what is depended on what data is placed in this field. Each additional once may change the length.

Flat files (text files)

The flat file is a basic data sets that are not related.It allows to extract data from databases when you cannot enter it. Why they are so important in the extraction process?
Basically text files are important because :

  • Most data are delivered in flat files.
  • Using flat files is much faster than handling with DBMS.
  • It allows bulk load.
    There are two types of flat files:
  • Fixed length known also as positional files. It is created by programmers working at the source system. It requires a file layout which has to be put to flat file before any ETL process. The layout is in form of table and contains: file name , length ,start position , end position and data type.
  • Delimited - It separates fields by delimiters, like a comma. The Etl tools have usually delimited file wizards which specify the layout for this file type.

ERP system sources(Enterprise Resource Planning systems)

Enterprise Resource Planning (ERP) was created to integrate heterogeneous data into a one platform or database. This system is very complex, contain often thousands of tables and because of that extracting data from ERP source is difficult. In this situation the best solution is using special adaptors to communicate with those systems. Unfortunately the adaptors/connectors are usually not free and might cost quite a lot.

The most popular ERP systems are: SAP, Oracle ,Baan , PeopleSoft, IFS.

XML

XML is translated as Extensible Markup Language. It’s an exchanging data language independent on platforms.It is a very wired entity because as it was sad it is a language but also stores data. It was designed to describing data , their content and structure.

    General characteristic of an XML file:
  • Use a character set
  • Use tags(increase complexity and are about 90%of the full XML file size)
  • Complex structure(never try to cerate the XML by your own)
  • Use DTD(Document Type Definition) and XMLschema to determine the grammar and structure of documents.

The independent groups of people may provide a DTD for data exchange between. It may be also use to check propriety of the document. The structure of DTD is sometimes very complicated (its usually not understood for people) and contain: base data ,element structures , mixed content , nillable , cardinality and allowed values.

The second standard using to defining the XML document structure is XML schema. It is much more powerful than DTD and allows to:
- Define document elements and their attributes
- Define the hierarchy and sequence of elements
- Define value of elements (empty or has a text)
- Define extensibility to future additions
- Support of namespace(tells where can we get the definition of the element or attribute)

The XML language is growing very fast so before starting the ETL process make sure your ETL tools have a ability to process XML.

Web log and clickstream data warehousing

Nowadays every corporation has a Web site and it can be used also as a data source in creating databases. The Web log shows every user that enter the site while a clickstream is the recording and storing of what the computer user click. It is a very good source of information about users (customers) needs.

The web logs are standardized and described through many elements formed in columns like: data , time , IP-user address , IP-server address , cs-methods(POST or GET –only those can be stored in clickstream data mart) and many other more specific. The ETL process is quite different than in many other sources. The most of the fields in Web log may be extracted without any transformation with unfortunately two exceptions: Cs-uri-query and Cs(cookie).

Those two are called the golden nuggets of the Web log because are very valuable. They contain customized name value pairs.