DATAWAREHOUSE ETL

Datawarehouse ETL is in other words a storage area for data and a set of procedures known as extract-transformation-load (ETL). Their purpose is to conduct analysis and simplify the reporting. Datawarehouse ETL holds all sorts of data – featuring organized, standarized, clean and also consistent source of information for further processing.

The most important requirement for a datawarehouse with ETL processes set up is that it should provide access to the data through a presentation layer, not low-level database queries.

The process starts with data extraction. Extraction may be performed from many sources – relational database, flat file system or such a system as Information Management System or even data structures as Indexed Sequentional Access Method or Virtual Storage Access Method.

For wider range operations extraction can be performed even from external systems by web spidering (gathering data from World Wide Web using certain programs called web crawlers) or screen-scraping (gathering data by using terminal's on-screen information). Information is analyzed and if it does not fit the requirements, it is rejected.

Proper data has to be defined, then copied for the next step – transformation. Transformation includes numerous procedures that are preparing the data to be loaded. First procedure is to clear the data from any unwanted information which is also known as dirty data. Data cleansing includes such basic operations as misspellings correction or resolving miscellaneous conflicts among data records. Data has to be also combined when it comes from different sources and any duplications have to be eliminated. Sometimes data could be incomplete or miss some parts – that is also dealt with in this step. Some data needs no manipulation whatsoever.

Preparing the data for next steps is almost completely dominated by simple sorting and processing. Often, instead of relation-based operations, datawarehouse ETL includes only massive file-system. In other cases data is transformed onto predefined normalized structure which is called enterprise data warehouse. With this normalized structures comes longer transformation time because data could be already in some defined form and then it has to be broken and re-set again to fit to our normalized structure. The key is to find balance between staging area where we prepare data and presentation area where most business-important decisions are being made.

After transformation, there comes another step – loading the data. Loading already transformed data to loading facilities is affiliated with indexing for further queries. This step is final – data is ready to be published for user community – new assumptions and any occured changes are revealed.

Any flaws in this complicated process transform into massive delays that is why pre-extraction data profiling is vital. Datawarehouse ETL also provides homogeneous environment for heterogeneous data. Designers should estimate the quantity of the data to be transformed by datawarehouse ETL. Sometimes it is several TB (terabytes) per hour, sometimes less. Growth of the database could be exponential – powerful central processing unit, vast disk space and huge memory bandwith should be ensured.