Extract Transform Load
The term ETL which stands for extract, transform, and load is a three-stage process in database usage and data warehousing. It enables integration and analysis of the data stored in different databases and heterogeneous formats. After it is collected from multiple sources (extraction), the data is reformatted and cleansed for operational needs (transformation). Finally, it is loaded into a target database,data warehouse or a data mart to be analyzed. Most of numerous extraction and transformation tools also enable loading of the data into the end target. Except for data warehousing and business intelligence, ETL Tools can also be used to move data from one operational system to another.
The purpose of the extraction process is to reach to the source systems and collect the data needed for the datawarehouse.
Usually data is consolidated from different source systems that may use a different data organization or format so the extraction must convert the data into a format suitable for transformation processing. The complexity of the extraction process may vary and it depends on the type of source data. The extraction process also includes selection of the data as the source usually contains redundant data or data of little interest.
For the ETL extraction to be successful, it requires an understanding of the data layout. A good ETL tool additionally enables a storage of an intermediate version of data being extracted. This is called "staging area" and makes reloading raw data possible in case of further loading problem, without re-extraction. The raw data should also be backed up and archived.
The transform stage of an ETL process involves an application of a series of rules or functions to the extracted data. It includes validation of records and their rejection if they are not acceptable as well as integration part. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent. A good ETL tool must enable building up of complex processes and extending a tool library so custom user's functions can be added.
The loading is the last stage of ETL process and it loads extracted and transformed data into a target repository. There are various ways in which ETL load the data. Some of them physically insert each record as a new row into the table of the target warehouse involving SQL insert statement build-in, whereas others link the extraction, transformation, and loading processes for each record from the source. The loading part is usually a bottleneck of the whole process. To increase efficiency with larger volumes of data we may need to skip SQL and data recovery or apply external high-performance sort that additionally improves performance.