What we call data transformation activities in the ETL process, is a set of technical and business rules that have been extracted from the source systems and software. Business Intelligence projects present the best opportunities to remove dead and useless data to bring new light to business people information requirements. Data transformation activities should be properly implemented to produce clean, condensed, new, complete and standardized data, respectively.
- Areas that are covered by Data transformation include:
- cleansing - it is by definition transformation process in which data that violates business rules is changed to conform these rules. It is usually done by ETL programs that determine or derive correct data values and then write them into the BI target databases.
- summarization - Values are summarized to obtain total figures which are subsequently calculated and stored at multiple levels as business fact in multidimensional fact tables.
- derivation - new data is created from existing (detailed) source data during this process by calculations, program logic or table lookups. Some examples of derivation may be: calculating profit from income and expense items or calculating customer's age based on their date of birth and the current year.
- aggregation - data elements for customers may be aggregated from multiple source files and databases (e.g. Customer Master File, Sales File, Prospect File). Sometimes (in multidimensional database terminology) this term also refers to roll-ups of data values.
- integration - the expected result of this part is to have each and unique data element known by one standard name with one standard definition and approverd name. Data integration forces the need to reconcile different data names and values for the same data element. Also, each element of the data should be associated with its source databases as well as with its BI target databases. Data standardization should always be a business objective.
Source data transformation could be an endless process and that is why the time required for the ETL procedures is very often underestimated. It happens because of not taking into account the time required for huge amounts of transformations needed to enforbe business data domain nad integrity rules.
Data transformation specifications that are given to the ETL developer, should never be limited strictly to data conversion rules. The ratio of particular effort of data transformation should be as high as 80 percent of enforcing business data domain rules and business data integrity rules and just 20 percent of technical rules of conversion of the data. Therefore, primary estimated ETL transformation effort should be multiplied by four and one may expect missing deadlines due to the amount of dirty data.
Full-time involvement from the business representative should also be insisted on, since the actual getting the right business representative is essential to ensure that the whole ETL process will run smoothly and follow strict business rules. When a data quality initiative is driven by business people, they are likely to assist to the ETL transformation process as, unlike the IT specialists, they understand what the data really means.