In the previous tutorial on data warehousing, we learnt about star schema and snow flaking.
In this tutorial, we would discuss the basic components of data warehouse. The components constituting the data warehouse structure were mentioned briefly in data warehouse tutorial-1. These components have flow of data as:
Source system-> Staging Area-> ODS-> ETL Process-> Presentation Area
Staging Area or data staging area is a place where data can be stored. It constitutes set of processes called ETL (Extract, transform, load). It is an interface between operational source system and presentation area. The staging area can be understood by considering it a kitchen of a restaurant.
As in the kitchen, raw foods are taken and processed to prepare the meal, in a similar way staging area can be used to keep the non-integrated raw data for it to be processed and transferred to data warehouse. The data staging area is used by skilled professionals and not the business users. It does not even provide querying and presentation services for the business users similar to the way customers are kept away from the kitchen.
ODS stands for Operational Data Store. It is a place where the integrated copies of operational data are stored. ODSs are updated on a frequent basis. It can be used to prepare operational reports and to feed the operational data to the data warehouse. It is used for real time processing. Data from data warehouse can also be used in ODS.
ODS thus can be considered an interface between operational system and data warehouse or can be used as a partition of data warehouse itself. It can also be defined as a place where granular atomic data can be stored.
ETL stands for extraction, transformation and loading. These processes are required for transferring the operational data to data warehouse or presentation area.
Extraction: It is a process which involves reading and understanding the source data. It also involves copying the source data which can be put into the staging area for further manipulation.
Transformation: During the ETL process, a number of processes occur under transformation:
1) Cleansing data – Correcting spelling, checking missing data, checking & resolving the domain conflicts.
2) Combination of data from multiple sources.
3) Deduplication of data.
4) Assignment of warehouse keys.
Loading: The integrated data is loaded into the presentation area of data warehouse. Before loading the transformed data, if the data is normalized, it is termed as enterprise data warehouse.
Data staging area also involves sorting and sequential processing of data. It may consist of flat files. It consumes lots of time and is more costly. It is better to improve the presentation area than taking time in normalizing the data before loading to the presentation area.
Loading is basically the process of loading the data in the data warehouse to each of the data marts. Indexing should be there in the data mart before arrival of data for better query performance. Thus the loaded data is indexed and supplied for publishing.
Data Mart and Presentation Area
The presentation area actually constitutes data warehouse and data marts. The organized and integrated data from the data warehouse or presentation area is available to use for querying by users, report writers and other analytical applications.
Presentation area is nothing but a series of integrated data marts. Data mart can be defined as a wedge of whole presentation area pie. In other words, it is a data from a single business process. E.g. sales data mart. Data from the data warehouse is divided into different data marts depending on the functions of business.
Continue reading the next tutorial for data modelling and normalization.