Real Time Data Warehouse – Integration with OLTP System

Following information protected by USA and international law. Any using of this publication or its part can be allowed per request. Consultation available - leave your email address as a comment.

                                                                     Appreciation
First of all, let me express my great thanks to authors of modern data architecture concepts based on which I provided analysis posted below:
            Bill Inmon – Data Warehouse (DW) [1],
            Ralph Kimball – Data Mart (DM) [2],
            Dan Linstedt – Data Vault (DV) [3 – 7],
            Lars Rönnbäck – Anchor Modeling (AM) [8],
and all people who interesting and help to build concepts in Data Warehousing, and provide its implementation.
My special thanks to Justin Langseth who observed real-time DW challenges and solutions [9].

                                                                          Analysis
Do we need Real Time DW (RTDW)? The necessity in RTDW comes from business: business management team always interested to know the status of business and all events happened (definitely, not all of them simultaneously, but any of them at any time). The later reflection of the status and events makes things worse.
What should we expect from a RTDW? Let’s consider the answer step-by-step.
First of all, it should be DW, that means that data collected and integrated into one conglomerate in efficient normalized form and the data completely identical to the business processes they reflect (more about consistency data and business processes see in the post http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html on my blog). All data should reflect business from historical prospective that means they should reflect all business events happened during predefined time. The data should be available for business analysis from any prospective. But usually, DW has data a day later that does not allow business management to react immediately on changes in the business process.
The second requirement is Real Time (RT) that means RTDW data always is a perfect (without distortion or delay) mirror of the business processes – real precision of reflection defined by ability IT system reflect changes in business, and requirements of business management (for example, some monthly reporting system may not need to have immediate reflection of changes in business). Such RT system is OLTP one, which immediately reflects the business process and always normalized and has current business data. But OLTP does not contain historical data.
The regular way to transport data from OLTP to the DW is a usage ETL or other mechanism with the similar function. But the usage of such mediator takes time and DW becomes “almost” real time that it’s not a perfect way from business management prospective.

                                                                          Approach
Let’s try to use another approach – merging OLTP and DW to get RTDW. To do this, let’s consider OLTP and DW as one data processing system. Why not? Powerful DW servers usually have enough capability to save OLTP data extra to its existing stuff.
The extra requirements can come as performance requirements, because DW itself requires powerful processor, much more powerful than the regular batch processing systems. The reason of such requirement is that DW uses hash joins as performance efficient execution plan for joining numerous queries. Operating in memory, hash joins increasingly shifting requirements for processor to the level of scientific calculations which process in memory zillion of iterations. I detected such patterns analyzing performance of terabytes level DW. That finding should be used to properly procure and configure DW servers. With such increased power of processor, adding OLTP system should not be a burden for DW.
Let’s consider the data model of RTDW which is an OLTP and DW simultaneously. Requirements to such RTDW come from both OLTP and DW. OLTP require normalization using at least 3-d normal form. DW requires maintaining historical data and having on each row date/time stamp and indicator of a place where data appeared. As result, we get such data model (some variations are possible, but main features shown on the Figure 1):



Each row in RTDW has surrogate and business keys, normalized business data, and row label (date/time stamps of row-create event; last row-update event; business data effective date and expiration dates; place where the row appeared;  ID of user who last inserted/updated/deleted row).
The picture shows transformation the status of identified by surrogate and business keys business row in the table of RTDW (any table, because all tables of RTDW processed in the same manner):
- Initially entered row becomes a “current” one and reflects the latest status of the business. Status “current” means that the row can be considered as OLTP row. It has effective date 01/01/0001 (for not ancient data in the DW it’s enough) and expiration date 12/31/9999 (the latest future we can express in our calendar).
- All changes of “current row” created two rows: terminated “current row” and new “current row”.
o Terminated “current row” – it’s the “current row” with business data expiration date changed to the date/time of moment when this business data become not effective anymore (usually, if other is not predefined for particular RTDW, it’s the date/time when change happened) and such data becomes the latest changed data. Such row becomes historical that means it can be considered as RTDW row. When it happened first time, the row becomes initially changed data, all next changes of the “current row” become the next changed rows, and the latest of them becomes the latest changed data.
o New “current row” – it’s the “current row” with new business data (result of provided changes), new effective date which is always the same as expiration date of the terminated “current row” (otherwise, the history of data in the RTDW would be corrupted). Expiration date is open the latest possible date in the calendar - 12/31/9999. Definitely, place and user ID can be the new ones because they reflect place and user where the row appeared and who provided changes.
o Deleting “current row” also created terminated “current row” as described above.
- After first change of “current row”, the terminated “current row” becomes initially changed row; the latest terminated “current row” becomes the latest one.
All terminated “current rows” together as one package is the “data history” (history of the table – traditional for DW, but without row which contains current business data). Each “current row” reflects current business data (traditional for OLTP), just with not usual for traditional OLTP row label described above (such label makes structure of the “current row” absolutely identical with the “data history”). Together, “data history” and “current row” is the RTDW – completely integrated history of all data appeared in the business process. As it represented in completely normalized form, it becomes the “single version of truth”.
Such data processing system – RTDW – has two functional layers: OLTP (“current data”) and DW (“current data” and “data history” together). As OLTP real time data processing also generates “data history” (which is a major part and feature of DW) in real time, RTDW is a real time “single version of truth”, and time can be shifted back, into the past, looking at the “single version of truth” at any time in the past (to do it, just necessary to require business data at any moment and get result based on business data effective date and expiration dates in the any table(s).
Logically, “current row” and “data history” should be in the same table, but physically, for improvement of OLTP, it’s possible to split data for two tables with the same structure: one for OLTP, other – for “data history” (last one can have suffix or prefix identified that it’s a “data history”). In this case, both tables should be logically united (for example, using SQL UNION operator) into one view – RTDW view containing real time “single version of truth” at any time covered by RTDW.

                                                                        Extensions
RTDW can be used as one whole formation (as described above) and also it can be extended into different directions. Let’s consider first coming ones.
Initial imagination of RTDW makes impression that this is one formation which will work locating in one physical location (database/server), and which has the same table structure for the same business entity on different layers (OLTP and “data history”). But it’s not a mandatory requirement, RTDW can be distributed. Distribution can happen either because of hardware requirements, or because of business specific.
Hardware requirements (lack of space or power of processors) causes splitting data horizontally, on timing bases (for example, split “current data” and “data history”; or split close “data history” (one or two years) and very old “data history” (for archiving). Also, hardware requirements may split data vertically, separating different groups of tables (for example, operational tables and accounting tables). In case of hardware requirements, data may be integrated using database links the similar way as for one database.
Splitting data because of business requirements also can be implemented horizontally, aggregating data on different management levels (for example, department’s level and corporation level) and having different structure of tables on different levels of business management (data aggregation can be provided for higher level).
Splitting data because of business requirements also can be provided vertically that can be required by lack or absence of integration of business (for example, when corporation has not related to each other business – machinery building and pharmacy). In this case, data split provided following business split, and each part of RTDW used for appropriate part of business.
For both reasons of splitting data, horizontal and vertical splitting can be combined if necessary. In splitting data vertically, nomenclature of different parts of RTDW will be different because different parts related to the different parts of business, even may be to different businesses.
It’s necessary to keep in mind that more splitting cause more complicated and less efficient structure of RTDW. Such structure requires more efforts both for development and maintenance of RTDW (you can read some approaches on production support in the post http://alex-t-ito.blogspot.com/2012/01/production-support-story-about-smooth.html).
ETL in RTDW is not so much necessary as in the regular DW because OLTP, as a source of data, can be integrated with “data history” in the same database without Staging Area (SA). ETL may become necessary in a distributed DW, but in this case, it’s a big chance to get “near RTDW” because ETL runs taking time. The most natural way of data transferring in RTDW is the using database links to remote parts of RTDW. At the same time, ETL has an advantage over RTDW internal means (database links) because it independent from RTDW processes. Communication through Web “clouds” also can be considered.
Replication of OLTP data changes into “data history” can be provided using different approaches, first of all, triggers and user-interface programs. It would be very useful for RTDW to have Changed Data Capture (CDC) mechanism (one more method of replication data in database) oriented on RTDW. It means that CDC should provide replication of changed data into “data history”, and keep tracking columns (date/time, place, and user). Such RTDW oriented CDC will make RTDW data processing smoother, more natural and efficient for design, implementation and daily using.
In case of transferring data in RTDW, the most critical factor is the real time reaction on any business event – reflection the data change in the “data history “should follow immediately after OLTP (“current row”) change. This is the most valuable feature of RTDW for business management.

                                                                  Transformations
RTDW does not need internal transformations because the structure of OLTP and “data history” are the same. But for processing “data history” (with or without “current data”) for reports and queries using normalized data can meet performance issue. To resolve the issue, normalized data should be transformed to the SELECT-oriented structure of data – DM (star schema), which becomes output-oriented layer of RTDW.
Traditional way (create dimensions, then fact table step-by-step) can be used, but data processing will take time and RTDW will lose “real time” feature. Definitely, reports usually do not need to present result of real-time data, but ad hoc queries mostly running just to get real-time data to see current status of business. Let’s see whether it’s possible to transform RTDW to the Real Time DM (RTDM).
All rows in RTDW have surrogate and business keys and business data effective and expiration dates. These columns can be used to build Slow Changed Dimensions (SCDs) type 2 (probably, the most popular type). We don’t consider other types of dimensions because RTDW contains all necessary data to build them and process will be the similar. Using tables which don’t have surrogate key columns from other tables (Hubs in DV terminology), as well as tables which linked to the first ones (Hubs) and don’t have business data (Links in DV), we can create SCD not changing data, just joining them. The business data columns from other (Satellites in DV) tables should be added to these joins. As result, joining existing RTDW “current data” and “data history”, we get ready to use SCD. As RTDW contains “single version of truth” that means that it exists only as one instance, each dimension built form RTDW will be the same for all DMs built from RTDW, that means it’s a Conformed Dimension (CD) in DM terminology.
Creating CDs should take time (for large RTDW, pretty long), that will negatively impact the Real Time feature. To resolve the issue, it’s possible to build CD as materialized views. Such way, CD will be refreshed with latest data changes immediately after used RTDW data changed. The other way of creating CDs can be using triggers and transfer RTDW changes into CD that will give more control over transferring process.
As result, we get Real Time CD (RTCD) which can be used for all DMs built on RTDW.
The other necessary component of DM is a fact table, and it also can be built the similar way using triggers on the RTDW tables containing fact data (materialized views also can be used, but linking the fact table to RTCD through SQL may become more complicated). Definitely, initial loading the fact table will be necessary if trigger or whole DM will be built later than RTDW started to work.
In case of using AM or any other method of data normalization in RTDW, the creating of RTDM will be the similar because source (RTDW) have the same data, just structured (linked keys and  business data) differently, and result (RTDM) will be the same, just materialized views will be different reflecting specific of RTDM data structure. Hence, transformation RTDW into RTDM will work even methods of normalization RTDW will be different in different RTDWs.
Only incremental data changes will be loaded into DM in real time. Complete recreation of DM from RTDW also available, but it will take long time and can be used only if DM backup copy is not available. In some cases, these two ways can be combined: DM restored from the last backup copy, and the data changes after last backup should be loaded in incremental manner.
To save space for CDs, they can be built as views, not normalized view. It will save space, but performance of SELECT statement will be notably lower. Nevertheless, such approach can be used for small amount of data, or if RTDM created temporarily.

                                                                      Advantages
RTDW has advantages vs. other structures of DW:
1. It contains information about status of business at any moment of business history (“always single version of truth”), and presents it to the business management that increases efficiency of management decisions.
2. Its structure is homogenous that simplifies software for its support (for example, RTDW requires only one type of program module instead of three types operating with separately existing OLTP, SA, and DW, because structure of OLTP layer and “data history” is the same, and SA is not necessary for RTDW).
3. Its three layers (input-oriented “current data”; analysis-oriented “data history”; and output-oriented “star-schema”) organically merged and become one multifunctional data conglomerate.
4. It can be easy segmented to be completely consistent to organization of business process.
5. It can be built based on different methods of data normalization for “current data” and “data history”, that makes it flexible for using in different data structure requirements and styles, and allows to aggregate it with preexisting data processing systems (for example, DMs).
6. Using different data modeling approaches in RTDW allows to combine advantages of all of them and get new combined effect (for example, using DV can help not only to nail major entities represented by hubs and links and efficiently organize DW data, but also define conformed dimensions which will be used for DM’s creation).
7. It can be built from scratch (probably, the clearest way to build RTDW), or from an existing layer (for example, already existing OLTP, or existing DM).
8. Building RTDW requires fewer efforts than summarized efforts for creating OLTP, SA, DW, and DM, that increases efficiency of RTDW.
9. As RTDW covers all functions of OLTP, DW and DM, it can be used instead of all of them that increase its efficiency, and reduce expenses for its support.
10. Existence of row label in each row allows to indicate its information for user that increases the value of information presented to user (for example, it’s possible to show on user’s screen when and where showed data created, and who created it).

                                                                        Conclusion
RTDW combines existing data models: OLTP, Normalized DW (for example, DV), DM with Conformed Dimensions (from DV point of view – Hubs), AM (if you split rows up to the atomic level – columns), and probably much more what we did not consider yet in DW analysis). Combining advantages of each approach, it’s possible to build formation which efficiently supports Management’s requirements and gives to Management an “Always Single Version of Truth”. Using different forms of business data as described above can be expressed in formula:
                                                  OLTP+DW=RTDW=RTDM

                                                                         References
1. Bill Inmon (2004-09-09). "The Single Version Of The Truth". Business Intelligence Network (Powell Media LLC).
2. Kimball, Ralph; Margy Ross (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (2nd ed.). Wiley. ISBN 0-471-20024-7.
3. Linstedt, Dan. "Data Vault Series 1 – Data Vault Overview". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
4. Linstedt, Dan. "Data Vault Series 2 – Data Vault Components". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
5. Linstedt, Dan. "Data Vault Series 3 – End Dates and Basic Joins". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
6. Linstedt, Dan. "Data Vault Series 4 – Link Tables". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
7. Linstedt, Dan. "Data Vault Series 5 – Loading Practices". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
8. L. Rönnbäck and O. Regardt and M. Bergholtz and P. Johannesson and P. Wohed (2010). "Anchor modeling - Agile information modeling in evolving data environments". Data & Knowledge Engineering 69: 1229–1253. ISSN 0169-023X.
9. Langseth, J., "Real-Time Data Warehousing: Challenges and Solutions", DSSResources.COM, 02/08/2004

No comments:

Post a Comment