Sunday, August 19, 2012

Real Time Data Warehouse – Integration with OLTP System

The following information protected by USA and international law. Any usage of this publication or its part can be allowed per request. Consultation available - leave your email address as a comment.

                                                                 Appreciation

First of all, let me express my great thanks to authors of modern data architecture concepts based on which I provided analysis posted below:
            Bill Inmon – Data Warehouse (DW) [1],
            Ralph Kimball – Data Mart (DM) [2],
            Dan Linstedt – Data Vault (DV) [3 – 7],
            Lars Rönnbäck – Anchor Modeling (AM) [8],
and all people who are interesting and help to build concepts in Data Warehousing, and provide its implementation.
My special thanks to Justin Langseth who observed real-time DW challenges and solutions [9].

                                                                       Analysis

Do we need a Real Time DW (RTDW)? The necessity in a RTDW comes from business: a business management team always interested to know about the status of their business and about all business events (definitely, not all of them simultaneously, but some of them at the same time). The later reflection of the status and events makes business information worse.

What should we expect from a RTDW? Let’s consider the answer step-by-step.

First of all, it should be a DW, that means that data collected and integrated into one conglomerate in efficient normalized form, and the data completely identical to the business processes they reflect (more about consistency data and business processes see in the post http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html ). All data should reflect the business from historical prospective that means they should reflect all business events happened during predefined time. The data should be available for business analysis from any prospective. But usually, a DW has data a day later that does not allow business management to react immediately on changes in the business process.

The second requirement is a Real Time (RT) that means a RTDW data always is a perfect (without any distortion or delay) mirror of the business processes – real reflection accuracy is determine by the ability of IT system to reflect changes in the business, as well as requirements of doing business (for example, some monthly reporting system may not need to be an immediate reflection of changes in the business). Such a RT system is an OLTP one, which immediately reflects the business process and usually normalized and has current business data. But an OLTP does not contain historical data.
The regular way to data transfer from an OLTP to a DW is to use an ETL or other mechanism with the similar function. But the usage of such an intermediary is time consuming and a DW becomes “almost or near” real time, which is not the ideal way to control from a business perspective.

                                                                     Approach

Let’s try to use another approach – merging an OLTP and a DW to get a RTDW. To do this, let’s consider an OLTP and a DW as one data processing system. Why not? Powerful DW servers usually have enough capability to save an OLTP data extra to its existing stuff.

Extra requirements to a server performance can come, because a DW itself requires a powerful processor, much more powerful than regular batch processing systems. The reason of such requirement is that a DW uses hash joins as a performance efficient execution plan for joining numerous queries. Hash joins operate in the memory, increasingly shifting requirements for processor to the level of scientific calculations which process in memory zillion of iterations. I detected such patterns when analyzed performance of a DW with a few terabytes of data. That finding should be used to properly procure and configure DW servers. With such increased power of the processor, addition an OLTP system should not be a burden for a DW.

Let’s consider the data model of a RTDW which is an OLTP and a DW simultaneously. Requirements to such RTDW come from both an OLTP and a DW. An OLTP requires normalization using, at least, the 3-rd normal form. A DW requires maintaining historical data and having on each row date/time stamp and indicator of a place where data have appeared. As the result, we get such data model (some variations are possible, but main features shown on the Figure 1):




Each row in a RTDW has surrogate and business keys, normalized business data, and row label (date/time stamps of row-create event; last row-update event; business data effective date and expiration dates; place where the row appeared;  ID of the user who last inserted/updated/deleted row).

The picture shows transformation of the status of identified by surrogate and business keys business row in the table of RTDW (any table, because all tables of RTDW processed in the same manner):

- The initially entered row becomes a “current” one and reflects the latest status of the business. The status “current” means that the row can be considered as the OLTP row. It has effective date 01/01/0001 (for not ancient data in the DW it’s enough) and expiration date 12/31/9999 (the latest future we can express in our calendar).

- All changes of the “current row” created two rows: the terminated “current row” and the new “current row”.
  • The terminated “current row” – it’s the “current row” with business data expiration date changed to the date/time of moment when this business data become not effective anymore (usually, if other is not predefined for particular RTDW, it’s the date/time when change happened), and such data become the latest changed data. Such row becomes a historical one that means it can be considered as RTDW row. When it happened first time, the row becomes initially changed data, all next changes of the “current row” become the next changed rows, and the latest of them becomes the latest changed data.
  • New “current row” – it’s the “current row” with new business data (result of provided changes), new effective date which is always the same as expiration date of the terminated “current row” (otherwise, the history of data in the RTDW would be corrupted). The expiration date is opened the latest possible date in the calendar - 12/31/9999. Definitely, the place and the user ID can be the new ones because they reflect the place and the user where the row appeared and who provided changes.
  •  Deleting the “current row” also creates the terminated “current row” as described above.
- After the first change of the “current row”, the terminated “current row” becomes initially changed row; the latest terminated “current row” becomes the latest one.

All terminated “current rows” together as one package is the “data history” (history of the table – traditional for DW, but without the row which contains current business data). Each “current row” reflects current business data (traditional for OLTP), just with not usual for the traditional OLTP row label described above (such label makes structure of the “current row” absolutely identical with the “data history”). Together, the “data history” and the “current row” are the RTDW – completely integrated history of all data appeared in the business process. As it represented in the completely normalized form, it becomes the “single version of truth”.

Such data processing system – a RTDW – has two functional layers: a OLTP (“current data”) and a DW (“current data” and “data history” together). As the OLTP real time data processing also generates the “data history” (which is a major part and a feature of DW) in real time, a RTDW is a real time “single version of truth”, and the time can be shifted back, into the past, looking at the “single version of truth” at any time in the past (to do it, just necessary to require business data at any moment and get result based on the business data effective date and expiration dates in any table(s).

Logically, the “current row” and the “data history” should be in the same table, but physically, for improvement of an OLTP, it’s possible to split data for two tables with the same structure: one for an OLTP, another – for a “data history” (last one can have the suffix or prefix identified that it’s a “data history”). In this case, both tables should be logically united (for example, using the SQL UNION operator) into one view – the RTDW view containing real time “single version of truth” at any time covered by RTDW.

                                                                    Extensions

A RTDW can be used as one whole formation (as described above) and also it can be extended into different directions. Let’s consider the first coming ones.

Initial imagination of a RTDW makes impression that this is one formation which will work locating in one physical location (database/server), and which has the same table structure for the same business entity on different layers (OLTP and “data history”). But it’s not a mandatory requirement, a RTDW can be distributed. Distribution can happen either because of the hardware requirements, or because of the business specific.

The hardware requirements (lack of space or power of processors) causes splitting data horizontally, on timing bases (for example, split “current data” and “data history”); or split a close “data history”  (one or two years) and a very old “data history” (for archiving). Also, the hardware requirements may split data vertically, separating different groups of tables (for example, operational tables and accounting tables). In case of the hardware requirements, the data may be integrated using database links the similar way as for one database.

Splitting the data because of the business requirements also can be implemented horizontally, aggregating data on different management levels (for example, a department’s level and a corporation level), and having a different structure of tables on different levels of the business management (the data aggregation can be provided for higher level).

Splitting the data because of the business requirements also can be provided vertically that can be required by lack or absence of the business integration (for example, when a corporation has not related to each other businesses – machinery building and pharmacy). In this case, the data split can be provided following the business split, and each part of the RTDW used for appropriate part of business.

For both reasons of the data splitting, horizontal and vertical, the split can be combined if necessary. In splitting data vertically, nomenclature of different parts of the RTDW will be different because different parts related to the different parts of the business, even may be to different businesses.
It’s necessary to keep in mind that more splitting cause more complicated and less efficient structure of RTDW. Such structure requires more efforts both for development and maintenance of a RTDW (you can read some approaches on production support in the post http://alex-t-ito.blogspot.com/2012/01/production-support-story-about-smooth.html).

An ETL in a RTDW is not so much necessary as in a regular DW because an OLTP, as a source of data, can be integrated with “data history” in the same database without Staging Area (SA). An ETL may become necessary in a distributed DW, but in this case, it’s a big chance to get “near RTDW” because an ETL runs taking time. The most natural way of data transferring in a RTDW is the usage database links to remote parts of RTDW. At the same time, an ETL has the advantage over RTDW internal means (database links) because it independent from RTDW processes. The communication through Web “clouds” also can be considered.

Replication of OLTP data changes into a “data history” can be provided using different approaches, first of all, triggers and user-interface programs. It would be very useful for a RTDW to have Changed Data Capture (CDC) mechanism (one more method of the replication of data in a database) oriented on a RTDW. It means that the CDC should provide replication of changed data into “data history”, and keep tracking columns (date/time, place, and user). Such RTDW oriented CDC will make a RTDW data processing smoother, more natural and efficient for design, implementation and the daily usage.

In case of transferring data in a RTDW, the most critical factor is the real time reaction on any business event – reflection the data change in a “data history“ should follow immediately after OLTP (“current row”) change. This is the most valuable feature of a RTDW for the business management.

                                                              Transformations

A RTDW does not need internal transformations because the structures of an OLTP and a “data history” are the same. But for the processing of a “data history” (with or without “current data”) for reports and queries by using normalized data may have a performance issue. To resolve the issue, normalized data should be transformed to the SELECT-oriented structure of data – a DM (a star schema), which becomes an output-oriented layer of a RTDW.

The traditional way (create dimensions, then fact table step-by-step) can be used, but its a data processing will take time, and RTDW will lose the “real time” feature. Definitely, reports usually do not need to present a result of real-time data, but ad hoc queries are mostly running just to get real-time data to see the current status of business. Let’s see whether it’s possible to transform a RTDW to a Real Time DM (RTDM).

All rows in a RTDW have surrogate and business keys and business data effective and expiration dates. These columns can be used to build Slow Changed Dimensions (SCDs) type 2 (probably, the most popular type). We don’t consider other types of dimensions because a RTDW contains all necessary data to build them, and the process will be similar. Using tables, which don’t have surrogate key columns from other tables (Hubs in DV terminology), as well as tables which linked to the first ones (Hubs) and don’t have business data (Links in DV), we can create SCD not changing data, just joining them. The business data columns from other tables (Satellites in DV) should be added to these joins. As the result, joining existing RTDW “current data” and “data history”, we get ready to use SCD. As a RTDW contains “a single version of truth” that means that it exists only as one instance, each dimension, built from a RTDW, will be the same for all DMs, built from a RTDW, that means it’s a Conformed Dimension (CD) in the DM terminology.

Creation of CDs should take time (for a large RTDW - pretty long time), that will negatively impact the Real Time feature. To resolve the issue, it’s possible to build CDs as materialized views. Such way, CD will be refreshed with a latest data change immediately after used RTDW data changed. The other way of creating CDs can be a usage triggers and transfer RTDW changes into CDs that will give more control over transferring process.
As the result, we get a Real Time CD (RTCD) which can be used for all DMs built on a RTDW.

The other necessary component of a DM is a fact table, and it also can be built the similar way - using triggers on the RTDW tables which contain fact data (materialized views also can be used, but linking a fact table to a RTCD through SQL may become more complicated). Definitely, an initial loading of a fact table will be necessary if a trigger or whole DM will be built later than a RTDW started to work.

In case of usage AM or any other method of data normalization in RTDW, the creation of a RTDM will be the similar because source (a RTDW) have the same data (linked keys and  business data), just structured differently, and result a (RTDM) will be the same, just materialized views will be different, reflecting the specific of a RTDM data structure. Hence, transformation a RTDW into a RTDM will work, even methods of normalization a RTDW will be different in different RTDWs.
Only incremental data changes will be loaded into DM in the real time. Complete recreation of a DM from a RTDW also available, but it will take long time, and can be used only if a DM backup copy is not available. In some cases, these two ways can be combined: DM restored from the last backup copy, and the data changes after last backup should be loaded in the incremental manner.

To save space for CDs, they can be built as views, not normalized view. It will save space, but the performance of SELECT statements will be notably lower. Nevertheless, such approach can be used for a small amount of data, or if a RTDM created temporarily.

                                                                  Advantages

A RTDW has advantages vs. other structures of a DW:

1. It contains the information about status of business at any moment of business history (“always single version of truth”), and presents it to the Business Management, that increases an efficiency of management decisions.

2. Its structure is homogenous that simplifies the software for its support (for example, a RTDW requires only one type of a program module instead of three types operating with separately existing an OLTP, a SA, and a DW, because the structure of an OLTP layer and “a data history” is the same, and a SA is not necessary for a RTDW).

3. Its three layers (an input-oriented “current data”; an analysis-oriented “data history”; and an output-oriented “star-schema”) organically merged and become one multifunctional data conglomerate.

4. It can be easy segmented to be completely consistent to organization of the business process it related to.

5. It can be built based on different methods of data normalization for the “current data” and the “data history”, that makes it flexible for usage in different data structure requirements and styles, and allows to aggregate it with preexisting data processing systems (for example, DMs).

6. Usage different data modeling approaches in a RTDW allows to combine advantages of all of them, and get the new, combined effect (for example, using DV can help not only to nail major entities represented by hubs and links entities and efficiently organize a DW data, but also define conformed dimensions, which will be used for the DM’s creation).

7. It can be built from scratch (probably, the clearest way to build a RTDW), or from an existing layer (for example, already existing an OLTP, or an existing DM).

8. The construction of a RTDW requires fewer efforts than summarized efforts for creating an OLTP, a SA, a DW, and a DM, that increases efficiency of a RTDW.

9. As a RTDW covers all functions of an OLTP, a DW and a DM, it can be used instead of all of them that increase its efficiency, and reduce expenses for its support.

10. The existence of a row label in each row allows to indicate its information for user that increases the value of information presented to users (for example, it’s possible to show on user’s screen when and where showed data created, and who created it).

                                                               Conclusion

A RTDW combines existing data models: a OLTP, a Normalized DW (for example, DV), a DM with Conformed Dimensions (from the DV point of view – Hubs), an AM (if you split rows up to the atomic level – columns), and probably, much more what we did not consider yet in the DW analysis). Combining advantages of each approach, it’s possible to build the formation which efficiently supports Business Management’s requirements and gives to the Business Management an “Always Single Version of Truth”. Usage the different forms of business data, as described above, can be expressed in the formula:
                                                  OLTP+DW=RTDW=RTDM

                                                                References

1. Bill Inmon (2004-09-09). "The Single Version Of The Truth". Business Intelligence Network (Powell Media LLC).
2. Kimball, Ralph; Margy Ross (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (2nd ed.). Wiley. ISBN 0-471-20024-7.
3. Linstedt, Dan. "Data Vault Series 1 – Data Vault Overview". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
4. Linstedt, Dan. "Data Vault Series 2 – Data Vault Components". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
5. Linstedt, Dan. "Data Vault Series 3 – End Dates and Basic Joins". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
6. Linstedt, Dan. "Data Vault Series 4 – Link Tables". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
7. Linstedt, Dan. "Data Vault Series 5 – Loading Practices". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
8. L. Rönnbäck and O. Regardt and M. Bergholtz and P. Johannesson and P. Wohed (2010). "Anchor modeling - Agile information modeling in evolving data environments". Data & Knowledge Engineering 69: 1229–1253. ISSN 0169-023X.
9. Langseth, J., "Real-Time Data Warehousing: Challenges and Solutions", DSSResources.COM, 02/08/2004