InformationTechnologyOrganization: Real Time Data Warehouse

The following information protected by USA and international law. Any usage of this publication or its part can be allowed per request. Consultation available - leave your email address as a comment.

                                                                 Appreciation

First of all, let me express my great thanks to authors of modern data architecture concepts based on which I provided analysis posted below:
            Bill Inmon – Data Warehouse (DW) [1],
            Ralph Kimball – Data Mart (DM) [2],
            Dan Linstedt – Data Vault (DV) [3 – 7],
            Lars Rönnbäck – Anchor Modeling (AM) [8],
and all people who are interesting and help to build concepts in Data Warehousing, and provide its implementation.
My special thanks to Justin Langseth who observed real-time DW challenges and solutions [9].

                                                                       Analysis

Do we need a Real Time DW (RTDW)? The necessity in a RTDW comes from business: a business management team always interested to know about the status of their business and about all business events (definitely, not all of them simultaneously, but some of them at the same time). The later reflection of the status and events makes business information worse.

What should we expect from a RTDW? Let’s consider the answer step-by-step.

First of all, it should be a DW, that means that data collected and integrated into one conglomerate in efficient normalized form, and the data completely identical to the business processes they reflect (more about consistency data and business processes see in the post http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html ). All data should reflect the business from historical prospective that means they should reflect all business events happened during predefined time. The data should be available for business analysis from any prospective. But usually, a DW has data a day later that does not allow business management to react immediately on changes in the business process.

The second requirement is a Real Time (RT) that means a RTDW data always is a perfect (without any distortion or delay) mirror of the business processes – real reflection accuracy is determine by the ability of IT system to reflect changes in the business, as well as requirements of doing business (for example, some monthly reporting system may not need to be an immediate reflection of changes in the business). Such a RT system is an OLTP one, which immediately reflects the business process and usually normalized and has current business data. But an OLTP does not contain historical data.
The regular way to data transfer from an OLTP to a DW is to use an ETL or other mechanism with the similar function. But the usage of such an intermediary is time consuming and a DW becomes “almost or near” real time, which is not the ideal way to control from a business perspective.

                                                                     Approach

Let’s try to use another approach – merging an OLTP and a DW to get a RTDW. To do this, let’s consider an OLTP and a DW as one data processing system. Why not? Powerful DW servers usually have enough capability to save an OLTP data extra to its existing stuff.

Extra requirements to a server performance can come, because a DW itself requires a powerful processor, much more powerful than regular batch processing systems. The reason of such requirement is that a DW uses hash joins as a performance efficient execution plan for joining numerous queries. Hash joins operate in the memory, increasingly shifting requirements for processor to the level of scientific calculations which process in memory zillion of iterations. I detected such patterns when analyzed performance of a DW with a few terabytes of data. That finding should be used to properly procure and configure DW servers. With such increased power of the processor, addition an OLTP system should not be a burden for a DW.

Let’s consider the data model of a RTDW which is an OLTP and a DW simultaneously. Requirements to such RTDW come from both an OLTP and a DW. An OLTP requires normalization using, at least, the 3-rd normal form. A DW requires maintaining historical data and having on each row date/time stamp and indicator of a place where data have appeared. As the result, we get such data model (some variations are possible, but main features shown on the Figure 1):

Each row in a RTDW has surrogate and business keys, normalized business data, and row label (date/time stamps of row-create event; last row-update event; business data effective date and expiration dates; place where the row appeared; ID of the user who last inserted/updated/deleted row).

The picture shows transformation of the status of identified by surrogate and business keys business row in the table of RTDW (any table, because all tables of RTDW processed in the same manner):

- The initially entered row becomes a “current” one and reflects the latest status of the business. The status “current” means that the row can be considered as the OLTP row. It has effective date 01/01/0001 (for not ancient data in the DW it’s enough) and expiration date 12/31/9999 (the latest future we can express in our calendar).

- All changes of the “current row” created two rows: the terminated “current row” and the new “current row”.

The terminated “current row” – it’s the “current row” with business data expiration date changed to the date/time of moment when this business data become not effective anymore (usually, if other is not predefined for particular RTDW, it’s the date/time when change happened), and such data become the latest changed data. Such row becomes a historical one that means it can be considered as RTDW row. When it happened first time, the row becomes initially changed data, all next changes of the “current row” become the next changed rows, and the latest of them becomes the latest changed data.
New “current row” – it’s the “current row” with new business data (result of provided changes), new effective date which is always the same as expiration date of the terminated “current row” (otherwise, the history of data in the RTDW would be corrupted). The expiration date is opened the latest possible date in the calendar - 12/31/9999. Definitely, the place and the user ID can be the new ones because they reflect the place and the user where the row appeared and who provided changes.
Deleting the “current row” also creates the terminated “current row” as described above.

- After the first change of the “current row”, the terminated “current row” becomes initially changed row; the latest terminated “current row” becomes the latest one.

All terminated “current rows” together as one package is the “data history” (history of the table – traditional for DW, but without the row which contains current business data). Each “current row” reflects current business data (traditional for OLTP), just with not usual for the traditional OLTP row label described above (such label makes structure of the “current row” absolutely identical with the “data history”). Together, the “data history” and the “current row” are the RTDW – completely integrated history of all data appeared in the business process. As it represented in the completely normalized form, it becomes the “single version of truth”.

Such data processing system – a RTDW – has two functional layers: a OLTP (“current data”) and a DW (“current data” and “data history” together). As the OLTP real time data processing also generates the “data history” (which is a major part and a feature of DW) in real time, a RTDW is a real time “single version of truth”, and the time can be shifted back, into the past, looking at the “single version of truth” at any time in the past (to do it, just necessary to require business data at any moment and get result based on the business data effective date and expiration dates in any table(s).

Logically, the “current row” and the “data history” should be in the same table, but physically, for improvement of an OLTP, it’s possible to split data for two tables with the same structure: one for an OLTP, another – for a “data history” (last one can have the suffix or prefix identified that it’s a “data history”). In this case, both tables should be logically united (for example, using the SQL UNION operator) into one view – the RTDW view containing real time “single version of truth” at any time covered by RTDW.

                                                                    Extensions

A RTDW can be used as one whole formation (as described above) and also it can be extended into different directions. Let’s consider the first coming ones.

Initial imagination of a RTDW makes impression that this is one formation which will work locating in one physical location (database/server), and which has the same table structure for the same business entity on different layers (OLTP and “data history”). But it’s not a mandatory requirement, a RTDW can be distributed. Distribution can happen either because of the hardware requirements, or because of the business specific.

The hardware requirements (lack of space or power of processors) causes splitting data horizontally, on timing bases (for example, split “current data” and “data history”); or split a close “data history” (one or two years) and a very old “data history” (for archiving). Also, the hardware requirements may split data vertically, separating different groups of tables (for example, operational tables and accounting tables). In case of the hardware requirements, the data may be integrated using database links the similar way as for one database.

Splitting the data because of the business requirements also can be implemented horizontally, aggregating data on different management levels (for example, a department’s level and a corporation level), and having a different structure of tables on different levels of the business management (the data aggregation can be provided for higher level).

Splitting the data because of the business requirements also can be provided vertically that can be required by lack or absence of the business integration (for example, when a corporation has not related to each other businesses – machinery building and pharmacy). In this case, the data split can be provided following the business split, and each part of the RTDW used for appropriate part of business.

For both reasons of the data splitting, horizontal and vertical, the split can be combined if necessary. In splitting data vertically, nomenclature of different parts of the RTDW will be different because different parts related to the different parts of the business, even may be to different businesses.
It’s necessary to keep in mind that more splitting cause more complicated and less efficient structure of RTDW. Such structure requires more efforts both for development and maintenance of a RTDW (you can read some approaches on production support in the post http://alex-t-ito.blogspot.com/2012/01/production-support-story-about-smooth.html).

An ETL in a RTDW is not so much necessary as in a regular DW because an OLTP, as a source of data, can be integrated with “data history” in the same database without Staging Area (SA). An ETL may become necessary in a distributed DW, but in this case, it’s a big chance to get “near RTDW” because an ETL runs taking time. The most natural way of data transferring in a RTDW is the usage database links to remote parts of RTDW. At the same time, an ETL has the advantage over RTDW internal means (database links) because it independent from RTDW processes. The communication through Web “clouds” also can be considered.

Replication of OLTP data changes into a “data history” can be provided using different approaches, first of all, triggers and user-interface programs. It would be very useful for a RTDW to have Changed Data Capture (CDC) mechanism (one more method of the replication of data in a database) oriented on a RTDW. It means that the CDC should provide replication of changed data into “data history”, and keep tracking columns (date/time, place, and user). Such RTDW oriented CDC will make a RTDW data processing smoother, more natural and efficient for design, implementation and the daily usage.

In case of transferring data in a RTDW, the most critical factor is the real time reaction on any business event – reflection the data change in a “data history“ should follow immediately after OLTP (“current row”) change. This is the most valuable feature of a RTDW for the business management.

                                                              Transformations

A RTDW does not need internal transformations because the structures of an OLTP and a “data history” are the same. But for the processing of a “data history” (with or without “current data”) for reports and queries by using normalized data may have a performance issue. To resolve the issue, normalized data should be transformed to the SELECT-oriented structure of data – a DM (a star schema), which becomes an output-oriented layer of a RTDW.

The traditional way (create dimensions, then fact table step-by-step) can be used, but its a data processing will take time, and RTDW will lose the “real time” feature. Definitely, reports usually do not need to present a result of real-time data, but ad hoc queries are mostly running just to get real-time data to see the current status of business. Let’s see whether it’s possible to transform a RTDW to a Real Time DM (RTDM).

All rows in a RTDW have surrogate and business keys and business data effective and expiration dates. These columns can be used to build Slow Changed Dimensions (SCDs) type 2 (probably, the most popular type). We don’t consider other types of dimensions because a RTDW contains all necessary data to build them, and the process will be similar. Using tables, which don’t have surrogate key columns from other tables (Hubs in DV terminology), as well as tables which linked to the first ones (Hubs) and don’t have business data (Links in DV), we can create SCD not changing data, just joining them. The business data columns from other tables (Satellites in DV) should be added to these joins. As the result, joining existing RTDW “current data” and “data history”, we get ready to use SCD. As a RTDW contains “a single version of truth” that means that it exists only as one instance, each dimension, built from a RTDW, will be the same for all DMs, built from a RTDW, that means it’s a Conformed Dimension (CD) in the DM terminology.

Creation of CDs should take time (for a large RTDW - pretty long time), that will negatively impact the Real Time feature. To resolve the issue, it’s possible to build CDs as materialized views. Such way, CD will be refreshed with a latest data change immediately after used RTDW data changed. The other way of creating CDs can be a usage triggers and transfer RTDW changes into CDs that will give more control over transferring process.
As the result, we get a Real Time CD (RTCD) which can be used for all DMs built on a RTDW.

The other necessary component of a DM is a fact table, and it also can be built the similar way - using triggers on the RTDW tables which contain fact data (materialized views also can be used, but linking a fact table to a RTCD through SQL may become more complicated). Definitely, an initial loading of a fact table will be necessary if a trigger or whole DM will be built later than a RTDW started to work.

In case of usage AM or any other method of data normalization in RTDW, the creation of a RTDM will be the similar because source (a RTDW) have the same data (linked keys and business data), just structured differently, and result a (RTDM) will be the same, just materialized views will be different, reflecting the specific of a RTDM data structure. Hence, transformation a RTDW into a RTDM will work, even methods of normalization a RTDW will be different in different RTDWs.
Only incremental data changes will be loaded into DM in the real time. Complete recreation of a DM from a RTDW also available, but it will take long time, and can be used only if a DM backup copy is not available. In some cases, these two ways can be combined: DM restored from the last backup copy, and the data changes after last backup should be loaded in the incremental manner.

To save space for CDs, they can be built as views, not normalized view. It will save space, but the performance of SELECT statements will be notably lower. Nevertheless, such approach can be used for a small amount of data, or if a RTDM created temporarily.

                                                                  Advantages

A RTDW has advantages vs. other structures of a DW:

1. It contains the information about status of business at any moment of business history (“always single version of truth”), and presents it to the Business Management, that increases an efficiency of management decisions.

2. Its structure is homogenous that simplifies the software for its support (for example, a RTDW requires only one type of a program module instead of three types operating with separately existing an OLTP, a SA, and a DW, because the structure of an OLTP layer and “a data history” is the same, and a SA is not necessary for a RTDW).

3. Its three layers (an input-oriented “current data”; an analysis-oriented “data history”; and an output-oriented “star-schema”) organically merged and become one multifunctional data conglomerate.

4. It can be easy segmented to be completely consistent to organization of the business process it related to.

5. It can be built based on different methods of data normalization for the “current data” and the “data history”, that makes it flexible for usage in different data structure requirements and styles, and allows to aggregate it with preexisting data processing systems (for example, DMs).

6. Usage different data modeling approaches in a RTDW allows to combine advantages of all of them, and get the new, combined effect (for example, using DV can help not only to nail major entities represented by hubs and links entities and efficiently organize a DW data, but also define conformed dimensions, which will be used for the DM’s creation).

7. It can be built from scratch (probably, the clearest way to build a RTDW), or from an existing layer (for example, already existing an OLTP, or an existing DM).

8. The construction of a RTDW requires fewer efforts than summarized efforts for creating an OLTP, a SA, a DW, and a DM, that increases efficiency of a RTDW.

9. As a RTDW covers all functions of an OLTP, a DW and a DM, it can be used instead of all of them that increase its efficiency, and reduce expenses for its support.

10. The existence of a row label in each row allows to indicate its information for user that increases the value of information presented to users (for example, it’s possible to show on user’s screen when and where showed data created, and who created it).

                                                               Conclusion

A RTDW combines existing data models: a OLTP, a Normalized DW (for example, DV), a DM with Conformed Dimensions (from the DV point of view – Hubs), an AM (if you split rows up to the atomic level – columns), and probably, much more what we did not consider yet in the DW analysis). Combining advantages of each approach, it’s possible to build the formation which efficiently supports Business Management’s requirements and gives to the Business Management an “Always Single Version of Truth”. Usage the different forms of business data, as described above, can be expressed in the formula:
                                                  OLTP+DW=RTDW=RTDM

                                                                References

1. Bill Inmon (2004-09-09). "The Single Version Of The Truth". Business Intelligence Network (Powell Media LLC).
2. Kimball, Ralph; Margy Ross (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (2nd ed.). Wiley. ISBN 0-471-20024-7.
3. Linstedt, Dan. "Data Vault Series 1 – Data Vault Overview". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
4. Linstedt, Dan. "Data Vault Series 2 – Data Vault Components". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
5. Linstedt, Dan. "Data Vault Series 3 – End Dates and Basic Joins". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
6. Linstedt, Dan. "Data Vault Series 4 – Link Tables". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
7. Linstedt, Dan. "Data Vault Series 5 – Loading Practices". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
8. L. Rönnbäck and O. Regardt and M. Bergholtz and P. Johannesson and P. Wohed (2010). "Anchor modeling - Agile information modeling in evolving data environments". Data & Knowledge Engineering 69: 1229–1253. ISSN 0169-023X.
9. Langseth, J., "Real-Time Data Warehousing: Challenges and Solutions", DSSResources.COM, 02/08/2004

4 comments:

AnonymousSeptember 3, 2012 at 7:11 AM
Hi Alex, I agree the concept and theory is probably sound. Technically I think it would be possible but goes beyond my capabilities these days.

I think the issues will be business based. Although the business benefits, the cost, risk and management overhead would prevent many organisations investing. There are certain types of business which will benefit more than others and various business departments which would also benefit. I suspect the best industries for RTDW would be the ones with what we now call 'big data' and on-line transactional businesses or customer contact centres.

For the organisations with no existing DW this could be an option if they had the funds to pay for it. The risks would be that an issue in the RTDW process could impact BAU, which could cost money and customers - so infrastructure costs would be higher, especially as the types of business it would suit expect to be operational 24/7. Also the impact to report users would have to be considered at the same time as the RTDW was processing x number of transactions.

I have debated the need for RTDW for many years and although on paper it is good and technically feasible, the costs and risks have outweighed the business benefit. The banking sector could benefit but they already have many bespoke solutions doing something similar.

I think if it could be designed for a specific business sector working with a standard operational transactional system then I think it may adopt buy-in. If it was initially targeted at master data then it would be a good entry RTDW system. I think it would work best in telecoms, on-line retail, logistics and banking sectors.

A good topic of discussion and one I have had several times.
Jason
Alex Treyster.September 3, 2012 at 7:30 AM
Hi Jason,
Thank you very much for your detailed analysis! It will help me a lot on developing RTDW concept. I would be glad to share with you result I get.
Regards,
Alex
Bala SeetharamanSeptember 17, 2012 at 9:30 PM
Hi Alex

The concept looks promosing and the applicablity scope is not big except retail, telco, stock markets.

The applicability of RTDW is not a mandated one for most of our BI/DW requirements. again the implementation has lot of data silos which needs impact assessment. now a days most of the ETL tools has its own data lineage and quality checks for better tracking and the maintenance of these silos and will not be an issue.

I can think of its requirement for cloud based CRM or SCM system to inhouse DW which needs a reconcilation or synchronization report or data integration itself.

Can you please include the limitations also so that people don't apply this concept just like that.

You may suggest a technical solutions with few tools will be awesome!!!

BR
Bala.S

Alex Treyster.September 26, 2012 at 7:21 PM
Hi Bala,

Thank you for your comments! They will help me to improve RTDW approach.
Some immediate responses:
- yes, applicability scope is not big so far, but the most important that it grows;
- I would add to applicability such a large branch of industry as machinery building;
- traditional DW comes from the past time when DW just appeared and became very popular, and the more RTDW instances appeared, the more popular it will become - like a chain reaction;
- RTDW built based on described approach does not need ETL because OLTP integrated with DW and reflects changes immediately, that makes it much less expensive both for design and maintenance.

CRM and SCM consideration would be very interesting and I would appreciate your thoughts on this.

As for limitation, I don't see any of them for RTDW because it can work efficiently for any business. It's just better than the regular (not real time) DW like more efficient and less expensive product better than less efficient and more expensive one.

I am going to consider different ways of creating RTDW based on different combination of existing data processing systems, and believe it will help to discuss how to re-architecture existing OLTP and DW to build RTDW. I hope to see your comments on this.

Sincerely,
Alex

InformationTechnologyOrganization

Sunday, August 19, 2012

Real Time Data Warehouse – Integration with OLTP System

4 comments:

Business Intelligence (BI) Group on LinkedIn

Busness Intelligence Group on LinkedIn