Sunday, August 19, 2012

Real Time Data Warehouse – Integration with OLTP System

The following information protected by USA and international law. Any usage of this publication or its part can be allowed per request. Consultation available - leave your email address as a comment.

                                                                 Appreciation

First of all, let me express my great thanks to authors of modern data architecture concepts based on which I provided analysis posted below:
            Bill Inmon – Data Warehouse (DW) [1],
            Ralph Kimball – Data Mart (DM) [2],
            Dan Linstedt – Data Vault (DV) [3 – 7],
            Lars Rönnbäck – Anchor Modeling (AM) [8],
and all people who are interesting and help to build concepts in Data Warehousing, and provide its implementation.
My special thanks to Justin Langseth who observed real-time DW challenges and solutions [9].

                                                                       Analysis

Do we need a Real Time DW (RTDW)? The necessity in a RTDW comes from business: a business management team always interested to know about the status of their business and about all business events (definitely, not all of them simultaneously, but some of them at the same time). The later reflection of the status and events makes business information worse.

What should we expect from a RTDW? Let’s consider the answer step-by-step.

First of all, it should be a DW, that means that data collected and integrated into one conglomerate in efficient normalized form, and the data completely identical to the business processes they reflect (more about consistency data and business processes see in the post http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html ). All data should reflect the business from historical prospective that means they should reflect all business events happened during predefined time. The data should be available for business analysis from any prospective. But usually, a DW has data a day later that does not allow business management to react immediately on changes in the business process.

The second requirement is a Real Time (RT) that means a RTDW data always is a perfect (without any distortion or delay) mirror of the business processes – real reflection accuracy is determine by the ability of IT system to reflect changes in the business, as well as requirements of doing business (for example, some monthly reporting system may not need to be an immediate reflection of changes in the business). Such a RT system is an OLTP one, which immediately reflects the business process and usually normalized and has current business data. But an OLTP does not contain historical data.
The regular way to data transfer from an OLTP to a DW is to use an ETL or other mechanism with the similar function. But the usage of such an intermediary is time consuming and a DW becomes “almost or near” real time, which is not the ideal way to control from a business perspective.

                                                                     Approach

Let’s try to use another approach – merging an OLTP and a DW to get a RTDW. To do this, let’s consider an OLTP and a DW as one data processing system. Why not? Powerful DW servers usually have enough capability to save an OLTP data extra to its existing stuff.

Extra requirements to a server performance can come, because a DW itself requires a powerful processor, much more powerful than regular batch processing systems. The reason of such requirement is that a DW uses hash joins as a performance efficient execution plan for joining numerous queries. Hash joins operate in the memory, increasingly shifting requirements for processor to the level of scientific calculations which process in memory zillion of iterations. I detected such patterns when analyzed performance of a DW with a few terabytes of data. That finding should be used to properly procure and configure DW servers. With such increased power of the processor, addition an OLTP system should not be a burden for a DW.

Let’s consider the data model of a RTDW which is an OLTP and a DW simultaneously. Requirements to such RTDW come from both an OLTP and a DW. An OLTP requires normalization using, at least, the 3-rd normal form. A DW requires maintaining historical data and having on each row date/time stamp and indicator of a place where data have appeared. As the result, we get such data model (some variations are possible, but main features shown on the Figure 1):




Each row in a RTDW has surrogate and business keys, normalized business data, and row label (date/time stamps of row-create event; last row-update event; business data effective date and expiration dates; place where the row appeared;  ID of the user who last inserted/updated/deleted row).

The picture shows transformation of the status of identified by surrogate and business keys business row in the table of RTDW (any table, because all tables of RTDW processed in the same manner):

- The initially entered row becomes a “current” one and reflects the latest status of the business. The status “current” means that the row can be considered as the OLTP row. It has effective date 01/01/0001 (for not ancient data in the DW it’s enough) and expiration date 12/31/9999 (the latest future we can express in our calendar).

- All changes of the “current row” created two rows: the terminated “current row” and the new “current row”.
  • The terminated “current row” – it’s the “current row” with business data expiration date changed to the date/time of moment when this business data become not effective anymore (usually, if other is not predefined for particular RTDW, it’s the date/time when change happened), and such data become the latest changed data. Such row becomes a historical one that means it can be considered as RTDW row. When it happened first time, the row becomes initially changed data, all next changes of the “current row” become the next changed rows, and the latest of them becomes the latest changed data.
  • New “current row” – it’s the “current row” with new business data (result of provided changes), new effective date which is always the same as expiration date of the terminated “current row” (otherwise, the history of data in the RTDW would be corrupted). The expiration date is opened the latest possible date in the calendar - 12/31/9999. Definitely, the place and the user ID can be the new ones because they reflect the place and the user where the row appeared and who provided changes.
  •  Deleting the “current row” also creates the terminated “current row” as described above.
- After the first change of the “current row”, the terminated “current row” becomes initially changed row; the latest terminated “current row” becomes the latest one.

All terminated “current rows” together as one package is the “data history” (history of the table – traditional for DW, but without the row which contains current business data). Each “current row” reflects current business data (traditional for OLTP), just with not usual for the traditional OLTP row label described above (such label makes structure of the “current row” absolutely identical with the “data history”). Together, the “data history” and the “current row” are the RTDW – completely integrated history of all data appeared in the business process. As it represented in the completely normalized form, it becomes the “single version of truth”.

Such data processing system – a RTDW – has two functional layers: a OLTP (“current data”) and a DW (“current data” and “data history” together). As the OLTP real time data processing also generates the “data history” (which is a major part and a feature of DW) in real time, a RTDW is a real time “single version of truth”, and the time can be shifted back, into the past, looking at the “single version of truth” at any time in the past (to do it, just necessary to require business data at any moment and get result based on the business data effective date and expiration dates in any table(s).

Logically, the “current row” and the “data history” should be in the same table, but physically, for improvement of an OLTP, it’s possible to split data for two tables with the same structure: one for an OLTP, another – for a “data history” (last one can have the suffix or prefix identified that it’s a “data history”). In this case, both tables should be logically united (for example, using the SQL UNION operator) into one view – the RTDW view containing real time “single version of truth” at any time covered by RTDW.

                                                                    Extensions

A RTDW can be used as one whole formation (as described above) and also it can be extended into different directions. Let’s consider the first coming ones.

Initial imagination of a RTDW makes impression that this is one formation which will work locating in one physical location (database/server), and which has the same table structure for the same business entity on different layers (OLTP and “data history”). But it’s not a mandatory requirement, a RTDW can be distributed. Distribution can happen either because of the hardware requirements, or because of the business specific.

The hardware requirements (lack of space or power of processors) causes splitting data horizontally, on timing bases (for example, split “current data” and “data history”); or split a close “data history”  (one or two years) and a very old “data history” (for archiving). Also, the hardware requirements may split data vertically, separating different groups of tables (for example, operational tables and accounting tables). In case of the hardware requirements, the data may be integrated using database links the similar way as for one database.

Splitting the data because of the business requirements also can be implemented horizontally, aggregating data on different management levels (for example, a department’s level and a corporation level), and having a different structure of tables on different levels of the business management (the data aggregation can be provided for higher level).

Splitting the data because of the business requirements also can be provided vertically that can be required by lack or absence of the business integration (for example, when a corporation has not related to each other businesses – machinery building and pharmacy). In this case, the data split can be provided following the business split, and each part of the RTDW used for appropriate part of business.

For both reasons of the data splitting, horizontal and vertical, the split can be combined if necessary. In splitting data vertically, nomenclature of different parts of the RTDW will be different because different parts related to the different parts of the business, even may be to different businesses.
It’s necessary to keep in mind that more splitting cause more complicated and less efficient structure of RTDW. Such structure requires more efforts both for development and maintenance of a RTDW (you can read some approaches on production support in the post http://alex-t-ito.blogspot.com/2012/01/production-support-story-about-smooth.html).

An ETL in a RTDW is not so much necessary as in a regular DW because an OLTP, as a source of data, can be integrated with “data history” in the same database without Staging Area (SA). An ETL may become necessary in a distributed DW, but in this case, it’s a big chance to get “near RTDW” because an ETL runs taking time. The most natural way of data transferring in a RTDW is the usage database links to remote parts of RTDW. At the same time, an ETL has the advantage over RTDW internal means (database links) because it independent from RTDW processes. The communication through Web “clouds” also can be considered.

Replication of OLTP data changes into a “data history” can be provided using different approaches, first of all, triggers and user-interface programs. It would be very useful for a RTDW to have Changed Data Capture (CDC) mechanism (one more method of the replication of data in a database) oriented on a RTDW. It means that the CDC should provide replication of changed data into “data history”, and keep tracking columns (date/time, place, and user). Such RTDW oriented CDC will make a RTDW data processing smoother, more natural and efficient for design, implementation and the daily usage.

In case of transferring data in a RTDW, the most critical factor is the real time reaction on any business event – reflection the data change in a “data history“ should follow immediately after OLTP (“current row”) change. This is the most valuable feature of a RTDW for the business management.

                                                              Transformations

A RTDW does not need internal transformations because the structures of an OLTP and a “data history” are the same. But for the processing of a “data history” (with or without “current data”) for reports and queries by using normalized data may have a performance issue. To resolve the issue, normalized data should be transformed to the SELECT-oriented structure of data – a DM (a star schema), which becomes an output-oriented layer of a RTDW.

The traditional way (create dimensions, then fact table step-by-step) can be used, but its a data processing will take time, and RTDW will lose the “real time” feature. Definitely, reports usually do not need to present a result of real-time data, but ad hoc queries are mostly running just to get real-time data to see the current status of business. Let’s see whether it’s possible to transform a RTDW to a Real Time DM (RTDM).

All rows in a RTDW have surrogate and business keys and business data effective and expiration dates. These columns can be used to build Slow Changed Dimensions (SCDs) type 2 (probably, the most popular type). We don’t consider other types of dimensions because a RTDW contains all necessary data to build them, and the process will be similar. Using tables, which don’t have surrogate key columns from other tables (Hubs in DV terminology), as well as tables which linked to the first ones (Hubs) and don’t have business data (Links in DV), we can create SCD not changing data, just joining them. The business data columns from other tables (Satellites in DV) should be added to these joins. As the result, joining existing RTDW “current data” and “data history”, we get ready to use SCD. As a RTDW contains “a single version of truth” that means that it exists only as one instance, each dimension, built from a RTDW, will be the same for all DMs, built from a RTDW, that means it’s a Conformed Dimension (CD) in the DM terminology.

Creation of CDs should take time (for a large RTDW - pretty long time), that will negatively impact the Real Time feature. To resolve the issue, it’s possible to build CDs as materialized views. Such way, CD will be refreshed with a latest data change immediately after used RTDW data changed. The other way of creating CDs can be a usage triggers and transfer RTDW changes into CDs that will give more control over transferring process.
As the result, we get a Real Time CD (RTCD) which can be used for all DMs built on a RTDW.

The other necessary component of a DM is a fact table, and it also can be built the similar way - using triggers on the RTDW tables which contain fact data (materialized views also can be used, but linking a fact table to a RTCD through SQL may become more complicated). Definitely, an initial loading of a fact table will be necessary if a trigger or whole DM will be built later than a RTDW started to work.

In case of usage AM or any other method of data normalization in RTDW, the creation of a RTDM will be the similar because source (a RTDW) have the same data (linked keys and  business data), just structured differently, and result a (RTDM) will be the same, just materialized views will be different, reflecting the specific of a RTDM data structure. Hence, transformation a RTDW into a RTDM will work, even methods of normalization a RTDW will be different in different RTDWs.
Only incremental data changes will be loaded into DM in the real time. Complete recreation of a DM from a RTDW also available, but it will take long time, and can be used only if a DM backup copy is not available. In some cases, these two ways can be combined: DM restored from the last backup copy, and the data changes after last backup should be loaded in the incremental manner.

To save space for CDs, they can be built as views, not normalized view. It will save space, but the performance of SELECT statements will be notably lower. Nevertheless, such approach can be used for a small amount of data, or if a RTDM created temporarily.

                                                                  Advantages

A RTDW has advantages vs. other structures of a DW:

1. It contains the information about status of business at any moment of business history (“always single version of truth”), and presents it to the Business Management, that increases an efficiency of management decisions.

2. Its structure is homogenous that simplifies the software for its support (for example, a RTDW requires only one type of a program module instead of three types operating with separately existing an OLTP, a SA, and a DW, because the structure of an OLTP layer and “a data history” is the same, and a SA is not necessary for a RTDW).

3. Its three layers (an input-oriented “current data”; an analysis-oriented “data history”; and an output-oriented “star-schema”) organically merged and become one multifunctional data conglomerate.

4. It can be easy segmented to be completely consistent to organization of the business process it related to.

5. It can be built based on different methods of data normalization for the “current data” and the “data history”, that makes it flexible for usage in different data structure requirements and styles, and allows to aggregate it with preexisting data processing systems (for example, DMs).

6. Usage different data modeling approaches in a RTDW allows to combine advantages of all of them, and get the new, combined effect (for example, using DV can help not only to nail major entities represented by hubs and links entities and efficiently organize a DW data, but also define conformed dimensions, which will be used for the DM’s creation).

7. It can be built from scratch (probably, the clearest way to build a RTDW), or from an existing layer (for example, already existing an OLTP, or an existing DM).

8. The construction of a RTDW requires fewer efforts than summarized efforts for creating an OLTP, a SA, a DW, and a DM, that increases efficiency of a RTDW.

9. As a RTDW covers all functions of an OLTP, a DW and a DM, it can be used instead of all of them that increase its efficiency, and reduce expenses for its support.

10. The existence of a row label in each row allows to indicate its information for user that increases the value of information presented to users (for example, it’s possible to show on user’s screen when and where showed data created, and who created it).

                                                               Conclusion

A RTDW combines existing data models: a OLTP, a Normalized DW (for example, DV), a DM with Conformed Dimensions (from the DV point of view – Hubs), an AM (if you split rows up to the atomic level – columns), and probably, much more what we did not consider yet in the DW analysis). Combining advantages of each approach, it’s possible to build the formation which efficiently supports Business Management’s requirements and gives to the Business Management an “Always Single Version of Truth”. Usage the different forms of business data, as described above, can be expressed in the formula:
                                                  OLTP+DW=RTDW=RTDM

                                                                References

1. Bill Inmon (2004-09-09). "The Single Version Of The Truth". Business Intelligence Network (Powell Media LLC).
2. Kimball, Ralph; Margy Ross (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (2nd ed.). Wiley. ISBN 0-471-20024-7.
3. Linstedt, Dan. "Data Vault Series 1 – Data Vault Overview". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
4. Linstedt, Dan. "Data Vault Series 2 – Data Vault Components". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
5. Linstedt, Dan. "Data Vault Series 3 – End Dates and Basic Joins". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
6. Linstedt, Dan. "Data Vault Series 4 – Link Tables". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
7. Linstedt, Dan. "Data Vault Series 5 – Loading Practices". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
8. L. Rönnbäck and O. Regardt and M. Bergholtz and P. Johannesson and P. Wohed (2010). "Anchor modeling - Agile information modeling in evolving data environments". Data & Knowledge Engineering 69: 1229–1253. ISSN 0169-023X.
9. Langseth, J., "Real-Time Data Warehousing: Challenges and Solutions", DSSResources.COM, 02/08/2004

Friday, January 13, 2012

Production Support – Story About Smooth Transition



The following information protected by USA and international law. Any usage of this publication or its part can be allowed per request. Consultation available - leave your email address as a comment.
Foreword

Production support is the last, biggest, and one of the most complicated and difficult part in the Software Development Life Circle (SDLC). Unlike development, it meets real, not imaging environment. It’s like a main dish of the dinner and all participants of IT process involved into production support. That’s why these notes addressed not only to production support team, but also, and may be first of all, to managers, developers and DBA’s.
Production support usually begins from release. Release can be provided either first time (initial release - IR) when a new project moves to the regular running for users, or release following IR (following release – FR) when it provided on already working IT system.
Initial Release
IR usually means step-by-step process of moving from existing (legacy) system or from scratch to new one, and replacing developers with production support (PS) team (PST). It’s probably the most exciting moment because users begin to get data they expected, and PST begins to take care of this new-born system. The main issue here is the process of transferring knowledge from developers to users and PST. Despite everybody in IT knows that it’s necessary to have clear, complete and well structured documentation, reality is different. Usually in real life, transition from developers to PST provided in a restricted form as a short term training accompanied by short fragmental notes on some key nodes of initial design.
Probably, the best way to provide smooth IR is to include most knowledgeable and experienced members of PST into development team.
These PST agents will get the necessary knowledge from developers beginning from the complex testing and write documentation for PST. Obvious, that as PST members, these PST agents will do their best to prepare the best documentation they can for their PST team. Then, they transfer their knowledge to the rest of members of PST using and polish documentation they prepared. Usually, it should be provided short time before IR, necessary to transfer knowledge and mentor other members of PST. Depending on how the new IT process is big, this transferring can take from a few days to a few months.
At the same time, developers would gain a lot from such transition of knowledge to PST through PST agents, because they will not spend a lot of time on writing documentation (just periodical discussion with PST agents), and will not spend a lot of time on training whole PST, only PST agents.
As result, IR would run faster and smoother with fewer issues.

Following Release
FR means the adding some functions or just a regular fix on already released IT system. The main issue here is not to damage existing IT system, which sometimes runs around clock. To provide this release, it’s necessary to do not abruptly, as one shot, transforming from running IT system to modified one, but step-by-step, as smoothly as possible. This is especially relevant to data modification, because creating new set of processing data takes much more time then switching to new programs. To illustrate some methods how to do it, let me give you a couple of examples from my experience of 7 TB Oracle Data Warehouse support.
The first method of such transition is to run modified data processing in parallel with existing one.
-          To exclude not necessary data from the processing in tens GB table, there were built two new tables: one with necessary data which should be used in modified system, and other table contained not used data. The tables were created and loaded almost completely prior to release. The last portion (current partition of data) has been loaded at the first day of release. Existing old table still was used as a source for reports.
-          Then, running a few statements script (it took a small fraction of a second), existing table was renamed getting suffix ‘_old‘ and two new tables were merge (using UNION statement) into one view with the same name as was a processed table. Such way, the existing table was divided in two tables: one with necessary data and other one with the rest of data. The switch was seamless to users, and the rollback script with renaming back has been prepared and kept ready to run in case of rollback.
-          After splitting data, the data loading procedure has been replaced with intermediate one which loaded data both into the old table (in case of rollback) and a new couple which was used through view created on the previous step.
-          In a couple of days after using splitting data and stable work of all applications using this data, the step of cutting not used data has been provided. It was a dropping the view created on the first day of release, and new table with only used data was renamed to the same name as had initial (now ‘_old’) table, and view was dropped (again, it took a small fraction of a second).
-          Then, intermediate procedure was replaced with new one which loaded only used data into the new table having already the same name as it had prior to release. Release has been completed.
-          After that, data existed before release and not used data were compressed and saved temporary for couple of months, then dropped later.
Such way of release has some advantages:
-          It was seamless to users, eliminating shutdown and access disruptions.
-          There was no risk of losing some data.
-          At any moment there was a chance immediately step back operating only with DDL statements, not DML.
-          It allowed operate with tens of GB amount of data using old data preloading.
Such way was the similar one as an alpinist climbing to the peak of mountain always touching the rock by two hands and foot, or by two feet and hand, minimizing the risk of falling .  It also can be named as a “sliding step” because each step of release provided as putting one foot close to other one.
Extra precautions are the similar steps like extra pop-up box (sometimes, even more than one) which appeared when you clicked “delete” on some file. Then, deleted file saved in Recycle Bin.
The second example illustrates extension of existing data processing system smoothly transforming it to the new one. In Oracle 9, there was a permanent problem to add extra disk space to the permanently growing data. Traditional DBA approach was a physical relocation data files to the new, just added, disk space. Such operation took time and impacted user’s access to data. To eliminate shutdown and access disruption, I moved approach in other direction: not relocate existing files, but create new ones on new disk space and assign the files to the tablespaces which was necessary to increase. As result, process of increasing space for tables became seamless to users, eliminating access disruption. Later, such approach has been implemented in the next versions of Oracle.
Such approach can be implemented to resolve other issues of growing existing data processing system. It can be formulated as a striving not to restructure existing data processing system, but adding external addendum which would extend existing system to the new, more powerful level. In the new extended system, old one will exist as a subset which will not contradict to a new functionality, but continue to work as a part of it.
Final conclusion: to make release process more efficient and successful, it’s necessary to strive to make it smooth, gradually moving from existing system to the new one, always been ready to step back. Such release looks longer than a one big jump, but in the real life it does not impact user that is most important, and eventually transition takes less time and provided with less efforts and expenses.
Rules of releases:
1.      Step-by-step providing.
2.      Maximum of parallel processing (both old and new ways of data processing).
3.      Smooth transition.
4.      Spending extra time for reliability reduces losses from possible crashes.
5.      Readiness to rollback which should be included in plan and test.
6.      Save old programs and especially data because it’s not reproducible.  

Epilogue
Production support is the longest and most expensive step of SDLC, but it’s not been described a lot in the IT theory. That’s why I post in the blog the first article on this subject, expecting to get more examples from best practice and notes about small and big issues. Such way, we can help each other and enrich our knowledge in this area.