Sunday, September 22, 2013

Integrated Real Time Data Warehouse (RTDW + OLTP)

(continuation: questions and answers)


The following information protected by USA and international law. Any usage of this publication or its part can be allowed per request. Consultation available - leave your email address as a comment.

Introduction
The questions and opinions came in comments and emails after the previous post (http://alex-t-ito.blogspot.com/2012/08/real-time-data-warehouse-integration.html). Current post contains the questions and answers (Q/A) on them, and also thoughts appeared from time of the previous post.
Some of questions and answers you can find in the previous post, but current answers grouped, systematized, and presented in the logical sequence that made these answers logically more readable.
The post is open for new questions and requests for consultations.
The structure of Q/A in the post: question/objection (QO), short answer (SA), and detailing answer (DA).
Questions and Answers
1.      QO: Why a RTDW?
SA: It’s more efficient than Not Real Time Data Warehouse (NRTDW).
DA: The regular Data Warehouse (DW) usually consists of such parts (layers):
-         Source System (SS) – usually OLTP – we need it as a source data to collect them in DW;
-         ETL – the traditional approach which assumes that without it, data can’t be Extracted, Transformed and Loaded into DW (it does look obvious so far – doesn’t it?);
-         Staging Area (SA) or Operational Data Storage (ODS) – the very convenient for data processing place where source data chewed  and baked to make them prepared for loading to the so desirable and expected dish – DW;
-         DW (finally) – normalized or/and denormalized collection of historical data (unstructured DW will be considered later).
Definitely, it’s possible to have other types of NRTDW, but it will not change the approach: source data loaded into DW through other layers and applications. It means that we have in NRTDW not only useful SS and DW, but also other layers (ETL and SA) which take resources (money and time) to create and support them.
 
RTDW makes the construction of DW more efficient because it has only two layers: SS and DW integrated into a single whole. Technically, it’s possible to have them not only as one logical, but even as one physical layer when both OLTP and DW data saved in the same table. These details discussed in the previous post (http://alex-t-ito.blogspot.com/2012/08/real-time-data-warehouse-integration.html).

It’s necessary to emphasize what was described in the previous post RTDW has important feature: OLTP and DW data structure are absolutely identical and even can be integrated into one database object (virtual or even physical). It is one whole, completely integrated conglomerate of data which always reflects the current status of business process it describes and the history of previous statuses. That’s why such RTDW does not need any layer between OLTP and DW.

Such RTDW organically presents the business process for management. More details about this described in the earlier post http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html .


2.      QO: What is the advantage of a RTDW described in the previous post (http://alex-t-ito.blogspot.com/2012/08/real-time-data-warehouse-integration.html)?
SA: It’s more efficient than other ways of creation RTDW.
DA: Offered in the previous post way of creation RTDW - is really a RTDW by design where SS and DW are organically integrated into one piece (two database tables with the same structure which can be merged in one database view, or even in a one database table).
There is no any other mechanism of data processing (for example, CDC, ETL, export-import, etc.). Also, the source transaction and changes in DW are processed with one COMMIT that means absolutely simultaneously. More details are below in the answer about RTDW implementation.
Obvious, that absence of extra components costs less, and one COMMIT takes less time than more than one COMMIT.
Such way, a RTDW by nature, by design reflects a business event simultaneously in the SS and DW.


3.      QO: Who does need a RTDW?
SA: All (disagree? – please, just continue to read).
DA: Isn’t it true that Business Management needs  its information as early as possible? - To verify this postulate, try to give your management the daily report in a year after reporting day. But how early the Management needs the information? – It depends on how early IT can give the information to the Management. If your Business Management used to get the information next day (the regular cycle in many data processing systems – is a process source data over night by the next morning), and used to have it gotten the next day during decade(s), definitely, the Management will believe that there is no other way to get the data faster.

But let’s assume that you or someone else offered your Business Management to get the necessary information today, not tomorrow, and not just today, but immediately (and spend less money for this!) – do you think your Business Management will reject such suggestion or will not open presented report till tomorrow? I sure you answered this question.
 

4.      QO: How to convince Business Users to accept a RTDW (and finance it)?
SA: Ask your Business Users whether they want to get the information (which they need) faster and with fewer expenses.
DA: First of all, you need to be sure yourself in what you want to convince others (general rule). Then, try to find such examples which will be simple and will show to your Business Users the advantages of getting the Business Information faster and with fewer expenses. Avoid any technical details, just saying, ‘faster and with fewer expenses’.
For example, some businesses may be not interested to get the information faster, because in this case, they will need to pay their customers money earlier. Definitely, they may not be interested to speed the data processing up (even with fewer expenses). In this case, you can explain them that their competitors can offer the similar business but with faster paying money to customers, and customers will switch to the business which returns money faster.

One more argument which is pretty obvious, but mentioned not so often: RTDW always contains information current, actual, and available for using. It’s like a news TV channel always gives you the latest, fresh news. It means that Business Management can work on making decision and make it faster and more efficient because the information will be the most current. Usage RTDW makes available to do screening the object of management. Isn’t it a dream of any really good manager?
If your Business Users don’t want to make decision faster, just ask them how they want to make their decision more efficiently. This topic also will be touched later during a consideration of metadata and their role in making efficient management.

Nevertheless, some managers may say that their business does not require real time information. It will mean that they run the business slowly and eventually their competitors will win the market. You can recommend them invite the market and management analyst to find the ways for management improvement. As for RTDW creation, just tell them that NRTDW will cost them notably more than RTDW. I believe it should convince them.


5.      QO: How to convince yourself to accept RTDW?
SA: Read everything above.
DA: Ask yourself why not. If you find why you don’t accept RTDW, write comment to this post and we will discuss your point together. But before writing, please read the next questions – may be you will find the answer for your question (for example, if you don’t have an access to the SS and get source data as files).


6.      QO: How to build a RTDW?
SA: Leave in a data processing only the necessary components.
DA: There are two different situations: one - you need to build RTDW from scratch and another - you need to transform your existing NRTDW into RTDW.
-          To build RTDW from scratch - design it as a single whole: initially, design the data model based on the business model, creating the tables to reflect the business events – the single version of facts (more details are in the post http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html).
Then, add to the tables the “row label” (details are in the post http://alex-t-ito.blogspot.com/2012/08/real-time-data-warehouse-integration.html). These tables will be your normalized RTDW.
You can extend it to the denormalized RTDW (see the same post), or you can use some other way to build the RTDW tables – but in any case: if you load data simultaneously into SS (OLTP or other) and DW, these two data layers together will be RTDW because data in both of them always will be consistent and reflect the single version of fact, just different historical snapshots: SS – current (only one), DW – historical (as many as the history is long).

-          To transform your existing NRTDW into RTDW, you need to find the less expensive way to transform it to the built from scratch RTDW as described above. Each NRTDW is different and requires the individual consideration and approach.
Definitely, each particular data processing system with its own nuances requires individual approach in a searching the most efficient way. The re-architecture of the existing data processing system and transformation it into RTDW should be provided especially carefully if existing data processing system runs in the 24x7 mode (one of such approaches described in http://alex-t-ito.blogspot.com/2012/01/production-support-story-about-smooth.html).


7.      QO: How to implement the simultaneous transactions for the an OLTP and a DW layers in a RTDW?
SA: Provide both transactions in one COMMIT.
DA: Technically it means that you allocate in the program SQL statement (INSERT / UPDATE / DELETE) for an OLTP table, and the next after it there should be the similar SQL statement for a DW table. Then, after both of these statements, there should be COMMIT. Such way, you will avoid problem in case if program/server failed just between these SQL statements: one COMMIT will either commit both of them or will rollback both of them.

Attempt to use a trigger on OLTP database to provide an appropriate transaction on a DW is not correct because:
-         it distributes the related to each other SQL statements in different functional layers (in an OLTP program – for an OLTP table, and in an OLTP database – for a DW table);
-         if the algorithm of OLTP program becomes complicated, the implementation it in the database trigger can create the situation of “mutating table”, which will abort the trigger and the program.
 

8.      QO: Which data model should be used for a RTDW?
SA: Data Vault.
DA:  A RTDW has layers with normalized data (an SS and a normalized DW) and denormalized data (star-schema Data Marts, Cubes). Data Vault is the data modeling approach which has normalized tables with columns which allow smoothly transform them into a star-schema data model.
The Data Vault used in OLTP, and then repeated in DW as a normalized model. Then, a normalized DW can be organically transformed into a denormalized Data Marts using a row label. 
Also, allocation of data in many Data Vault satellite tables allows building compact tables that makes operational process faster on program level, as well as makes program code shorter and more readable for developers and makes a process of program implementation (writing, debugging, and functional testing) faster.

Data Vault lets to use an efficient Agile organization of project when not all parts of an Enterprise Data Model clarified simultaneously. In this case, using a Conceptual Data Model, it’s possible to build Hub and Link tables, and then, later, hang on them, step-by-step, Satellite tables when data for them becomes clarified. Such way allows preparation of the data testing and programming while the Physical Data Model continues to be developed.

Definitely, not only a Data Vault model can be used for structuring an OLTP and a DW, but also any other methodology (for example, an Anchor Modeling), if it allows simultaneously change a current and a historical versions of fact.


9.      QO: Is it possible to implement a Data Mart in a RTDW?
SA: Yes.
DA: Yes. More details are in the previous post (http://alex-t-ito.blogspot.com/2012/08/real-time-data-warehouse-integration.html) in the part “Transformations”.

--
10.  QO: What to do with files coming from an SS with source data for a RTDW?
SA: Use them as an SS.
DA: It’s obvious that the transactions provided in OLTP already reflected in the files created from this OLTP. From this point of view, these files are an SS (transactional data), and it’s only necessary to reflect them in a DW that will make a DW integrated with an OLTP.

To make these files a complete reflection of source data, it’s necessary to add to the each record the “row label” (see question 6.). Extra to the “row label”, “record creation date” field should be added to show how old the record is (the “age” of record will be the difference between “a record creation date” and “a last row-update event”). After loading data from files into DW tables, this extra field will be loaded into a DW table into an appropriate extra column. The loading data from files into DW tables makes the structured DW synchronized with these files.

Such way, the files with labels become already integrated with DW at moment they delivered because DW can use the data from these files, but this integration will be in non-structural form. After loading the files into a DW database, the data from files become integrated with the DW in a structural form. Definitely, it will be not a RTDW from the physical prospective, only from the logical one: the earlier data with the label become available for DW, the closer to the RTDW this DW becomes. When it’s possible, the process should be replaced with the usage record-by-record transferring and loading data from the SS into the EDW using some program interface (service). It will make the process faster than using the file.

The approach of using files can be used for any unstructured data (for example, emails).


11.  QO: What to do if source data come to a RTDW from many OLTP systems?
SA: Unify their structure.
DA: The best ways of usage source data coming from many OLTP systems is unification of these data structure, and then - reflection them in one historical table. To differentiate data coming from different sources, it’s necessary to have an extra column with an OLTP system ID.
If unification and integration of many OLTP systems with historical data is difficult in creating a particular RTDW, the usage of files with unified structure can help, though the data in a RTDW will be available not at the same time as in the source OLTP’s, and usage the files will require extra expenses.


12.  QO: Can an ETL help in creating a RTDW?
SA: Yes.
DA: ETL does not implement the real synchronization of data in an SS and a DW like two SQL statements with one COMMIT, but it can make it pretty close to a RTDW. In any case, it’s better than traditional nightly batch data processing.
The usage of ETL can make the loading process faster if it combined with using a CDC which is the really real time process.
 
--
13.  QO: Will transactions in a RTDW run slower than in an OLTP or a DW separately?
SA: Yes and No.
DA: Yes – because doubled transaction (into the OLTP and the DW), definitely, will run slower than only one either in the OLTP or the DW.
No – because this time will be comparable or even faster than creation, for example, in a regular CDC mechanism, two CDC rows (“old” and “new” ones) generated and then load them into a staging area for further loading into the DW.


14.  QO: How to support RTDW?
SA: Just do it.J
DA: There are no some unusual requirements in supporting a RTDW which were not in the production support of an OLTP or a DW. It’s even much easier to do it in a RTDW because there is neither an ETL, nor an ODS. It is necessary to pay attention on keeping the data integrity of current (an OLTP) and historical (a DW) parts data. The attempt to provide manual data changes may corrupt a RTDW, especially if you have in the RTDW more layers (for example, Data Marts).

It’s important to emphasize that any data correction in RTDW should be provided with identification of such correction. It can be the extra column in the row label that will allow keep it separately from business data. The data type and the column format will be defined by the level of granularity of the types of correction. The simplest way is just use 0 for original business data, and 1 – for correction.

More information about smooth transition over providing enhancement, and development over production support can be found in http://alex-t-ito.blogspot.com/2012/01/production-support-story-about-smooth.html .


15.  QO: Can be created metadata (MD) in a RTDW?
SA: Yes.
DA: Structural MD in a RTDW should be the same as in a NRTDW.
Descriptive MD in a RTDW should be created differently than in a NRTDW, if we consider the descriptive MD as a function of data it describes.

In a NRTDW, the function for ‘n’ number of data events (moments) can be presented as:
Fn(d1,…dt,…dn), 
where d1,…dt,…dn - are data related to the particular events (moments),
and the function for ‘n+1’ number of data events (moments) can be presented as:
Fn+1(d1,…dt,…dn+1).

In a RTDW, the function for ‘n’ number of data events (moments) can be presented as:
Fn(d1,…dt,…dn), 
but the function for ‘n+1’ number of data events (moments) can be presented as:
Fn+1(Fn,…dn+1).

It means that in a RTDW descriptive MD calculated for each new data based on the already calculated MD for previous data and new data. The previously calculated MD used such way, and not recalculated MD for all amount of data events (moments).

Such approach used in math calculations and named “iterated functions”. In data processing, such functions also can be named “incremental functions” (IF) because such name reflects the major feature of such functions and calculations – calculation for incrementally coming data. In a RTDW, it allows to save time: instead of massive recalculation of MD for all previous data there will be MD calculated for only new data.
For example, an average value calculated using previous an average value, a number of previous data and a new data value:
Avgn+1 = (Avgn * n + dn+1) / (n+1).

Such approach used mostly in math, not in business calculations, but a RTDW as a OLTP integrated with a DW is a pretty new approach, and definitely, may require some new technique which, eventually, will become a traditional one. 
The definition of business iteration functions requires more analysis and practical efforts but a RTDW, as the result, will award these efforts.

The very important advantage of the IF in data processing calculation is that IF allow completely eliminate so named Big Data Problem (see more in the appropriate question below).


16.  QO: Should be created metadata (MD) in a RTDW?
SA: Yes.
DA: Structural MD in a RTDW always will be in a DW (both a RTDW and a NRTDW).
As for calculation of a Descriptive MD, a very first impression may be that creation of a Descriptive MD in a RTDW does not make sense because if you calculate some data for a quarter or year, an immediate calculation or a calculation on the next day or even a month insignificant comparing with age (quarter or year) of the processed data. But such impression is not correct.

Let’s consider the difference between immediate calculation of a Descriptive MD and calculation in a NRTDW.
The calculation in a NRTDW which is providing after calculated period of time (quarter or year) is over. Then the calculation will usually take at least a day or more. See the chart below:
|---------------------------------------------------------------|+++|

T0                                                                               Tn    Tc

Where:
      T0         - start time of period of time;
      Tn         - end time of period of time;
      Tc         - end time of calculation. 
It means that the average time of waiting for calculation will be
      Tw = (Tn - T0) / 2 + (Tc - Tn)
For example, for a quarter period (Tn - T0) with calculation time (Tc - Tn) a week, an average time of waiting (Tw) will be half a quarter + a week, or more than 7 weeks.
Accordingly, for a year, an average time of waiting will be more than 6.25 month.

But for RTDW, the waiting time is always 0 (zero!). Isn’t it a good argument for explanation to Business Users the advantage of a RTDW over a NRTDW? Just say them whether they prefer to wait more than 6 months the necessary information for starting the preparation of some strategic decision, or always have such information available for less cost (see the answer for question 1.).
If the Business Users still hesitate, just remind them that their competitors may use more efficient way – a RTDW.


17.  QO: How much do you win from a RTDW?
SA: A lot.
DA: First of all, using a RTDW allows win time. Traditional midnight data processing, if it is not eliminated, allows at least reduce the regular maintenance (for example, backup copies). This is the win in IT.
But the major winner is a Business Management. It can start to work on any strategic decision at any moment and permanently reconsider it based on changed in real time picture of Business Process. The amount of such gain of time can reach many weeks or even months.


18.  QO: What is the specific of using so named Big Data (BD) in a RTDW?
SA: No specific issues relevant to RTDW.
DA: The full name of the BD should be the Data which is too Big for Capability of Available Resources (DBCAR).
As data processing covers more and more layers of data (new subject areas and longer history), each new step meets new, higher, level of data amount. On each step, a new amount of data initially considered as BD, but really it’s just a DBCAR. As soon as the new challenge appeared, new ways began to come up: first of all, new hardware; also, new ways of data processing (for example, distributed data processing).
It is important that all such new (temporary!) challenges and approaches are absolutely the same both for a RTDW and a NRTDW. None of them is only for a RTDW or for a NRTDW. It means that the ways of resolving DBCAR new issue will be identical for both a RTDW and a NRTDW.
From DBCAR point of view, a RTDW even has advantage over a NRTDW because a  RTDW has fewer layers than a NRTDW (see the answer for question 1.).

The described in the previous questions the incremental functions (IF) allow absolutely, completely eliminate the DBCAR problem. It managed to do because IF never provide calculation on all data, only on the one latest coming data and using result of IF calculated for all previously coming data. Such way, the using a RTDW where IF is a necessary component, allows absolutely eliminate so named the Big Data problem. Absolutely, disregard how big the amount of processed data. RTDW uses the IF more efficiently because a RTDW always processed only on one transaction. This is one more advantage of using RTDW.
Conclusion: RTDW DOES NOT HAVE “BIG DATA PROBLEM”. It’s one more advantage of RTDW.


19.  QO: Can a size of organization affect a RTDW?
SA: Yes.
DA: Yes, the size of organization should notably effect on its RTDW because all data replication processes run simultaneously with events they were created by, and parallel to each other. But the size of organization just increases complexity of its RTDW, but absolutely does not prevent from it. Vice versa, a RTDW helps much more to manage the large organization multiplying the efficiency of each department in their work together on preparing and making strategic decision.
Bigger organization has more layers of management, but it should not change the RTDW approach because each next layer considers the previous one as a business process which it manages.


20.  QO: Does a RTDW built based on a Business Data Model or an Information Model?
SA: Both.
DA: A Business Data Model used in the OLTP layer and historical data.
The Information Model used in the layers next by level of integration – as a metadata. It oriented on the management process.
More information on this topic is in http://alex-t-ito.blogspot.com/2011/04/business-model-as-base-for-data-model_03.html .
 

21.  QO: Not all questions were answered.
SA: Yes.
DA: Yes, it’s true because  RTDW as  DW integrated with N OLTP is a young approach and new questions appear more and more. Definitely, new questions will appear over reading the current post.

That’s why any new question, especially if it based on new practical results of creation and usage a RTDW, is extremely valuable. Please share such sort of information in your questions by writing comments or by sending emails.


Summary
A RTDW is more efficient both for Business and IT, it requires less resources and allows a Business Team to prepare both operational and strategic management solutions at any time. A RTDW creates the new way of absolute elimination of the  Big Data problem. A RTDW can be created both from scratch and using re-architecture of existing data processing system.

The post is open for new questions and requests for consultations.