Managing Time In Relational Databases

  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Managing Time In Relational Databases as PDF for free.

More details

  • Words: 66,175
  • Pages: 161
Articles in the Time and Time Again Series. Time and Time Again: Managing Time in Relational Databases, Part 1 Tom Johnston, Randall Weis Article published in May 2007, in DM Review magazine Time and Time Again: Managing Time in Relational Databases, Part 2 Preliminaries Tom Johnston, Randall Weis Article published in May 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 3 Version Patterns 1 and 2 Tom Johnston, Randall Weis Article published in May 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 4 Preliminaries to Version Pattern 3 Tom Johnston, Randall Weis Article published in June 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 5: Version Pattern 3 Tom Johnston, Randall Weis Article published in June 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 6: Version Pattern 4 Tom Johnston, Randall Weis Article published in June 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 7: Completing Version Pattern 4 Tom Johnston, Randall Weis Article published in July 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 8: Version Pattern 5 Tom Johnston, Randall Weis

Article published in August 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 9: Completing Version Pattern 5 Tom Johnston, Randall Weis Article published in August 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 10 Foreign Keys in Version Tables Tom Johnston, Randall Weis Article published in September 2007, in DM Direct Time and Time again: Managing Time in Relational Databases, Part 11 Foreign Keys in Versioned tables Continued Tom Johnston, Randall Weis Article published in October 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 12: Foreign Keys in Version Tables Tom Johnston, Randall Weis Article published in October 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 13 - A Roadmap of the Rest of This Series Tom Johnston, Randall Weis Article published in November 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 14 Versioning with Surrogate Keys Randall Weis, Tom Johnston Article published in November 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 15 - For and Against Surrogate Keys Tom Johnston, Randall Weis Article published in December 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases Part 16 Physical Implementation of Version Pattern 1 Randall Weis, Tom Johnston

Article published in the December 21 2007 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 17 Physical Implementation of Version Pattern 2 Randall Weis, Tom Johnston Article published in the January 4, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 18 Physical Implementation of Version Pattern 3 Randall Weis, Tom Johnston Article published in the January 18, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 19 - RI and Temporal RI Tom Johnston, Randall Weis Article published in the February 1, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 20 - RI and Temporal RI (Continued) Tom Johnston, Randall Weis Article published in the February 15, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 21 Enforcement Time Options for Temporal RI Tom Johnston, Randall Weis Article published in the March 7, 2008 issue of DM Direct. Time and Time Again: Managing Time in Relational Databases - Part 22: Original and Temporal Deletes Tom Johnston, Randall Weis Article published in the March 21, 2008 issue of DM Direct. Time and Time Again: Managing Time in Relational Databases, Part 23 Original and Temporal Updates Tom Johnston, Randall Weis Article published in the April 4, 2008 issue of DM Direct. Time and Time Again - Managing Time in Relational Databases, Part 24: Original and Temporal Updates (Concluded) Tom Johnston, Randall Weis

Article published in the April 18, 2008 issue of DM Direct. Time and Time Again: Managing Time in Relational Databases, Part 25 Original and Temporal Inserts Tom Johnston, Randall Weis Article published in the May 2, 2008 issue of DM Direct. Time and Time Again - Managing Time in Relational Databases, Part 26: Points in Time and Periods of Time Tom Johnston, Randall Weis Article published in the May 16, 2008 issue of DM Direct.

Time and Time Again: Managing Time in Relational Databases, Part 1

1 of 4

http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

Time and Time Again: Managing Time in Relational Databases, Part 1 Tom Johnston, Randall Weis DM Review Magazine, May 2007 Relational databases usually keep data about the current state of things - about how things are now. Updates do occur, of course; but when they do, they overwrite the data already there. In doing so, history is lost. We can't tell how things used to be. To tell how things used to be, as well as how they are now requires that we introduce time into our databases. This series of articles on a very technical topic is motivated by the fact that increasingly, businesses need real-time access to historical data. No longer can all historical data be pushed off onto archive files or onto less immediately accessible historical databases. The objective of this series of articles is to show how to provide real-time access to history in particular, and to any kind of noncurrent data in general, and to show how to do so using the relational database management systems (DBMSs) and SQL that are available today. First, a note on terminology. It will often be convenient to talk about history, because that is the kind of noncurrent data that businesses are primarily concerned with. But the real topic is the representation of time in relational databases, whether past, present or future time, whether real-world or database time, whether actual or hypothetical time, whether linear or branching time.

Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

Managing time in today's relational databases is difficult. With more than 50 years of combined experience in data and database management, we have encountered no database model more complex for its size than the model we will propose, no queries fraught with more possibilities for misstatement and misinterpretation than queries against databases that attempt to manage time. But we know of no simpler way to meet business requirements for real-time access to what we will call "versioned history." However, very little support for time management has yet been provided by DBMS vendors or is specified in the current SQL standards. The complexity of time management cannot be hidden, as it ultimately must be, inside the DBMS itself, with declarative interfaces provided in extensions to the SQL DDL and DML languages. But businesses need robust and reliable time management in their databases now, not eventually. So we take a do-it-yourself approach. And the first thing we must do ourselves is find a way to encapsulate both time management data structures and the additional complexities in SQL statements that are required for time management data access. As for updating temporal data structures, all updates should be mediated by application program

10/7/2008 9:23 AM

Time and Time Again: Managing Time in Relational Databases, Part 1

2 of 4

http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

interfaces (APIs). If an operational data store (ODS) or data warehouse is the database being updated, the API function will be provided by the extract, transform and load (ETL) process, whether hand coded or implemented with a commercial ETL tool. For retrievals against temporal data structures, all queries should access a set of view tables from which as much of the technical detail of managing time will have been removed as possible. These tables will satisfy the following two requirements: When point-in-time data is wanted, these view tables will look, column for column, like tables that are not temporally managed, i.e., like "typical" or "normal" tables in a database. When that desired point in time is now, these view tables will guarantee that existing retrievals will remain completely unaffected as time management is added to the tables they access. We should think of these temporal tables as no more accessible to the end user, DBA or developer than are the dynamic balanced-tree access method structures that provide the file storage subsystem for many commercial DBMSs. The only database users able to directly manipulate these temporal tables should be: The developers who translate source system or end-user-supplied update transactions into updates to these temporal tables, and The developers who write the CREATE VIEW statements that encapsulate these tables for retrieval transactions. Figure 1 illustrates the encapsulation necessary to hide the complexities of temporal database structures from both those who update those structures and those who retrieve data from them.

Figure 1: Encapsulating Temporal Tables When we IT professionals talk about history in databases, we may have several different things in mind. A taxonomy of the types of history that IT data management professionals talk about is presented in Figure 2.

10/7/2008 9:23 AM

Time and Time Again: Managing Time in Relational Databases, Part 1

3 of 4

http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

Figure 2: A Taxonomy of History as Managed in RDBMSs Database history is historical data kept in relational database tables. The two major categories of historical data are reconstructable history and queryable history. Reconstructable history is data about the past state of something, obtained by restoring a backup file and then applying update transactions captured in a DBMS logfile, from that point in time forward to the point in time of interest. This requires the intervention of IT personnel and is therefore not real time. Queryable history is data about the past state of something, obtained with a SQL query, without the need for restoring backups and reapplying logged updates. Because the query may be directly executed, this type of history is possibly real time. It is actually real time if its performance meets the parameters which define real time for the class of query it belongs to. Queryable history may be kept as either a record of events or as a record of states. Event history is data about events that have altered the states of things. This data is captured as transactions in tables that are best queried with links to the leaf-level entries of a set of dimensional hierarchies. Thus, these tables are often queried as the fact tables in star schema data marts, usually by means of OLAP reporting tools. State history is historical data, captured as copies of the current state of an object, either on a periodic basis or in response to a specific update event. State history may be kept as either snapshots or versions. Snapshot history is a coordinated set of copies in a relational database. Snapshots are often taken at the entire database level, but sometimes at the level of a semantically related subset of tables. Snapshots occur at regularly scheduled intervals. Consequently, snapshots will miss any updates that are overlaid by later updates to the same data that happen before the next snapshot is taken. Another drawback to snapshots is that they are an inefficient way to record history, because they create copies of rows whether or not those rows have changed. Version history is individual row-level logical updates, implemented without overlaying data, by "retiring" the current version of the object and replacing it with a new version that contains the updated data. Versions are created on an as-needed basis, i.e., every time an update important enough to be versioned occurs. Consequently, versions will not miss any updates. They are also an efficient way to record history because they create new rows only when a versionable change has occurred. Thus, for example, suppose that the current version of customer 123's name is Mary Smith. When

10/7/2008 9:23 AM

Time and Time Again: Managing Time in Relational Databases, Part 1

4 of 4

http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

she changes her name to Mary Jones, the Mary Smith version of that customer loses its status as the current version. The Mary Jones version is inserted and marked as the new current version. Effectivity dates indicate when each version was the truth about Mary, as the business knew it. Over the past two decades, many suggestions have been made as to how vendors should implement time management in relational databases, and how SQL DML and DDL should be modified to provide declarative access to those new capabilities. The best access to pre-2000 computer science research is written by the principal computer scientist working in the field of temporal databases.1 A well-known commentator on relational databases and a member of the SQL standards committee have also written on this topic.2 But our focus is not on recommendations to vendors or to the SQL standards committees. Our focus is on the here and now. Given today's relational DBMSs, and the SQL common to them, how can we provide the temporal support that businesses increasingly need, and that they need now? This series of articles will continue in the DM Direct newsletter and will be published every other issue beginning May 4. The discussion will be at the implementation level of detail, although to emphasize its relevance, we will present this very technical material in terms of a real-world business application. In these articles, academic research will be brought to bear on the problem of providing temporal data management in today's relational databases. However, our emphasis is more on real-world implementation than on theory. These articles will address the problems and issues faced by business IT data architects, data modelers, DBAs, those developers who write the ETL or other interface logic to update temporal tables and those developers who write the views which insulate the rest of the IT and business community from the complexities of temporal data. It's about time. References: 1. R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. San Francisco: Morgan-Kaufmann, 2000. 2. C. J. Date, Hugh Darwen and Nikos Lorentzos. Temporal Data and the Relational Model. San Francisco: Morgan-Kaufmann, 2002. For more information on related topics, visit the following channels: Databases Real-Time Enterprise ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:23 AM

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

1 of 6

Time and Time Again: Managing Time in Relational Databases, Part 2 - Preliminaries Tom Johnston, Randall Weis DM Direct, May 2007 The needs which businesses have for versioned history are many and varied. However, those needs and the data patterns which satisfy them seem to fall quite nicely along a linear progression from simple to complex. We will call the patterns that satisfy each set of business needs "version patterns" because, in the terms organized by the taxonomy in Part 1 of this series, we are concerned with queryable history of the states of persistent objects, kept as versions of those objects. However, our concern will not be exclusively with past and present versions. The more general topic is maintaining and querying tables which contain noncurrent as well as current versions of their persistent objects. And there are other kinds of noncurrent versions than historical ones. We will also consider versions for error correction, versions for future states, versions which overlap in time, coordinated (pending/sandbox) updates and even versions for hypothetical states. Version Patterns: Manner of Presentation In this series of articles, we will discuss a dozen version patterns. Each pattern will be introduced by describing the business requirements it supports.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

Next, we will show how these version patterns work by presenting several scenarios and timelines for each pattern. Each scenario will show the results of doing an insert, update or delete against a table constructed according to the pattern. Each timeline will show what would be returned by a query against the past, present or future state of the object whose versions are being used as the example. Initially, these transactions will be simple ones that physically affect just one row in one table. But gradually, more semantically expressive scenarios will be considered, ones in which what is seen as one transaction to the issuing code or person will result in several rows being affected in possibly several tables. Following the scenarios for each history pattern, we will present and comment on a list of semantic constraints that must be followed for the pattern to work. Most of these constraints can be enforced only by triggers, stored procedures or application code. It is an indication of how "special" and complex temporal state management is that the normal DBMS entity and referential integrity mechanisms can do very little to enforce the semantics of these patterns.

10/7/2008 9:21 AM

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

2 of 6

Finally, each pattern will conclude with a list of objections to it that we have encountered. Some of these objections are valid in the sense that if the different business requirements for the pattern were weighted a little differently, a different pattern or at least a different physical implementation than that suggested by the pattern would be appropriate. Other objections are not valid, either because the pattern was not fully understood by those criticizing it, or because the flaws in the alternative pattern that the critics preferred were not fully understood. In this series of articles, it will not be possible to discuss all the objections to these patterns that we are aware of. But we can, and will, at least note the existence of the objections we are aware of. Following is an annotated list of the version patterns which we will discuss in these articles.

Figure 1: Annotated List of Version Patterns Scenarios

10/7/2008 9:21 AM

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

3 of 6

Figure 2 shows the template we will use to present scenarios.

Figure 2: The Scenario Template The numbered boxes in the template will contain the following information: 1. 2. 3. 4. 5. 6. 7.

Name of the scenario. Table used for this scenario. Definition of the table. Description of the scenario. Now (whatever date that may be). Column headings for the table. One or more rows of the table.

Basically, boxes 1 through 5 contain metadata that is necessary to understand the scenario. Boxes 7 are the column headings for the table under discussion. There are one or more rows indicated as boxes 8, and each one represents a row in a relational table. Timelines Except for the first two patterns, we will also use a timeline diagram to explain what is going on. Figure 3 illustrates this diagram. It will be used beginning with Version Pattern 3, which is the first pattern that goes beyond a simple update in place approach to history.

Figure 3: The Timeline Diagram Health Care Insurance: A Business Example for Version Patterns Health care insurance is not the only industry which has a need for real-time history. But the insurance industry in general is certainly one in which real-time history is very important. Two paradigmatic insurance situations make the point: processing a claim and answering a customer telephone inquiry. As for the first paradigmatic situation, a claim is filed for an event which happened at some point in the past. The claimant has a policy with the insurance company, to which she sends the claim.

10/7/2008 9:21 AM

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

4 of 6

But the policy which is relevant to the claim is not her policy as it is when the claim is received. Rather, it is her policy as it was when the claimable event occurred. In property insurance, the claimable event is damage to property. In health care insurance, the claimable event is a service event, an event in which a doctor, hospital or other health care professional or organization provided medical services to the claimant. Processing a claim involves comparing the claim to a past state of the policy against which the claim is being made. That past state may happen to be identical to the current state of the policy. But the point is that it equally well may not. And, in point of fact, it often is different. So processing a claim requires real-time access to historical data. There is a customer-facing aspect to real-time claims data, also. And because it is customerfacing, it represents not just a need which a processing system has, but a product differentiator for the insurance company. This is our second paradigmatic situation. These events occur when a customer calls the insurance company and asks for an explanation of benefits. Usually, what they are concerned with are benefits not provided, i.e., claims not paid or claims that were reimbursed for less than the policyholder thought they should be. The customer service representative (CSR) needs online, real-time history to answer the customer's questions during that phone call, there and then. He needs complete information on the caller's policy, as well as the date the service was provided for which the claim the caller is concerned with was made. The need for real-time access to policy history exists because it is important to answer the caller's questions while the caller is on the phone. Companies which cannot do that are at a serious competitive disadvantage. Other industries will have their own paradigmatic situations illustrating their own needs for real-time access to history. But claims processing is the operational heart of insurance, and customer satisfaction is the operational soul of any business. So: historical data is already being managed in insurance company databases. Our intent, in this series of articles, is to present what we believe is a better way of satisfying this need than the various ad hoc solutions we have seen over the past few decades. And better has the bottom-line meaning it always has had: better in terms of revenue enhancement, cost containment and customer service. Two additional points will serve to complete our introduction to this topic. The first point illustrates the increasing importance of versioned history by considering the breakdown of a traditional alignment of current versus historical with operational versus analytic. The second point is that dates and timestamps (of varying granularity) are just different ways of representing the tick of a logical clock that provides reference times for all versioned history. Operational Data and Analytical Data Another way of realizing the significance of this topic is to understand it in terms of the breakdown of a traditional alignment. Traditionally, the distinction between operational data and analytical data has lined up with the distinction between current data and historical data. In talking about the increasing need for real-time access to historical data, we are emphasizing that this alignment is breaking down. On the one hand, data marts have traditionally been considered part of the analytical environment,

10/7/2008 9:21 AM

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

5 of 6

not part of the operational environment. Data marts are and always have been a means of discovering trending over time patterns. But near real-time data is increasingly being required in data marts. The reason is that businesses are finding that they can make better analytical decisions if the relevant data reaches right up to the present moment. This is a particularly acute need for large-scale inventory management systems of consumer goods, especially during the holiday season. The analytics needed from such systems do not directly concern on-hand, on-order, sales and returns volumes. Rather, the single critical concern is with velocities. The critical question isn't, "How much did we sell last week?" or even "How much do we have on-hand and on-order right now?" The critical question is, "Is our on-hand and our pipeline adequate to keep product in stock, given the rate at which it is being sold?" Since sales volumes for much consumer merchandise peaks during the holiday season, the velocity that decision-makers are concerned with, at that time of year, isn't one based on rates of change reaching back six months. It's one based on rates of change reaching back six days! So this is a situation in which current data is needed for analytical decision-making. If you are thinking, at this point, that we have just described a need for operational data, not for analytical data, then you are illustrating the confusion caused when an alignment (operational = current, analytical = historical) breaks down. What we have just described is a need for near real-time data. If we put the currency of data aside, however, what distinguishes operational from analytical data? One answer might be that analytical data is used by upper-level management, while operational data is used by lower-level management. While this happens to be true, it isn't very informative. What we need to know is what upper-level management is doing with its data that is different from what lower-level management is doing with its data. The answer is that analytical data attempts to predict the future, while operational data attempts to optimize the present. Analytical data uses the past to establish curves, one of whose axes is time. The more accurate those curves are, the better an indicator they will be of future trends. For the best basis on which to guess what the future will be like is a sufficiently long and accurate timeline of what has happened in the past. Just project the curve into the future. On the one hand, as our inventory management example illustrated, real-time data is needed for analytical decision-making. Conversely, operational data, whether in OLTP systems or in operational data stores (ODSs), increasingly includes relevant historical data, and operational decision-making increasingly requires access to that data that is just as real-time as is access to current data. Again, in the insurance industry, the past state of insurance policies is the relevant example. The claim is current data. The relevant state of the policy is historical data. So this is a situation in which historical data is needed for operational decision-making. The old "operational = current, analytical = historical" alignment is breaking down. Historical data is no longer "second-class" data. Every Tick of the Clock Throughout these articles, we will be talking about dates, not timestamps. This is for the sake of simplicity; nothing in these patterns would be disturbed if we were to talk about timestamps instead. However, this simplification means that the granularity at which time will appear in these examples is the granularity of a single day. For example, we will talk about a version ending one

10/7/2008 9:21 AM

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

6 of 6

day prior to the start of the next version. More generally, a version ends one tick of the clock before the start of the next version (assuming, as we do for now, that there can be no gaps between versions of the same thing being versioned). That is one tick of whatever "clock" is being used. With dates, every clock tick is one day. However, our discussions apply equally to situations in which timestamps are used instead of dates. With timestamps, the clock ticks once for every change in timestamp value, at the granularity at which the timestamps are recorded. Thus, for example, timestamps can record a clock that ticks in hours, minutes, seconds, milliseconds, etc. Using a clock for which each tick is one day isn't at all unrealistic. It supports a business requirement, for example, to load a data warehouse or ODS once a day, with whatever the source data is like at that point in time. For many warehousing situations, a clock that ticks once a day is fine-grained enough to meet all business requirements for the warehouse. Note: We will not include century in the dates used in these examples. Instead, we will specify them as "mm/dd/yy." This is to keep the columns as narrow as possible, to fit entire rows across a page. However, there is one "special date" value that will require a century. It's the value "12/31/9999." Obviously, "12/31/99" cannot be reserved for a special meaning, since it is a date that is likely to appear as real business data. But "12/31/9999" is available to carry a special meaning, on the assumption that no business data will occur that needs to specify that exact day nearly eightthousand years from now. That special meaning has some of the semantics of "end date unknown," and also some of the semantics of "end of time." But we think that the most accurately descriptive name for this value is "until next changed or deleted." These semantics will be examined more thoroughly in a later installment in this series. The differences are subtle; and sometimes, in the real world of business IT, they are even relevant. With this special meaning, "12/31/9999" is not a date; that's why we refer to it as a date value, i.e., a value that appears in a date data type but which does not carry the semantics of a date. (If this appears awkward to you, it is. We need the SQL standards committees, and the vendors, to provide something better for us. But until they do, this is what we have to work with.) Part 3 begins the discussion of specific version patterns. For more information on related topics, visit the following channels: Best Practices/Benchmarking Business Intelligence (BI) Customer Relationship Management (CRM) Data Marts Operational Data Store ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:21 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

1 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 3 - Version Patterns 1 and 2 Tom Johnston, Randall Weis DM Direct, May 2007 Version Pattern 1: Updating in Place Updates in place destroy history by overwriting the previous version of the rows they update. But in most cases, this is the way tables are in fact updated. In these cases, the business does not require that the earlier states of the updated rows be available as queryable history. For this kind of data and these business requirements, the need for history is so infrequent and so nonreal-time critical, that reconstructable history is good enough. However, even with tables that lack queryable history, it is quite common for developers to include both a create date and a last update date; these two dates provide a minimal level of support for history. Let's include them as business requirements for Version Pattern 1 and note them as follows: {BR 1}. Retain the date each row was created, and also the date of its last update.

Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

The scenarios below illustrate Version Pattern 1. Our example will be an insurance policy table. Scenario 1.1: an Insert to the Policy Table

On 1/12/04, a policy for client Mary Jaspers is inserted into the Policy table for the Acme Healthcare Insurance Company (AHIC). For now, we will assume that this is also the action that makes the policy effective. In other words, a policy with AHIC becomes effective on the date that a row representing that policy is inserted into AHIC's Policy table. Note: although 1/12/04 is in the past, as of the date of publication of this article, the paragraph above said that on "1/12/04, a policy for client Mary Jaspers is inserted into the Policy table" (italics added). It did not say ".... A policy ...... was inserted .......". This "is" should be consider a tenseless "is," not an indication of the present tense. The reason to speak tenselessly whenever possible is that otherwise, the tenses involved in an English language description of these situations simply become too complex. With a tenseless natural language description, we will not be distracted from the focus of these articles, which is the management of time in relational databases, not the management of time and tenses in natural languages.

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

2 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

In later history patterns, we will drop this assumption, and distinguish system activity dates such as dates rows were inserted, updated or deleted, from business effectivity dates such as begin and end dates of business effectivity, or the effectivity dates for a correction to an earlier entry. This distinction between system activity and business effectivity dates is referred to, in the literature, as the "bi-temporal" distinction. Note: those in the health care insurance industry will recognize that this example about policies is not a typical situation. Typically, health care insurance policies are group policies, which are contracts between a client company and the insurance company, not a contract between the insured individuals and the insurance company. In the property and casualty industry, on the other hand, policies which cover an individual or family are usually contracts between the insurance company and that individual or family. But health care insurance companies do issue individual policies, and we believe that this will provide a simpler set of examples with which to illustrate our proposal for managing versioned history. Scenario 1.1 shows the result of entering Mary's policy into the Policy table. (See the previous article - Part 2 - for an explanation of the scenario template used here and throughout the remaining articles in this series.)

Note: the conventions for distinguishing different types of columns in these illustrative tables are as follows: Primary key columns are listed left-most, and the column headings are bold-faced. They contain the suffix (PK). Foreign key columns are listed next, and the column headings are underlined. They contain the suffix (FK). Note: these articles do not contain examples of foreign keys which are also primary keys. One of us (Johnston) has written several articles explaining why he thinks that foreign keys should never occur in primary keys. See References for links to these articles. Non-primary key and non-foreign key business data columns are listed next. Metadata columns, such as crt-dt and last-updt-dt, are listed last. The client-nbr foreign key can be ignored for now. It will be used in later scenarios. The same is true for [now], whose template cell contains "n/a" (for "not applicable"). [Now] (the brackets indicate that we are talking about a metadata item, and not merely referring to the present moment) will not become relevant until we move on to more complex scenarios.

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

3 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

The policy type is PPO (preferred provider option). The first metadata column (crt-dt) tells us that the row was inserted on 1/12/04. The {null} in the second metadata column (last-updt-dt) tells us that, so far, this row has not been updated. Presumably there are many other rows and columns in the Policy table. But we won't show any of them until later on, when they are needed to illustrate discussions of more complex patterns. Scenario 1.2: an Update to the Policy Table

On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization). After that change is recorded in the Policy table, Mary's Policy row looks like this:

The use of a create date and last update date is common in business IT. And with those two dates, we are not entirely lacking in temporal metadata about this row. Crt-dt tells us that the row was inserted on 1/12/04, last-updt-dt that it was last changed on 3/14/05. But there is, nonetheless, a lot that we do not know. We do not know how many times this row was changed prior to 3/14/05 (if any), nor what column or columns were changed by any earlier updates. We don't even know what column or columns were changed on 3/14/05. And even if we did know that it was the policy type column that was changed, we still wouldn't know what the previous value was. So, information has been lost. More precisely, using the taxonomy introduced in Part 1, information in the form of queryable history has been lost. Reconstructable history is always available, of course. In this case, to get Policy P138's reconstructable history, we would scan the logfile of the initial inserts and all subsequent updates to the Policy table, starting on 1/12/04, and arrange the logfile entries for P138 in chronological order. Scenario 1.3: a Physical Delete to the Policy Table

On 6/1/06, Mary's policy is ended (by contract expiration, explicit termination, revocation, etc. - it doesn't matter). This is recorded, using Version Pattern 1, by physically deleting her policy's row in the Policy table. After that change is recorded in the Policy table, the table looks like this:

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

4 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

After the update shown in Scenario 1.2, as we said above, we have both the current state of Mary's policy, and also a small amount of metadata. Now we have neither. We can't even tell that Mary ever had a policy with AHIC. Sometimes losing this much information to queryable access doesn't matter. Sometimes all we need to know in real-time is what things are like right now. But frequently, businesses do require a minimal amount of real-time available history, such as, in this case, information about policies no longer in effect. This requirement can be met by replacing a physical delete with a logical delete. This too is a technique well-known to business IT professionals. Eventually, logically deleted rows will be archived off the online table. This is usually done based on their age. For example, the business rule might be "at each month-end, archive all logically deleted rows that were deleted more than forty-nine months ago." In terms of the taxonomy introduced in Part 1, this is a rule which states the conditions under which a row changes from queryable to reconstructable. But for now, we can ignore the archive process. What we want to do is see how a logical delete is typically implemented. This is shown in Version Pattern 2. Version Pattern 2: Updating in Place with Logical Deletion Because physical deletes destroy even the little historical information that Version Pattern 1 provides, businesses frequently ask their IT departments for some way to preserve that minimal level of historical information. Usually, the way this is done is to do a logical delete instead of a physical delete. {BR 2}. Support {BR 1}, and also retain deleted rows as queryable history. There are variations on this theme, of course, as there are on almost any data pattern. We will explain below why we chose the variation we did. Scenario 2.1: A Logical Delete to the Policy Table

To support logical deletes, we need an additional column - a delete indicator. Often this is implemented as a delete flag, set to "Y" if the row is logically deleted, and to "N" otherwise. The date of the deletion is then put in last-updt-dt. Another option is to have a separate delete date column. With this option, a delete flag isn't needed, because logically deleted rows are all and only those rows whose delete dates are not null. We prefer this second option because it preserves rather than overwrites the metadata information about the last update date (and also because we have a further use for this new column, which we will discuss later on). On 6/12/06, Mary Franklin's policy ends. Starting on that date, it is no longer in effect. This time, we record that event by logically deleting her policy's row in the Policy table. After that change is recorded, the table looks like this:

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

5 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

Note: to make it possible to show an entire row on one line, we will start doubling up. The two metadata dates, crt-dt and last-updt-dt, will be stacked from this point forward. Later, other such doubling-ups may be required. Keep in mind that in spite of the graphical layout, crt-dt and last-updt-dt are still two distinct columns in this table. The Semantics of Version Pattern 2 For Version Pattern 2 to satisfy its business requirements, the following constraints must be enforced when updating the table: {SC 2-1}. Always, del-dt must be null if the row is not logically deleted, and must otherwise contain a valid date. {SC 2-2}. When entered, del-dt must be the current date, or else a future date. It cannot be a past date. {SC 2-3}. When entered, foreign key semantics must be enforced. {SC 2-4}. Once entered, del-dt cannot be changed. If del-dt is not null, the date in it must be the date originally put in it. {SC 2-2} is particularly interesting. What is the reasoning behind it? Why can't we use a past date when logically deleting a row? After all, if we were supposed to delete a row on a particular date, and simply forgot to do it, why can't we just apply the delete retroactively? Wouldn't that produce the same result as if we had done the delete on its originally scheduled date? The answer is "no." If a past date is used, we are changing the past, and doing so without leaving any indication that we did. In other words, if a past date is used, we were either not telling the truth before the logical delete, or we are not telling the truth after the logical delete. Here's why. Imagine we were supposed to delete the row seven days ago and that we logically delete the row, today, by setting its del-dt to the date seven days ago. Whether we did this seven days ago, or do it today, the result will be the same. From today forward, the database will show a logical delete on 6/12/06. However, for the seven days prior to this action, the row was not deleted during those seven days, and so during that period of time, our database says that the policy is active. But as soon as we put a past date into del-dt, our database now says that the row was logically deleted during those seven days, and therefore that the policy was not active during that period of time. Well, either it was active over the past seven days, or it wasn't. We can't have it both ways.

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

6 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

With this retroactive delete, we have changed the past, and left no indication that we have done so. A query counting active policies in the period 6/6/06 - 6/12/06, run sometime during the past week, would include Mary's policy in that count. The identical query, run anytime after that event, would not include her policy. Even so, the assumption is that we were supposed to delete that row some time in the past. The reason is that, just as a policy becomes effective on the date a row for that policy is inserted into the Policy table, a policy ceases to be effective on the date a row for that policy is deleted from the Policy table. We missed the date to delete the policy, and now we apparently have no way to correct the mistake and show the date on which Mary's policy became no longer in effect. Our dilemma is this: as long as the date of a physical insert into a table is also a business effectivity start date, and the date of a physical delete is also a business effectivity end date, we better be sure we can insert and delete on time! For if we miss either an insert or a delete date, the only way we can correct the mistake is to lie about the past! The dilemma is resolved, not by lying about the past, but rather by eliminating the homonyms that are the source of the dilemma. Crt-dt is a homonym because it means both "the date the row is inserted" and also "the date that the object represented by the row becomes effective." Del-dt is a homonym for analogous reasons. As we said at the beginning of this article, distinguishing database activity dates from business effectivity dates creates what computer scientists call a "bi-temporal" pattern. We will have more to say about bi-temporality in later installments, starting with the discussion of Version Pattern 5. As for {SC2-3}, it simply says that semantically, a logical deletion must follow the same referential integrity rules as does a physical deletion. So, for example, if there are any foreign key dependencies on P138, then either the logical delete of P138 must be blocked, or else it must trigger a cascade (logical) delete to all rows that refer to it. As for {SC 2-4}, the reason we cannot change del-dt once it has become non-null is the same. Doing so would change the past. Our database would give different answers, depending on when it was queried. At least one of those answers would have to be false. This concludes our discussion of Version Patterns 1 and 2. The next installment will discuss Version Pattern 3. References: 1. Tom Johnston. "Primary Key Reengineering Projects: the Problem."DM Review. February 2000. 2. Tom Johnston. "Primary Key Reengineering Projects: the Solution."DMReview.com. March 1, 2000. 3. Tom Johnston. "The Semantics of Surrogate Keys and Business Keys: Part 1."Datawarehouse.com. December 6, 2002. 4. Tom Johnston. "The Semantics of Surrogate Keys and Business Keys: Part 2." Datawarehouse.com. January 13, 2003. For more information on related topics, visit the following channels: Data Management Data Modeling

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 3 - V...

7 of 7

http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:24 AM

Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

1 of 6

Time and Time Again: Managing Time in Relational Databases, Part 4 - Preliminaries to Version Pattern 3 Tom Johnston, Randall Weis DM Direct, June 2007 The first two version patterns, described in Part 3, support a minimal form of queryable history in business databases. But sometimes that isn't enough. For even if we use Version Pattern 2 updating in place with logical deletion - there are still two important ways in which history can be lost. First, as soon as the first update takes place, the row, as originally entered, is lost because the update overwrites the row. So we lose the information that was true of the policy on the date it became effective. Next, considering the entire row as a single unit of data, all updates but the most recent one are also lost, including even the knowledge of whether or not there were any previous updates. Sometimes this is acceptable; but sometimes it is not. Note: this may suggest that we should manage versions at the level of individual row/columns, not at the level of entire rows. Although that would certainly be possible, we believe that the same semantics can be supported by versioning entire rows, and at a lower overall system cost. We will not consider row/column level versioning in the remainder of these articles.

Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

Now we will consider a more stringent business requirement for history. {BR-3}. Support {BR-2}, and also keep a complete queryable history on policies, i.e., on all inserts, updates and deletes to all policies. As our history taxonomy indicates, our first choice is to keep this history as a chronological record of either events or states. We could keep an event-based history of changes to policies by adding a row to the Policy table each time a new policy was created, and after that keeping a transaction table in which each transaction was an update or delete to the policy. Event-based history is most appropriately used to manage changes to metric values of relationships among persistent objects, values such as counts, quantities and amounts. These relationships that contain metrics are often called balances. Each change to a relationship metric is recorded as a transaction, and in the traditional star schema design, each transaction contains a date, the change to each metric, and a foreign key to each object that is affected by the transaction, objects such as customers, products, suppliers and so on.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

2 of 6

However, event-based history is not appropriate for managing changes to non-metric values of persistent objects. For them, state-based history is the preferred option. Because we have chosen to keep a state-based history of policies, we must now decide whether to do so using snapshots or versions. Snapshots are the preferred method for recording the coordinated history of many rows, usually across many tables, as of a single point in time. However, if the same column in the same row is updated more than once between snapshots, only the last update will be captured. Therefore, since the business requirement is to track all changes to policies, we must do so using versions. What Version Tables Version Before we dive into the details of managing versioned history, we must first understand what is being logically inserted, updated and deleted in versioned tables. It is not versions. It is the objects they are versions of. The physical insertion of a version logically inserts, updates or deletes an object, and does so in the following way. The physical insert of the first version for a policy is a logical insert of that object, i.e., that policy. The physical insert of subsequent versions for that policy, which are not designated as a logical delete of the policy, are logical updates of the policy. The physical insert of a version marked as a logical delete logically deletes the policy. Note that "object" is being used here in the sense of "anything that persists over time," not in its narrower object-oriented sense. Customers, products, suppliers, contracts, employees, facilities, inventories, packages, invoices, purchase orders - all these are objects. For the most part, we are using insurance policies as illustrative objects in these articles. A Policy Version Table Each row in the original Policy table represents a policy (of course). But no row in the Policy Version table represents a policy! Instead, each row represents what we know about what a policy looked like during a designated period of time. Multiple rows on the Policy Version table may have the same policy-nbr, of course. If this were a Policy table, that would be an error. But on a Policy Version table, it is not an error. On a Policy Version table, those rows, and only those rows, with a given policy number are versions of that policy. Those rows, and only those rows, contain information about that policy during a period of time. So a version is a timeslice of an object. It is not the object itself. A complete sequenced set of versions is a full timeline for an object. Since none of these rows in the Policy Version table represent policies, where are the rows that do? If this table is not a table of policies, where is that table? The answer is that, on our implementation of versioned history, there is no Policy table. The reason is that we don't need one. For any point in time during which a policy was a policy, we can retrieve a row from the Policy Version table that represents the policy as of that point in time. The first row in the Policy Version table for a policy shows what the policy was like when it became

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

3 of 6

effective. Subsequent rows show what the policy was like after each of the changes that happened to it and until the next one happened. If the Policy terminated, there is a version for that event also. A Simplifying Assumption: One and Only One Episode per Object Before proceeding, we will introduce two simplifying assumptions into our initial discussion of versions, assumptions which we will drop later on. The first one is this: objects can't recur. For example, if Mary Jaspers' policy P138 terminated on a given date, but the next day Mary called and said she wanted to continue her policy, then given this assumption, AHIC would have to issue her a new policy. If Mary wanted no changes to her policy, the new policy would be identical to the old one except for its policy ID number, but it would in fact be a new, distinct policy. If an object could recur, what would that look like, as far as data is concerned? In the case of Mary's policy, it would look like a logical delete of policy P138 followed, after one or more clock ticks, by another version for P138, which is not a logical delete. Normally, persistent objects re-appear after some period of time, i.e., after some possibly large number of clock ticks. So there can be gaps between episodes of the same object, although if there is only one clock tick between a logical delete and the next version of an object, those episodes would not have a gap between them. This is illustrated in Figure 1. Figure 1: Episodes of an Object Since that next version is an initial version of the policy after a prior logical delete of that policy, it constitutes a reappearance of the policy. Let us call the first version for P138 (the Jan 1, 2005 version) the original initial version, and each other version which immediately follows a logical delete of P138 (the Jul 15, 2005 version) a successor initial version. The count of initial and successor versions for an object is a count of what we will call episodes of that object. Gaps may or may not exist between successive episodes. In this case, the gap extends from one clock tick after the first episode is terminated, to the clock tick on which the second episode begins. If the number of logical deletes of an object is the same as the number of episodes, then, for as long as that is true, the object does not currently exist on our database. That is because every episode, including the most recent one, has been terminated by a logical delete. The only other case that can arise is that the number of episodes is one greater than the number of logical deletes. That is because, in a series of versions for the same object, what splits them into episodes are the logical delete versions. In this case, the object does currently exist on our database because there is no logical delete version for the most recent (or only) episode. So our simplifying assumption can also be expressed like this: (for now), each persistent object has one and only one episode. That episode, of course, can consist of any number of versions. This single-episode constraint means that AHIC (our fictional insurance company) does not permit policies to recur. So if a policy lapsed for even a single day, and the policy holder wanted her policy reinstated immediately, she would be assigned a new policy. The initial version of that new policy would become effective one clock tick after the deletion of her previous policy. This is not a realistic way of doing business, of course, and later on we will consider scenarios in which a lapsed policy may be reinstated.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

4 of 6

Another Simplifying Assumption: No Gaps Within Episodes Our second assumption is that there can be no gaps, within a given episode of an object, between successive versions. One version must follow right after the other. That is, all noninitial versions of an episode of an object must begin on the next tick of the clock after the version they supercede ends. So in Figure 1, each arrow which is labeled "version {n} ends; version {n+1} starts" actually points to two adjacent clock ticks. A corollary is that each nonterminal version of an episode of a persistent object must end on the tick of the clock immediately before the tick on which the version they precede begins. (See Part 2 for a discussion of clock ticks.) Let us call a gap within an episode an intra-episodic gap, using the term "episode" as indicated above, to refer to the time span for an object that reaches from the begin date of its initial version to the end date of the first logical delete reached by following that object's versions in chronological sequence. And let us call the first assumption, that objects have only a single episode, the requirement that there be no inter-episodic gaps. Note: Bear in mind that while we reference an event's commencement and conclusion as begin date and end date, these temporal markers would frequently be a more granular date/timestamp. As always, the key concept here is a "tick of the clock," at whatever level of granularity is appropriate for your specific application. Semantic Constraints Our two simplifying assumptions come to this: until further notice, these articles are discussing single-episode objects whose versions have no gaps. The semantic constraints imposed by these assumptions are as follows: {SC 3-1}. If a version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version. {SC 3-2}. If a version for an object is not followed by another version for that same object, and is not itself a logical delete version, the end date will be assumed to be unspecified. (This is not always the case, of course; and in later patterns, we will drop this simplifying assumption.) {SC 3-3}. If a version for an object is a delete version, the object will be considered deleted as of that delete date. {SC 3-4}. It is semantically illegitimate for any version of an object to follow (to be chronologically later than, based on its version date) a logical delete version of that object. There are several issues raised by these semantic constraints, all of which we will discuss later on. They include: The semantics of "unspecified" end dates. For some versions which do have an end date, that end date may be unknown at the time the version is entered. For others, which we can think of as "until further notice" versions, there is no end date, known or not, at the time they are entered. In the literature, an unspecified end date is sometimes called "forever." But

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

5 of 6

forever implies that we know the version will never terminate, and there aren't many objects which have versions like that. And one may hope that if an end date for a version should be known, it is known at the time the version is entered. If we can assume that, it leaves the "until further notice" kind of version as the only kind of version with an unknown end date. But its semantics is not the same as the semantics of forever. This issue will be discussed in later installments in this series. The existence of versions with a known end date. We must be able to manage versions of objects with a known end date, and also versions of objects whose end date is not known. As we will see later, this can prove quite challenging. The fact, as already noted, that businesses often need to recognize the re-appearance of the same object after a gap in time. This is the need to permit multiple episodes of an object on the database. It is a common business requirement; customers come and go, products are offered for sale, later retired, and later still reintroduced, etc. The fact that businesses often need to keep two or more versions with overlapping or identical effectivity periods. This last point is especially important. In normal cases, versions cannot temporally overlap one another. If they did, there would be a period of time during which both were effective. But if they were both effective, how would we know which one was the truth? And what would the other one represent, if not the truth? These are topics for later articles in this series. Corrections, Gaps and Overlaps To avoid intra-episodic gaps, we must write code to insure that every new noninitial version begins exactly one tick of the clock after the version it is about to supercede. However, while necessary, this is not sufficient to guarantee that there will be no intra-episodic gaps. A second way in which such a gap can be introduced is with a correction. Consider a non-initial version entered with an effective begin date that is earlier than it should be. In that case, we need to move its begin date forward. But doing so would introduce a gap between it and its precedessor version. By the same token, moving an effective end date back would introduce a gap between it and its successor version. To prevent such corrections from violating {SC 3-1}, we must always correct two versions when an effective begin date on a non-initial version is moved forward or when an effective end date on a non-terminal version is moved backward. One is the version itself that is being corrected. The other is the adjacent version, which must now be corrected to "fill in the gap" created by the first correction. We should not think of these two physical updates as two transactions. There is one transaction; it is the movement of the transition point in time between two adjacent versions. This single semantic update requires two physical rows to be updated. Semantic constraint {SC 3-1} also requires that updating an effectivity date range must not create an overlap. One caveat, of course, is that a correction which does not alter a version's effectivity date range must have exactly the same date range as the version it corrects. Version Patterns 1 and 2 were both "update in place" patterns. In this installment, we have laid the groundwork for discussions of patterns which do not update in place, i.e., which do not overlay data already on the database. First, we introduced two simplifying assumptions, which we will eventually drop as we discuss increasingly complex patterns. Next, we introduced some concepts

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

6 of 6

which will prove useful in those later discussions. The concepts were: Timeslice Timeline Original initial version Successor initial version Episode Gaps Intra-episodic gaps Inter-episodic gaps In Part 5, we will begin our discussion of Version Pattern 3. For more information on related topics, visit the following channels: Data Management ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

1 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 5: Version Pattern 3 Tom Johnston, Randall Weis DM Direct, June 2007 The business requirement for Version Pattern 3 is to be able to show what an object was like at any point in its lifetime. To do this, we must retain the before-update state of an object, not just the after-update state. We must retain a versioned history of changes to the state of an object. This means that in place of the Policy table we have been using thus far, we will need a Policy Version table. {BR-3}. Support {BR-2}, and also retain a queryable history of changes to objects. Tables and Keys The primary key of our original Policy table is policy-nbr. So the primary key of our Policy Version table must be policy-nbr plus something else. The purpose of that something else is to distinguish each version of a policy from all other versions of that same policy.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

Pretty much anything will do to distinguish policy versions. In fact, any value guaranteed not to recur for the same policy will do. The first thing that comes to mind is a sequence number. The initial version is 1, and the subsequent n versions are 2, 3, ...... , n. This isn't a bad idea, just as the logical delete flag, discussed in Version Pattern 2, was not a bad idea. But just as we preferred a delete date to a delete flag, we prefer to use a date as our versiondiscriminating part of the primary key for policy versions, and thus the combination of policy-nbr and this date to uniquely identify each row in the Policy Version table. Business Dates and System Dates Version Pattern 3 does not distinguish business effectivity dates from physical database activity dates. We must have a business effectivity date, of course. But for the purposes of this analysis, we can consider this date as either the business effectivity start date for the version, or that date and also the date the version was physically inserted. In the former case, we simply fail to record the date of physical database activity. In the latter case, the two dates are identical, with complications that we will examine later.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

2 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

The former case is simpler, but it is also almost never seen in production databases. Physical insertion dates (and physical last update dates, in the case of updates in place) are almost always included. So for Version Pattern 3, we will assume that version begin date is both the date the inserted row became business effective and also the date the row was physically inserted. Let's now recreate the series of events described in Scenario 1. We begin with the event that created a new policy for Mary Jaspers. This event takes place on 1/12/04. Scenario 3.1: A Policy Insert in the Policy Version Table As we have already pointed out, the physical action of inserting the initial version for a policy is also the logical action of inserting a new policy. Here is the result of inserting Mary's policy effective 1/12/04:

Figure 1: Result of Inserting Mary Jaspers's Policy Effective 1/12/04 The primary key of the policy version is policy-nbr (the unique identifier for a policy) plus version begin date. All versions for the same policy have the same policy number and are distinguished from one another by their begin date. Because Version Pattern 3 is based on the assumption that objects cannot recur (as described in Part 4), there will be no temporal gaps among versions for the same object. Consequently, we also know the end date for all versions but the current one. It is one day (one clock tick) prior to the begin date for the chronologically next version. If the above row were inserted on any date later than 1/12/04, it would be a retroactive insert. If it were inserted on any date earlier than 1/12/04, it would be a proactive insert. But as long as rows must be inserted on their business effectivity dates, then neither retroactive nor proactive inserts are possible. In Version Pattern 4, we will distinguish these two kinds of dates and consider retroactive and proactive database activity. Using this primary key presupposes that we can never have two versions of the same policy that begin on the same date, i.e., on the same tick of whatever clock we are using. For now, we'll stick with this assumption. Later on, we will examine an exception to it.

Figure 2: Timeline Diagram for Scenario 3.1

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

3 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

By inserting this initial version, we have created policy P138, effective 1/12/2004. Now let's see how the first update to this policy is represented in the Policy Version table. Scenario 3.2: A Policy Update in the Policy Version Table On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization). After that change is recorded in the Policy Version table, the table looks like this:

Figure 3: Policy Version Table After Policy Type Change Because of our no-gap assumption, we know from this data that this policy was a PPO policy from 1/12/04 through 3/13/05 - the latter date being one clock tick prior to the begin date of the next version. Note: One way of representing date ranges is by a pair of dates (one of them inferred, in this example). Another is to specify a begin date and a duration, i.e. a number of clock ticks. These two ways of representing date ranges are semantically equivalent; each can handle all (and only) the business requirements that the other can. They differ only in details of implementation. Throughout these articles, we have chosen to represent date ranges with a pair of dates. When a pair of dates is used, we must specify whether neither, either or both are included in the range they specify. Again, the options are semantically equivalent. In these articles, we have chosen to include begin dates in the range, but exclude end dates. The timeline for policy P138, after this update, is shown below.

Figure 4: Timeline Diagram for Scenario 3.2 Note: graphics for timeline diagrams permit us to draw the lines indicating the start of a version only in alignment with the start of a vertical bar. Vertical bars represent months. So our convention will be to draw these version start lines at the start of the month of their ver-beg-dt. Thus, a ver-beg-dt of 3/14/05 is drawn aligned with the start of the vertical bar representing March, 2005. After this change, we do have an end date for our initial version, although it is inferred rather than

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

4 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

expressed as an explicit data value. However, we still do not have an end date for Mary's policy itself, as is graphically illustrated by the ellipsis on the above timeline. Scenario 3.3: A Policy Deletion in the Policy Version Table On 6/12/06, Mary's policy is terminated. But in the Policy Version table, as shown in Scenario 3.1, we have no way to distinguish a version representing a policy deletion from one representing a policy insert. To distinguish versions which logically delete a policy, we could add a delete flag. But a delete date carries more information than a delete flag. So while we do not need a second date to manage policy inserts and updates, we do need one to manage policy deletions. Notice that the version date in the primary key is called version begin date, not version create date or version insert date. By the same token, we will call this second date object end date, not object delete date. And, of course, we do not call it version end date. It is the end date of the object, i.e., the policy, not of a version of the object. After we terminate Mary's policy, the Policy Version table looks like this:

Figure 5: Policy Version Table After Termination of Policy We can identify the first row as the initial version for P138 because there is no other row for P138 with an earlier ver-beg-dt. We can identify the second row as the terminal version, i.e. the logical delete point, for policy P138 because it is the only version for P138 whose obj-end-dt is not null. The lifetime of policy P138 is from 1/12/04 to 6/11/06, inclusive. A query asking for what the policy looked like, at any given point during its lifetime, can now be satisfied. Notice that these two rows are a physically condensed way of expressing the same semantics that would be expressed by having nearly 900 rows in the table, one for each day in the life of this policy. These two rows are semantically equivalent to a set of nearly 900 consecutive daily snapshots. This illustrates how, from a data storage perspective, versions are much more efficient than snapshots. Notice also that this is the first instance in which we have physically updated a row in the Policy Version table. Prior to 6/12/06, the row with a 3/14/05 ver-beg-dt has a null obj-end-dt. From 6/12/06 forward, it does not. So we have overwritten that state of the row. In doing so, have we lost information? In fact, we have not. We can infer from the second row above that its obj-end-dt would have been shown as null prior to 6/12/06, and as not null from that

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

5 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

date forward. So although we did physically overwrite that row, we lost no information about the policy, or about what we knew about the policy at any point in its lifetime. This shows why we think Dr. Snodgrass's metaphor of the relationship between nontemporal and temporal tables of the same "things" is a poor one. He encourages us to think of a temporal table as derived from a nontemporal one by adding various dates to it. So he would encourage us to think of the Policy Version table as the Policy table with time management added to it. Indeed, in his examples, he changes nothing in table names as he changes them from nontemporal to temporal. (This metaphor is first encountered in {Snodgrass 2000, p.20}, where a Lot table is still a Lot table after adding two pairs of dates. It is again encountered on p.113 (Adding History), p.117 "...adding valid-time support to a table..." and elsewhere.) Throughout our discussion of Version Pattern 3, we have seen the profound difference between objects and versions of objects. In temporal tables, we physically manage rows which represent versions of objects. In the process, we logically/semantically manage those objects themselves. But in that process, the table whose rows represent those objects disappears! We have not kept a Policy table and added a Policy Version table. We have replaced the Policy table with a Policy Version table; and that version table should not be thought of as the Policy table with time management added to it. The complete lifetime of policy P138 is shown on the timeline below.

Figure 6: Timeline Diagram for Scenario 3.3 It is only now, when the policy has been logically deleted, that we have full knowledge of its timeline. It is only now that the last version on the timeline is not followed by an ellipsis. The Semantics of Version Pattern 3 In Part 4, we defined four semantic constraints for Version Pattern 3. These constraints spell out the implications of the single-episode, no-gaps assumptions that are in effect for this version pattern. To repeat, those constraints are as follows: {SC 3-1}. If a version for an object is followed by another version for that same object, the (implied) end date of the former version must be one tick of the clock prior to the begin date for the latter version. {SC 3-2}. If a version for an object is not followed by another version for that same object and is not itself a logical delete version, the end date will be assumed to be unspecified. (This is not always the case, of course; and in later patterns, we will drop this simplifying assumption.) {SC 3-3}. If a version for an object is a delete version, the object will be considered deleted as of that delete date.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

6 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

{SC 3-4}. It is semantically illegitimate for any version of an object to follow (to be chronologically later than, based on its version date) a logical delete version of that object. In this installment, we have defined two dates which are needed to implement Version Pattern 3 a version begin date and an object end date. Additional semantic constraints specify the rules for populating these dates. The first group of additional constraints specifies the rules for populating these two dates during the insert of the first version for an object. Semantically, this is the action of inserting a new policy. {SC 3-5}. Ver-beg-dt cannot be in the past or in the future, because it is also the system insert date. To set such a date to anything but [now] would be a lie. We can't do anything in the past, or in the future. [Now] is all we have. {SC3-6}. Obj-end-dt must be null. We can't end something before it begins. We can't even end something when it begins. We can only end something after it begins. The second group of additional constraints specify the rules for populating these two dates during the insert of a noninitial, nonlogical delete version for an object. Semantically, this is the action of updating an existing policy. {SC 3-7}. For every noninitial version of a policy, its ver-beg-dt must be at least one clock tick past the ver-beg-dt of the immediately prior version for the policy. If it is not, then the two versions will overlap for at least one clock tick. But that is invalid because there cannot be two versions of the truth at any given point in time. However, for Version Pattern 3, this condition should never arise. Since a ver-beg-dt is also the system insert date, only an incorrect system clock could permit a version to have an earlier begin date than the version which preceded it. It shouldn't be necessary to write code to enforce {SC 3-7} because if the system clock is incorrect, we have a much bigger problem on our hands and will probably have to roll back and reapply updates anyway. {SC 3-8}. Obj-end-dt must be null. If it is not, the policy is being deleted; and, by assumption, this is not a delete transaction. The third group of additional constraints specify the rules for populating object end date during a logical delete. Semantically, this is the action of terminating a policy. {SC 3-9}.To terminate a given policy, set the object end date of the current version (which may or may not be the initial version) to [now]. The object end date cannot be in the past or in the future, because it is also the system insert date. What Comes Next Version Pattern 3 is a significant achievement. It retains the knowledge of the state of an object at any point during the lifetime of that object. Indeed, what more could there be? What more could businesses want that versioned tables can provide? The answer is: quite a lot. To begin with, we might want to see what would happen with Version Pattern 3 if we distinguished business dates from system activity dates. This would mean that version begin date and object end date are business dates only. For system activity dates, we would then want to keep an insert date and a version update date. That latter date will be {null}

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

7 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

except when the version is updated. And because with versions we are no longer considering updates in place, this update can only be one thing - the date the object end date was specified. In the academic literature, versioning which has both a business set of dates and a system activity set of dates is said to implement a "bitemporal" pattern. In providing an annotated list of version patterns, in Part 2, we did not distinguish a bitemporal and nonbitemporal variation of Version Pattern 3. (In addition, the description of Version Pattern 3 was incorrect, and should have stated the opposite of what it did state.) So following is a revised annotated list of the version patterns which we are discussing in these articles. This list provides a correct description for pattern 3. It "bumps down" the other patterns, and specifies a pattern 4 which is pattern 3 with business and system activity dates distinguished. Finally, it reverses the sequence of the original list's ninth and tenth patterns in order to put the two patterns which describe snapshots next to one another.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve...

8 of 8

http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

Figure 7: Annotated List of Version Patterns - Revision 1 For more information on related topics, visit the following channels: Data Modeling Databases Real-Time Enterprise ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:26 AM

Time and Time Again: Managing Time in Relational Databases, Part 6: Ve...

1 of 6

http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 6: Version Pattern 4 Tom Johnston, Randall Weis DM Direct, June 2007 Version Pattern 3, discussed in Part 5, keeps a record of the state of an object at every point during its lifetime. This results in an unbroken chronological series of versions of the object and guarantees that no versionable updates will be lost. With Pattern 3 and Pattern 4, each version of an object is distinguished from all other versions of the same object by means of a version begin date. The start of an object's lifetime is marked by the chronologically earliest version of that object. The end of an object's lifetime is marked by a chronologically latest version which contains an object end date. The current version of an object whose lifetime is not yet ended is marked by a chronologically latest version which contains {null} in its object end date. Notice that these versions have no version end date. This is because Patterns 3 and 4 assume that every non-initial version follows its predecessor immediately, i.e., that when a new version is added, the previous version ends its business effectivity exactly one clock tick prior to the start of business effectivity of the newly added version. In other words, with Patterns 3 and 4, inserting a new version into a table determines both the begin date for that version, and also the (implicit) end date for what had been the current version until that insert took place.

Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

A second assumption that characterizes Patterns 3 and 4 is that after an object has been deleted, it cannot reappear in the database. More technically, and using our Policy table example, given a version with a non-null object end date, no version with a later version begin date can have the same policy number. Equivalently, any version with a later begin date must be a version of a different policy. In Part 4, we introduced the concept of an episode, and characterized this second assumption as the assumption that no object can have more than one episode. These two assumptions are semantic constraints on Patterns 3 and 4. If your business requirements for versioned history rule out either or both of these assumptions, then neither of these patterns will meet your needs. Patterns 3 and 4

10/7/2008 9:27 AM

Time and Time Again: Managing Time in Relational Databases, Part 6: Ve...

2 of 6

http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

A version pattern is most precisely defined by the combination of a) its business requirement (noted as {BR-x}, where x is any number) which the pattern must satisfy, and b) its semantic constraints (noted as {SC-x-y}, where x is the number for the pattern, and y is the number for the constraint). In these terms, Pattern 4, which is the pattern we will examine in this article, satisfies the same business requirement and operates under the same semantic constraints as Pattern 3 with one exception. That exception is semantic constraint {SC 3-5}: {SC 3-5}. Ver-beg-dt cannot be in the past nor in the future because it is also the system insert date. To set such a date to anything but [now] would be a lie. We can't do anything in the past, nor in the future. [Now] is all we have. With Pattern 4, this constraint is changed. Semantic constraints 1 - 4 and 6 - 7 are identical to their corresponding Pattern 3 constraints. But Pattern 4 has no constraint corresponding to {SC 3-5} because it uses distinct dates to indicate when the version becomes effective (ver-beg-dt) and when the version is inserted into its table (crt-dt). Bi-Temporality With Pattern 4, we introduce a second set of dates. The first set is business effectivity dates. For Patterns 3 and 4, they are: Version-begin-date (ver-beg-dt): the clock tick on which that version of the object becomes the currently in effect version. Object-end-date (obj-end-dt): the last clock tick on which the last version of the object is in effect. The second set, included in Pattern 4 but not Pattern 3, also consists of a pair of dates. They are: Create-date (crt-dt): the clock tick on which that row is physically inserted into its table. Last update date (last-updt-dt): the clock tick on which that row is physically updated. Normally, versions aren't updated. And indeed, versions are never logically updated. The information they contain is never lost. But as noted in Part 5, there is one case in which a version may be physically updated. That happens when a transaction is received which directs the system to logically delete the object. This logical delete of the object is implemented by physically updating the current version, changing its object end date from {null} to the date of the logical delete. When this physical update takes place, the last update date records the clock tick on which it does. In all other cases, last update date is null. These two sets of dates are orthogonal to one another. Changes to one set do not necessarily entail changes to the other set. Computer scientists have called the use of these two sets of dates the "bi-temporal" data pattern. The difference between Patterns 3 and 4, then, is solely that Pattern 3 is not bi-temporal while Pattern 4 is. To work out the implications of this difference between the two patterns, let's begin by recreating the series of events described in Scenario 1. Scenario 4.1: a Proactive Insert

10/7/2008 9:27 AM

Time and Time Again: Managing Time in Relational Databases, Part 6: Ve...

3 of 6

http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

With Pattern 4, it is now possible to physically insert a row on a different date than the date it becomes effective. So let's assume that Mary's policy, entered into the database on 1/12/04, does not become effective until 6/1/04. (Correction: In scenario templates, the fourth cell down on the left-hand side specifies the date or dates on which the version could validly have been inserted. In Part 5, Scenario 3.1, that cell contains "1/12/04 or any later date." It should contain "N/A," because for that Version Pattern, there is no choice about when the insert takes place; it must take place on the ver-beg-dt.)

The primary key of the policy version is policy-nbr (the unique identifier for a policy) plus version begin date. All versions for the same policy have the same policy number, and are distinguished from one another by their version begin date. In addition, for all versions but the current one (if one is currently in effect), we also know the dates on which they end their effectivity. If there is a terminal version, it is the clock tick in its object end date. Otherwise, it is one clock tick prior to the begin date of the chronologically next version. The insert of the row shown above is a proactive insert. It takes place some four and a half months prior to the clock tick on which the inserted row becomes the in-effect row for that object. This insert is shown on the timeline diagram below.

With proactive inserts, there is a period of time between when the row was physically inserted and when the version it represents becomes effective. In the timeline diagram for Scenario 4.1, this period of time is indicated by the shaded block at the front of the version. One situation in which proactive inserts are valuable is when there are multiple versions that become effective on the same clock tick. If our clock ticks once per day, then we have 24 hours to insert all those versions. Usually, that will be enough time to handle all but the largest volume situations. But suppose our clock ticks once per second. With Pattern 3, we cannot insert versions rapidly enough to give more than a small number of them the correct ver-beg-dt. This immediately suggests that we should assign the correct ver-beg-dt to versions, rather than reading that date from the system clock once for each version. Indeed, the more granular our clock ticks are, the less reasonable it is to design a system in which

10/7/2008 9:27 AM

Time and Time Again: Managing Time in Relational Databases, Part 6: Ve...

4 of 6

http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

each version's ver-beg-dt is populated from a read of the system clock. So suppose we redesign our system to assign version begin dates rather than base them on a read of the system clock. Since we also need to record the clock tick on which rows are physically inserted - for all the usual reasons, such as backing out an entire batch run - we need to add an additional pair of dates, which we will call create date and last update date. Doing this gets us from Pattern 3 to Pattern 4. Retroactive Inserts? By inserting this initial version, on 1/12/04, we have created policy P138, effective 6/1/2004. But with increased expressive power - here the power to physically insert versions on dates other than their effectivity dates - comes an increased possibility for making a mistake. One such mistake is a retroactive insert. Suppose that instead of inserting Mary's policy on 1/12/04, we instead inserted it on 8/1/04.

Before we explain why this is an error, let's note that there is nothing the database management system (DBMS) can do to prevent it. Put another way, SQL DDL does not let us tell the DBMS that retroactive inserts are invalid. Until DBMSs are upgraded with robust temporal management capabilities, we are stuck with a do-it-yourself approach. In this case, we must write code that prevents an insert whenever the transaction's create date is later than its version begin date. Because this is an integrity rule which applies to all versions in the database, it should not be left to individual applications to enforce. It should be enforced by the database itself, by means of a pre-insert trigger. Why Not? The reason that retroactive inserts are not valid is easy to see. They change the past. Prior to the retroactive insert, the past looked one way; afterwards, it looks different. Consider policy P138's version that has a 6/1/04 effective date but that we didn't insert until 8/1/04. Prior to the retroactive insert, the past from 6/1/04 up to 8/1/04 looked like there was no policy P138 in effect at any point during that period of time. Afterwards, it looks like there was a policy P138 in effect during that period of time. But both things can't be true of the period 6/1/04 - 8/1/04; either policy P138 was in effect, or it wasn't! It's certainly true that during any period of time, including the 6/1/04 - 8/1/04 period, either P138 was in effect or it wasn't. But suppose that it was. In that case, for those two months, the database was incorrect. Surely there must be a way to correct mistakes? Well, with mistakes, we can either correct them or we can ignore them. If we correct them, we can either eradicate all traces of the error or we can preserve the evidence. The choice depends on the cost/benefit ratio of each alternative. Ignoring errors, actually doing nothing about them, is usually not an option when we are dealing with production databases, so we won't consider that option any further.

10/7/2008 9:27 AM

Time and Time Again: Managing Time in Relational Databases, Part 6: Ve...

5 of 6

http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

When we are dealing with transactionally updated databases, accountants know that the way to correct an error is to a) create a transaction which offsets the bad transaction, and then b) add the correct transaction. Although our Policy table is not transactionally updated, it is versioned, so something similar should take place. The error should not be erased or overwritten. It should be "sequestered" in some way, so that it is invisible to "normal" queries but nonetheless retrievable with specially written queries. Then the correct version should be inserted in its place. This is the "preserve the evidence" approach to error correction. But it is not always the best choice. Consider that correcting mistakes while at the same time not removing all traces of them from queryable history (see Part 1 for an explanation of the term "queryable history") requires complex logic for both updating the database and subsequently querying the database after the correction. The cost of developing this logic is not trivial. Moreover, the potential for mistakes in queries against a database containing such corrections - especially mistakes in end-user-written, non-IT-managed queries - is very real and can be as costly as you care to imagine. If there are no legal requirements to retain a queryable record of the mistake, for example if the only legal requirement is to be able to pull archive tapes when auditors ask for them, then the benefit of making both the correction and the error immediately queryable is not very great. In a case like this, the best approach is to snapshot the database (so the error can later be preserved in archival storage), and then proceed with the retroactive insert (or the update in place of an existing version, if that is what is required). However, suppose our queryable database is a database of record for the data in question (policies, in our case). Suppose, further, that there is either a legal or a customer relations downside to telling one story about whether or not policy P138 was in effect throughout 6/1/04-8 /1/04, when asked prior to 8/1/04, and subsequently telling the opposite story when asked. A legal downside is easy to imagine. If a report on policies in effect was run originally on 7/31/04, and sent to the appropriate regulatory agency, but a week later rerun and sent again to that agency, the two reports are not going to agree about P138, or about any counts or other statistics it might contribute to. A customer relations downside is equally easy to imagine. If our customer service representative (CSR) tells a customer one story about her policy on one call and a contradictory story on a later call, the customer is not likely to be happy and won't feel increased confidence in the reliability of our company. Downsides that cannot be contained within the enterprise, such as regulatory, customer relationship and even vendor relationship downsides, are usually taken far more seriously by executive management than are downsides that affect only internal operations. With externally visible downsides, the cost/benefit ratio quickly swings to the "preserve the evidence" approach to error correction. But neither Version Pattern 3 nor 4 supports correction with error retention. So if your business requirements for versioned history do not require error retention in queryable history and do permit you to manage nonrecurring continuously existing objects, then you may still be able to use one or both of these patterns. Otherwise, neither Pattern 3 nor 4 will meet your needs. We will have to wait for a later Version Pattern to get error correction with error retention. In Part 7, we will finish our discussion of Version Pattern 4. For more information on related topics, visit the following channels:

10/7/2008 9:27 AM

Time and Time Again: Managing Time in Relational Databases, Part 6: Ve...

6 of 6

http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

Data Modeling Databases Real-Time Enterprise ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:27 AM

Time and Time Again: Managing Time in Relational Databases, Part 7: ...

1 of 5

http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 7: Completing Version Pattern 4 Tom Johnston, Randall Weis DM Direct, July 2007 As we saw in Part 6, what distinguishes Pattern 4 from Pattern 3 is bi-temporality. With Pattern 4, the business effectivity begin date of a version can be different than the date the version was physically inserted into its table. This was not possible with Pattern 3. Consequently, bi-temporality makes it possible to physically insert a row representing a new policy proactively, prior to the date the policy becomes effective. As long as queries are careful to filter out rows whose effectivity lies in the future (unless those queries are expressly written to retrieve such rows), the ability to proactively insert new objects is often very useful. With this ability, we can "insert and forget," meaning that we can insert a new object as soon as we know when it will become effective, and then not have to do anything later on to "activate" it. The mere passage of time will ensure that the object becomes a "now in effect" object on the indicated effectivity date. By the same token, bi-temporality also makes it possible to physically insert a row representing a new policy retroactively, after the date the policy becomes effective. We concluded Part 6 by discussing why retroactive inserts were semantically invalid, and why they must therefore be prevented. A retroactive insert of a new object, n days after its business effective begin date, is semantically invalid because it changes the past. Prior to the insert, a query about any point in time within n days prior to the time of insert, would show that the object did not then exist. After the insert, the same query would show that the object did exist at that point in time. One query, contradictory results.

Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

Let's turn now to the two other scenarios we have been using to illustrate all of these version patterns - updates and deletes. Scenario 4.2a: A Type 1 Proactive Update On 1/12/04, policy P138 is created for Mary Jaspers. As we have already described, the creation of a new policy is implemented by inserting an initial version for that policy into the Policy Version table. However, policy P138 does not become immediately effective. It becomes effective on 6/1/04. It is, thus, a proactive insert of a policy. Since inserting a new policy and updating an existing policy are both implemented by physically inserting a row in the Policy Version table, proactive updating of a

10/7/2008 9:28 AM

Time and Time Again: Managing Time in Relational Databases, Part 7: ...

2 of 5

http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

policy should follow the same pattern as proactive insertion of a policy. Proactive deletion should be similarly straightforward. But as we shall see, things are a little more complicated than that. While there is only one type of proactive insertion of a new policy, there are two types of proactive updates. With the first type, Mary requests a change to her policy to become effective after the policy itself becomes effective. With the second type, Mary requests a change to her policy to become effective after it is entered into the database, but prior to its effective date. To illustrate a Type 1 proactive update, let's assume that on 6/18/04, Mary requests a change in policy type from PPO to HMO, to become effective on 9/1/04, and that this change is physically recorded in the Policy Version table on 7/1/04. This is indeed a proactive update since the create date is two months prior to the effectivity date. After inserting the new version (a physical insert which semantically updates the policy), the Policy Version table looks like this:

What these two rows tells us is semantically straightforward. Policy P138 goes into effect on 6/1/04, as a PPO policy. Three months to the day later, it remains in effect but becomes an HMO policy. Scenarios 4.2b: a Type 2 Proactive Update But there is a second kind of proactive update to consider. What will the Policy Version table look like if, instead of the update just illustrated, we instead enter an update on 3/10/04, which also changes P138 from PPO to HMO, but this time becoming effective on 3/15/04? Clearly this is also a proactive update, since it is entered prior to its effectivity. So shouldn't things be just as straightforward as with the Type 1 update? Let's see. After inserting the new version (a physical insert which semantically updates the policy), the Policy Version table looks like this:

10/7/2008 9:28 AM

Time and Time Again: Managing Time in Relational Databases, Part 7: ...

3 of 5

http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

What these two rows tell us may be semantically straightforward, but it is also wrong! In other words, the result is not what we intended. These two rows tell us that Policy P138 will go into effect on 3/15/04, as an HMO policy, and that two and a half months later, it will remain in effect but will become a PPO policy. Type 1 and Type 2 Proactive Updates. What's the Difference? The difference between Type 1 and Type 2 proactive updates does not lie in the update transactions themselves. We may assume that the schema for the transactions is the same in both cases. Neither does the difference lie in the code processing the two transactions. We may assume that it is the same code. What, then, is the difference? Why do Type 1 proactive updates "make sense," while Type 2 proactive updates do not? The relevant difference between these two types of proactive update seems to be that Type 1 is an update to a policy that is already in effect, i.e., that has a version in effect at the time the new version is physically inserted. Type 2, however, is an update to a policy that has not yet gone into effect. But how does this difference result in the difference between a semantically correct database state and a semantically incorrect one? One way to look at it is this. Version Pattern 4 does not permit us to enter versions with a known effective end date. All versions entered are assumed to be valid from their effective begin date "until further notice." So when we enter a version of P138 that becomes effective prior to a version already in the table, we enter it with the understanding that it is effective until further notice, i.e., effective until we take some other action regarding policy P138. The semantics of a Type 2 proactive insert, one whose effective begin date is prior to the effective begin date of an existing version, violates this understanding. It violates the semantics of proactive activity carried out in the absence of effective end dates because, without any further action being taken with respect to policy P138, the version which was entered first will change its status from being the insertion of a new policy to being an update of an existing policy.

10/7/2008 9:28 AM

Time and Time Again: Managing Time in Relational Databases, Part 7: ...

4 of 5

http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

From Mary's point of view, she understood the semantics of her two requests to be this: In the case of the Type 1 update, Mary's second request is to change an existing policy. She requests that her policy, in effect since 6/1/04, be changed to an HMO policy three months later. And this is what she gets. In the case of the Type 2 update, Mary's second request is not to change an existing policy. Instead, with this second request, Mary changes her mind about a policy which is not yet in effect. Instead of a new PPO policy becoming effective on 6/1/04 and remaining in effect until further notice, she now wants an HMO policy becoming effective on 3/15/04 and remaining in effect until further notice. But this is not what she gets. Again, the exact same code is used to make both changes. In the first case, Mary got what she wanted. A database update correctly implements her change request. But in the second case, that same code produces a disaster in the making. For two and a half months after the Type 2 update, everything works as it should. Then, without any other database change taking place, on 6/1/04, Mary's policy suddenly becomes an HMO policy. In the first case, Mary wants a change to something already in effect. In the second case, she wanted a replacement to a request that has not yet been implemented. We can easily imagine that the business analyst who gathered the requirements for a versioned history of policies failed to distinguish between changes and replacements. After all, there are only three things you can do to rows in a table. You can insert them, update them or delete them. What the user calls a change and what she calls a replacement (or perhaps calls "changing her mind") are, to the business analyst, both physical changes to something already in the database. In other words, to the business analyst, Mary's changing a policy and Mary's changing her mind amount to a distinction without a difference. As Paul Newman said in Cool Hand Luke, "What we've got here is a failure to communicate." Proactive Deletion We need to discuss proactive deletions only briefly. A Type 1 proactive deletion terminates an existing policy. A Type 2 proactive deletion terminates a policy that does not yet exist. But because you can't do anything to something that doesn't exist, we need to say what we mean more clearly than this. In a non-versioned table, one row represents one object. In a Policy table, each row would represent a policy. But in the Policy Version table, no row represents a policy and each row represents one time slice of one policy. Consequently, actions to the policies themselves do not line up, one for one, with actions to rows in the Policy Version table. With non-versioned tables, whose rows are one for one with objects, semantics are generally one for one with physical activity. For example, physically inserting a row is semantically inserting an object; physically updating a row is semantically updating an object; and so on. But with versioned tables, none of whose rows represent objects, semantics are not similarly isomorphic with physical activity. For example, physically inserting a row may be semantically inserting an object; but it may not be. Instead, it may be semantically updating an object. And the only physical updates to versioned tables, at least with the version patterns considered so far, are done to implement semantic deletions. So to speak more clearly about deletions, we should say this: a Type 1 proactive deletion

10/7/2008 9:28 AM

Time and Time Again: Managing Time in Relational Databases, Part 7: ...

5 of 5

http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

terminates an existing policy by supplying a termination date for the version of the policy that was current at the time of the physical Type 1 action. A Type 2 proactive deletion terminates the version of a policy which has the latest effective begin date, but does so prior to that date. Thus, unless some other action takes place, a Type 2 proactive deletion ensures that its corresponding policy will never become effective. What Comes Next We will not itemize the semantic constraints which define Version Pattern 4. Enough has already been said that the reader should be able to modify the semantic constraints for Pattern 3 and create a set of constraints that correctly define Pattern 4. Next time, we will begin our discussion of Version Pattern 5. It is simpler than Pattern 4 in that it is not a bi-temporal pattern. But it is more complex than Pattern 4 in that it drops the non-recurring assumption. Once we have learned to handle objects that appear for awhile, vanish for awhile, and then show up again, and to do so without bi-temporality, we will then re-introduce bi-temporality. That will take us to Version Pattern 6 which is the penultimate versioning pattern. For more information on related topics, visit the following channels: Data Modeling Databases Real-Time Enterprise ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:28 AM

Time and Time Again: Managing Time in Relational Databases, Part 8: Ve...

1 of 5

http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 8: Version Pattern 5 Tom Johnston, Randall Weis DM Direct, August 2007 Version Pattern 5 is the first pattern to manage objects with multiple episodes, a concept introduced in Part 4. To avoid taking on too much at once, however, Pattern 5 will not include bi-temporality. That will come with Pattern 6 which will be, consequently, a pattern supporting both multiple episode objects and bi-temporality. Up to this point, the version patterns we have examined have incorporated the assumption that once an object is logically deleted, it will never recur. In Part 4, we called this the "single episode" assumption. Technically, this assumption means that a) there can be at most one logical delete version for an object; and b) that no other version for the same object can have an effectivity begin date later than the logical delete date. Single Episode Objects In the case of single episode versioning, it is reasonable to ask, "What kind of objects never recur?" For if there aren't any such objects, then the patterns considered thus far have no real world applicability.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

The answer to this question depends on what we mean by "recur". So, at this point, we could embark on a general discussion of the concept of recurrence. That discussion would include an examination of the ontology of types vs. instances, and of enduring (persisting through time) objects vs. events (processes in and through which objects change over time). The types vs. instances distinction is, in relational databases, the distinction between a table and its rows, or an entity and its attributes. In object-oriented theory, it is the distinction between object classes and their objects, and there the ontological question looms large, as the question of whether or not classes are "first-class objects". These questions are the current manifestation of what the medieval Scholastics (especially Ockham and Duns Scotus) knew as the realism / nominalism debate. For realists, types really existed, just as truly as their instances did. Realists would have treated classes as themselves objects. For nominalists, instances were what really existed; a type was just a name given to the common features we observed across many different objects. Nominalists would not have treated classes as objects, able to exert influence on other objects, for example.

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 8: Ve...

2 of 5

http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

What about the second philosophical distinction, that between objects and events? Some proponents and followers of the Kimball methodology believe that everything of importance can be represented in dimensional data structures. From the perspective of the history of Philosophy, we would see that this position reflects a commitment to the ontological priority of events over objects. We would see that the approach to change which we have taken in this series of articles, which involves preserving both the before and after states of objects undergoing change, reflects a commitment to the ontological priority, or at least equivalence, of objects to events. This, we believe, would be an important discussion. The history of Philosophy has much to tell us about problems which IT professionals and computer scientists alike are still struggling with. A discussion like this would illustrate how such business IT-specific issues as we have just mentioned are but the current manifestation of two ways of thinking about the world, whose roots go back at least to the pre-Socratic philosophers. Although object-orientation (in the sense of the ontological priority of "things" and "stuff") has dominated Western metaphysics for two-and-a-half millennia, there has been a strain of process-orientation (in the sense of the ontological priority of events and occasions). This latter strain originates, perhaps, in the reflections of Heraclitus, and his famous dictum "panta rei" - "all things change". It reached perhaps its fullest expression some eighty years ago, in the work Process and Reality, by Alfred North Whitehead. Don't Know and Don't Care The problem with the single-episode assumption is that things exist over time, and so always contain within themselves the possibility of recurrence - of appearing to us, going away, and then returning. So, by their very natures, objects are fraught with multi-episode possibilities. The reason that version patterns incorporating the single-episode assumption may still have some real-world use, is that there are objects we may be interested in whose recurrence is not of interest to us. In other words, there may be objects which, when they do recur, we treat as new objects. Either we don't notice that they are the same objects we encountered before, or we do notice but don't care. A simple example is a policy holder. Suppose that Joe Smith is a policy holder with our company from 2002 to 2004, at the end of which year he lets his policy lapse. A year later, Joe takes out another policy. If our company makes no effort to identify Joe as the same person who held a policy from 2002 to 2004, then when he returns, in 2006, Joe is effectively a new policy holder for us - not just a holder of a new policy, but a new holder of a policy. Twenty years ago, to take another example, it was common for banks to manage "accounts". Some effort was made to identify cases in which one person held several accounts concurrently, but almost no effort was made to identify cases in which one person held several accounts successively. So the answer to the question "What kind of objects are single-episode objects?" is clear. Singleepisode objects are those whose recurrence we either don't know about, or don't care about. But with Version Pattern 5, we begin to deal with multi-episode objects. These, then, are objects whose recurrence we do make an effort to identify and track. Across industries that deal with people as customers, the same change has taken or is currently taking place. Insurance companies who managed a relationship with policy holders now want to manage a relationship with persons who are, or who may become, policy holders. Banks which used to manage the account relationship now want to manage the customer relationship. In some

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 8: Ve...

3 of 5

http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

industries, companies are even striving to recognize the appearance and re-appearance of households. Multi-Episode Objects If an object could recur, what would that look like, as far as data is concerned? In the case of Mary's policy, it would look like a logical delete of policy P138 followed, after one or more clock ticks, by another version for P138 which is not a logical delete. Normally, persistent objects re-appear after some period of time, i.e. after some possibly large number of clock ticks. So there can be gaps between episodes of the same object, although if there is only one clock tick between a logical delete and the next version of an object, those episodes would not have a gap between them. This is illustrated in Figure 1.

Figure 1: Episodes of an Object If the number of logical deletes of an object is the same as the number of episodes, then, for as long as that is true, the object does not currently exist on our database. That is because every episode, including the most recent one, has been terminated by a logical delete. The only other case that can arise is that the number of episodes is one greater than the number of logical deletes. That is because, in a series of versions for the same object, what splits them into episodes are the logical delete versions. In this case, the object does currently exist on our database because there is no logical delete version for the most recent (or only) episode. Semantic Constraints for Pattern 5 Pattern 5 is identical to Pattern 3, but with the single-episode constraint dropped. So instead of presenting our usual insert/update/delete scenarios, we will develop Pattern 5 by beginning with the semantic constraints (called "assumptions" when that is the more convenient terminology) for Pattern 3, and seeing what changes need to be made to them to accommodate the multi-episode functionality. We start with the first constraint. Semantic Constraint {SC 3-1}. {SC 3-1} states that "if a version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version." This constraint says that versions must have no gaps in time between them. But

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 8: Ve...

4 of 5

http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

multiple episodes of an object nearly always will have gaps in time between them. For example, a policy expires, because of a late payment. But a week later that policy is re-instated because the company granted a grace period in this case, in order to retain a good customer. So to accommodate multiple episodes, {SC 3-1} must be modified to apply only to versions which do not follow a terminal (logical delete) version. This gives us the corresponding constraint for Pattern 5. {SC 5-1}. If a non-terminal version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version. Semantic Constraints {SC 3-2 and 3-3}. {SC 3-2} states that "if a version for an object is not followed by another version for that same object, and is not itself a logical delete version, the end date will be assumed to be unspecified". And {SC 3-3} states that "if a version for an object is a delete version, the object will be considered deleted as of that delete date". Both of these semantic constraints apply to Pattern 5 as well. Semantic Constraint {SC 3-4}. {SC 3-4} states that "it is semantically illegitimate for any version of an object to follow (to be chronologically later than, based on its version date) a logical delete version of that object." This is the single-episode constraint. So it is precisely the constraint that is dropped for Pattern 5. Semantic Constraints {SC 3-5 and 3-6}. The next two semantic constraints for Pattern 3 explicitly apply to a logical insert, i.e., to the first version of an object. {SC 3-5} states the constraint on the effectivity begin date. It says, of the first version of an object, that "ver-beg-dt cannot be in the past or in the future, because it is also the system insert date. To set such a date to anything but [now] would be a lie. We can't do anything in the past, or in the future. [Now] is all we have." {SC3-6} states the constraint on the logical delete date. It says, of the first version of an object, that "obj-end-dt must be null. We can't end something before it begins. We can't even end something when it begins. We can only end something after it begins." What's different about Pattern 5 is that there can be multiple first versions of an object, one for each episode. So the question is whether these two constraints apply to initial versions of any episode, or just to the initial version of the first episode. The answer is that these two constraints apply to initial versions of any episode of an object. No matter what episode we are dealing with, it is obviously true that (without bi-temporality) we can't do anything in the past, or in the future. And it is equally obvious that (without bi-temporality) we can't end something before it begins, or even when it begins. Taking Stock of Where We Are We will complete Version Pattern 5 in Part 9. But here, at a point about a third of the way through

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 8: Ve...

5 of 5

http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

this series of articles, we want to remind you why we are considering several versioned history patterns, and also why the discussion has been so painstakingly detailed. There are two reasons for considering several version patterns. The first is that each pattern can be used to satisfy any set of business requirements which conform to the semantic constraints stated for the pattern. Earlier patterns are less complex than later ones, and some developers may prefer to use simpler versioned history patterns when their system requirements permit it. The second reason for considering several version patterns is to gradually build up to what is the most powerful versioned history pattern that we know of. As we said in Part 1, the versioned history pattern we have evolved is the most complex data pattern for its size that we know of. It is best, pedagogically, to approach it one step at a time. Another pedagogical feature of these articles is their discursive style. Often, data patterns or data models are presented as completed objects. Little effort is spent on explaining the thought processes that led up to them. We believe, however, that the thought processes are, if anything, even more important than the completed pattern. If we can teach senior data modelers how to think about time, by explaining how we thought through the issues and why we chose the solutions we did, then the final versioning pattern we present should seem almost inevitable. If, at the end of this series of articles, that pattern does seem inevitable or at least completely plausible to you, then we will have given you not just a tool for managing time in databases, but also the skills to build other such tools. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 9: ...

1 of 6

http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 9: Completing Version Pattern 5 Tom Johnston, Randall Weis DM Direct, August 2007 We noted in Part 8 that Version Pattern 5 is equivalent to Version Pattern 3, but with the single episode constraint {SC 3-4} removed. However, in that same installment, we also pointed out a difference in the first semantic constraint for each pattern. Constraint {SC 3-1} states that "if a version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version." But for multiepisode objects, clearly, there may be any number of clock ticks between the logical delete version for one episode, and the initial version of the next episode. We accommodated this change by specifying that constraint {SC 5-1} has the same wording as {SC 3-1}, except for replacing "if a version for an object ..." with "if a non-terminal version for an object ..." On examining constraints {SC 3-5} and {SC 3-6}, we found no differences. These constraints apply, without modification, to Pattern 5. This brings us to the last two constraints for Pattern 3. Do they also apply to Pattern 5? Except for the minor rewording of {SC 3-1} and dropping {SC 3-4}, are the semantic constraints for Pattern 5 exactly the same as those for Pattern 3? Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier patterns are happening pretty frequently now, and we hope this will help the reader recall what those prior version patterns were like.

Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 9: ...

2 of 6

http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

Figure 1: Chart of Version Patterns The Importance of Semantic Constraints Before we proceed, let's remind ourselves of the importance of explicitly listing the semantic constraints for each pattern. There are two reasons for doing so. The first is that, together with the schemas for each pattern - the columns used to implement the pattern - these constraints are the formal and fully explicit definition of each pattern. The scenarios show how each pattern works. The diagrams provide useful visual representations of what is going on. But the constraints and the schemas are the patterns. The second reason for explicitly listing semantic constraints is that we will have fully and correctly implemented a version pattern if and only if a) we include the versioning columns defined for the pattern in all tables we wish to version; b) we enforce each of these pattern-specific constraints; and c) for all version patterns, we implement what I will call "temporal referential integrity." Entity integrity constraints - the uniqueness of primary keys - are still enforced by the DBMS. But,

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 9: ...

3 of 6

http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

as we shall see later on, we cannot use the DBMS to enforce referential integrity. When standards committees stop squabbling and produce a temporal data management standard, and then when vendors implement that standard, it is to be hoped that "temporal referential integrity" will be enforced by the DBMS. But because we are still in the "do it yourself" period, as far as the evolution of temporal database management is concerned, that's what we will have to do. We will have to implement temporal referential integrity ourselves, and also the specific constraints that define each version pattern. That means, in most cases, that we will have to write code. In general, that code will be written as before-insert triggers. But this is getting ahead of ourselves. For now, let's wrap up Version Pattern 5 by looking at its last two semantic constraints. Semantic Constraints {SC 3-7 and 3-8} Constraints 5 and 6 apply to initial versions of objects. For Pattern 3, these are the versions that implement the insert of a new object. For Pattern 5, these are the versions that implement the insert of a new episode of an object. Inserting a new object is thus a special case of inserting a new episode of an object. It is the special case of inserting an initial episode of an object. Constraint 5 applies to the effectivity begin date, and constraint 6 to the logical delete date. Constraints 7 and 8 apply to the same pair of dates, but implement versioned updates rather than versioned inserts. {SC 3-7} states the constraint on the effectivity begin date. It says of any noninitial, non-logicaldelete version of an object, that "its ver-beg-dt must be at least one clock tick past the ver-beg-dt of the immediately prior version for the policy. If it is not, then the two versions will overlap for at least one clock tick. But that is invalid because there cannot be two versions of the truth at any given point in time." This constraint also applies to Pattern 5, and for exactly the same reason. However, in reviewing Pattern 3, we have discovered that constraint 7 is a corollary of constraint 1. Constraint 1 states that "If a (non-terminal) version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version." This is the same thing as saying that "... the begin date of the latter version must be one tick of the clock later than the end date for the former version." If it must be one tick of the clock later, then it follows that it "must be at least one tick of the clock" later. So constraint 7, first of all, should have said "exactly" instead of "at least." But in either form, it is not a distinct constraint because it follows from constraint 1. If code implemented constraint 1, there would be no need for additional code to implement constraint 7. {SC 3-8} states the constraint on the logical delete date. It says, of any non-initial, non-logicaldelete version of an object, that "obj-end-dt must be null. If it is not, the policy is being deleted; and, by assumption, this is not a delete transaction." Version Pattern 5 has the same semantic constraint. It says that the object end date for a version has a value if and only if its corresponding object is being logically deleted. Note that obj-end-dt is a business date, not a date reflecting activity against the DBMS. (In general, when we characterize a column as a "business" or "logical" or "semantic" column, we mean the same thing. More precisely (since these terms emphasize different things, and thus are not truly synonyms), every column described in these articles is either all three of these things, or none of them.)

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 9: ...

4 of 6

http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

Infrastructure is a Hard Sell More than one company has turned down the opportunity to implement versioned history as an enterprise resource, as a data management layer on top of all production databases, and on which a view layer can be built to hide its complexities. The attitude within IT seems to be that they need to get on with the job of building or re-building new OLTP systems, warehouses, data marts or ODSs (Operational Data Stores). If versioned history is required, they will just add a date to the primary key and be done with it. It works like this. Versioned history is usually introduced during a project to build some kind of data warehouse, some kind of database for which there is a requirement to keep some amount of versioned history (as defined in Part 1). The next time a database is being created or modified and versioned history again becomes a requirement, it is implemented for that database. If a data mart is being built or modified, one based on the dimensional model, and there is a requirement to provide not only dimensions as they are currently, but also dimensions as they were at some point in the past, then work is done and, eventually, the result is achieved. This approach works, more or less. As time becomes increasingly important in databases, it will be used more and more frequently. But it is the wrong approach, and results in a set of databases with (a) hampered semantic interoperability; (b) little or no flexibility to accommodate additional semantic temporal requirements; (c) high maintenance and enhancement costs; (d) a higher proportion of serious (but often difficult to detect) semantic errors in the data; and (e) a database whose time management functions must be understood and manipulated by both end user and IT-developer query writers. Let's consider each of these flaws in the ad hoc approach, one at a time. 1. Hampered semantic interoperability. An ad hoc, one-database-at-a-time approach might be represented by a set of databases, each of which uses one of these version patterns. Obviously, federated queries that range across two or more such databases will have to reconcile the different implementations of versioned history. As we have seen, just by considering these first five version patterns, that reconciliation is likely to be difficult and error-prone. 2. Restricted flexibility. Suppose one of these version patterns has been implemented in a database, and that later on, the business comes back and asks for additional versioned history functionality. This is equivalent to asking for a new Version Pattern. If this involves nothing more than rewriting insert triggers, the changes will be relatively easy to implement. If it involves adding non-primary key columns to a versioned table, the changes will still be relatively easy to implement. But the change from Pattern 2 to Pattern 3, and another change that we will consider later on, both involve changes to the primary key of the versioned tables. These changes are still easy to implement, compared to changes in primary keys for non-versioned tables. The reason, as we will see, is that referential integrity for versioned tables is not implemented by storing primary key values in foreign key columns. This means that changes to the primary keys of versioned tables affect those tables only and do not "ripple out" to affect all foreign key-related tables. Nonetheless, the costs of incrementally adding temporal functionality are not negligible. Whenever the schemas are changed, for example, all code that touches those schemas must be altered. Although, as described in Part 1, only view-creating and physical update code will directly access versioned tables, this can still be a significant burden when changes must be made. Far better to

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 9: ...

5 of 6

http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

implement an "ultimate" versioned history solution, and then "turn on" specific functionality as and when it is requested. 3. High maintenance and enhancement costs. High maintenance costs result from the need to maintain different version-implementing codebases. High enhancement costs are just a corollary of the previous point, restricted flexibility. The main source of high enhancement costs is the cost of adding a date column to the primary key of each table being converted to a version table, together with the (erroneous) belief that foreign keys to the table being converted must also be changed. With primary key expansion, and the belief that foreign keys must be maintained when a table is converted to a versioned tables, foreign keys in all directly related objects must also be expanded. Thus, the cost of change is no longer simply the cost of adding a date to a primary key, in a single table. Furthermore, when foreign keys to the original table are themselves part of the primary keys in dependent tables, the need to modify foreign keys ripples out to the tables dependent on those tables. This ripple effect can continue indefinitely, and require changes to any number of tables. Another problem is that complex "trickle" processes must be developed to replicate changes in versions and to replicate terminating events to dependent objects. 4. Risk of errors. This is a special case of the previous point, high maintenance and enhancement costs. 5. Risk of Misinterpretation of Query Results. In addition, there is the specific risk of errors in interpretation. Because of the complexity of versioned history, it is quite easy to write queries that return apparently valid results, but that are actually answers to a slightly different question than the one the query author had in mind. This risk of misinterpreting query results exists for all queries, of course. But we have found that it is an especially high risk when temporal queries are involved. The Biggest Risk: Internal Machinery Exposed to Query Writers with the Ad Hoc Approach Temporal management functions are infrastructure. It's just that standards committees can't agree on how to implement them, and so DBMS vendors hold back. This does not mean that temporal management functions should be developed on the fly, one database at a time. It means that they are infrastructure we must currently develop ourselves. Although not necessarily so, every ad hoc implementation of versioned history that we have seen exposes the temporal management infrastructure to both developers and business users of the databases. It also asks modelers and developers to design and build perhaps the most complex semantic patterns that databases are asked to manage. And the ability to do this well, let alone the ability to do it correctly, is found only with the most advanced modelers and developers an IT department has. For those IT managers who think that modeling and database development are commodity skills, the management of temporal data is the strongest disproof possible. Because temporal data management is infrastructure, it should be encapsulated within a data access layer that hides its complexity from modelers, developers and business users alike, as explained in Part 1. Just as designers and developers don't have to implement balanced-tree access methods, but make use of them every time they design and build a relational database, so too should they be able to use temporal data management without the internal machinery being exposed to them. Infrastructure projects are always a hard sell. By definition, they don't satisfy any particular need

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 9: ...

6 of 6

http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

of a specific business unit. They are paradigmatically enterprise work, work whose value is realized indirectly by the specific applications and databases that serve specific business users and purposes. The management of time with today's relational DBMS products is a "do it yourself" effort. For the reasons just articulated, it is work whose considerable cost should be born by the enterprise. One final note. We have noticed that when infrastructure work is proposed, and objections are raised to it, at some point someone will say that what we are proposing is a "build it and they will come" approach. And, of course, everyone knows that that is a bad idea. But with infrastructure work, the only alternative to "Build it and they will come" is "get in, patch it, and get out as quickly and cheaply as possible." We believe that when an enterprise solution to temporal database management is built, they will come. It is not the business that will come. But the business will make demands on developers that require temporal data management, and those developers will come. Earlier, in the numbered paragraph entitled "High Maintenance and Enhancement Costs," we suggested that foreign keys don't work with versioned tables the same way they work with "normal" unversioned ones. It is important to be very clear about this. A common misunderstanding of versioning is that when a new version of an object is created, all foreign keys that pointed to the original version must be cascade-updated to point to the new one. Indeed, were it true that foreign keys work the same way in versioned and un-versioned tables, the problems would get even more complicated. So next time, in Part 10, we will stop and examine the use of foreign keys when versioned tables are involved. We will see that DBMS-enforced referential integrity cannot be used. A new way of using foreign keys and enforcing referential integrity is required. And once again, because temporal database management is on hold as far as DBMS vendors are concerned, we will have to take a "do it yourself" approach. For more information on related topics, visit the following channels: Data Management Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:29 AM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

1 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 10 - Foreign Keys in Version Tables Tom Johnston, Randall Weis DM Direct, September 2007 In trying to explain versioning to other data modelers and DBAs, we have found that many of them are worried about the cascade update implications for versioned tables. In general, the concern is that if a new version of an object is inserted into a version table, then all foreign key references to the version it replaced must be updated to point to the new version. Furthermore, if those updates take place in tables which are themselves versioned tables, then those updates create additional new versions, and all foreign keys that point to them must also be updated. And so a single versioned update can cascade out to affect many other versioned tables, ending only when a cascade chain reaches either a) a non-versioned table or b) a versioned table which does not itself contain any foreign keys. These concerned modelers and DBAs correctly point out that this cascade effect can make versioned updates very expensive. Furthermore, if there are any circular references, the cascade will become an infinite loop. For example, consider this pair of tables: a Client Version table and our Policy Version table. The Policy Version table contains a foreign key to the Client Version table, and each row in the Client Version table contains a count of the total number of versioned updates across all policies for that client. The infinite loop happens like this: 1. A new version of a policy is inserted. Call the old version P12 and the new version P13. 2. This triggers code that updates the policy version count in the related client, resulting in a new version of that client. Call the old version C25 and the new version C26. 3. At this point, we must cascade update the policy version. P13's foreign key still points to client version C25, and it must be updated to point to C26. This creates a new version of that policy, P14. 4. But now the trigger executes once again, updating the policy version count in the related client. This loops us back to step 2, where we update that count, creating client version C27. 5. But once again, we must cascade update the policy version. P14's client foreign key still points to C26. This update creates a new version of that policy, whose client foreign key points to C27. This new version of the policy is P15. 6. And so on, and so on, to either the last syllable of recorded time, or to a frustrated machine operator who cancels the process. Also, notice that so far, we have talked only about cascade updates. There are also, of course, cascade deletes. As we will see later on, the semantic constraints surrounding cascade deletes in versioned tables are not identical to those for cascade deletes of non-versioned tables.

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

2 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

But for now, we will concentrate on the cascade update problem. Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier patterns are happening pretty frequently now, and we hope this will help the reader recall what those prior version patterns are.

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

3 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Figure 1: Chart of Version Patterns Chart of This Series Before continuing, we insert here a chart listing the installments to date in this series and a brief description of what each one is about. As this series grows longer, the back references to earlier installments will only increase, and we think that having a chart like this at hand will be helpful.

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

4 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Figure 2: Chart of Installments to Date in this Series What to Do About the Cascade Update Problem for Versioned Tables There are only three possible responses to the versioned table cascade update problem (or to any problem). They are: a) ignore the problem, b) avoid it or c) fix it. Ignore it. The first response is to ignore the problem. If we take this approach, it means that in many if not most cases, a single versioned update will trigger any number of additional versioned updates. Perhaps the attitude about multiple updates is: "DASD is cheap; CPU cycles are cheap. So just let it happen." But then what about the possibility of an infinite loop? Perhaps the attitude about infinite loops is: "Design your databases so there aren't any loops." In some specific cases, this may be an acceptable response. But if our objective is to develop an enterprise solution (which, indeed, is our objective), then this response just won't do. We need a response that works in all possible cases, not just in special individual cases. We can't just ignore the issue of cascade updates in versioned tables. Avoid it. The second response is to avoid the problem. This means to turn off referential integrity checking in the DBMS (which would stop the loop in the client to policy direction). If we can accept cascade updating, and are only concerned to avoid the infinite loop, an alternative is to turn off the

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

5 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

developer-written code which updates counts (which would stop the loop in the policy to client direction). Avoiding the problem is not as cavalier as it sounds. Standards committees are at loggerheads over how to introduce temporal management functions into relational DBMSs, and as a result, DBMS vendors are stymied. Consequently, there is no adequate support for temporal database management in current relational DBMSs. So we shouldn't be surprised that there is also no DBMS-implemented referential integrity that works for temporal (versioned) tables. Until there is such support, avoiding the problem by turning off referential integrity may be a very reasonable response. But isn't the proper conclusion, given DBMS shortcomings, to implement referential integrity with our own code? Isn't the proper conclusion to fix it, not to avoid it? That depends, of course, on a comparison of the costs and benefits of the two responses. Specifically: What costs do we incur if we fix the problem? We may assume that these are all IT costs, not business costs. The answer is that we incur the development and maintenance costs that are part of any IT software artifact. What benefits do we garner if we fix the problem? We may assume that these are all business benefits, not IT benefits. The answer is that a) we can use DBMS-supplied referential integrity, and thereby gain the assurance that RI errors will not occur; and b) by updating foreign keys whenever referenced rows are replaced by newer versions, the relationships involving versioned tables will remain current. What costs do we incur if we avoid the problem? We may assume that these are all business costs, not IT costs. The answer is that a) since we cannot use DBMS-supplied referential integrity, we must therefore write our own code to guarantee that integrity; and b) since relationships among versioned tables are not updated if we avoid the problem, such relationships gradually become more and more "out of date" as additional versions of a foreign-key referenced object are created. We may assume that there are no benefits to avoiding the problem (other than cost avoidance). Fix it. The third response is to fix the problem. To fix any problem, we need to start with a correct diagnosis. We need to identify the root cause, the thing which, if fixed, will eliminate the problem. We suggest that the root cause of the problem is the instability of foreign-key references when those foreign keys point to versioned tables. With "normal" (non-versioned) tables, cascade updates are rare because key changes are rare. If we were dealing with a Client and a Policy table - instead of with a Client Version and Policy Version table - then nothing we have considered so far would require a cascade update. But because we are dealing with foreign key-related version tables, every new version in the referenced table (Client Version) requires a cascade update in the referencing table (Policy Version). Because we have chosen to create a new version in the referenced table every time a new version is added in the referencing table (by incrementing a count), we loop back and create an infinite cascade update problem. We can fix the infinite loop problem quite easily. In the case of the example we are working with, the infinite loop will be broken if a versioned update in the referencing Policy Version table does not cause a new version to be created in the referenced Client Version table. That means that we either don't keep a count of policy versions in the client, or else that updating that count is an

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

6 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

overwrite type of update, not one that requires a new version to be created. Additional discussion of the infinite loop problem, and the fix for it, might provide additional insight into the root cause of the overall cascade update problem. But we prefer to go after that additional insight by turning to the other part of the problem, the cascade update from "parent" to "child," from the referenced table to the referencing table. A cascade update happens when a primary key changes, and there exist one or more foreign keys which reference it. So to fix a cascade update problem (not ignore it and not avoid it), we must either a) eliminate primary keys, b) eliminate foreign keys or c) stabilize the link between them. We can rule out the first two options immediately. They amount to conceding that the relational model cannot handle versions and, more generally, cannot handle time. While it might be interesting to pursue this concession, we have neither the time nor the inclination to do so here. But what does the third fix mean? What does it mean to "stabilize" the link between a primary and a foreign key? Let's start by looking at an example of these keys.

Figure 3: The Client Version and Policy Version Tables In both tables, the primary key consists of two columns. The first column is an identifier for the individual object, either a client or a policy. The second column, a date, distinguishes and sequences the versions of each object. No two versions of the same object can have the same begin date. Note: in all examples up to this point, the foreign key in the Policy Version table consists of only a single column - client number. Now that we are considering an issue involving foreign keys to versioned tables, we have to drop that simplification and show the full foreign key. What Does It Mean? We said earlier that to understand how avoiding the problem might work, we need to answer to the question, "What do foreign keys relate when they relate versioned tables?" The same thing is true for the "fix it" option. To understand how to fix the problem, we need to answer the same question. Next time, in Part 11, we will see that the "avoid it" and "fix it" options are based on two different answers to the question, "What do foreign keys relate when they relate versioned tables?" Since different answers are being given, the different responses based on those answers will very likely entail different semantics supported by the answer. It is a cost/benefit analysis of these different semantics which will determine whether we should continue using an "avoid it" or a "fix it" response.

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 10 - ...

7 of 7

http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Editor's note: Watch for Part 11 in the October 5, 2007 issue of DM Direct. For more information on related topics, visit the following channels: Database Application Performance Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:02 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

1 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Time and Time Again: Managing Time in Relational Databases, Part 11 - Foreign Keys in Version Tables Continued Tom Johnston, Randall Weis DM Direct, October 2007 In Part 10, we presented a concern that many modelers and database administrators (DBAs) have expressed about what happens to foreign keys when a new version of an object is created. We said that "in general, the concern is that if a new version of an object is inserted into a version table, then all foreign key references to the version it replaced must be updated to point to the new version. Furthermore, if those updates take place in tables which are themselves versioned tables, then those updates create additional new versions and all foreign keys that point to them must also be updated. And so a single versioned update can cascade out to affect many other versioned tables, ending only when a cascade chain reaches either a) a non-versioned table, or b) a versioned table which does not itself contain any foreign keys." We also noted that there is a potential cascade problem when an object is deleted. Note two key phrases here: a) "when a version is replaced," and b) "when an object is deleted." Versions are not updated; versions are replaced. It is objects which are updated in the process of one version replacing a previous one. Also, versions are not deleted; objects are deleted by placing a deletion indicator on the current version of the object, and also on any future versions on the database at the time of the deletion. Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier patterns are happening pretty frequently now, and we hope this will help the reader recall what those prior version patterns are.

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

2 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Figure 1: Chart of Version Patterns Chart of This Series Before continuing, we insert here a chart listing the installments to date in this series, and a brief description of what each one is about. As this series grows longer, the back references to earlier installments will only increase, and we think that having a chart like this at hand will be helpful.

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

3 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Figure 2: Chart of Installments to Date in this Series What Do Foreign Keys Relate? The Issue In non-versioned tables, foreign keys relate objects. Each row in a Client table represents a client. Each row in a Policy table represents a policy. What a foreign key from a Policy table back to a Client table does can be expressed in two different ways. First, we can say that it relates the row representing that policy, which contains the foreign key, to the row representing the client for that policy. Secondly, we can equally well say that it relates that policy to that client. The first way of saying what the foreign key does refers to the data; the second way refers to the things which that data represents. The first way, from the point of view of data management, is physical; the second way is logical or, more precisely, semantic. In non-versioned tables, there is a one-to-one pairing of things we are interested in (clients, policies) with data that represents them (rows in tables). Because the mapping between what is represented and its representation is one-to-one, we don't need to pay much attention to the distinction. We can talk about what foreign keys relate in terms of either objects or rows. Seldom dealing with versioned tables, modelers and DBAs are usually not very concerned with the difference between data and semantics, between tables, rows and columns, and the types, instances and properties that they represent. But when we are dealing with versioned tables, the one-to-one mapping breaks down. It is only when there is exactly one row, representing the initial version of an object, that the mapping appears to be one to one. After that, it is one object to many versions of that object.

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

4 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

However, a more careful reading of the situation would make it clear that, even in this case of a single version of an object, there is not a one-to-one mapping between object and row. For a row which represents a version of an object, even when there is only one version of that object being represented, does not represent the object. It represents a delimited section of the timeline of that object. Each version explicitly says, "I represent this object as it was, beginning on this specific date." Then (in the "ultimate" version pattern we are leading up to), the version also says either "And I no longer represent this object after this specific end date", or else it says "I will continue to represent this object 'until further notice'" (which, as we shall explain later, is not the same thing as saying "I represent this object until the end of time"). Because of this breakdown of a one-to-one mapping between an object in the world and a row in a table, talk about relating objects and talk about relating rows are not, as they are in the case of non-versioned tables, effectively equivalent. Let's consider a more typical situation, say, one client and five versions for that object in the Client Version table, and one policy related to that client, and three versions for that object, in the Policy Version table. Here, it is clear that the one-to-one mapping has broken down, both for clients and for policies. And here we will have to make a choice between the two ways of speaking. Do foreign keys relate policies to clients, or do they relate policy versions to client versions? What Do Foreign Keys Relate? One Answer Those modelers and DBAs who are concerned about the foreign key problem in versioned tables assume that foreign keys relate rows which represent versions. A foreign key from a Policy Version table back to a Client Version table relates one row to another, a policy version to its associated client version. If we want a justification for the assumption that foreign keys in versioned tables relate versions to versions, it might go like this: we understand that the business is interested in clients and policies, and which go with which. But we modelers and DBAs serve the business by managing data. If the business analyst has gotten the requirements right, and if we have expressed them correctly, then foreign keys relating rows to rows will accurately reflect the real-world relationships between objects and objects. An even shorter justification goes like this: the DBMS doesn't know anything about what's "out there," about what the data it manages represents. So there is no way we can ask the DBMS to relate objects; all it knows how to do is to relate rows to rows, and foreign keys are the method it uses to do that. Versioned tables or non-versioned tables, it's all the same to the DBMS. Foreign keys relate rows to rows. But if we proceed on the assumption that foreign keys in versioned tables relate rows to rows, and thus versions to versions, we can easily find ourselves dealing with the cascade update problem described in Part 10. Specifically, whenever a new version is created for a client, then all the policies whose foreign keys pointed back to the original client version must be updated to point to the new client version. If this update is a versionable update, it will create a new version of the policy. Then, any versions anywhere else across the database that pointed to the policy which just received a new version will have to be updated to point to that new version. And so on, and so on. In Part 10, we concluded that we could not ignore this problem. The reason we can't ignore the problem is that we are looking for an enterprise solution, one that will apply to all sets of version tables. It may be that in databases that contain only a few versioned tables, and/or databases whose versioned tables change very infrequently, the performance penalty of cascading updates

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

5 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

might be manageable. But with an enterprise solution, we cannot count on that. With an enterprise solution, some databases may be crippled by the performance demands of frequently updated version tables with lengthy cascade chains. So, on the assumption that we are relating versions to versions, we must somehow either fix the problem or avoid it. How can we do that? As we said in Part 10, we could avoid the problem by turning off RI. But DBAs just don't like to turn off RI. That opens the door to foreign keys that don't point to anything. So, is there a way to fix the problem rather than avoid it? There are two ways to fix the problem. One way is to make foreign key updates non-versionable updates, i.e., updates in place, updates which overwrite the foreign key but which do not produce new versions in the process. The second way is to not update foreign keys among versions. In either case, RI among versions can remain on, but cascade updates are prevented. Let's consider each solution in turn. Solution 1. Make Foreign Key Updates Non-Versionable. Figure 1 below shows the example we introduced in Part 10. Both the Client and Policy tables are versioned tables. However, the inclusion of policy-type as a column of the Client Version table, back in Part 10, was a mistake. But it turns out to be one we will need as this analysis continues. So to distinguish it from the policy-type column on the Policy Version table, let's assume that policy-type on client versions means something like "the type of policy which the client's sales representative most recently tried to sell to the client." So let's rename it "upsell-policy-type."

Figure 1: The Client Version and Policy Version Tables Let's assume that a new version of client C882 has just been created. This is shown in Figure 2 below. We must now update all foreign keys that pointed to [C882, 5/23/04], and make them point to [C882, 10/01/04]. And if we make this a non-versionable change, then there is no cascade problem. Instead, the result is shown in Figure 3.

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

6 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Figure 2: Client C882 is Versioned There is no cascade problem because, as Figure 3 shows, the foreign key change in the Policy Version table was made as a physical update to C882's current policy version. No new version of that policy was created.

Figure 3: Client C882's Policies are Updated This is one way that we can keep foreign keys pointing to the current versions of the objects they reference, and at the same time avoid the cascade update problem. But notice the drawback to this solution: it loses information. The specific information that is lost is that, from 6/01/04 to 10/01/04, policy version [P138, 6/01/04] pointed to client version [C882, 5/23/04]. During that period of time, a join from that policy version to its related client version would have shown an upsell policy type of PPO. This is the information that is lost. Why is it lost? Because we physically updated the foreign key in the policy version. We did that in order to avoid the cascade update problem. But why does the foreign key need to be updated?

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 11 - ...

7 of 7

http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

The answer we get to this question, from the modelers and DBAs we have spoken to, is that joins should always join to the most current version of the object referenced. After all, with non-versioned tables, foreign keys point to the most current version – which also happens to be the only version. In Part 12, we will continue this line of questioning, which will lead us to consider the second solution to the cascade update problem, which is: don't update foreign keys that point to versions. We will then conclude our discussion of foreign keys and versioned tables by considering the other answer to the question, "What do foreign keys relate?” that answer being, "Objects, not versions." For more information on related topics, visit the following channels: Data Management Data Modeling ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:03 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

1 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Time and Time Again: Managing Time in Relational Databases, Part 12: Foreign Keys in Version Tables Tom Johnston, Randall Weis DM Direct, October 2007 In this article, and in Part 10 and Part 11, we have been discussing the issue of foreign keys and referential integrity in versioned tables. The issue arises because most modelers and database administrators (DBAs) seem to believe that if a change is made in a versioned table, and there are foreign keys pointing to that table, those foreign keys must also be changed. If they themselves are in a versioned table, and foreign keys in yet other versioned tables point to the rows they are in, versioned updates will ripple out to those other tables, creating a cascade update problem.

Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

In Part 10, we decided that our options are the options we always have for any problem. We can ignore the problem, avoid the problem or fix the problem. We also determined that we cannot ignore the problem because we are looking for an enterprise solution, one applicable to all databases. For databases containing small and/or infrequently updated versioned tables, we might be able to ignore the problem. But an enterprise solution must apply to all databases, regardless of size or frequency of update. In Part 11, we suggested that the most perspicuous line of investigation was to begin by asking "What do foreign keys to versioned tables relate?" One answer is that they relate rows to rows. The "related-to" rows are the rows in versioned tables. The "related-from" rows may be either versions or non-versions. On the assumption that foreign keys relate rows to rows, whether or not any of those rows are in versioned tables, we developed one solution to the versioned foreign key problem. Here in Part 12, we develop a second solution to the versioned foreign key problem, also based on the assumption that those foreign keys still relate rows to rows. We will then develop a solution based on the assumption that foreign keys that point into versioned tables do not relate rows to rows, i.e., versions to versions, but rather relate objects to objects. We will conclude our discussion of referential integrity involving versioned tables by distinguishing between the machinery of versioning and its semantics. The machinery is complex, and should be as invisible as possible to developers and users. But the semantics are the semantics of temporal databases, and once such databases are supported, those who write queries must understand those additional semantics in order to write unambiguous queries against those databases. Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

2 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

patterns are happening pretty frequently now, and we hope this will help the reader recall what those prior version patterns are. They also give a hint of version patterns not yet discussed.

Figure 1: Chart of Version Patterns Chart of This Series Before continuing, we insert here a chart listing the installments to date in this series, and a brief description of what each one is about. As this series grows longer, the back references to earlier installments will only increase, and we think that having a chart like this at hand will be helpful.

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

3 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Figure 2: Chart of Installments to Date in this Series Solution 2. Don't Update Foreign Keys to Versions As explained in Part 10, this is the second solution to the cascade problem, given the assumption that foreign keys to versioned tables relate versions to versions, if the foreign-key containing table is also a versioned table, or else relate objects to versions, if the foreign-key containing table is not a versioned table. The first solution is to update foreign keys in versioned tables, but to do so as updates in place, updates which overwrite the foreign key but which do not produce new versions in the process. The second way, discussed now, is to not update foreign keys that point to versions. Figure 3 shows the example we introduced in Parts 10 and 11. Both the Client and Policy tables are versioned tables.

Figure 3: The Client Version and Policy Version Tables

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

4 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Let's assume that a new version of client C882 has just been created. This is shown in Figure 4. With this “don’t update” solution, no further physical activity against the database is required. In particular, we do not update the foreign key that points to [C882, 5/23/04].

Figure 4: Client C882 is Versioned Now let's insert a new policy and see what happens. The result of this insert is shown in Figure 5.

Figure 5: Client C882 Gets Another Policy At first sight, Figure 5 is puzzling. Policy P138 points to the first version of client C882, but policy P704 points to the second version. Why is there this difference? The difference is in the appearance of things only. It is, in fact, the result applying the following rule.

Figure 6: Rule for Managing Foreign Keys in Versioned Tables The version of the client that was current when policy P138 was created is the first version. The

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

5 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

version current when policy P704 was created is the second version. So with this rule, every foreign key pointing to a versioned table points to the version of the related object that was current when the foreign-key containing row was created. This rule applies both when the foreign key is a column in a versioned table, and also when it is a column in a nonversioned table. Joins using such versioned foreign keys show what things look like, across joins, at the moment in time that the row from which we are joining was created. Therefore, using this rule, we can always follow a chain of foreign keys out from a particular row of interest, and bring together versions of the related objects as they were when that original row was created. We can, in other words, bring back data which is equivalent to a snapshot of that data taken at the moment the original row was created. But sometimes that isn’t what we want. Sometimes, we want to join to the most current version of the related object. For example, consider a query which takes policy version [P138, 6/1/04], and joins it to the client who owns that policy. If we use the foreign key in that version, it will join to client version [C882, 5/23/04]. If this query was executed anytime between 5/23/04 and 9/30/04, it returned an upsell policy type of PPO. And during that period of time, this is the correct upsell policy type. But suppose this same query was executed on or after 10/1/04. In that case, the query returned an incorrect result, because from 10/1/04 forward, the current upsell policy type was HMO. What are we to make of this? First of all, it is clear that this way of managing foreign keys in versioned tables will always produce correct results provided that, implicit in every join is the assumption that joined-to versions will be the version current when the joined-from row was created. But if this assumption is not valid, and in particular when what is desired is joins that return the then current versions of all joined-to objects, then using the foreign keys in versioned tables to make those joins is not guaranteed to return correct results. How, then, can we write these other queries, the ones that want the current version of all joined-to versioned objects? The way we can do this is to write a query which specifies the joined-to object identifier – client number in this case – together with a clause that specifies the greatest version begin date not greater than the current date, and containing a {null} object end date. The “not greater than the current date” qualifier is needed to filter out future versions, versions that are not yet in effect at the time the query is executed. The "containing a {null} object end date" qualifier is needed to filter out an otherwise qualifying version that is for an object that has been deleted. Something else that we must be aware of is that even though foreign keys to versioned objects point to rows current at the time the containing row is created, this does not guarantee that using these foreign keys will always return versions current when the joined-from row was first created. Here's how this apparent anomaly can happen. Suppose that we also have an Employer table, which is also versioned. This table contains the employers of our clients. Now suppose we have a query which selects a policy, and then joins first to the client version current when it was created, and from there to the employer version current when that client version was created. This query may return employer data that was not current when that policy version was created. Figure 7 illustrates this situation.

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

6 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Figure 7: “Current When” Joins Do Not Always Retrieve Contemporary Versions This join will return [E721, 11/15/01], even though the version of employer E721 that was current on 6/1/04 was [E721, 5/15/04]. Supposing that the third column is employer data retrieved in the query, that query will show “abc” as the "current" employer value, whereas “xyz” was the value current when the policy version was created. These situations can arise because “current when” is not a transitive relationship. This is illustrated in Figure 8.

Figure 8: “Current When” is not a Transitive Relationship Assumption 2. Foreign Keys in Versioned Tables Relate Objects to Objects The first answer to the question, "What do foreign keys in versioned tables relate?” was that they relate rows to rows. On that assumption, we found that there were two solutions to the cascade update problem. The second answer to that question is, “Foreign keys in versioned tables relate objects, not versions.” But we have to understand what this means since, in a physical database sense, it is obvious that foreign keys relate rows to rows. And in the case of our examples, since both the Policy and Client tables are versioned tables, foreign keys relating these two tables must relate rows to rows – in other words, versions to versions. But in a semantic sense, our ability to relate versions to versions, given current RDBMS technology, is limited. If we don’t update foreign keys when the versions they point to are

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

7 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

superseded, then the relationships implemented by those foreign keys slowly “age.” On the other hand, if we do update (overwrite) foreign keys when the versions they point to are superseded, then the relationships established when those rows were first created are no longer stored in the database. Our only other option is to create a new version each time a foreign key changes. But that option creates the cascade update problem we are trying to avoid. On assumption 2, we may as well store only the object identifier component of a foreign key to a versioned table. In our example, we may as well store only client number, and not bother with version begin date. If we want to join a policy to the client version current when the policy was created, the WHERE clause will ask for the version of that client whose begin date is the greatest date less then or equal to the begin date of the policy version, and whose object end date is {null}. On the other hand, if we want to join a policy version to the most current version of the related client, the WHERE clause will ask for the version of that client whose begin date is the greatest date less then or equal to [now], and whose object end date is {null}. One would hope, of course, that a versioned client would not be version-deleted (have a non-null object end date) if it still had at least one non-deleted versioned object related to it. If we were dealing with non-versioned objects, the DBMS would guarantee this by either carrying out a cascade delete, or by denying the delete until all dependent objects were first deleted. But with versioned objects, the DBMS cannot guarantee this because such temporal semantics are not part of any current RDBMS. So since we must guarantee it ourselves, there is always a chance of error. To guard against that error, the WHERE clause we are discussing must also specify that the client version whose begin date is the greatest date less then or equal to [now] also has an object end-date of {null}. Hiding the Complex Semantics of Time We have already mentioned that managing time in relational databases is perhaps the most difficult challenge we have ever encountered as analysts, modelers and DBAs. This discussion of the semantics of versioned foreign keys certainly emphasizes this point. It is clear that the mechanisms which manage time in relational databases must be hidden from developers and users alike. One reason is the complexity of those mechanisms. Another reason is that the management of time is not an applications- or subject area-specific requirement any more than the management of dynamic balanced trees is an application- or subject area-specific requirement. The optimal solution will be one provided by vendors and specified by standards committees. The mechanics may differ from one vendor to another, as long as the syntax and semantics of both the SQL DML and SQL DDL are supported. But a SQL with temporal data management capabilities must first be proposed and then accepted by the standards committees. The proposal was made, almost a decade ago, but the acceptance seems no closer today than it was back then. So in the meantime, we must provide our own encapsulation of this complexity, our own enterprise-level solution to temporal database management. That is what we are doing in this series of articles. Being Aware of the Complex Semantics of Time

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 12: ...

8 of 8

http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

There is one thing about temporal database management that cannot be hidden, however. It is the fact that with such databases, new temporal semantics are available. This means that anyone who writes queries must understand these semantics, and code his queries accordingly. For example, a query asking for a join of a policy version to a client version raises the question, “Which client version?” A query asking for a join of a policy to a client raises no such questions. If the query writer does not understand the possibilities, and therefore does not code for them, then either the DBMS will tell him that his query is ambiguous, or it will execute the query using whatever default behavior has been defined. In the former case, the query writer needs to understand what the error message means, and then write a new query that makes his temporal assumptions explicit. In the latter case, the query writer had better understand what the issues are, and what the default behavior is, or else he may ask one question with his query, but receive an answer to a different question in the result set returned by the DBMS. The basic point is this. The mechanics of temporal database management can and should be as invisible to developers and other query authors as possible. But it is still necessary for any query author to understand temporal semantics, because he cannot unambiguously say what he means unless he does understand those semantics. No matter how elegant and expressive temporal SQL will prove to be, it cannot relieve query authors from the obligation to understand the options involved, and to be careful to code queries which say exactly what they mean. The Versioned Table Foreign Key Position, Going Forward In Part 13, we will take up Version Pattern 6, the bi-temporal versioning of recurring objects. When we do, we will handle foreign keys to versioned tables by capturing the full foreign key of the version current as of the time the row in question is inserted, both object identifier and version begin date. This will be done whether the row being inserted is being inserted into a versioned or a non-versioned table. The issue is the table pointed to, and what to do when that table is versioned. Thus, in terms of our example, we will continue to record both client number (the object identifier) and version begin date as a foreign key in the Policy table. For more information on related topics, visit the following channels: Data Modeling Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:04 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

1 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 13 - A Roadmap of the Rest of This Series Tom Johnston, Randall Weis DM Direct, November 2007 We are nearly ready to discuss the penultimate pattern, Version Pattern 6. This pattern expresses the combined semantics of all the version patterns we have discussed so far. The last versioning requirement we will examine adds the requirement to version error corrections, and it is Version Pattern 7 that will accommodate that requirement, as well as all the patterns previous to it. Version Pattern 7 is therefore the “ultimate” versioning pattern to be discussed in these articles. It is clear that if Pattern 6 can express the semantics of patterns 4 and 5, it can express the semantics of all prior patterns. And indeed it can.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

Like Pattern 4, it distinguishes business effectivity dates from database activity dates, making it possible, for example, to create new versions in advance of the date on which they will become effective. Like Pattern 5, it manages versions which may have temporal gaps between them, making it possible, for example, to record the reappearance of an object after its absence for some period of time. But before we discuss Pattern 6, there are two preliminary matters to address. The first is to provide a roadmap to the rest of the articles in this series. The second is to introduce the use of a special set of surrogate keys for versioned tables, and explain why they are needed and how they will be used. The rest of this article lays out that roadmap. Then, Parts 14 & 15 make the case for using surrogate keys, and describes the special considerations needed to make surrogate keys work with versioned tables. Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier patterns are happening pretty frequently now, and we hope this will help the reader recall what those prior version patterns are. They also give a hint of version patterns not yet discussed.

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

2 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Figure 1: Chart of Version Patterns Chart of This Series Before continuing, we insert here a chart listing the installments to date in this series, and a brief description of what each one is about. As this series grows longer, the back references to earlier installments will only increase, and we think that having a chart like this at hand will be helpful.

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

3 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Figure 2: Chart of Installments to Date in this Series Roadmap for the Remainder of the Series The high-level view of the roadmap to the rest of this series is as follows. Items are listed in the sequence in which they will be discussed. 1. 2. 3. 4. 5. 6.

Versioning with surrogate keys. The physical implementation of Version Patterns 1 – 5. Version Pattern 6. Semantics and implementation. Managing recurring objects. Version Pattern 7. Semantics and implementation. Star Schema versioning.

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

4 of 9

7. 8. 9. 10. 11. 12.

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Snapshots. Sandbox versioning. Hierarchical versioning. Performance considerations. Migrating to an enterprise versioning solution. Notes towards a standards-based and vendor-implemented solution.

Let’s briefly consider each of these items. Versioning Using Surrogate Keys Surrogate keys are primary keys whose values are system-created and free of any business content. Although Dr. E. F. Codd’s Information Principle recommends the use of business data rather than surrogate keys as primary keys, there are both theoretical and practical reasons for using surrogates as primary keys. The distinction we have already made, between objects and their versions, is best represented, when surrogate keys are used, by a pair of such keys. One of them we will call an “object identifier” (“object id,” abbreviated as OID), and the other a “version identifier” (“version id,” abbreviated as VID). These matters will be the focus of the Roadmap Item #1 discussion. The Physical Implementation of Version Patterns 1 – 5 In earlier articles, we noted that each of the first five version patterns might be an appropriate solution in a specific versioning situation. Those articles discussed the semantics which each pattern supports, and described each pattern in what we will call logical data model (LDM) terms. With Roadmap Item #2, we will provide the SQL DDL and DML that define the corresponding physical database schemas, and that provide access to them. We will call this the physical data model (PDM) representation of these patterns. This clears the deck by completing our discussion of the first five patterns. It's time, then, to move on to more advanced patterns. Version Pattern 6: Semantics and Implementation In semantic terms, this pattern is the logical union of the semantics of all preceding patterns. After demonstrating that this is so, we go on to a PDM discussion, which demonstrates how to actually build tables that conform to this pattern. Also, with this pattern, we switch from business keys to surrogate keys. This will be the focus of the Roadmap Item #3 discussion. Managing Recurring Objects Recurring objects are anything that persists over time, are recognized by us, go away, and later on come back. A policy that has lapsed, but then is renewed, is a recurring object. But the most important kind of recurring objects, for businesses, are customers – people or organizations who purchase goods or services from us. The principal reason that recognizing returning customers is so important is human psychology. Returning customers want to be greeted like old friends. They don't want the standard "We're happy to welcome you as a customer of ours" greeting. They want a "Welcome! It's great to have you back!" greeting.

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

5 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Another reason it is important to recognize returning customers is so we can use the valuable knowledge we have gained about them from their previous history with us, particularly the knowledge of what kinds of products or services they tend to purchase from us. If we're a company really on its toes, we may even have captured the information about why they left us in the first place. All this information is far too valuable to lose just because the IT Department tells upper management that "the match problem is difficult"! From the point of view of temporal database management, however, it doesn't matter whether the returning/recurring object is a person or a policy, a customer or a competitor. The point is that when a new row is inserted into a versioned table, it is inserted as the next version of an object for which we already have prior versions, even though there is a gap in time between when the object was terminated and when this new version is inserted. These gaps mean that we can recognize logical groupings of versions, one group for each continuous set of versions, with gaps distinguishing the groups. These groups are what we called "episodes" back in Part 5. We already know how to record the reappearance of a returning object in a versioned table. But there are two related issues, one of which we will discuss, and one of which we will not. The issue we will not discuss is how to recognize a returning object as an object we have seen (and recorded in our databases) previously. In a non-temporal form, this is identical to the issue of how to recognize that data from two different input streams, very often with different and incomplete identifying data, are indeed for the same object. The issue, in either its temporal or its non-temporal form, is "match logic,” and it differs so much from business to business, and object type to object type, that nothing useful could be said of the general match logic problem in much less than a book-length treatment. The issue we will discuss is what to do when we handle reappearance incorrectly. The more general issue is what to do when our match logic makes the wrong choice. The first wrong choice is to say that we have a recurring object (a match to an object already in our databases) when in fact we do not. The second wrong choice is to say that we do not have a recurring object when in fact we do. The third wrong choice is to say (correctly) that we have a match, but to identify the wrong object to match it with. In the first case, once we identify our mistake, we have to split the incorrect match, separating data identified as being for one object into two sets of data, one for each of the objects we now recognize as distinct. In the second case, once we identify our mistake, we have to merge the data for the two objects into data for a single object, either letting one of the two objects "survive" the merge, or else creating a new third object to merge the first two into. In the third case, we have to do both. This is not a versioning problem, but it is a problem that inevitably arises when we are required to (try to) correctly recognize recurring objects. When that requirement is combined with the requirement to version the objects in question, we get a situation in which error correction becomes a very complex problem. So the relevance of this issue to versioning is that if both recurring objects and error correction are requirements, the problem of correcting match logic mistakes becomes in part a versioning problem. This complex problem is the subject matter of Roadmap Item #4. But most of the analysis of this problem is relevant whether or not the target tables being affected are versioned or not. So because the requirement to recognize recurring objects is being imposed on more and more business application systems, and because the work required to correct the inevitable match mistakes is so complex, we think it is warranted for us to branch off from our core topic of

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

6 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

versioning, and consider the related topic of correcting object matching mistakes. Version Pattern 7: Semantics and Implementation With Roadmap Item #5, we present Version Pattern 7, our “ultimate” version pattern. Because of its importance, we will provide a more thorough and detailed discussion of it than we have of the prior version patterns. In particular, we will provide over a dozen usage scenarios for this pattern, whereby “usage scenario” we mean something like a use case, except that a workflow of separate processing steps is not part of a usage scenario. Our usage scenarios focus on those events in which one or more versioned tables are either updated from source transactions, or queried with SQL. For each usage scenario, we will discuss both the updates required to get the retrieved data into the database, and the queries required to get that data out. An LDM level discussion will be followed by the physical schemas, transformation code and SQL which constitute the PDM discussion. Specifically, we will cover the following points: Sample tables, in their logical form, as already used with the earlier patterns. In addition, sample physical tables, and the DDL used to generate them. All sample tables will contain instance data sufficient to illustrate the scenario being presented. Timeline diagrams for each scenario. These are the same diagrams that have been used to explain earlier version patterns. The semantics of each update scenario. These semantics will be clearly stated. In addition, the transforms from source transaction to versioned table updates will be documented, either as code or as “logically tight English.” These transforms provide one aspect of the encapsulation of versioning shown in Figure 2 of Part 1. The semantics of view creation. This will include SQL view creation statements, and their instance-level results. These views provide the complementary aspect of the encapsulation of versioning shown in Figure 2 of Part 1. The semantics of temporal queries against these views. The objective here is to maximize the simplicity of the query, while keeping all the temporal semantics available for the query author. Star Schema Versioning Star schemas are the most common data model for fact/dimension databases, often referred to as "data marts.” But versioning doesn't apply to fact tables because fact tables are about events, and events don't change. Events happen, and then they are over with. Versioning tracks change over time, and only objects which persist over time can change over time. They change by participating in events. In the Roadmap #6 articles, we will discuss the semantics – indeed the ontology – of events and persistent objects, in order to understand this important point more clearly. However, the dimension tables of fact/dimension databases do represent persistent objects. Indeed, they represent the very persistent objects whose properties or relationships are altered by their participation in the events recorded in their fact tables. Because dimensions represent objects that change over time, versioning is applicable to them. The Roadmap #6 articles will provide both a logical and a physical description of the use of versioning in fact/dimension

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

7 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

databases. Snapshots In Part 1, we provided a complete taxonomy of ways in which historical data is maintained in business databases. In particular, we distinguished snapshots from versioning. Although this series has focused on versioning, snapshots also have their place in the management of temporal data. Roadmap Item #7 discusses snapshots and their uses. Sandbox Versioning Often, application systems need to accumulate updates over a period time, and across several physical units of work (UOW) that represent a single logical business transaction. Not until all updates are applied, and the results of the complete set of updates validated, are those results made visible to database end users. In other words, the updates are accumulated from several different UOWs, but are then assembled into a single logically atomic transaction, i.e., one in which the updates either all succeed or all fail, and in which the state of the database while the updates are being applied is not visible to end users. One common way of implementing the requirement to create a single, multi-transaction atomic update out of updates from several distinct UOWs is to copy the database schema and store the results of the updates in this copy. Once all updates have been applied, and all results stored in the copy database, those results are then validated and either approved, or re-applied until the results are all correct. At that point, all the data is moved from the tables in the database copy to the corresponding tables in the live database. This way of implementing a multiple UOW to single atomic update requirement is often costly. Redundant objects and processes may be needed to validate the data stored in the copy database. In addition, it may require significant amounts of system resources to marshal this data into and out of the copy database. Sandbox versioning will be discussed when we reach Roadmap Item #8. We will explain how the requirement we have described can be met without the need for a copy database, and also without the need for intra-database copies of the updated tables for these pending updates. Instead, everything is done within the original tables themselves. When the results have been validated and approved, the results become visible to the database user almost instantly, regardless of how many rows across how many tables have participated in this "mass update." Sandbox versioning functionality also provides the basis for the “check-out” and “check-in” capabilities of hierarchical versioning, which is described next. Hierarchical Versioning The versioning we have been considering, up to this point, we may call “linear versioning.” Except for error corrections, versions of the same object follow one after the other, with or without intervening gaps, but never overlapping. They thus constitute a linear sequence of versions, each the truth, and the only truth, about the object during the version’s effectivity period of time. But businesses also have a need for hierarchical versioning. As the name suggests, this is versioning in which one or more versions are both a) “child” versions of the same “parent version,” and b) “valid,” in some sense, in concurrent and/or overlapping periods of time. IT professionals

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

8 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

deal with this kind of versioning when they use software that allows them to “check-out” and “check-in” copies of a base object. Roadmap Item #9 discusses hierarchical versioning. Performance Considerations Other things being equal, a versioned table of an object (i.e., an entity type) will not perform as well as a non-versioned table of the same object. Several factors account for this, including the following ones: Versioning combines current and non-current versions of an object in the same physical table, and in general, the non-current versions will account for most of the rows and most of the size of the table. As explained in Parts 10 – 12, temporal joins are not directly supported in current RDBMSs. Necessarily, dates are used to carry much of the semantics of temporal distinctions among versions. In general, this means that to retrieve a version current as of a specific date, the versioned table must be searched for first the object, and then for the version possessing the unique pair of effectivity dates that frame the specific date. Roadmap Item #10 presents some real-world production database metrics for updating and querying versioned tables. They also include some “tips and tricks” for performance-tuning, based on several years’ experience we have had in implementing and supporting relational databases which contain versioned tables. Migrating to an Enterprise Versioning Solution Starting from scratch is easy. It’s like bringing a ship into dry dock for major repairs. But in business IT, we seldom have that luxury. Usually, we are required to carry out major repairs while the ship is still at sea. The articles in this series, up to Roadmap Item #10, will serve you well if you are able to design one or more versioned databases from scratch, before they begin to be populated with production data. Otherwise, your will have to face the challenge of introducing versioning into production databases. From a theoretical point of view, this is a special case of the “schema change” problem, one in which the schema changes are from non-versioned to versioned schemata. (See Bibliography.) From a practical point of view, the challenge is even more difficult than that; and the reason is that the schema changes cannot be done all at once, against a database that has been taken off-line, schema-changed, instance-updated, and then brought back on-line. That would be the dry dock approach to schema change. Our challenge, in the world of production databases, is to carry out schema changes while still “at sea.” The Roadmap Item #11 articles will present our recommendations for implementing schema changes in already productionalized databases. Notes Toward a Standards-Based and Vendor-Implemented Solution We lack a vendor implementation of anything more than elementary functionality for temporal database management. The reason is that the SQL Standards Committee, and the British

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 13 -...

9 of 9

http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

committee in particular, rejected the original temporal database proposal made several years ago by Dr. Richard Snodgrass. We are not aware that any progress has been made towards resolution of this issue by the standards committees. For those interested in what the differences are, each side has written a book explaining their position. These are the two books referenced in Part 1, one by Dr. Richard Snodgrass and the other by Chris Date, Hugh Darwen and Nikos Lorentzos. With Roadmap Item #12, we will attempt to find common ground between the two positions, and also describe implementation issues which we believe should be taken into consideration when designing this standard. Roadmap Wrap-Up In summary, we anticipate that a sufficiently detailed coverage of these eleven topics will require thirty to thirty-five more articles. At two articles a month, that means that this series will continue for approximately another fifteen to eighteen months. As always, the authors welcome questions and comments from readers. This input can help us identify issues we have not discussed clearly enough, or objections we have not considered at all, and we can then clarify the issue, or respond to the objection. As always, we can be reached via the indicated email addresses, and also at our respective websites. Note: like many authors in this field, we have begun to use the word "semantics" quite frequently. And, like most of them, we haven't bothered to define the term before using it. Of course, one way to learn a new term is "by osmosis" as it were, to simply see it used in enough sentences that one gets a "feel" for how to use it. This, in fact, is how we all learn nearly all the words we use. Googling is a way to accelerate this kind of learning.

But some words are important enough and specialized enough that we should supplement this bootstrap approach to learning by studying a definition and the discussion that accompanies it. To that end, one of us (Johnston) has recently added just such a definition with discussion to his Web site, MindfulData.com. Reference: 1. E. Rahm and P. A. Bernstein. (December 2006.) An Online Bibliography on Scheme Evolution. SIGMOD Record 35(4), 30–31. For more information on related topics, visit the following channels: Data Modeling Databases DW Basics DW Design, Methodology ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:05 PM

Time and Time Again: Managing Time in Relational Databases, Part 14 - ...

1 of 6

http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 14 - Versioning with Surrogate Keys Randall Weis, Tom Johnston DM Direct, November 2007 Beginning with Version Pattern 6, we will use surrogate version keys in all versioned tables. But note that the phrase “surrogate version keys” should be read as “surrogate keys for versions,” and not as “keys for surrogate versions.” We are recommending the use of surrogate keys for versioned tables. We are not introducing a concept of “surrogate versions.” Neither the chart of version patterns nor the chart of this series are included in this article. The reason is that this discussion of surrogate keys is relatively self-contained and makes few references to earlier articles in this series. Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

We begin by defining our terms. Surrogate and Other Kinds of Keys: Definitions Primary Key. The column or columns of a table in a relational database that the database management system (DBMS) recognizes as a unique identifier for that table, and as the column or columns whose values will be duplicated in foreign keys in order to create relationships. Comment: The second half of the definition, the part following the "and," is necessary. The reason is that a DBMS is able to recognize and accommodate more than one unique identifier for a table. Any column or columns over which a unique index is defined is as much a unique identifier for a table as is the primary key. Once one of those columns or groups of columns is designated the primary key, the DBMS knows that their values will appear in foreign keys whenever a relationship to their rows is created. Foreign Key. The column or columns of a table in a relational database that the DBMS uses to establish a relationship between that table and another (not necessarily distinct) table. At the instance level, foreign keys relate pairs of rows such that each foreign key value is identical to the primary key value of a row in the referenced table. Surrogate Key. A primary key whose values are system-generated and are either hidden from business users or exposed with the understanding that no meaning other than their function as unique identifiers should be assumed true of them.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 14 - ...

2 of 6

http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Comment: the second half of the definition, the part following the "and," is necessary. The reason is that system-generated primary key values may be exposed to business users. A typical example is system-generated document numbers for invoices, purchase orders, receipts, etc. The exposure does not mean that those primary keys are no longer surrogate keys. But, if exposed, such keys must be exposed to business users with the clear understanding that their only role is to identify, and that no other meaning can reliably be given to them. Because this has turned out to be an excellent example with which to illustrate the concepts of meaning and semantics, the next section will be devoted to elaborating on it. This is an aside to our main theme of introducing the use of surrogate keys with versioned tables, but we think it is an important one. Business Key. A unique identifier for a table, all of whose columns express business meaning. Natural Key. A primary key which is a business key. Semantics and Meaning, Illustrated With a Surrogate Key Example Speaking of surrogate keys, we said above that "… if exposed, such keys must be exposed to business users with the clear understanding that their only role is to identify, and that no other meaning can reliably be given to them." The word "reliably" is essential here, because sometimes additional meaning is imputed to surrogate keys by business users, whether we IT folks like it or not. For example, perhaps many users of Acme's Policy table know that, following the "P" in its values, the three-digit number is a number that was sequentially assigned by the system. Knowing that, these business users may then use that knowledge to conclude, for example, that of two clients, the one with the higher numeric part in the client-nbr was added to the database later than the other one. In this way, such business users are imputing meaning, over and above the meaning of the key as a unique identifier, to the values that key takes on. The additional meaning, or semantics, is that the surrogate key also chronologically sequences the events in which the clients or policies were added to Acme's tables. Those key values "tell us" the sequence in which rows of that table were created, and it is in their "telling us that" that their imputed semantics consists. While IT cannot stop users from reading these semantics into these keys, we can and must explicitly state that such imputed semantics are not reliable. We must tell the users that if they continue to do this, they do so at their own risk. Let's see, now, what kind of trouble such users can get themselves into. Acme's Policy Management System initially did start out by creating primary key values using a sequence number for the last three characters in the key. Although many business users knew this, what they did not know is that policy numbers, starting about a year ago, were being reused. Whenever an Acme policy was deleted, a six-month waiting period was observed (to allow all references to the deleted policy to be "flushed out" of the on-line database), after which the policy number was used as the unique identifier of a different policy. Obviously, the reuse of policy numbers makes them useless for telling which of two policies was created before the other one. This illustrates the risks of relying on undocumented semantics. As semantically risky as such reuse of identifiers is, most of us who have been around for awhile have seen this kind of thing happen. In our case, Acme apparently thought that it would never

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 14 - ...

3 of 6

http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

have to manage more than a thousand policies. A very modest way to start out, but also a little shortsighted. And this shortsightedness is the cause of this semantically risky reuse of policy numbers. Acme's IT department is desperately trying to put off the evil day when they will have to change from four-byte to five-byte policy numbers. It doesn't take much reflection to realize that the changeover will be very "evil," i.e., very costly and complicated. Existing policy numbers are everywhere – in primary keys, in foreign keys, across several Acme databases, and in the databases of customers and partners. Those numbers are "in flight" on transactions created but yet to be applied to Acme's databases. This can't be cleaned up in one fell swoop, on one night between midnight and 6a.m. while all the data is unavailable for access. It can't be cleaned up, in other words, in one multi-transaction atomic unit of work. While the changeover from a four-byte to a five-byte client number is taking place, semantic disintegrity will be exposed to business users. Joins will return only partial result sets. Code and SQL will have to be rewritten to use "link tables" and, after the changeover is complete, rewritten again to not use those link tables. It reminds me of what Everett Dirksen, the senior senator from Illinois who almost helped to write the Civil Rights Act of 1964, said about how the government spends money: "A billion here, a billion there, and pretty soon you're talking about real money." Well, when it comes to changing the syntax of the primary keys of critical enterprise tables, it may not be a billion here and a billion there. But it can often be "a million here and a million there." The Client Version and Policy Version Tables – Without Surrogate Keys Figure 1 shows our Client Version and Policy Version tables, as we have seen them through the first fourteen articles in this series. In these tables, the primary key is a natural key, not a surrogate key, because the same columns that play the role of primary key also play the role of business key. Client-nbr and Policy-nbr. Although we have identified client-nbr and policy-nbr as surrogate keys, each is really a hybrid. We said that the numeric part was surrogate because, even if it had the semantics of chronological sequence, semantics were an accident of how the numbers were created and could not be relied on. But we tacitly accepted the semantics of the alphabetic portion, in which "C" designated a client key and "P" designated a policy key. So because one part of those keys has business meaning, let's think of them as natural keys. In other words, let's think of them as not being surrogate keys. (This notion of a primary key column having "parts" brings up the old issue of "intelligent keys.” But intelligent keys have no relevance to the issues being discussed here. However, for an extensive discussion of intelligent keys, see Johnston's two articles on "Primary Key Re-Engineering Projects"(Part 1, Part 2). Client-ver-beg-dt and Policy-ver-beg-dt. These primary key columns have business meaning, because they designate business effective begin dates in both cases. Therefore, [client-nbr + client-ver-beg-dt] and [policy-nbr + policy-ver-beg-dt] are business keys for versions of clients and policies, respectively.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 14 - ...

4 of 6

http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Figure 1: The Client Version and Policy Version Tables with Natural Keys The Client Version and Policy Version Tables – With Surrogate Keys Figure 2 shows our examples of a Client Version and Policy Version table, with surrogates used as primary keys. The first thing to notice is that these tables still contain their original business keys. Business keys are what provide semantic entity integrity. Primary keys provide physical entity integrity. The distinction between semantic and physical entity integrity arises, of course, only when we substitute surrogate for natural keys, and thus separate what the DBMS recognizes as the primary key from what the business recognizes as the combination of data elements that properly play the role of distinguishing each instance of the entity from every other instance. This is a complex topic, and we will not say anything more about it in these articles. However, at the DAMA 2008 Conference, Johnston's presentation will discuss this issue in greater detail. Client-nbr, client-ver-beg-dt, policy-nbr and policy-ver-beg-dt have the same semantics they have had all along, except that their role as primary keys has been removed. In the primary key roles, we have provided two new columns for each table, an object identifier ("oid") and a version identifier ("vid"). We have also "stacked" a few more columns, again so the picture of these tables could fit comfortably on a page. So although the picture of each table suggests that there are six columns apiece, in reality the Client Version table has nine columns and the Policy Version table has ten columns. Or, if you prefer, the graphic columns are six and six, respectively, while the database columns are nine and ten, respectively. The Client Version and Policy Version tables, with surrogate keys, are shown in Figure 2.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 14 - ...

5 of 6

http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Figure 2: The Client Version and Policy Version Tables with Surrogate Keys

For both oids and vids, we have used five-digit numbers. But the important thing is that a) these numbers carry no business meaning whatsoever, and b) they play their roles as unique identifiers. Their semantics as unique identifiers are, at a minimum: No two policies may have the same policy-oid. All versions of the same policy must have the same policy-oid. No two versions of the same policy may have the same policy-vid. And the same for clients. For reasons described in Johnston's "Primary Key Re-Engineering Project" articles, we would prefer additional semantics for oids and vids, namely: No two objects, of whatever type (policies and clients, here) may have the same oid. No two versions, of the same or of different objects, may have the same vid. One final point before we conclude Part 14. Surrogate keys are usually single-column primary keys. But our versioned tables are two-column primary keys. It isn't necessary to use two columns – an oid and a vid – as the primary key of versioned tables. But it is very useful to have a systemgenerated key for an object, with an additional distinct system-generated key for a version of an object. What isn't necessary is that these two system-generated columns be used as the primary key of a versioned table. Instead, we could have a typical surrogate key, i.e., one consisting of a single column. Why do we need a system-generated oid and also a system-generated vid? It is not for the sake of semantics. As far as semantics is concerned, we could use a single-column surrogate key for versioned tables, while retaining the client-nbr (or policy-nbr) and associated version begin date. No oid, no vid. Recall that client-nbr and policy-nbr were our somewhat simplified examples of business keys. So it could just as easily have been that policies had a four-column business key while the client had a three-column business key. Quite possible no two or more of those six columns would have the same data type and length.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 14 - ...

6 of 6

http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Why is that significant? It is significant because, all along, we have had a single, enterprise-wide solution to versioning as our objective. Not one solution for one database and a slightly different solution for another database. While we may be able to retrieve data from versioned tables with SQL alone, it is not possible, as we will see, to update data in versioned tables with SQL alone. For updating versioned tables, therefore, there is a codebase. Our enterprise objective, if we are to take it seriously, requires that one and the same codebase be deployed for all versioning solutions. If an enterprise versioning solution is used, let us say, in six databases across the enterprise, then there may be six physical copies of that codebase. An enterprise solution does not require federated access to a single physical codebase instance. But it does require that all instances of that codebase be line-for-line, byte-for-byte identical. The only alternative is to develop a custom versioning solution for each database that requires versioned tables. While that may be the most common way of implementing versioning in an enterprise, it is not our objective. The Versioned Table Surrogate Key Position, Going Forward The concept of surrogate keys is a contentious one. Publications arguing both for and against their use go back at least twenty years. Arguments based on both theoretical and practical concerns can be made for using surrogate keys and also for not using surrogate keys. One of us (Johnston) has been an advocate of using surrogate keys for all kinds of tables, in all kinds of databases. This means for dependent and associative tables, not just for the “kernel” tables which those tables are dependent on, or relate. This means for tables in OLTP and data mart databases, not just in ODS and data warehouse databases. We will refer to this as the “strong surrogate key” position to distinguish it from positions which recommend surrogate keys for some kinds of databases but not for others, or for some kinds of tables but not for others. In Part 15, we will provide an outline of the major arguments for and against the strong surrogate key position. This will not be an impartial review of arguments on both sides. It will be a distillation of the arguments Johnston has made in half a dozen articles going back to 2000, and also a preview of the presentation he will be making at the DAMA International Conference next March in San Diego. For more information on related topics, visit the following channels: Data Management Data Management ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 15 - ...

1 of 6

http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 15 - For and Against Surrogate Keys Tom Johnston, Randall Weis DM Direct, December, 2007 In Part 14, we introduced the concept of surrogate keys and showed how our example database of clients and policies would look if we used surrogate keys with that database. We also introduced some terminology that will help us examine the arguments for and against their use, terms such as "business key," "natural key" and "mixed key." We then went on to define two sets of semantic constraints that give surrogate keys their meaning. The minimal set consisted of three rules for the Client table and a corresponding three rules for the Policy table. The full set consisted of an additional two rules, both of which applied to both tables. We also designated the approach to surrogate keys that we will use in the remainder of these articles as the "strong approach." The approach we will contrast it to we will designate the "standard approach," for such it is. We begin, then, by describing the standard approach and our reasons for dismissing it.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

Neither the chart of version patterns nor the chart of this series are included in this article. The reason is that this discussion of surrogate keys is relatively self-contained and makes few references to earlier articles in this series. Surrogate and Other Kinds of Keys: Definitions Primary Key. The column or columns of a table in a relational database that the database management system (DBMS) recognizes as the unique identifier column or columns for that table. Foreign Key. One or more columns of a table in a relational database used by the DBMS to establish a relationship between that table and another (not necessarily distinct) table. At the instance level, foreign keys relate pairs of rows such that each foreign key value is identical to a primary key value in the referenced table. Surrogate Key. A primary key whose values are system-generated, and which therefore carry no business meaning. Surrogate keys are usually single-column keys. Mixed Key. A multi-column primary key, some of whose columns contain system-generated values and other of whose columns contain business-meaningful values. Business Key. A unique identifier for a table, all of whose values carry business meaning. Natural Key. A primary key that is a business key.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 15 - ...

2 of 6

http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

Surrogate Keys: The Standard Approach When IT practitioners speak of a database that uses surrogate keys, they are usually referring to a database some of whose tables have single-column surrogate primary keys, but other of whose tables have multi-column primary keys whose values may or may not be system assigned. Thus, they might use a single-column surrogate key for a Purchase Order table but create a mixed primary key for a Purchase Order Line Item table, one consisting of a foreign key to the Purchase Order table together with a line number. For an associative table matching up purchase orders to receipts, they would typically follow the general practice for associative tables and create a primary key by concatenating the primary keys of the Purchase Order and Receipt tables. Let's call this the standard approach to using surrogate keys. What might justify the standard approach as the best way to assign primary keys to the tables in a database? It would have to be something like this. Whether or not surrogate keys are preferable to business keys is something that should be determined for each table. Now consider a database all of whose tables have natural keys. Among those tables, there is a subset whose primary keys do not contain any foreign keys, i.e., which do not use "identifying relationships." Call them "kernel tables." Assign a single-column surrogate key to these kernel tables. That surrogate key will then become part of an identifying relationship in all of the other tables to which each kernel key table is related. There are two types of such other tables: dependent tables and associative tables. Dependent tables have primary keys which contain one surrogate foreign key. Associative tables have primary keys that contain two surrogate foreign keys. Our objection to this standard approach is that it makes the decision to use or not to use surrogate keys at the level of individual tables. The result is a database in which some tables use surrogate keys, other tables use mixed keys and yet other tables may use natural keys. We have seen various rules of thumb offered for deciding whether to use a surrogate, mixed or natural key for a particular table. For example, some database modelers/designers will always use a natural key for reference/code tables. Their argument is that besides the value being referenced, all that such tables contain are things like a short and a long description. Since for most purposes, it's just the code that is wanted in a result set, using that code as the natural key of the reference table means that a join to that table is not required in order to get the code value. ICD-9 codes on a medical claim are a good example. If the ICD-9 code itself is the foreign key, the claim with its list of ICD-9 codes can be materialized without needing to join back to the ICD-9 reference table. We prefer to make the decision to use or not use surrogate keys at the database level. For all the tables in the database, it's all or nothing. Either all of them use surrogate keys, or none of them do. If single-column surrogate keys are being used, then the primary keys of all tables contain exactly one column. This is true of associative and dependent tables, of core tables and code tables. All use surrogate keys. Moreover, on our strong approach to surrogate keys, no surrogate key value will appear in more than one primary key instance anywhere in the database. A primary key is unique, in other words, across all tables in the database, not merely unique within one table. Argument Against the Standard Approach to Surrogate Keys Our argument against the standard approach to using surrogate keys is simplicity itself. The standard approach leaves us with a mixture of primary and foreign keys to manage that share no

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 15 - ...

3 of 6

http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

consistent syntax and across which duplicate values are possible. The argument for using surrogate keys rather than natural keys is simple: natural keys can change, and that change can be expensive. As for change, the values in natural keys describe something about what the row they are part of represents; that follows from the definition of "natural key." Suppose one natural key is a two-byte column whose values represent geographical sales regions and are: NE, NW, SE and SW. Now suppose the VP of Sales, after extensive analysis and reflection, of course, decides to change sales regions as follows: California is too big to be lumped in with the rest of the SW, so it will become its own sales region, with "CA" as its code. The NE and SE regions are also getting too big, but we don't want to carve out one or more states to fix their problem. Instead, we will split them each into two regions, with designations as follows: NE1, NE2, SE1 and SE2. We invite you to reflect on the cost of such changes. Next, think about the cost of these changes if IT had, instead of using natural keys, used a surrogate key for this Sales Region table. Argument 1 Against Surrogate Keys: Users Don't Understand Them Taken in one sense, this argument means that users don't know what surrogate key values stand for. But of course, that's the whole point. The system generates values for surrogate keys, and those values have no business meaning. No wonder users don't understand them. Taken in another sense, this argument means that users don't understand the need for surrogate keys, that they are uncomfortable when they see surrogate keys in their result sets and so on. The need for surrogate keys can be explained to business users in two steps. First of all, primary keys are needed to uniquely designate each row in each table. Secondly, the business may need to occasionally require IT to changes values and/or data types and lengths of primary keys in order to make the data being used as a unique identifier more accurately describe what it is about. And when those changes are required, the costs of implementing them can be enormous. We saw this above, in the argument against the standard approach to surrogate keys. Argument 2 Against Surrogate Keys: Users Don't Use Them This argument is that, because surrogate keys do not contain business data, users will never select rows from tables by supplying surrogate key values. This is certainly true, but it doesn't support the conclusion that users don't use them. On the contrary, surrogate keys, together with the foreign keys that replicate their values, are used to join rows to other rows. If a user wants information about a particular Policy, for example, we may assume that he knows what the business key of that policy is. His query then begins with a Select operation on the Policy table, with policy-nbr as the target, and a specific policy number as the search argument. Since business keys should be unique, this process should select exactly one row (provided the supplied policy number is valid). Next, logically although not temporally speaking, the query proceeds to the Join operation in which one or more other rows foreign key-related to the original row, are also selected. Finally, again speaking logically and not temporally, the query proceeds to the Project operation in

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 15 - ...

4 of 6

http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

which some columns are designated to be included in the result set, while the other ones drop out. The entire query, at this point, asks a question of the database. It says, "Do there exist one or more instances of data that satisfy this specification?" If so, the resulting instances are returned, each one as a row in a result set. There is generally no good reason to include surrogate keys or their associated foreign keys in the result set. There is generally no good reason for users to ever see a surrogate key value. For even if the user writes his own SQL and uses surrogate keys and foreign keys in his joins, he does not supply a surrogate key value anywhere in the query, and isn't interested in a surrogate key result. Primary and surrogate key join pairs, like two adjacent links in a chain, are purely DBMS machinery for linking data together. Argument 3 Against Surrogate Keys: They Take Up Extra Space This argument points out that if we used a table's business key as its primary key, we wouldn't need an extra surrogate key column. Say that surrogate key is eight bytes long, and it occurs in a million-row Client table. That's eight million unnecessary bytes, approximately 7.5MB extra for the entire table. While direct access storage device (DASD) space is relatively cheap, the argument will continue, response time to a user's query is not. And the eight extra bytes on each row also contributes to poorer response time, because fewer rows will fit into a single I/O block; fewer rows at a time can be transferred between secondary storage and main memory. In response, we need to distinguish between the impact of a surrogate key on the table it is the primary key for and its impact across the entire database. In its native table, the Client table in this case, adding a surrogate key will indeed add extra space to the storage requirements for the table and will also add some amount of extra time when doing sequential retrievals, i.e., retrievals that require moving I/O block after I/O block from DASD to main storage. But let's consider all the tables in the database and nonsequential retrievals. Across all the tables in a database, the use of surrogate keys may well reduce the total requirement for storage space. For consider our eight-byte, single-column surrogate key, and compare it to a natural key of one or more columns, which may well be longer than eight bytes. Next, suppose that the Client table is an important table in the specific sense that many other tables have foreign keys to the Client table, and for some of them those foreign keys are part of their own primary keys, i.e., those foreign keys implement identifying relationships. Finally, suppose that for some of these other tables with identifying relationships to the Client table, there are further tables that have foreign keys to them. Throughout this chain of direct and also indirect foreign keys to the Client table, and depending, of course, on the number of other rows which instantiate these foreign keys, we may have dozens or even hundreds of occurrences of Client table primary key values. Let us assume that, at some point in time, there are 5,000 such Client foreign keys. And let us assume, further, that the business key for the Client table would have been two columns, totaling twelve bytes. The arithmetic here is simple. In the first case, the foreign keys total 60kb. In the latter case, they total 40kb.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 15 - ...

5 of 6

http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

Now consider a database in which important tables like our Client table contain millions of rows. And consider then the trade-off in space consumed between such a database in which surrogate keys were used for all tables versus a database in which business keys were used for all tables. There is indeed a significant difference in storage requirements between using and not using surrogate keys. Surrogate keys are much more space-efficient. Now let's briefly consider the argument that retrievals will take longer with surrogate keys because fewer rows will fit into an I/O block. Although there are several things to note, we will take the time to mention only one of them. This argument applies only when sequential retrievals are being considered. When a business key value is in hand, and about to be used to select a client, then after a lookup in the unique index defined on business key, only a single I/O block will be retrieved. As for joins, once a primary key value is in hand, and about to be used in a join, then after the lookup in the index, only a single I/O block will be retrieved. Any negative impact on retrieval time caused by using surrogate keys is negligible to the point of being undetectable. Argument 4 Against Surrogate Keys: They Require Extra Joins This argument points out that whenever a surrogate primary key is used, then all relationships to a row with such a key use foreign keys whose values are identical to that row's surrogate key value. So whenever a query has specified a row which has such a foreign key, and whenever that query wants to include all or part of the business key of the related row, a join is required to bring that business key into the result set. This argument is correct. However, it is possible to avoid joins that the use of surrogate keys would otherwise make necessary. The solution is to denormalize and put duplicate columns on the referencing table, columns that duplicate the business key of the row that the surrogate foreign key is pointing to. The Argument For Surrogate Keys One of us (Johnston) will present a more complete case for using surrogate keys on all tables in a database, and will do so at the DAMA 2008 Conference in San Diego in March 2008. But in brief, the reasons we will proceed to discuss versioning in terms of sample tables that use surrogate keys are these: 1. We want an enterprise solution to versioning, one applicable to any table and any database without modifying the code that implements it. That requires that all versioned tables have the same syntax for their primary keys, i.e., the same columns each with the same data types, lengths and domains. 2. We want to avoid the usually very high costs of changing primary keys by adopting keys that there will never be a business reason to change. Such keys are precisely those that carry no business meaning and whose only semantics are to link related rows together. A Physical Description of Earlier Version Patterns In Part 15, we begin to revisit our earlier version patterns and provide a physical description of them. This includes physical schemas and SQL, and shows how these version patterns work in a real database. For a longer-term look at where this series is going, consult Part 13.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases, Part 15 - ...

6 of 6

http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

For more information on related topics, visit the following channels: Data Management Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:06 PM

Time and Time Again: Managing Time in Relational Databases Part 16 - ...

1 of 4

http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases Part 16 - Physical Implementation of Version Pattern 1 Tom Johnston, Randall Weis DM Direct, December 21, 2007 Starting with this installment in our series, we will begin to describe the physical implementation of these versioning patterns. To this point, patterns have been presented from a logical perspective. Data modelers would recognize these discussions as being about logical data models, rather than physical data models. Almost equivalently, we could say that previous articles in this series have discussed the semantics of versioning rather than its implementation. We now turn to the corresponding physical data models and to their implementation in a real database. Throughout the rest of this series, we will refer to articles that focus on implementation as "PDM (physical data model) segments" of this series, where each segment is a separate article. We will reference each pattern by version number, and we will use the illustrations and the scenarios of the version pattern within the context of each of these PDM segments. Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

These PDM segments will show the physical Data Definition Language (DDL) used to create the tables in the schema; the Structured Query Language (SQL) used to populate and update those tables; and the SQL used to query the tables. We will also discuss performance and data integrity in the physical database. In addition, we will describe and discuss known issues, database management system (DBMS) limitations and alternatives to the implementations that are depicted here. We encourage you to create, populate and query these objects in your own database. By doing so, you will deepen your appreciation of the temporal functionality provided by the early version patterns, and of the more complete functionality provided by the later patterns. In addition, you may learn to develop some of your own variations on these patterns, variations that might better apply in your own DBMS and application environment. Preliminaries We will be using Microsoft SQL Server as the DBMS. However, with some minor adjustments, the SQL and the designs suggested in these PDM segments could be used in other DBMSs, such as IBM’s DB2, Oracle or MySQL. Various database objects will be created, including tables, views, indexes, primary keys, foreign keys, triggers and stored procedures. However, we will assume that the reader will create other needed objects, such as databases, containers, file groups and tablespaces, and will have the appropriate permissions to create and use these objects.

10/7/2008 1:09 PM

Time and Time Again: Managing Time in Relational Databases Part 16 - ...

2 of 4

http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

In SQL Server the schema owner is often defined as “dbo.” We will not prefix the objects with “dbo,” but that will be the implicit schema name. However, in many DBMSs the default schema name is the same as the current SQL ID or login ID of the person that is connected to the database. This should suffice for most of these examples. Some DBMSs will implicitly create certain database objects based on the creation of other objects. For example, some DBMSs will automatically create indexes based on the creation of primary keys and foreign keys, but others will not. In many of our examples, we will include the DDL to create objects that might be implicitly created by other DBMSs. Thus, you many encounter some minor syntax or duplicate object errors when declaring these objects to your DBMS. In many cases, you will be able to ignore these errors. We suggest that you confirm that the error does not compromise the model, and that the fundamental purpose of the object, function or statement is fulfilled in some other way. The SQL we present will often use literals for the variables rather than parameter markers (host variables). This is for illustrative purposes only. Normally, for systems that require high performance, applications should use either statically bound SQL or prepared statements that use parameter markers in dynamic SQL such as JDBC rather than dynamic literals. This strategy saves a significant amount of overhead, avoiding constant reoptimization of the SQL’s access path. So, while we will use literals in our examples, this is not the recommended approach and is only used to clarify the examples. Note also that some companies have standards to not use a “select” clause. We will use it in our examples simply to save textual space. The Application The insurance industry has extensive date/time requirements for their application databases. These include: The date range an insurance policy is effective, The date an event occurred, The date a service was performed, The date a claim was adjudicated, and The date a claim was paid. Often the dates when a policy or a policy-related event are recorded in a database do not match the effectivity range of the policy or the date on which the related event actually occurred. This distinction is one of the reasons we chose to use an insurance application in our examples. However, many other industries have complex date-driven rules and business processes. In fact, most large companies need to adhere to best practices or to various regulations, such as Sarbanes-Oxley, that require appropriate accounting and tracking of data changes. Therefore, we believe that while your application may not appear to currently need advanced version patterns, it is likely that, in time, they will. So, as you read these physical model segments and try out the examples, you should be able to easily translate these PDM entities, attributes, objects and examples into your own industry’s terms. Keys

10/7/2008 1:09 PM

Time and Time Again: Managing Time in Relational Databases Part 16 - ...

3 of 4

http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

In non-versioned data models, and also in some of the earlier patterns in this series, the primary key of the table is also considered the unique identifier of the object. For example, either a business key or a surrogate key of the Policy object might be called policy_nbr, and it would be the primary key in the physical database. However, as we progress through the more advanced versioning patterns, we will discover that the object identifier alone will not uniquely identify a single occurrence (row) in a table because multiple versions of an object may (and usually will) exist in a versioned table. This restriction in current relational databases – that the distinction between an object and its versions must be made by the developer and is not built-in to either standard or vendor-augmented DDL – forces us to create a new surrogate key that represents the composite of an object id and either a version date/time or a version id, which now becomes the unique primary key. Therefore, as you review these patterns, consider the fact that the object id will not always be considered the primary key, regardless of whether it is a business key or a surrogate. So to help you distinguish the different columns that would normally represent a single object and the primary key in non-versioned tables, these columns (keys) will be suffixed with “OID,” meaning the object id, and surrogate columns that distinguish versions of an object will be suffixed with “VID,” meaning version id. PDM Structure These PDM segment articles will be structured as follows: Reference to the version pattern as defined in the earlier articles. A brief description of the pattern. A table facsimile with sample rows. The DDL to create the base table(s). Supporting DDL, such as constraint, index or view creation. Special design notes and considerations. A copy of the scenario figure used in the original version pattern. The Data Manipulation Language (DML) to update the database for each event. SQL to query the results at various points in time. The results of the queries in tabular format. Figures 1 and 2: Templates that Will be Used (All figures are oversized and linked at the end of the article) (See pdf). Setup The initial patterns will concentrate on the Policy table. We will retain pattern versions in separate tables so we can compare results. Therefore, each Policy table will be prefixed with a “P#” to identify each pattern number. We will also, as needed, reference a Client table. The following DDL and SQL can be executed to create the base Client table. Figure 3: Client Table Setup (see pdf) Version Pattern 1 Now let’s see how Version Pattern #1 will look in a physical database.

10/7/2008 1:09 PM

Time and Time Again: Managing Time in Relational Databases Part 16 - ...

4 of 4

http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

Figure 4: Version Pattern 1 (see pdf) Version 1 Scenario 1.1 Figure 5: Scenario 1.1 (see pdf) Version 1 Scenario 1.2 Figure 6: Scenario 1.2 (see pdf) Version 1 Scenario 1.3 Figure 7: Scenario 1.3 (see pdf) These scenarios show the physical data model for the first version pattern, which has no history or future dating requirements whatsoever, as well as insert, update, delete and retrieve SQL against that physical model. This is the easiest version pattern to update and query, but by the same token is the least expressive of temporal requirements. The date a row was created, and the date it was last updated, provide the only persisted temporal information about the data. Relational databases and SQL standards have not yet evolved to easily support various temporal states. So, this series will continue to show you how to implement feature-rich version patterns into your database that could also become part of your enterprise architecture. These physical design segments will take this concept from theory to actual implementation. It’s about time! For more information on related topics, visit the following channels: Data Management Databases DW Basics ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:09 PM

Object/Version PDM Pattern Template Version: Pattern Number and Name Description: Brief Description of Pattern

Table Facsimile my_oid (pk) 138

parent_oid (fk) 882

Table Creation DDL Create table my_table ( my_oid int parent_oid int attr_n_val char(10) crt_dt datetime last_updt_dt datetime

not null, null, not null, default getdate() default getdate()

Primary Key Alter table my_table add constraint pk_my_table primary key (my_oid); Foreign Keys Alter table my_table add constraint fk_parent_child foreign key (parent_oid) references parent (parent_oid) on delete restrict; Notes: Special design notes and options

figure 1

attr_n_val PPO

crt_dt 1/12/2004

last_updt_dt 1/12/2004

not null, not null) ;

Index DDL Create unique pk_my_table on my_table (my_oid);

index

Create clustered index fk_ parent_child on my_table (parent_oid);

Scenario Template Scenario: Version Pattern Number Scenario Number and Name Description: Brief Description of Scenario

Scenario Figure

Insert SQL Insert into my_table (my_oid, parent_oid, attr_val, crt_dt) VALUES(138, 882, 'PPO', '2004-01-12'); Query SQL Select * from my_table where my_oid =138; Notes: Explanation of how the SQL works and any special notes regarding SQL and criteria. Hints, tips options.

Query Results my_oid 138

figure 2

parent_oid 882

attr_n_val PPO

crt_dt 1/12/2004

last_updt_dt 1/12/2004

Client Table Setup Create table client ( client_nbr client_name crt_dt last_updt_dt

int varchar(50) datetime default getdate() datetime default getdate()

Alter table client add constraint pk_client primary key (client_nbr); Insert into client (client_nbr, client_name, crt_dt) values (882, ‘Mary Client', '2004-01-19') ;

figure 3

not null, not null, not null, not null) ;

Object/Version PDM Pattern Version Pattern 1: Updating in Place Description: No versioning with updates occurring in place, and the end of an object’s episode is represented by the deletion of the object. Table Facsimile policy_nbr (pk) 138

client_nbr (fk) 882

Table Creation DDL Create table p1_policy ( policy_nbr int client_nbr int policy_typ char(10) crt_dt datetime last_updt_dt datetime

policy_typ

crt_dt

last_updt_dt

PPO

2004-01-12

2004-01-12

not null, null, not null, default getdate() default getdate()

Primary Key Alter table p1_policy add constraint pk_p1_policy primary key (policy_nbr); Foreign Keys Alter table p1_policy add constraint pk_p1_policy_client foreign key (client_nbr) references client (client_nbr) on delete no action on update no action ;

not null, not null) ;

Index DDL Create unique nonclustered index ix_pk_p1_policy on p1_policy (policy_nbr); Create clustered index ix_fk_p1_policy_client on p1_policy (client_nbr);

Notes: We clustered this table by the foreign key because access is normally by a foreign key. The delete/update rule is set to “no action” which is similar to “restrict”; however, cascade could be used, also. Remember, however, there is no versioning.

figure 4

Scenario Scenario 1.1 of Version Pattern 1: an Insert to the Policy Table Description: shows the result of entering Mary's policy into the Policy table Scenario Figure

Insert SQL Insert into p1_policy (policy_nbr, client_nbr, policy_typ, crt_dt, last_updt_dt) VALUES(138, 882, 'PPO', '2004-01-12', '2004-01-12'); Query SQL Select * from p1_policy where policy_nbr=138; Notes: Add and look up Mary’s Policy. Query Results policy_nbr

client_nbr

policy_typ

crt_dt

last_updt_dt

138

882

PPO

2004-01-12

2004-01-12

figure 5

Scenario Scenario 1.2 of Version Pattern 1: an Update on the Policy Table Description: On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization). After that change is recorded in the Policy table, Mary's Policy row looks like this: Scenario Figure

Update SQL Update p1_policy Set policy_typ='HMO', last_updt_dt='2005-03-14' where policy_nbr=138; Query SQL Select * from p1_policy where policy_nbr=138; Notes: Update Mary’s Policy in place. We’ve lost all record that Mary originally had PPO Query Results policy_nbr

client_nbr

policy_typ

crt_dt

last_updt_dt

138

882

HMO

2004-01-12

2005-03-14

figure 6

Scenario Scenario 1.3 of Version Pattern 1: a Physical Delete to the Policy Table Description: On 6/1/06, Mary's policy is ended (by contract expiration, explicit termination, revocation, etc. - it doesn't matter). This is recorded, using Version Pattern 1, by physically deleting her policy's row in the Policy table. Scenario Figure

Delete SQL Delete from p1_policy where policy_nbr=138; Query SQL Select * from p1_policy where policy_nbr=138; Notes: Mary’s Policy is deleted. We’ve lost all record that Mary originally had PPO, HMO, or had any policy whatsoever Query Results policy_nbr [no row]

figure 7

client_nbr

policy_typ

crt_dt

last_updt_dt

Time and Time Again: Managing Time in Relational Databases, Part 17 - ...

1 of 4

http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 17 - Physical Implementation of Version Pattern 2 Tom Johnston, Randall Weis DM Direct, January 4, 2008 In this PDM segment, we will describe the physical implementation of Version Pattern 2. This pattern was described, from a logical point of view, in part 3 of this series. This pattern expands on pattern 1 by adding a logical delete capability. This pattern is more semantically expressive than Version Pattern 1, just as each pattern described in this series is more expressive than the patterns prior to it. This pattern allows us to keep a record of deleted rows, and thus information about when they were first inserted, last updated and, finally, deleted. A delete request which is carried out as a physical delete loses all this information, removing all evidence, in the online table, that the row ever existed. Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

To support logical deletes, we need an additional column, either some kind of a delete indicator or else a delete date. A delete indicator would be set to "N" when the row is first inserted, and set to "Y" if and when the row is logically deleted. A delete date would be set to {null} when the row is first inserted, and set to the date of the logical deletion if and when the row is logically deleted. Remember, in this series when we reference a column as “date” we normally would consider a “date/timestamp” to be more accurate and to provide more flexibility as to when we can change the state of a row. We use dates in our examples to help keep them simple. We prefer to use a delete date because it provides more information than a delete indicator. If a delete indicator is used, we must decide to either leave the last update date alone, or to put the delete date in that column. If we choose the first option, we have lost the delete date. If we choose the second option, we have overwritten the date of the last update. But if we use a delete date instead of a delete indicator, we can preserve both dates. In addition, as we will show later, it becomes easier and more efficient to query when looking for rows that were active at a given point in the past. Also discussed later in this article is the option of using a special date value rather than {null}. A variant of Version Pattern 2 would be to add a “delete reason” column. Together with a delete date, this would tell us both when and why rows were logically deleted. For example, a row in a Customer table might have been logically deleted because of no recent orders, bad credit, etc. And, of course, the combination of insert date and delete date tells us how long the row was logically present in the table. (This may not correspond to how long the customer was active,

10/7/2008 1:11 PM

Time and Time Again: Managing Time in Relational Databases, Part 17 - ...

2 of 4

http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

however. It will correspond only if the business chooses to treat row insertion and deletion dates as also representing customer effectivity dates.) With rows logically rather than physically deleted, it is also possible to maintain referential integrity for rows in dependent tables. Otherwise, a physical deletion would force us either to set foreign keys to {null}, or to cascade delete those dependent rows. In certain types of queries, data from a logically deleted parent row might frequently be usefully joined to nondeleted child rows. This would provide data about the parent row related to those child rows, a relationship that was in effect between the later of the insert dates for the parent and the child rows, and the delete date of the parent row. Foreign Key Constraints Most popular database management systems (DBMSs) allow a “Set Null” delete option. With this option, if there are one or more rows in one or more dependent tables that have a foreign key to a row being deleted, then those foreign keys are set to {null}. As just mentioned, while this approach retains the dependent rows, it loses the link to the parent row. So a logical delete allows us to retain more information than a physical delete which uses the Set Null option. These same DBMSs also provide for a “Cascade” delete option. With this option, if there are one or more rows in one or more dependent tables that have a foreign key to a row being deleted, then those dependent rows are also deleted. If other rows have foreign keys to these dependent rows, then those other rows will also be deleted. Usually, Cascade delete options are avoided in most financial and other mission-critical applications. A Cascade delete retains less information than a Set Null delete against the same set of rows because a Cascade delete removes the dependent rows while a Set Null delete retains the rows and merely sets their foreign keys to {null}. So a logical delete also allows us to retain more information than a physical delete with the Cascade option. The third common delete option is called “No Action” or “Restrict.” No Action and Restrict are nearly identical, and in DBMSs that support both, the only difference is when the constraint is enforced – before or after other constraints. This constraint is used as a safety valve to prevent the inadvertent deletion of dependents when a parent row is physically deleted and the inadvertent deletion of a parent row that has dependents. Therefore, in this physical implementation of Version Pattern 2, we will use the “Delete No Action” option because no rows should ever be physically deleted, but if someone tries to physically delete a row, this option will prevent the deletion if that row has dependent rows. This will not, however, prevent the physical deletion of rows that do not have dependents. So, if we really want to prevent the physical deletion of rows in a table, the safest approach would be to revoke the delete permission on the table from the users, or not grant delete authority in the first place. This will force the user to update the logical delete indicator (or logical delete date) in order to perform a business deletion. A logical delete date or indicator column can help maintain referential integrity if we want to inactivate a parent table row while retaining the dependents’ relationship. For example, we may want to inactivate a customer but retain both the orders they placed and the foreign key relationship. This is helpful for several reasons, including the ability to find the customer name and address of an order that was shipped last year, even though the customer is no longer active. This approach also gives us the ability to reactivate the customer if conditions change – without

10/7/2008 1:11 PM

Time and Time Again: Managing Time in Relational Databases, Part 17 - ...

3 of 4

http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

having to create a new customer entity occurrence – thus retaining the customer's previous history. But if we do this, we will lose the information that this customer was logically deleted and inactive for a period of time before its re-activation. To handle re-activation without loss of information required a more advanced version pattern. Figure 1 shows how Version Pattern 2 will look in a physical database. Figure 1: Version Pattern 2 (see PDF below) Figure 2: Version 2 Setup (see PDF below) Starting in this PDM segment we will include a setup section that will show the preparation of the data and chronology of SQL events leading up to the Scenarios. Figure 3: Version 2 Scenario 2.1 (see PDF below) Figure 4: Version 2 Follow-Up (see PDF below) We will also perform some follow-up SQL that shows various ways the data can be queried. Two Approaches to a Delete Date In the physical implementation of our remaining versioning patterns, we will not use {null} in delete dates. Instead, we will use the special value “12/31/9999.” Two reasons to use this special value are to simplify queries and to improve performance. For example, if a delete date is inclusive and nullable, the SQL that looks for active rows at a given point in time may be written as: “… AND (crt_dt <= ‘01/01/2005’ AND (del_dt >= ‘01/01/2005’ OR del_dt is null)) ” …using the table facsimile in Figure 5. Figure 5: Version 2 Follow-Up (see PDF below) This query will return policies 138 and 139 because they were both active on 01/01/2005. However, if we used “12/31/9999” for the delete date, and it is inclusive, the SQL may be simply written as: “… AND (:my-date between crt_dt AND del_dt ) ” or another way: “… AND (crt_dt <= :my-date AND del_dt >= :my-date) ” where :my-date is ‘01/01/2005’ using the table facsimile in Figure 6. Figure 6: (see PDF below) This query will also return policies 138 and 139 because they were both active on 01/01/2005. Using “12/31/9999” instead of {null}, the SQL is clearly simpler to write. In addition, performance

10/7/2008 1:11 PM

Time and Time Again: Managing Time in Relational Databases, Part 17 - ...

4 of 4

http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

will be better because we have removed the “OR” from the optimization criteria. Many DBMSs do not optimize ORs very well, and often will alter the access path to something less efficient compared to similar SQL without the OR. Using “12/31/9999,” we may also decide to put an index on the delete date, usually combined with some other columns such as client_nbr. Wrap-Up This article has shown how to implement Version Pattern 2, which builds on Version Pattern 1 by providing a logical delete to use in place of a physical delete. However, updates using this version pattern still lose a great deal of information. This pattern does not retain previous versions of an object, nor can it distinguish between re-activated objects and those which were never deleted. Also, we’ve only discussed one dimension of time, the system dates of the row's creation and deletion. Thus, Version Pattern 2 does not manage bi-temporality. Its dates are dates of database activity. Business meaningful dates – specifically the dates on which a version of an object became effective and ceased being in effect – are not explicitly represented in Version Pattern 2. In the next several articles, we will proceed to demonstrate how to physically model the more advanced semantics of the higher version patterns. We will also present the SQL needed to maintain and query these tables when these more advanced patterns are used. Time is on our side! For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:11 PM

Time and Time Again: Managing Time in Relational Databases, Part 18 - ...

1 of 4

http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 18 - Physical Implementation of Version Pattern 3 Tom Johnston, Randall Weis DM Direct, January 18, 2008 In this article, we will describe the physical implementation of Version Pattern 3. This pattern expands on Patterns 1 and 2 by retaining a history of updates. This pattern was logically described in part 4 and part 5 of this series. Pattern 2 provided the ability to query objects that were previous deleted. However, it failed to save history created by updates. In Pattern 3, we will start saving this update history by using versioning, and will lay the groundwork for the more robust history patterns that follow. Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

Version Tables One way to keep track of updates is to use history tables that are nearly clones of the object table, and are populated by using update triggers. There are pros and cons with using this approach, which we will discuss later. Pattern 3 takes an approach where all of the versions (current and past) are stored in the same table. A Policy was the object in the Policy table in Patterns 1 and 2. Now, the table will contain multiple versions of the same object. Therefore, to distinguish these, the table that contains the versions of a policy will be called the “Policy Version” table. While you might assume that having a policy version table would also require a policy (object) table, this is not so. We will only populate and query the Policy Version table for our current and previous versions of an object. Some organizations might demand both tables for various reasons, but it is not required to support the semantics of this pattern. There are benefits to having only one (version) table to update, manage and query, so we will describe this pattern with this design decision. Version Primary Key Having versions of the same object in the same table requires a change to the primary key of the table. Policy number (policy_nbr) was the primary key in Patterns 1 and 2. However, policy number alone is no longer a valid primary key now that we are storing versions in the same table. We need something else as part of the primary key to distinguish multiple versions of the same policy. An alternative might be to replace policy number with a surrogate key. But since surrogate keys were not introduced as part of these earlier patterns, we will use a composite key made up of the policy_nbr and something else to define the version.

10/7/2008 1:12 PM

Time and Time Again: Managing Time in Relational Databases, Part 18 - ...

2 of 4

http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

Figure 1: Correlated Subselect for Max (see PDF) That “something else” could be a sequence number (version number), but for our examples we will use the version beginning (system activity) date (ver_beg_dt). Remember, we use dates in our examples to simplify them, but we really would use a date/timestamp to allow for multiple events per day. The primary key will be policy_nbr and ver_beg_dt. The current version is identified by being the row for the policy that has the highest (most recent) version begin date. So, to find the most current version you would need to find the highest (MAX) date for the policy, then use that date to qualify the ver_beg_dt to get the single current row. This can be accomplished by using a correlated subselect, as shown in Figure 1. Either way, this approach has negative performance implications in applications that read these rows. We have seen this performance problem in implementations of this pattern, and it is often not discovered until there is a larger quantity of history and a load is placed on the application. This issue will be addressed and overcome in future patterns in this series, so stay tuned. Returning a version of an object “as of” a certain point in time, while possible, is a bit more complex. You need to find the specific row that began (beg_dt) prior to the input date, but it must less than the next version that is higher. It is not a simple “between” predicate. This kind of query is shown in Figure 2. Figure 2: “As Of” (see PDF) The SQL in the following scenarios after the product data management (PDM) design will help demonstrate how to retrieve the various versions using an input parameter representing the as-of date. Remember, this pattern only works with a single dimension of time. It is the time that the event is recorded in the computer system. We assume for this pattern that this date is both the system activity and business effectivity date. The later patterns in this series will show us how to work with both dimensions of time. But, for now, here is Pattern 3 – Update History. Version Pattern 3 Figure 3: Policy – Version 3 – PDM Pattern (see PDF) Version 3 Scenario 3.1 Figure 4: Scenario 3.1 (see PDF) Version 3 Scenario 3.2 Figure 5: Scenario 3.2 (see PDF) Version 3 Scenario 3.3 Figure 6: Version 3 Scenario 3.3 (see PDF) Version 3 Follow-up

10/7/2008 1:12 PM

Time and Time Again: Managing Time in Relational Databases, Part 18 - ...

3 of 4

http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

As explained in prior PDM articles, we will expand upon the original scenarios with a follow-up section that demonstrates additional capabilities of the pattern. This follow-up for pattern 3 shows one of the benefits of storing the current and past versions in the same table by being able to use the same SQL and application logic to retrieve and process both the current and previous versions, just by supplying a different input date. Figure 7: Version 3 Follow-up 3.4 (see PDF) Figure 8: Version 3 Follow-up 3.5 (see PDF) One or Two Tables? Pattern 3 combines both the current version and the past versions into a single table. There are pros and cons to this, both physically and logically. One of the semantic constraints of Pattern 3 is that date represents both the business effectivity date and the system activity date. So, the date in the computing system on which it appears is the date that the event is effective for the business. This also implies that we cannot accurately allow retroactive or future date processing; that is, an event cannot be dated prior to the current version, nor can it be dated in the future. Consequently, it would not significantly impact the application if the various versions were stored in one or two tables, as far as Pattern 3 is concerned. However, in future patterns we will show why we combine all versions in a single table to better support bi-temporal patterns and retroactive and future date processing. Following is a list showing some of the pros and cons combining or separating versions into tables. Figure 9: Versioned Tables versus Current Plus Historical Tables (see PDF) If you are trying to retrofit history into an existing database and application that has no history, then you might consider creating a separate history table, populated by triggers on the original base object table. Otherwise, for a new application we lean towards combining the versions into the same table, possibly with partitioning, with the non-current versions. Wrap-Up This article has shown how to implement Version Pattern 3. This pattern allows you to easily query any version, current or past, based on the supplied as-of date input parameter. This reduces the number of database objects, application objects and SQL required for the application. More importantly, we start to better and more accurately retain history that is required by governing bodies and regulations, such as Sarbanes-Oxley (SOX), National Association of Insurance Commissioners (NAIC) and Securities and Exchange Commission (SEC). It also helps support auditing requirements and best practices desired by most major companies today. However, it does not yet support accurately post-dating or pre-dating database events. In upcoming articles, we will continue to demonstrate how to physically model these more advanced temporal patterns, and will also present the DDL to create these tables and their related database objects, plus the SQL needed to maintain and query the data. Stay with us as we travel through time!

10/7/2008 1:12 PM

Time and Time Again: Managing Time in Relational Databases, Part 18 - ...

4 of 4

http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:12 PM

Time and Time Again: Managing Time in Relational Databases, Part 19 - ...

1 of 6

http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 19 - RI and Temporal RI Tom Johnston, Randall Weis DM Direct, February 1, 2008 This installment should have been the continuation of our presentation of physical implementations of the version patterns discussed thus far. It should have presented a physical implementation of Version Patterns 4 and 5. However, because editors are just plain tough when it comes to deadlines, and because of some personal exigencies with the authors, we are unable to stick to this logical sequence. Instead, in this article, we will consolidate and extend some of the things we have learned about the semantic constraints that apply to versions, and especially to what referential integrity means when applied to versions. In terms of the roadmap presented in Part 13, this means that we are putting Step 2 aside, for the moment, and moving on to Step 3, the semantics and implementation of Version Pattern 6. The first part of those semantics which we wish to discuss is what we will call "temporal referential integrity" (temporal RI). Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

A glossary is included at the end of this article. As we have been writing this series, we have developed a set of terms which have been defined more by means of introducing and using them than in any more formal way. Of course, the later Wittgenstein (someone whose name C. J. Date likes to invoke) did indeed say "Meaning is use," but it is well-known (by philosophers, at least) that the later Wittgenstein is notoriously difficult to interpret and, consequently, correspondingly inappropriate as a source of quotations taken out of context. This glossary, like dictionaries when they are periodically undergoing revisions, takes usage to be prescriptive of meaning, but not identical to it. And in so doing, it and we part company with the later Wittgenstein for whom most philosophical problems need to be dissolved, not solved. RI and Temporal RI The referential integrity (RI) that we all know and love is RI that the database management system (DBMS) can enforce, based on declarations we make to it when we submit our data manipulation language (DML). It is RI against nonversion tables. For clarity's sake, we have been calling non-version tables "object tables," not in the object-oriented sense, but rather in the sense of things that persist through time, things that versions are versions of. Thus, a Customer table is a table of customers, which are instances of a certain kind of object, and which persist through time. But a Customer Version table is not a table of customers; it is a table of chronologically sequenced, nonoverlapping, time slices of customers.

10/7/2008 1:13 PM

Time and Time Again: Managing Time in Relational Databases, Part 19 - ...

2 of 6

http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

As we continue, we will use the term "object RI" when we think the adjective is needed for clarity, and simply "RI" when context makes it clear that we are talking about object RI. Finally, we will always use the term "temporal RI" when talking about RI as it applies to versions. Object RI Here is the definition of object RI: If there is an object RI dependency from table Y to table X (not necessarily distinct), then no state of the database is valid in which any row in table Y is not foreign key-linked to a row in table X. No inserts, updates or deletes are valid that would transform the database into such a state. The RI machinery of the DBMS insures this by carrying out the following checks. We assume throughout this discussion, unless otherwise noted, that all foreign keys are defined as non-nullable. 1. When inserting a row into object table Y, that row must reference a row in object table X that already exists in the database (or that is being inserted as part of the same atomic transaction). 2. When updating the foreign key to a row in object table X, from a row in object table Y, that new foreign key must reference a row in X that already exists in the database (or that is being inserted as part of the same atomic transaction). 3. When deleting a row in object table X, no rows may remain in object table Y, after that transaction is complete, that reference that row. With insert, update and delete actions, and two tables, we have a combination of six table/action combinations. Notice that three of them are not mentioned as constraints that object RI must enforce. Inserting or updating a parent row cannot affect RI, nor can deleting a child row or updating it in any way that leaves its foreign key to its parent unchanged. Note that for the time being, we are not concerned with the different ways that the DBMS can enforce constraint number 3 above – those ways being a) block, b) set null and c) cascade delete. Temporal RI We can now define and illustrate the enforcement of temporal RI in a way that emphasizes that object and temporal RI are both forms of referential integrity, but with the latter bringing time into consideration. But before we do, here is a depiction of what Version Pattern 6 looks like, using our Client Version table as an example.

Figure 1: Version Table Schema for Version Pattern 6 This schema will be refined later on, as we begin our discussion of Version Pattern 6. It is

10/7/2008 1:13 PM

Time and Time Again: Managing Time in Relational Databases, Part 19 - ...

3 of 6

http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

presented here simply to provide a graphic image of the various time-implementing columns that are discussed below. So here is the definition of temporal RI, stated initially as a constraint involving two version tables As we will see later, we must also consider temporal RI constraints, and their enforcement, between two tables one of which is a version table and the other of which is an object table: If there is a temporal RI dependency from version table Y to version table X (not necessarily distinct), then no state of the database is valid in which any row in table Y is not object R- linked to a row in table X, or in which the effectivity time period of a row in table Y is not wholly contained within the effectivity time period of its parent row in table X. No inserts, updates or deletes are valid that would transform a database into such a state. Now for the enforcement of temporal RI. No existing DBMS provides the machinery to carry out this enforcement. Instead, it will be up to us, the developers, to support temporal RI by writing our own code, possibly using triggers, to enforce the following constraints on activity against the database: 1. When inserting a row into version table Y, that row must reference a row in version table X that a) already exists in the database (or that is being inserted as part of the same atomic transaction), and b) whose effectivity time period wholly contains the effectivity time period of the row being inserted into Y. 2. When updating the foreign key to a row in version table X, from a row in version table Y, that new foreign key must reference a row in X that a) already exists in the database (or that is being inserted as part of the same atomic transaction), and b) whose effectivity time period wholly contains the effectivity time period of the row being updated in Y. 3. When deleting a row in version table X, no rows may remain in version table Y, after that transaction is complete, that reference that row. 4. No update to a row in version table X (where a change in primary key is not considered an update) has any effect on RI constraints, other than a change to that row's effectivity time period. In that case, when updating the effectivity time period of an object X (which is done by creating a new current version of X that specifies the new time period), that effectivity time period may not be shortened as long as one or more object-RI linked rows in version table Y have an effectivity end date beyond that of the shortened end date for object X. Recall, from our discussions thus far, that deletions in version tables are always logical deletions. That is, they are always done by marking the current version of an object deleted as of an indicated date, not by physically removing a row from a table. Informally, what is going on is this: Temporal RI is object RI plus the constraint that the time period of a child row must always be wholly contained with the time period of its parent row. If we keep this in mind as we begin to look more closely at the details of implementing temporal RI enforcement, those details will be much less likely to confuse us. All of those details exist to support this one-sentence definition of temporal RI.

10/7/2008 1:13 PM

Time and Time Again: Managing Time in Relational Databases, Part 19 - ...

4 of 6

http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

Glossary Child table, child row. Y is a child table to X if and only if there is a foreign key dependency from Y to X. A row in Y is a child to a row in X if and only if the row in Y has a foreign key whose value is identical to the primary key value of that related row in X. 1. Parent/child relationships typically have a maximum cardinality of one/many, and a minimum cardinality of optional for the parent and required for the child. But it is a matter of which table contains the foreign key and which table is the reference of that foreign key that differentiates parent from child tables and rows. Cardinality constraints are not what make the difference. Effectivity time period. The period of time during which a version is regarded as the truth about an object. 1. In our version patterns, an effectivity time period is defined by an effectivity begin date and an effectivity end date, where "date" may be a calendar date or any other "tick of the clock" (as described in Part 2). 2. Our convention is that the time period begins on the begin date, but ends one clock tick prior to the end date. Object referential integrity, object RI. Referential integrity for objects, as opposed to object versions. 1. This is what the term "referential integrity" refers to, outside of these articles. It is the integrity constraint, enforced by all current RDBMSs, that every non-null foreign key must contain a value identical to the value in the primary key of a row in the RI-related table. Object table. A table whose rows represent persistent objects. Sometimes called a "non-versioned" table. 1. Persistent objects are things that exist over time and can change over time, such as vendors, customers, employees, regulatory agencies, products, services, bills of material, invoices, purchase orders, claims, certifications, etc. 2. From an OLAP, star-schema point of view, dimension tables are tables of persistent objects. For a more in-depth discussion of different types of tables, see the articles "An Ontology of Tables," at MindfulData.com. 3. From an OLTP database point of view, assuming that there are no version tables in the database, object tables are all the tables which are not transaction tables. 4. Roughly speaking, object tables are the tables which are the concern of master data management. Parent table, parent row. X is a parent table to Y if and only if there is a foreign key dependency from Y to X. A row in X is a parent to a row in Y if and only if that row in Y has a foreign key whose value is identical to the primary key value of that row in X. 1. Parent/child relationships typically have a maximum cardinality of one/many, and a minimum

10/7/2008 1:13 PM

Time and Time Again: Managing Time in Relational Databases, Part 19 - ...

5 of 6

http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

cardinality of optional for the parent and required for the child. But it is a matter of which table contains the foreign key and which table is the reference of that foreign key that differentiates parent from child tables and rows. Cardinality constraints are not what make the difference. Temporal referential integrity, temporal RI. Referential integrity for versions, which consists of object RI plus the constraint that the effectivity time period for child rows are wholly contained with the effectivity time periods of their object RI-related parent rows. Version table. A table whose rows represent versions of persistent objects. 1. A version of a persistent object is a time-slice of that object. A row in a version table represents a version, and describes that object as it was, is, will be and/or might be during its effectivity time period. 2. A row in a version table is what the custodians of the table believe is the truth about the object it represents, during the indicated effectivity time period. Wholly contained in (for effectivity time periods). Time period 2 is wholly contained in time period 1 if and only if the effectivity begin date of time period 2 is equal to or later than the effectivity begin date of time period 1, and the effectivity end date of time period 2 is equal to or earlier than the effectivity end date of time period 1. 1. In order to implement this constraint in today's production databases, we rely on SQL data types and operators. Specifically, we rely on dates and/or timestamps, and the DBMS implementation of comparison operators for those data types. 2. Also for ease of current implementation, we use "12/31/9999" in a way whose intracacies will be discussed in Part 20. For now, we note that when we use "12/31/9999," we let the DBMS interpret that string as a valid date. However, semantically, it almost never would be a valid date, because no business activities that we know of are concerned with an effectivity time period that extends up to but not including or beyond that date nearly eight-thousand years in the future. Wrap-Up In Parts 10 through 12 of this series, we discussed RI as it applies to version tables. There we concluded that when managing rows in version tables, the only RI we needed to enforce among those rows was object RI, RI expressed by means of object identifiers (OIDs). Clearly, if this is all that were done to check the validity of insert, update and delete transactions against version tables, it would be possible to introduce violations of temporal RI. As we will explain next time, our concern was to avoid the performance penalty of enforcing temporal RI on inserts, updates and deletes to the database. This concern, indeed, is the principal reason that IT professionals tend to avoid versioning as much as possible. In Part 20, we will examine our reasons for taking that position, and will explain why it is still a reasonable position to take. However, we will look more closely at the performance implications of doing temporal RI checking on insert, update and delete transactions. Those implications may not be as forbidding as we had originally thought them to be.

10/7/2008 1:13 PM

Time and Time Again: Managing Time in Relational Databases, Part 19 - ...

6 of 6

http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

We will also begin a discussion of the value "12/31/9999" which may appear in effectivity end dates. In date or timestamp data types, this string is understood by the DBMS to represent an actual date, nearly eight-thousand years in the future. But from a semantic point of view, that string does not represent a date. It functions as a temporal {null}. As such, we shall argue, its most mathematically secure management would require the use of a temporal three-valued logic. But fortunately, our focus is on implementation with today's technology. So just as {null}s in today's SQL fudge a robust three-truth-valued logic, we will show how these "temporal {null}s" can also limp along without the support of a secure foundation in a three-time-valued (modal) logic. Beyond these issues, we must also discuss temporal RI as it applies to future versions, to hypothetical versions, and also to versions which correct errors in other versions. This will take us several installments further into this series, at which point we will resume our presentation of physical implementations of Version Patterns 4 and 5. After that, we will begin our discussion of Version Pattern 6. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:13 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

1 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 20 - RI and Temporal RI (Continued) Tom Johnston, Randall Weis DM Direct, February 15, 2008 In this installment, we will begin a discussion of how temporal referential integrity (RI) may be enforced with today's relational database management systems (RDBMSs). In addition, we will expand the glossary which we began last time, in Part 19. Indeed, the glossary work has proven to be the major portion of the work done here in Part 20. This seems to be the right point in our series to begin formalizing the definitions of the terms we have introduced. Object RI and Temporal RI: A Clarification Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

In Part 19, we said that by "object RI" we meant the normal DBMS-enforceable referential integrity, implemented by populating a column designated as a foreign key to table X with the value of one of the primary keys in that table. That was wrong, and is indicated as such in the glossary by shading that entry. A new corrected definition is also included. On that basis, we went on to give the following definition of "temporal RI": If there is a temporal RI dependency from version table Y to version table X (not necessarily distinct), then no state of the database is valid in which any row in table Y is not object-RI linked to a row in table X, or in which the effectivity time period of a row in table Y is not wholly contained within the effectivity time period of its parent row in table X. No inserts, updates or deletes are valid that would transform a database into such a state. As this discussion should make clear, our corrected definition of the term "object RI" does not require us to amend the above definition of "temporal RI." Notice that in this definition, the dependent-on table (X) is itself a version table. Consequently, it contains no rows for objects, only rows for versions of objects. To illustrate, consider this schematic of a version table, also copied from Part 19. We also add a schematic of a dependent table which, in our ongoing example, is the Policy table.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

2 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

Figure 1: Version Table Schema for Version Pattern 6 In the Policy table shown here, the notation (OFK) stands for object foreign key. The object mentioned in this definition is, in this example, a specific client. The client is represented by the OFK column client-nbr. If the referenced table were itself an object table, and not a version table, then client-nbr would be a "normal" foreign key, not an OFK. It would contain a value that was unique in the designated primary key column of the referenced table, and its validity could be checked by the standard RI mechanism of the DBMS (see Glossary). But the definition specifically states that it is a definition that involves two version tables. Consequently, client-nbr in the referenced table, by itself, is not a complete primary key. So it is not something the DBMS can enforce. So what does "object RI" mean when there is no unique object to refer to, when, whatever it is, it is not enforceable by a RDBMS? By the same token, what kind of foreign key is an OFK? An OFK column contains the value of a unique object even though a table with that value as primary key does not exist. It is as if, instead of Figure 1, versioning were implemented as shown in Figure 2.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

3 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

In the as-if situation depicted in Figure 2, these are the foreign key dependencies: Policy Version table to Policy table. Client Version table to Client table. Policy table to Client table. These RI dependencies are all object RI dependencies because in all cases, the table referenced by a foreign key is an object (a non-version) table. In addition, all these object RI dependencies can be enforced by the RDBMS. So to keep this distinction clear, we will use the term "standard RI" (or "RI" by itself) to mean RI among non-versioned tables. Standard RI is RI that is enforceable by the DBMS. This leaves us free to redefine the term "object RI." We will now use it to mean a constraint that applies between a pair of tables, where the referenced (parent) table is a version table. In that referenced version table, one part of its primary key will be an identifier for the object whose version it is. In the referencing table, there is a column which is constrained to contain a value matching the object identifier part of a version in the referenced table which is current at the time the referencing (child) row was inserted. That column is what we are calling an OFK. In addition, if the referencing row is not in a version table, then an update in place can be applied to it, of course. If that happens, the object RI constraint requires that the new value put in the OFK column must match the object identifier in a version current at the time of the update. The Glossary below contains the indicated corrections, and also includes new expressions not defined previously.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

4 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

Glossary Child table, child row. Y is a child table to X if and only if there is a foreign key dependency from Y to X. A row in Y is a child to a row in X if and only if the row in Y has a foreign key whose value is identical to the primary key value of that related row in X. 1. Parent/child relationships typically have a maximum cardinality of one/many, and a minimum cardinality of optional for the parent and required for the child. But it is a matter of which table contains the foreign key and which table is the reference of that foreign key that differentiates parent from child tables and rows. Cardinality constraints are not what make the difference. 2. Also sometimes referred to as “dependent tables” or “dependent rows,” or various cognates (e.g., “RI-dependent tables,” “RI-dependent rows”). Current version, most recent version. A version of an object whose effectivity begin date is chronologically the latest across all non-future versions of that object, and whose effectivity end date is either unknown (represented by “12/31/9999”) or in the future. 1. Until we begin discussing Version Pattern 7, we will assume that no two versions of the same object can have the same effectivity begin date. After all, between that date and the earliest effectivity end date between them, that would mean that we had two different statements of what was true about an object during that period of time. But there can only be one version of the truth. 2. With Version Pattern 7, we will discuss cases in which two or more versions, of the same object, might have the same effectivity begin date. One source of such cases is when there is a business requirement to correct an error in a version table, but also to retain, as queryable history, the version discovered to be in error. Another source of such cases would be where one of a few number of things might be true of an object in a given time period, but we don’t know which one. Effectivity time period. The period of time during which a version is regarded as the truth about an object. 1. In our version patterns, an effectivity time period is defined by an effectivity begin date and an effectivity end date, where "date" may be a calendar date or any other "tick of the clock" (as described in Part 2). 2. Our convention is that the time period begins on the begin date, but ends one clock tick prior to the end date. Logical delete. A delete which marks the row as deleted, but leaves it in the table. 1. Also known as a “business delete,” or referred to as a “semantic delete” or a cognate. 2. See “physical delete”. Episode. Each version Y of an object which does not supercede a version X current at the time Y is created begins an episode of an object. 1. See “supercession.” 2. The first version of an object creates the initial episode of that object. That remains the

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

5 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

current episode until a logical delete version for that object is inserted into the table. 3. If, when a version Y is inserted into the table, the most recent version of that object already in the table is a logical delete version, then Y creates a noninitial episode of that object. That remains the current episode until a logical delete version for that object is inserted into the table. 4. If, when a version Y is inserted into the table, the most recent version of that object already in the table is a current version, then Y supercedes that version but remains within the same episode as the version it supercedes. 5. Also: a. Current episode. An episode of an object whose most recent version is not in the future and is not a logical delete b. Past terminated episode, terminated episode. An episode of an object whose most recent version is not in the future and is a logical delete c. Future episode. An episode of an object whose most recent version is in the future and is not a logical delete d. Future terminated episode. An episode of an object whose most recent version is in the future and is a logical delete. Object foreign key (OFK). A column in a versioned table which contains the object identifier used by one or more rows in a (not necessarily distinct) version table. 1. For example, in Figure 2 of Part 20, client-nbr in the Policy table is an OFK. It is constrained to have the value of a client-nbr in one or more rows in the Client table, such that at least one of those rows was current as of the time the row containing the OFK was created. Object referential integrity, object RI (revised). The constraint that when a row containing an OFK is inserted, the value in that OFK matches the value of the object identifier portion of a primary key in a (not necessarily distinct) version table. 1. Object RI cannot be enforced by today's RDBMSs. This follows from the fact that OFKs reference an object only indirectly, by means of the one or more versions that implement it. In those versions, the referenced object identifier is only part of the primary key of its table, and is thus not necessarily (or even usually) unique. 2. Object RI requires that the referenced (parent) table be a versioned table. But the referencing (child) table, the one that contains the OFK, may itself be either a versioned or an object (non-versioned) table. Object table. A table whose rows represent persistent objects. Sometimes called a "non-versioned" table. 1. Persistent objects are things that exist over time and can change over time, such as vendors, customers, employees, regulatory agencies, products, services, bills of material, invoices, purchase orders, claims, certifications, etc. 2. From an online analytical processing, star-schema point of view, dimension tables are tables of persistent objects. For more in-depth discussion of different types of tables, see the articles "An Ontology of Tables" at MindfulData.com. 3. From an online transaction processing database point of view, assuming that there are no version tables in the database, object tables are all the tables that are not transaction tables. 4. Roughly speaking, object tables are the tables which are the concern of master data management. Parent table, parent row. X is a parent table to Y if and only if there is a foreign key dependency

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

6 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

from Y to X. A row in X is a parent to a row in Y if and only if that row in Y has a foreign key whose value is identical to the primary key value of that row in X. 1. Parent/child relationships typically have a maximum cardinality of one/many and a minimum cardinality of optional for the parent and required for the child. But it is a matter of which table contains the foreign key and which table is the reference of that foreign key that differentiates parent from child tables and rows. Cardinality constraints are not what make the difference. 2. Also sometimes referred to as “dependent-on tables,” or “dependent-on rows” various cognates. Physical delete. A delete which physically removes the row from its table. 1. See “logical delete.” Queryable history. Data about an object that was valid at some time in the past, which is no longer currently valid, but which is as easily and rapidly accessible as current data. 1. “As easily and rapidly accessible as current data” means what it says. Our way of providing this access is to use version tables. In such tables, production queries against current data (the most common kind of query) can be used to retrieve historical data simply by adding a date to a BETWEEN clause of the SQL statement. 2. As we have mentioned before (and will again), providing queryable history, in this manner, can significantly lower the development and operations cost of accessing historical data, and significantly improve the availability of the historical data retrieved. Standard referential integrity, standard RI. The referential integrity constraint among non-versioned tables that today's RDBMSs can enforce. Because we are calling these non-versioned tables "object tables," we erroneously equated object RI with standard RI in Part 19. Supercede, supercession. In these articles, we use these terms to refer to the replacement of a current version with a new current version. 1. Supercession is a logical function. Physically, supercession is done by inserting a new row in a version table. 2. Deletion in a version table is done via supercession. 3. Versionable updates in a version table are also done via supercession. 4. However, creating the first version of an object does not involve supercession, as there is no current version of that object to supercede. 5. A more subtle point: Creating the first version of a noninital episode of an object also does not involve supercession. Even if other versions of an object exist, every episode is concluded with a logical delete version. Thus, when a new episode is started, the most recent prior version of that object is a logical delete version. Thus, there is no current version of the object to supercede when that new episode begins. Temporal referential integrity, temporal RI. Referential integrity for versions, which consists of object RI plus the constraint that the effectivity time period for child rows are wholly contained with the effectivity time periods of their parent rows. Version, version table. A table whose rows represent versions of persistent objects.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 20 - ...

7 of 7

http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

1. A version of a persistent object is a time-slice of that object. A row in a version table represents a version, and describes that object as it was, is, will be and/or might be during its effectivity time period. 2. A row in a version table is what the custodians of the table believe is the truth about the object it represents during the indicated effectivity time period. Wholly contained in (for effectivity time periods). Time period 2 is wholly contained in time period 1 if and only if the effectivity begin date of time period 2 is equal to or later than the effectivity begin date of time period 1, and the effectivity end date of time period 2 is equal to or earlier than the effectivity end date of time period 1. 1. In order to implement this constraint in today's production databases, we rely on SQL data types and operators. Specifically, we rely on dates and/or timestamps, and the DBMS implementation of comparison operators for those data types. 2. When we use "12/31/9999," we let the DBMS interpret that string as a valid date. However, semantically, it almost never would be a valid date, because no business activities that we know of are concerned with an effectivity time period that extends up to but not including or beyond that date nearly eight-thousand years in the future. Our ground-clearing that was needed before we discuss options for enforcing temporal RI is hopefully now complete. If it is, Part 21 will begin our discussion of those options. Beyond these issues, we must also discuss temporal RI as it applies to future versions, to hypothetical versions and also to versions that correct errors in other versions. This will take us several installments further into this series, at which point we will resume our presentation of physical implementations of Version Patterns 4 and 5. After that, we will begin our discussion of Version Patterns 6 and 7. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 21 - ...

1 of 5

http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 21 - Enforcement Time Options for Temporal RI Tom Johnston, Randall Weis DM Direct, March 7, 2008 At the conclusion of our previous article, we said that there are two options for enforcing temporal integrity constraints on a database. One is to enforce the constraints at the time of update so as to prevent violations from being persisted in the database. The other is to let violations occur, but to filter them out at retrieval time. This would be done by means of views against which all normal queries would be written. It's easy to disparage the latter option by describing it as letting temporal "garbage" onto the database, and then trying to hide that garbage from queries. But this is indeed the approach we have taken on temporal databases that we have designed, built, put into production and supported, and we had good reason for doing so. Like almost all database professionals, we were concerned that there would be an overwhelming volume of cascade updates if we tried to enforce temporal RI at time of update, especially when the update involves the delete of a versioned object that was the "parent" of possibly many other objects, in many other tables. (See the introduction to Part 10 for a more complete description of this "cascade update" fear.)

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

We believe it is time to re-examine this fear, and see if the performance hit for enforcing temporal RI on update transactions is as great as we have feared. We assume here that for as long as data storage continues to involve spinning disks, cylinders and mechanical movement of read/write heads, to get to required data, we can safely simplify our performance issue and treat it as a matter of how many physical I/Os, how many block-level transfers between a storage device (or its cache) and main memory, are required to complete a temporal RI validation. With this in mind, let us now examine two things – the logic of transactions against versioned objects, and the I/O that is required for those transactions. Temporal RI Reformulated To more explicitly allow for the query-time enforcement of temporal integrity constraints, we will slightly amend the definition of temporal RI given in Part 20, as follows: If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), then no exposed state of the database is valid in which any row in table Y is not object-RI linked to a row in table X, or in which the effectivity time period of a row in table Y is not wholly contained within the effectivity time period of its parent row in table X. No queries are valid which would

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 21 - ...

2 of 5

http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

expose such a state. There are three changes to the original definition: 1. "Version table Y" changed to "table Y." As explained last time, we allow both versioned and non-versioned tables to be object-RI dependent on a versioned table. 2. "State of the database" changed to "exposed state of the database." This allows for the option of enforcing temporal RI on queries rather than on updates. 3. "No inserts, updates or deletes are valid that would transform a database into such a state" replaced by "No queries are valid that would expose such a state." The replaced sentence was appropriate for update-enforced temporal RI. The replacing sentence is appropriate for either update- or query-enforced temporal RI. We will assume, in the following discussion, that we are dealing with update-enforced temporal RI. In later articles, we will discuss query-time enforcement. Note: before proceeding, please note that the glossary introduced a couple of installments ago is not included in this article, nor will it be included in future articles. The reason is that the glossary is already over 3,000 words long and likely to grow to several times that size. One reason for the glossary being this extensive is that we are making the definitions of its entries extensive. Most of these concepts are complex, and we don't think it is possible to do them justice with one- or two-sentence definitions. Another reason for the glossary being this extensive is that we are attempting to construct it as a controlled vocabulary. By that we mean that any expression involving temporal concepts which is used in a definition must itself be a defined glossary entry. This is the kind of rigor that must be applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like predicate logic and thus available for manipulation by software-realized inferencing engines. It is also the kind of rigor that must be applied (to any set of technical definitions) before we can say with confidence that we know what we mean. Of course, all this just makes the glossary that much more important than if its entries were brief and informal. So we have decided to put the glossary on each of our Web sites, to keep it updated there, and to include a link on our respective homepages to that glossary. The URLs are MindfulData.com and InbaseInc.com. In this review of updates to versioned tables, our approach will be to first describe the logic of those updates, including the enforcement of temporal constraints. After that, we will go on to consider the physical I/O costs of enforcing those constraints on those updates. The specific types of update that we will consider are: Deletes; Inserts; (Row-level) updates; and Upserts. We begin with some terminology. An "original" transaction is a transaction created or authorized by the business. A "temporal" transaction is the result of translating an original transaction into a transaction against a versioned object. As we will see, while each original transaction appears to its author to be an update against a single row of an object table, the translation may result in a

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 21 - ...

3 of 5

http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

set of updates to multiple rows in possibly multiple tables. (The original/temporal distinction, and other key terms of our temporal data management approach, are defined in the glossary.) Original Deletes: Their Semantics To the business, when they are issuing a delete against an object, they don't care if the object is versioned or not, and indeed may not know. They are directing us to remove the object from the database. Thus, an original delete transaction doesn't need to specify a version, just the object being deleted and the date the delete is to take effect. This is illustrated in Figure 1.

Figure 1: An Original Delete Transaction Let's organize this discussion around three questions. 1. First, when an object is versioned, and a business user issues or authorizes a delete, what does he think he is deleting, an object or a version? 2. Second, when an object is versioned, and a business user issues or authorizes a delete, what is actually deleted? 3. And finally, how is the actual delete transaction carried out? As for the first question, the author of the delete transaction intends to delete an object, and believes he is doing so. For example, he may issue a delete of a specific client, against a database in which clients are versioned, but he does not issue a directive to delete a specific version of that client. The reason that original update transactions are specified against objects, and not against versions, is that the burden of managing versions should not be placed on the business user, in this case on the author of the delete transaction. Versions are simply an internal mechanism for satisfying the business demand for real-time access to noncurrent data, which is just as fast and easy to access as current data. As far as possible, versions and their complexities should be hidden from both end users and developers. A delete transaction against a versioned object is valid if and only if there is a current version of that object on the database at the time of the transaction. Necessarily, this version will always be the one current version of the one current episode of that object, and applying the delete will delete the object by terminating its current episode. But what about past versions, future-dated versions or hypothetical versions? As for past versions, it is reasonable to expose semantics to the user that prohibits them from updating the past. Just as with good accounting practice, if what we recorded about the past was wrong, we do not update that original entry. Instead, we enter adjusting transactions. As for future or hypothetical versions, it could be said that if we allow users to create them, as we in fact do, then we expose the mechanism of versioning to those users. But this isn't so. Consistent with our principle of hiding the mechanism of versioning as much as possible, we do not expose to the user or developer the concept of deleting something that doesn't yet exist.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 21 - ...

4 of 5

http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

Instead, all that we have exposed to the user is functionality, the ability to insert transactions in advance of when they are to become effective, or even transactions that may never become effective. The mechanisms by which we do this are not themselves exposed. From the user's point of view, it appears as if we have simply accepted transactions in advance of when they should be applied and stored them somewhere in a transaction file, with the promise that we will apply them precisely on the date they are to become effective. Temporal Deletes: Their Targets So, given that there must be a current version of the object of an original delete, and there may be, in addition, past, future and hypothetical versions as well, what is actually deleted? This is determined by the code that transforms an original delete into a temporal delete. We may think of this code as the "transform" portion of an extract, transform and load (ETL) process against a batch file of original transactions. We may also think of this code as a stored procedure invoked against individual online transactions. In either case, the important thing is that business users, developers and database administrators (DBAs) all work with original transactions and not with temporal transactions. The code that translates original to temporal transactions encapsulates the complexity of versions and their manipulations, supporting their functional semantics without exposing their implementation complexities. This question about the targets of temporal transactions is the second of the three questions in the previous section. The answer is: the current version, and any future and hypothetical versions as well. To the user, it will seem that we have done three things: Deleted the one row representing the object in question; Flushed any remaining transactions for that object from the transaction file just mentioned; and If there were hypothetical versions, removed that object from one or more "hypothetical situation" databases. But how is this done? What physically happens when the object of an original delete is a versioned object? The answer is that this is done by inserting a new version that supercedes the current version of the object being deleted, as well as a new version for each of the future or hypothetical versions of that object. The process begins by finding the current version of the object specified on the original transaction. If it is not found, the transaction is invalid. Otherwise, a superceding delete version is created which has the same object identifier and business effectivity begin date as the version it is superceding. The effectivity end date on the superceding delete version is set to the delete date on the transaction. This is because an original delete is a business transaction. Its intent is to remove the object from the database as of the indicated date. For example, if a client is being deleted, the intent is that after the transaction is complete, that client no longer exists, i.e., that person or organization no longer exists as a client of our company. (We may assume that if a delete date is not supplied on the original transaction, the effectivity end date defaults to the date current when the transaction is physically applied to the database.) On these versions created by an original delete transaction, a delete flag is set to "yes." For all but

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases, Part 21 - ...

5 of 5

http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

these delete-created versions, this flag is set to, and remains, "no." The purpose of this delete flag is to distinguish deletes that happen in response to an original delete transaction from deletes that happen because a non-12/31/9999 end date comes to pass. Note that nowhere in this process have we discussed a physical deletion of a physical row in a version table. Physically removing rows from versioned tables is a process we will call "archiving" those rows, and it is outside the scope of what this series of articles is about. See the taxonomy in Part 1 for additional discussion of the difference between archiving, creating snapshots, and versioning. Wrap-Up Next time, we will finish our discussion of original and temporal delete transactions by examining the foreign key implications of deleting a versioned object. We will then be in a position to access the risk and the extent of potential "cascade" effects of such deletions. Since what we are proposing is an enterprise solution, we must either evaluate performance implications against worst-case scenarios, or else design our original transactions with a flag that can selectively turn on or off transaction-time temporal validation checks. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:14 PM

Time and Time Again: Managing Time in Relational Databases - Part 22: ...

1 of 5

http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases - Part 22: Original and Temporal Deletes Tom Johnston, Randall Weis DM Direct, March 21, 2008 Note: A glossary of technical terms used in these articles can be found on the Web sites of the authors. The URLs are MindfulData.com and InbaseInc.com. This glossary is being constructed as a controlled vocabulary. By that we mean that any expression involving temporal concepts which is used in the definition of a glossary entry must itself be a glossary entry.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

One reason for being this careful with definitions is that this kind of rigor must be applied to definitions before they can be "ontologized," i.e. expressed in a formal notation like predicate logic. That, in turn, is what is needed to make definitions available for manipulation by software-realized inferencing engines. Another reason for being this careful with definitions is that this is the kind of rigor that must be applied to any set of technical definitions before we can say with confidence that we know what we mean, and that we mean what we say, no more and no less. In our previous article, we introduced the distinction between original and temporal transactions, and began by discussing original and temporal deletes. This article completes our discussion of delete transactions. We will then go on to discuss insert, update and upsert transactions. These discussions describe the transformations required to map original transactions onto temporal transactions. This mapping is the "database update side" of the encapsulation of the complexities of temporal data management which, in Part 1, we said was an essential part of our approach. These same discussions also provide the insight necessary to decide whether we should attempt to enforce temporal referential integrity (RI) on the updates themselves, or instead defer that integrity checking until data is retrieved. With respect to the deferral option, we will repeat what we said about it in the previous article: It's easy to disparage [the deferral option] by describing it as letting temporal "garbage" onto the database, and then trying to hide that garbage from queries. But this is indeed the approach we have taken on temporal databases that we have designed, built, put into production and

10/7/2008 1:15 PM

Time and Time Again: Managing Time in Relational Databases - Part 22: ...

2 of 5

http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

supported, and we had good reason for doing so. Like almost all database professionals, we were concerned that there would be an overwhelming volume of cascade updates if we tried to enforce temporal RI at time of update, especially when the update involves the delete of a versioned object that was the "parent" of possibly many other objects, in many other tables. (See the introduction to Part 10 for a more complete description of this "cascade update" fear.) From the Previous Article For the sake of continuity, we repeat the following material from the previous article. If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), then no exposed state of the database is valid in which any row in table Y is not object-RI linked to a row in table X, or in which the effectivity time period of a row in table Y is not wholly contained within the effectivity time period of its parent row in table X. No queries are valid which would expose such a state. An "original" transaction is a transaction created or authorized by the business. A "temporal" transaction is the result of translating an original transaction into a transaction against a versioned object. To the business, when they are issuing a delete against an object, they don't care if the object is versioned or not, and indeed may not know. They are directing us to remove the object from the database. Thus, an original delete transaction doesn't need to specify a version, just the object being deleted and the date the delete is to take effect. This is illustrated in Figure 1.

Figure 1: An Original Delete Transaction Deleting a Versioned Object: The Existence Constraint The original transaction we are considering is a delete. Just as with a delete to a non-versioned object, the first thing that must be done is to find the target row(s). One target row must be the most recent row of the current episode of the object. The original transaction provides the electronic identifier (EID) for the object. The most recent version of the current episode of the object is the one (nonhypothetical) version of that object with an effectivity begin date in the past and an effectivity end date either in the future, or containing the "temporal null" value "12/31 /9999." First of all, note the assumption that there cannot be more than one current version of an object in a versioned table. We recommend that you attempt to state, clearly and precisely, what guarantees that this assumption is valid. (Hint: Consider the logic of inserting into a versioned table. In the next article, this logic will be described in terms of the mapping from an original insert to a temporal insert, but in both the glossary and in previous articles, the material already exists for you to answer this question.) If a current version of the indicated object is found, the delete can proceed. Otherwise, it is rejected as invalid. (We consider deletes which are explicitly against future versions in a later article. For now, we assume that all original transactions are against current versions.)

10/7/2008 1:15 PM

Time and Time Again: Managing Time in Relational Databases - Part 22: ...

3 of 5

http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

As described last time, an original delete of a versioned object requires us to supercede the most recent version of the current episode of the object with a delete version, and also to do the same for any active future (or hypothetical) episodes of that object. An active future episode is a future episode whose most recent version is not a delete version. Deleting a Versioned Object: What is a Dependent Row? If the version is found, the next thing is to look for any rows, in either object tables or version tables, that are dependent on it. This dependency is precisely that of temporal RI. Recall as well that the database management systems (DBMS) cannot help us here. If we are to enforce temporal RI on delete transactions, we must do the work ourselves. And since this is work to enforce a database constraint, it does not belong anywhere in application-specific code. It must be implemented in the codebase that maps original transactions to temporal transactions. (This codebase, of course, may be expressed in part, or even in its entirety, as DBMS triggers.) If the child table is an object table, a dependent row is one with the to-be-deleted row's EID as an object foreign key (OFK). Performance considerations would thus suggest that OFK columns, in all non-versioned tables, have a non-unique index defined on them. If the child table is a versioned table, a dependent row in a current episode of its object is one which has an OFK that contains the value of the to-be-deleted row's EID, and with an effectivity end date either in the future, or else containing the value "12/31/9999". A dependent row in an active future episode of its object is the most recent version of that episode. (Until further notice, we will not refer to hypothetical objects, versions or episodes any longer. The approach to temporal data management that we are describing does accommodate such hypotheticals, but they require special treatment. They will be discussed in later articles.) In short, dependent rows for a temporal delete are all and only those rows which are OFK-related to the to-be-deleted row, and which are the most recent versions in the active episodes for their object. Here, performance could be improved by using a non-unique unique index on OFK plus effectivity begin date (plus one more date, as we shall see when we discuss Version Pattern 7) in the dependent table. However, at this point we are getting into a detailed discussion of implementation options. We intend to discuss such issues in depth toward the end of this series, and so we will say no more about them here. Deleting a Versioned Object: Block, Set Null and Cascade Options The search for dependent rows is conducted under one of three directives. They are the familiar relational delete options of: Block, Set null, or Cascade. From a performance point of view, which, as we stated last time, means from a physical I/O point of view, the block option is often less costly than the other options. This is because if there are multiple rows dependent on the to-be-deleted current version, we can stop checking as soon as the first one is found.

10/7/2008 1:15 PM

Time and Time Again: Managing Time in Relational Databases - Part 22: ...

4 of 5

http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

But if either the set null or the cascade option is chosen, then our search for temporally RI dependent rows in versioned child tables must be exhaustive. It must retrieve and examine all OFK-related rows. Notice that even if the child table index used for the search is as described above, the rows themselves must be retrieved, because the check must look at the effectivity end date of the child rows. In very large tables, possibly with multiple OFKs, the cost of such indexes can be high. When an important to-be-deleted object has many dependent children, the I/O costs of a) searching both the index and the base table for possible episodes to terminate (i.e., possible most recent versions to supercede with temporal delete versions), and b) doing the appropriate writes to the database to terminate those episodes, may be prohibitively high. But these are simply the trade-offs that any database administrator (DBA) must consider when physicalizing any data model. It is the cumulative weight of such costs that can lead us to decide to postpone temporal RI checking until retrieval time. In the case of temporal deletes, it would mean that we would just terminate all active episodes of the designated object, and not check for any dependencies. Deleting a Versioned Object: The Foreign Key Ripple Effect We know that normal, DBMS-enforced RI doesn't just link child tables to parent tables. When using the set null or cascade options, we must also consider child tables to those child tables, and continue down the RI-dependency chain as far as it extends. Thus, in a situation where we have clients, policies and claims, a cascade or set null delete of a client may require one or more policies to be set null or deleted. And if any are, the DBMS must go on to set null or delete the claims dependent on those policies. Let's call this the "foreign key ripple effect." There are three things to note about this ripple effect as it applies to versioned objects. 1. It does apply to versioned objects. 2. The DBMS can't help us. We must write our own code to enforce temporal RI on temporal delete transactions. 3. The performance cost will usually be much higher than for normal, DBMS-enforced RI. So if we are enforcing temporal RI as the database is updated, the index and base-table lookups that we described above must be performed. In addition, as we just pointed out, they must be performed on all tables in the temporal RI chain of dependencies, thus on not just children of the to-be-deleted parent row, but also of grandchildren, great-grandchildren, etc. What of the claim that performance costs will usually be much higher than for normal RI? A detailed defense of the claim, of course, cannot be provided unless we describe a specific implementation. But this would not prove that some better implementation might keep down the incremental performance costs. However, regardless of the specific implementation, they all share the following characteristic: there are going to be a lot more rows in a versioned table than there would be in the corresponding object table. The reason is that in versioned tables, it is not just inserts that create new rows. Versionable updates do also, and versioned deletes do also. Deleting a Versioned Object: Temporal Delete Transactions Let's suppose that we have identified both the most recent version of the current episode of the

10/7/2008 1:15 PM

Time and Time Again: Managing Time in Relational Databases - Part 22: ...

5 of 5

http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

object specified on the original delete, the most recent version of any active future episodes of that object, and also the most recent version of all active episodes in all tables along the temporal RI-dependency chain. We then add a temporal delete transaction to each of these episodes, superceding the most recent version in each of them with a temporal delete version. Note that these temporal deletes do not flag and rewrite these versions. That would be to do an update in place logical delete. As we explained in an earlier article, update in place logical deletes lose information, and that information may be considered versionable information by the business. Temporal deletes insert a temporal delete version for each active episode. The effectivity end date on the delete version is set to the delete date on the transaction. This is because an original delete is a business transaction. Its intent is to remove the object from the database, as of the indicated date. For example, if a client is being deleted, the intent is that after the transaction is complete, that client no longer exists, i.e., that person or organization no longer exists as a client of our company. (We may assume that if a delete date is not supplied on the original transaction, the effectivity end date defaults to the date current when the transaction is physically applied to the database.) Note also that nowhere in this process have we discussed a physical deletion of a physical row in a versioned table. Physically removing rows from versioned tables is a process we will call "archiving" those rows, and it is outside the scope of what this series of articles is about. See the taxonomy in Part 1 for additional discussion of the difference between archiving, creating snapshots, and versioning. So just like original inserts and original (versionable) updates, original deletes to versioned objects become physical inserts or one or more rows. Wrap-Up There is one unfinished piece of business. Having argued that sometimes it is too performance costly to enforce temporal RI on original delete transactions against versioned objects, we should now show how that same RI is enforced if it is, instead, enforced at retrieval time. However, with deadlines approaching on our bill-paying projects, we are forced to postpone that discussion for a later time. Next time, we will provide the same treatment for original insert transactions against versioned objects. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:15 PM

Time and Time Again: Managing Time in Relational Databases, Part 23 - ...

1 of 4

http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 23 - Original and Temporal Updates Tom Johnston, Randall Weis DM Direct, April 4, 2008 Note: A glossary of technical terms used in these articles can be found on the Web sites of the authors. The URLs are MindfulData.com and InbaseInc.com. This glossary is being constructed as a controlled vocabulary. By that we mean that any expression involving temporal concepts used in the definition of a glossary entry must itself be a glossary entry.

Intuitive Data Reporting & Dashboards Get instant visibility into your business with clickable access to your data no matter where it resides. Customize reports & dashboards on the fly.Try PivotLink now.

One reason for being this careful with definitions is that this kind of rigor must be applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order logic. That, in turn, is needed to make definitions available for manipulation by software-realized inferencing engines. Another reason for being this careful with definitions is that this rigor must be applied to any set of technical definitions before we can claim not only that we say what we mean, but also that we know what we mean when we say it. In our previous three articles, we introduced the distinction between original and temporal transactions. We then reviewed temporal referential integrity (RI) rules as they apply to temporal deletes. This article continues the discussion by reviewing temporal RI rules as they apply to temporal updates. These same discussions also provide the insight necessary to decide whether we should attempt to enforce temporal RI on the updates themselves, or instead defer that integrity checking until data is retrieved. With respect to the deferral option, we will repeat what we said about it in the previous article: It's easy to disparage [the deferral option] by describing it as letting temporal "garbage" onto the database, and then trying to hide that garbage from queries. But this is indeed the approach we have taken on temporal databases that we have designed, built, put into production and supported, and we had good reason for doing so. Like almost all database professionals, we were concerned that there would be an overwhelming volume of cascade updates if we tried to enforce temporal RI at time of update, especially when the update involves the delete of a versioned object that was the "parent" of possibly many other objects, in many other tables. (See the introduction

10/7/2008 1:16 PM

Time and Time Again: Managing Time in Relational Databases, Part 23 - ...

2 of 4

http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

to Part 10 for a more complete description of this "cascade update" fear.) For the sake of continuity, we repeat the following definitions which are especially important for understanding the material in this article. These, and other technical terms, are all defined in the glossary. Temporal Referential Integrity: Definition If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), then no exposed state of the database is valid in which any row in table Y is not object-RI linked to a row in table X, or in which the effectivity time period of a row in table Y is not wholly contained within the effectivity time period of its parent row in table X. No queries are valid which would expose such a state. Original and Temporal Transactions: Definitions An "original" transaction is a transaction against a versioned object. A "temporal" transaction is the result of translating an original transaction into one or more transactions against one or more episodes of that versioned object. Updating a Versioned Object: the Original Update Transaction To the business, when they are issuing an update against an object, they don't care if the object is versioned or not, and, indeed, they may not know. They are directing us to apply the specified change to the object. Thus, an original update transaction doesn't need to specify a version, just the object being updated and the date the update is to take effect. This is illustrated in Figure 1. As we have pointed out before, we are using dates throughout these articles only for convenience. The approach to versioning that we are presenting here applies no matter what the granularity of the clock ticks that measure out time. The granularity used here would be the correct granularity only for that updates take place in batch mode and are applied at most once per day. As soon as online transactional updates are involved, however, we would need a finer granularity, such as a full timestamp.

Figure 1: An Original Update Transaction Update transactions can be submitted as either complete images of the updated row or else by specifying first the row, and then the column/value pairs for each column to be updated in that row. If these transactions are to become effective exactly when they are applied, the effectivity begin date in Figure 1 is not required. In those cases, an original update transaction against versioned data will be identical to the same update submitted against non-versioned data. Updating a Versioned Object: the Existence Constraint The original transaction we are considering is an update. Just as with an update to a non-versioned object, the first thing that must be done is to find the target row(s). Given that the target table is a versioned table, one target row must be the most recent version of the current

10/7/2008 1:16 PM

Time and Time Again: Managing Time in Relational Databases, Part 23 - ...

3 of 4

http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

episode of the object. The original transaction provides the EID for the object. The most recent version of the current episode of the object is the one version of that object with an effectivity begin date in the past and an effectivity end date either in the future, or containing the "temporal null" value "12/31/9999." First of all, note the assumption that there cannot be more than one current version of an object in a versioned table. This assumption is guaranteed by the logic of inserts and deletes against versioned objects. This reflects our commonsense belief that at any point in time, there cannot be two different things that are true about an object. When a user submits an update transaction against a non-versioned table, she is asserting that the indicated object already exists in that table. If the object doesn’t already exist, the transaction is invalid. When a corresponding original transaction is submitted, i.e., when the update will be applied to a versioned table, the update logic must also ensure that the object already exists in that table. Specifically, the update logic must ensure that the object exists right now. Just as with non-versioned tables, it's not enough that the object may have existed in the table at some point in the past or will exist at some point in the future. It must exist right now. There must also be only one row in the table representing that object. In a versioned table, that means there must be only one current version of the object, where a current version is one whose effectivity begin date is in the past and whose effectivity end date is not. If a current version of the indicated object is found, the update can proceed. Otherwise, it is rejected as invalid. (We consider updates which are explicitly against future versions in a later article. For now, we assume that all original transactions are against current versions.) An original update of a versioned object requires us to supercede the most recent version of the current episode of the object (which is, by definition, the current version of the object) with an updated version, and also to do the same for any active future episodes of that object. An active future episode is a future episode whose most recent version is not a delete version. Updating a Versioned Object: Temporal Update Transactions Let's suppose that we have identified the most recent version of the current episode of the object specified on the original update, and also the most recent versions of any active future episodes of that object. We then add a temporal update transaction to each of these episodes, superceding the most recent version in each of them with a new most recent version. Note that these temporal updates do not flag and rewrite these versions. That would be to do an update in place, which defeats the whole purpose of versioning by overwriting the past. Temporal updates insert a new most recent version for each active episode. The effectivity end date on the pre-update most current version is set to one click tick prior to the effectivity begin date on the version that is superceding it. This is because an original update is a business transaction. Its intent is to place the updated information in the database, as of the indicated date. For example, if a client is being updated, the intent is that after the transaction is complete, the current description of that client is different than it was before the update, reflecting new information we have come across about the current description of that client. But what about temporal RI? If we supercede a version of a client, what about the policies for that

10/7/2008 1:16 PM

Time and Time Again: Managing Time in Relational Databases, Part 23 - ...

4 of 4

http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

client that are temporally RI dependent on the client? And what about the claims for those policies that are temporally RI dependent on them? Updating a Versioned Object: the Concern As we’ve mentioned before, a pervasive fear among business IT professionals is that temporal RI enforcement on updates raises the likelihood of an extensive cascade effect, one we have called the “foreign key ripple effect” in other articles. We know that normal, database management systems (DBMS)-enforced RI doesn't just link child tables to parent tables. When using the set null or cascade options, we must also consider child tables to those child tables, and continue down the RI-dependency chain as far as it extends. Thus, in a situation where we have clients, policies and claims, a cascade or set null delete of a client may require one or more policies to be set null or deleted. And if any are, the DBMS must go on to set null or delete the claims dependent on those policies. If these RI checks must be done on normal tables whenever an object is deleted, don't temporal RI checks need to be done on versioned tables whenever a version of an object is superceded? Contrary to a widespread belief among data management professionals, the answer is that they do not need to be done whenever a version of an object is superceded. To understand why this fear is largely unfounded, note first of all that we implement temporal RI with object foreign keys (OFKs), not with normal DBMS-enforced FKs. OFKs in a temporally dependent table do not point to a specific parent row in the versioned table they are dependent on. They point only to the object they are dependent on. Thus, in a Policy table, each row has an OFK to the client whose policy they are. But the table the Policy table is temporally RI dependent on is a versioned table, and thus an OFK does not pick out a specific row in that table. Initially, we might want to conclude that temporal updates have no effect on temporal RI at all, and that just like with normal RI, a temporal update does not require any RI checking at all. However, this is not so. Most temporal updates have no effect on temporal RI, and thus incur no RI-related performance penalties. But there is one kind of temporal update that does require checking of dependent tables, and that may involve a lengthy chain of dependencies that must be checked so that the temporal update can either be blocked, or trigger a cascade of temporal updates, or trigger a cascade of temporal set nulls to the OFKs in the chain. We will examine this temporal RI-triggering original update in Part 24. In the meantime, we recommend that our readers think about temporal updates, and try to figure out for themselves what kinds of temporal updates can trigger a temporal RI cascade and what kinds cannot. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:16 PM

Time and Time Again - Managing Time in Relational Databases, Part 24: ...

1 of 5

http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

Time and Time Again - Managing Time in Relational Databases, Part 24: Original and Temporal Updates (Concluded) Tom Johnston, Randall Weis DM Direct, April 18, 2008 Note: A glossary of technical terms used in these articles can be found on the websites of the authors. The URLs are MindfulData.com and InbaseInc.com. This glossary is being constructed as a controlled vocabulary. By that we mean that any expression involving temporal concepts which is used in the definition of a glossary entry must itself be a glossary entry. Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

One reason for being this careful with definitions is that this kind of rigor must be applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order logic. That, in turn, is what is needed to make definitions available for manipulation by softwarerealized inferencing engines. Another reason for being this careful with definitions is that this is the kind of rigor that must be applied to any set of technical definitions before we can claim not only that we say what we mean, but also that we know what we mean when we say it. As this series continues, context becomes increasingly important so that the thread of the discussion is not lost. We begin by updating our chart of installments, which last appeared in Part 13. (See PDF at the end of this article for a chart of installments to date in this series.) With Part 19, we interrupted our presentation of the physical implementation of the versioning patterns that we developed in earlier articles in the series and began a discussion of temporal referential integrity (TRI). On completing Part 23, we had introduced and defined TRI and related concepts, and discussed TRI as it applies to deletes and updates. We now turn to TRI as it applies to inserts. Updating a Versioned Object: When TRI Checking Does Not Come Into Play Last time, we described a temporal update in which a current version of a client was superceded by a new current version in a client version table. We also alluded to a TRI dependent policy version table, in which every policy was owned by exactly one client. So if the client version that

10/7/2008 1:17 PM

Time and Time Again - Managing Time in Relational Databases, Part 24: ...

2 of 5

http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

was superceded owned one or more policies, doesn't that mean that we must either block the supercession, set null the references in the policies, or else cascade update them to point to the new version of the client? To raise the question is to begin to see the outline of the answer to it. Those policies –versioned or not – do not point to the superceded client version, because versioned tables are not referenced by foreign keys; they are referenced only by object foreign keys (OFKs), which point to the referenced object but do not point to any specific version of it. Let's think about the situation in terms of what is going on in the real world, not simply in terms of what is going on in the database. In the real world, policies are owned by clients. As those policies change over time, or as the client that owns them changes over time, those claims are still owned by those clients. This last sentence, translated into the language of database activity against versioned tables, reads like this: even as versions of those policies are superceded by newer versions, or as versions of the client that owns them are superceded by newer versions, those policies are still owned by those clients. Versioning of either or both objects has no effect on the ownership relationship between them. But because we are dealing with a temporally qualified relationship, there is a proviso. Versioning of either or both objects has no effect on the ownership relationship between them, provided that the superceding version of a TRI child (parent) object has an effectivity period that is wholly contained within (wholly contains) the effectivity period of all objects it is a TRI child of (a TRI parent of). Note that this proviso does not distinguish between TRI-mandatory and TRI-optional relationships. It clearly applies to mandatory relationships, but at this time we are not sure whether or not it should also apply to optional ones. It's time now to see what the "provided that" clause entails. Updating a Versioned Object: When TRI-Checking Does Come Into Play Suppose we have a policy whose effectivity period is 1/1/2007 through 12/31/2007, and an owning client whose effectivity period is 1/1/2004 through 12/31/9999. Although "12/31/9999" means "until further notice," it is treated by the database management systems as a real date. Thus, the client's effectivity period wholly contains that of the policy. This is shown in Figure 2.

Figure 1: A Client and Her Policy The graphical conventions used in this illustration are: 1. The vertical bars constitute a timeline, with each bar representing one month. The timeline runs from Jan 1, 2004 through Dec 31, 2009.

10/7/2008 1:17 PM

Time and Time Again - Managing Time in Relational Databases, Part 24: ...

3 of 5

http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

2. A client and a policy episode are shown, each starting on its effectivity begin date. 3. The client episode contains four versions, the policy episode three. All versions begin with left-most vertical bars. Adjacent versions share a vertical bar. Versions with a known effectivity end date have a right-most vertical bar indicating that end date. Thus, the most recent version of the policy episode has a 12/31/2007 effectivity end date. Versions whose effectivity end date is not known have an ellipsis in place of the right-most vertical bar and always have an end date value of "12/31/9999." We will call the former kind of episode "closed," and the latter kind "open". 4. {Now}, relative to the example, is indicated by the black box arrow, pointing to March 1, 2007. So, for purposes of discussing the example, we should assume that it is now March 1, 2007. 5. The "OFK" labeled crow's foot notation indicates a TRI relationship from clients to policies. Note in particular what this is not a relationship between. It is not a relationship between: A specific client and a specific policy. A specific version of the client and a specific version of the policy. A specific episode of the client and a specific episode of the policy. Just as a relationship type is defined on a pair of conventional tables, a temporal relationship type is defined on a pair of tables, at least one of which is temporal. As long as these effectivity periods do not change, and the policy continues to have an OFK to its owning client, TRI constraints between these two rows remain satisfied. Thus, no updates that leave this OFK and these pairs of dates alone can violate TRI, and therefore no versioning caused by these updates needs to be accompanied by TRI checks. Consequently, there is no TRI performance penalty in such cases. Suppose that it is now some time during 2007, say March 1, 2007. And suppose that the client's effectivity end date is moved back to 5/1/2007. If no other changes are made to that client, she will be temporally deleted on that date, which will then terminate her current episode. In addition, all versions of all future episodes will be temporally deleted. (Although we are postponing discussion of how future episodes are managed, recall that, as we explained earlier, original transaction authors will think of future episodes as future transactions that are queued up in a virtual "transaction batch file," with the promise that they will be applied on their effectivity date. From this point of view, temporally deleting all future episodes appears to be flushing those not-yet-applied transactions from that virtual batch file.) Because this new version of the client "shrinks" the client's effectivity time period, TRI checks are required. In our example, there is a policy that is TRI dependent on that client, and that policy's effective end date is 12/31/2007. Since 5/1/2007 is earlier than 12/31/2007, this violates the TRI constraint that the effectivity period of a child version be wholly contained in the effectivity period of the corresponding parent version. From the point of view of the real world, it means that if we let the update proceed, the result would be a database which showed a policy in effect for six months after its owning client was deleted. This situation, a violation of TRI, is shown in Figure 3. The period of time during which TRI is violated is indicated by a red block.

10/7/2008 1:17 PM

Time and Time Again - Managing Time in Relational Databases, Part 24: ...

4 of 5

http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

Figure 2: A TRI Violation This is an example of a change to the effectivity period of a parent object. If that effectivity period "shrinks," TRI checking is triggered on all its dependents to see that they are not "temporally stranded" by the shrinkage. There is obviously a correlative process when an update changes the effectivity period of a child object. In that case, if the effectivity period "expands," TRI checking is triggered on all its parents to see that they do not "temporally strand" themselves by extending their effectivity range beyond that of one of the objects they are dependent on. On original updates to versioned objects, then, TRI checks must be done whenever there are TRI dependencies, and effectivity periods are changed. Otherwise, original updates to versioned objects do not trigger TRI checks, and thus do not incur any TRI performance penalty. There is one more situation in which a temporal update may trigger TRI checking. That is when a {null} OFK in the updated object takes on a real value, or when an existing OFK value is changed. In either case, TRI checking is required to ensure that the effectivity timespan of a version of the newly referenced object wholly encloses the effectivity timespan of the object whose OFK was updated. With conventional RI, constraints exist to ensure that after any database update, there are no foreign keys with "dangling references," with values for which there is no corresponding primary key in the referenced table. With TRI, constraints exist to ensure that after any database update, there are no object foreign keys with "dangling references," with values for which there is no row in the referenced table for that object whose effectivity timespan wholly encloses the effectivity timespan of the referencing object. New Terms Added to the Glossary Adjacent versions. Closed episode. Dangling reference. Expanding a time period. Flushing the virtual batch file. Open episode. Shrinking a time period. Temporally stranded objects. TRI (as an acronym for "temporal referential integrity"). TRI mandatory relationship. TRI optional relationship. Virtual batch (transaction) file. Our discussion of TRI, except for a consideration of how future versions are affected by it, is nearly done. Next time, we will discuss TRI for the temporal transactions resulting from original inserts and original upserts. Then we will go on to extend our discussion to consider the effect of TRI

10/7/2008 1:17 PM

Time and Time Again - Managing Time in Relational Databases, Part 24: ...

5 of 5

http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

constraints on future versions of objects. As we shall see, those considerations are not insignificant. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:17 PM

Time and Time Again: Managing Time in Relational Databases, Part 25 - ...

1 of 5

http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

Time and Time Again: Managing Time in Relational Databases, Part 25 - Original and Temporal Inserts Tom Johnston, Randall Weis DM Direct, May 2, 2008 Note: A glossary of technical terms used in these articles can be found on the Web sites of the authors. The URLs are MindfulData.com and InbaseInc.com. This glossary is being constructed as a controlled vocabulary. By that we mean that any expression involving temporal concepts which is used in the definition of a glossary entry must itself be a glossary entry. Integrating Data: An Open Source Approach Any technology deployed in the information system needs to interoperate with existing applications or databases. Learn through real-life scenarios how Open Source data integration solves the interoperability challenge. Free White Paper.

One reason for being this careful with definitions is that this kind of rigor must be applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order logic. That, in turn, is what is needed to make definitions available for manipulation by softwarerealized inferencing engines. Another reason for being this careful with definitions is that this is the kind of rigor that must be applied to any set of technical definitions before we can claim not only that we say what we mean, but also that we know what we mean when we say it. As this series continues, context becomes increasingly important so that the thread of the discussion is not lost. Here is what we've done so far. (See PDF below for Figure 1: Chart of Installments to Date in this Series.) Beginning with Part 19, we interrupted our presentation of the physical implementation of the versioning patterns, which we described earlier in the series, and began a discussion of temporal integrity constraints. On completing Part 24, we had introduced and defined temporal RI and related concepts, and discussed these constraints as they apply to delete and to update transactions. In this and the following article, we turn to temporal integrity constraints as they apply to insert and to upsert transactions. The "ultimate" versioning pattern we are working toward is what we, following the computer science community, call a "bi-temporal" pattern. In Parts 19 through 25, we have been discussing both temporal entity integrity and temporal referential integrity constraints. But it is important to note that these discussions have taken place in the context of a "uni-temporal" pattern. The only time period considered has been an effective time period, and the only dates an effective begin

10/7/2008 1:17 PM

Time and Time Again: Managing Time in Relational Databases, Part 25 - ...

2 of 5

http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

and effective end date. As we have seen, integrity constraints in a uni-temporal versioning context have been far from simple. As we will see, integrity constraints in a bi-temporal versioning context are considerably more complex than that. Inserting a Versioned Object: the Original Insert Transaction To the business, when they are issuing an insert of an object, they don't care if the object is versioned or not, and indeed may not know. They are directing us to insert a row representing an object which is not already represented by a row in the target table. Thus, an original insert transaction doesn't need to specify a version, just the object being inserted, its business key, whether or not match logic should be applied, and the date the insert is to take effect. This is illustrated in Figure 2. As we have pointed out before, we are using dates throughout these articles only for convenience. The approach to versioning which we are presenting here applies no matter what the granularity of the clock ticks that measure out time. The granularity used here would be the correct granularity for transactions taking place in batch mode and being applied at most once per day. As soon as online individually applied transactions are involved, however, we would need a finer granularity, such as a full timestamp.

Figure 2: An Original Insert Transaction Before we begin to examine an original insert transaction in detail, let's look specifically at the "Match? (Y/N)" flag in Figure 2. Inserting a Versioned Object: Haven't We Met Before? Objects of interest to an enterprise frequently appear, go away, and then reappear later on. One response to the fact of recurrence is to ignore it. When the object reappears, it is treated as a new object, and no attempt is made to match it up with the history we have for it. But enterprises generally prefer to recognize a recurring object rather than treat it as a new object. For example, if the object is a customer, then if we don't attempt to determine whether or not she used to be a customer of ours, we won't be able to greet her and make her feel special, like a long-lost friend. We won't be able to say, "Welcome, back. Glad to have you as a customer once again." We will also be unable to use the history of transactions we have with this customer to more intelligently manage the reestablished relationship. We know the kinds of things she likes to purchase from us. We may even know the kinds of advertising she is most responsive to. But unless we can recognize her as a returning customer, we will lose the competitive advantage that this past history can give us. So the preferred response, if our target table is a versioned table, would be to look for a match of the object being inserted with a past version. Assuming that the primary keys of rows in the target table have not been back-propagated to their populating source systems, we must look for a match using the source system business keys of the objects. If we find a match, we will use the target table EID we found as the EID for the insert. This means, for a versioned target table, that it will be used as the EID for the first version of a new episode of the object. With this EID firmly

10/7/2008 1:17 PM

Time and Time Again: Managing Time in Relational Databases, Part 25 - ...

3 of 5

http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

attached to all the versions of all the episodes of the object, we retain the information we can glean from past episodes. We can say "Hi, and welcome back" to our returning customer. We will already know what she likes, and what she doesn't like. But notice the [Match? (Y/N)] flag on the original transaction. Why do we need it? If we keep the business key on our versioned tables, and a business key is provided on the transaction, why can't we automatically check for a match? In a perfect world, with objects that are assigned unique business keys, we would not need this match flag. But in the imperfect world of often dirty data that we IT professionals must manage, sometimes we can't rely on the business key. Sometimes we must even allow duplicate business keys, or keys with default values or {null}s in one or more columns, into our target tables. For example, multiple independent source systems might provide insert, update and delete transactions to the same target table. The same business key might be used in both systems, and the two rows thus designated might or might not represent the same object. One way we could find ourselves in this situation is if the two source systems were, up to now, managed rather "loosely," as systems for quasi-independent organizations. Mergers and acquisitions often start off with this "loosely coupled" approach because it is much quicker and much easier to bring online than full integration. But when the customer bases of the acquiring and the acquired companies overlap, as they often will, it is inefficient to manage the customers in common as different customers. It is often an inefficiency visible to those customers as well, whose impressions of our company will certainly not be enhanced by this annoying inability of ours to recognize them as one and the same customer. Another example would be overlapping product lines across the manufacturing plants of an acquiring and an acquired company. Here, too, there is a loosely coupled and a tightly coupled approach. Here, too, moving from the former to the latter will likely uncover latent business key collisions and make them fully explicit. These M&A scenarios are often exacerbated by the not infrequent practice of using a systemgenerated sequence number as the last column of a business key. When this happens, the odds of identical cross-system business keys representing different objects quickly escalates. For example, if each of two systems has three business keys, each set identical except for the sequence number portion, and both sets identical in the non-sequence number portion, and if we assume that there are indeed exactly three customers involved, the odds of matching them correctly across the two systems is nine to one, against! Another way business keys can be compromised is when a source system uses a surrogate key, one which, for example, recycles every couple of years. By the time it recycles, all reference to the object it originally represented will have been archived out of that system. But if our target is a versioned table with a longer online lifecycle than the source table, then when the reused surrogate appears, it may still exist in the target table as an identifier for the previous object it designated. In these ways, or in any of the other ways in which dirty data can make its way into the very business keys that identify the objects we are interested in, if we automatically search for a match on original insert transactions, we may inadvertently create synonyms, multiple rows which represent the same object. For this reason, original inserts must indicate whether or not match logic should be applied to them. But because the same semantics often need to be provided even when the target table is not versioned, requiring the match flag does not "tip our hand" and

10/7/2008 1:17 PM

Time and Time Again: Managing Time in Relational Databases, Part 25 - ...

4 of 5

http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

indicate that the target is a versioned table. The distinction between versioned and conventional tables remains hidden from the users and IT personnel who provide the source transactions. Unreliable Business Keys: Why Assume a No-Match? If we always assume a no-match when dealing with unreliable business keys, the effect is to turn every source update transaction into an insert. The result quickly becomes a situation in which there are a large number of rows for each object that is frequently updated. In business IT, we look for ways to keep these numbers down, but it's not our purpose to describe those heuristics here. But one could ask why the assumption must be that a match of unreliable business keys is a no-match. Perhaps application-specific circumstances mean that the odds are nine out of ten that such matching keys do in fact represent the same object. In that case, why not assume that they do? The reason, as all reasons are in business IT, is a negative impact on the bottom line. In many cases, the cost of creating one mistaken homonym is higher than the cost of creating nine, or even a hundred, synonyms. In this case, "homonym" refers to one row representing multiple objects, and "synonym" refers to multiple rows representing the same object. In these terms, the question, "Why assume a no-match?" can be rephrased as, "Why assume that homonyms are more costly than synonyms?" And the answer is that in general, they are. Here's why. In the case of synonyms, when we discover one, we have to do what is called a "file merge." If two Client table rows are discovered to represent the same client, for example, we must replace them with a single row. As for source transactions, we must change the match logic to point pairs of transactions hitherto directed to different targets, to the one new target which replaced them. As for RI dependencies, we must take all the children RI-dependent on either of the two original rows, change their foreign keys to point back to the one new row which replaced them. But notice that all these changes can be done with code. There is no need for human judgment in the revised match logic, or in the new RI dependencies. But such is not the case when we are dealing with homonyms. When we discover a homonym, we must do what is called a "file split." Say we split one client table row into two rows. The next time update transactions appear that had both been applied to the original row, how do we determine which of the two new clients to direct each one to? By the very fact that the homonym existed to begin with, we know that source systems have thus far failed to make the distinction. So the match logic must be updated to make a discrimination it previously did not, or could not, make. As for RI dependencies, with a single parent client row replaced by two (or more) client rows, how are we to know which new parent to redirect each RI child to? Occasionally, the means to make these new discriminations will exist in the data, and so the new logic can be fully automated. But far more frequently, those means do not exist in the data. In that case, file splits can only be done if we can devote human resources to the one-time task of fixing foreign keys, and to the recurring task of augmenting match logic to make discriminations that cannot be made on the basis of the data alone. In short, doing less discriminatory work is easy; doing more is hard. That is why homonyms cost more than synonyms. And that, finally, is why in the absence of cost/benefit data to the contrary,

10/7/2008 1:17 PM

Time and Time Again: Managing Time in Relational Databases, Part 25 - ...

5 of 5

http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

we should treat unreliable source system business keys by permitting duplicate rows into the target tables. Sometimes, of course, a cost/benefit analysis in a specific situation will indicate that it is better to create the occasional homonym than it is to create a flurry of synonyms. But the default assumption always has to be that homonyms are to be avoided, and synonyms reluctantly tolerated. We have spent this much time on the business key match problem because it is so important. In the past, the problem was often ignored, and each doubtful situation treated as an insert, as a new object situation. The reason was always the difficulty in producing reliable matches in the presence of match criteria data of poor quality, data which often came from outside our enterprise and so was beyond our ability to clean up, or in the absence of common match criteria across different source systems. But throwing up our hands and populating source tables with an unknown and possible significant number of duplicate rows, rows which represent the same object, is increasingly unacceptable. Increasingly, businesses believe that the high cost of solving the problem, or at least reducing its impact, is less than the cost of living with it. We have several times used the example of merging customer files in the presence of unreliable source system keys. We chose this example because the high cost of living with un-merged customer data is most apparent in customer-facing systems. We will have more to say about a general approach to keeping the cost of duplicate-containing tables to a minimum in a later set of articles. We think there is something of value to describe that will apply to nearly all situations in which unreliable business keys must be matched. While this is not a problem exclusive to versioned tables, neither is it a problem exclusive from them. In the meantime, we will turn out attention back to inserts and upserts to temporal tables in our next article. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 1:17 PM

Time and Time Again - Managing Time in Relational Databases, Part 26: ...

1 of 5

http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

Time and Time Again - Managing Time in Relational Databases, Part 26: Points in Time and Periods of Time Tom Johnston, Randall Weis DM Direct, May 16, 2008 Note: A glossary of technical terms used in these articles can be found on the Web sites of the authors. The URLs are MindfulData.com and InbaseInc.com. This glossary is being constructed as a controlled vocabulary. By that we mean that any expression involving temporal concepts which is used in the definition of a glossary entry must itself be a glossary entry. Join the Largest Community of MDM Experts in New York City, October 19-21, 2008 Restructured to better suit your business needs, MDM Summit Fall 2008 offers value of experience and thought leadership. Speakers will discuss hierarchy management, identity resolution and more. Pre-register by August 8 for bonus savings!

One reason for being this careful with definitions is that this kind of rigor must be applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order logic. That, in turn, is what is needed to make definitions available for manipulation by softwarerealized inferencing engines. Another reason for being this careful with definitions is that this is the kind of rigor that must be applied to any set of technical definitions before we can claim not only that we say what we mean, but also that we know what we mean when we say it. As this series continues, context becomes increasingly important so that the thread of the discussion is not lost. Figure 1 reviews what we've done so far. Figure 1: Chart of Installments to Date in the Series (See PDF link at the end of the article.) Last time, we began looking at original and temporal insert transactions but immediately digressed into a discussion of the problems that can be encountered when matching transactions to target tables. We planned to discuss both temporal entity integrity and temporal referential integrity as they apply to insert transactions in this installment. But we discovered that those discussions require another digression, this time into how we use pairs of dates to represent periods of time. We also discovered that we have vacillated in this series between two methods of using pairs of dates to represent periods of time, methods which are called the "closed-closed" and "closed-open" approaches. Two other methods are possible: "open-closed" and "open-open." But we won't enter into an extensive discussion of all four methods, because that has already been covered by both Snodgrass and by Darwen and Lorentzos.1,2

10/7/2008 9:25 AM

Time and Time Again - Managing Time in Relational Databases, Part 26: ...

2 of 5

http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

There is both a practical and a theoretical side to the question of how to represent time periods with dates. On the theoretical side, dates (or datetimes, or timestamps) are discrete points along a continuous timeline. And as Zeno first discovered, and Leibniz and Newton later formalized, the relationship between the continuous and the discrete is problematic, equally so for time as for space. Points in Time and Periods of Time: Practical Issues Consider the following two versions, V1 and V2, of the same object. Let us suppose that there is no gap in time between them, i.e., no gap along the "effectivity timeline." How are we to represent this? Two of the possibilities, the two specific ones we have vacillated between, are shown in Figures 2 and 3.

We also assume, in these examples, that the clock tick granularity is one day. In both cases, the first effective time period starts on 2/19/06 and ends on 5/22/07, and the second starts on 5/23/07 and ends on 10/14/09. So it might seem that the closed-closed representation is the correct one, and that the closed-open representation is simply wrong. But that is not the case. We could try to illustrate the wrongness of the closed-open representation by querying for the time period that contains the date 5/23/07. On the closed-open representation, can we tell which version's time period is the desired one? Yes, we can. We just need the following WHERE clause in our SQL query: WHERE EFFECTIVE-BEGIN-DATE LESS-THAN OR EQUAL TO "5/23/07" AND "5/23/07" LESS THAN EFFECTIVE-END-DATE With this clause, the query will correctly pick out V2. So why might we have thought that the closed-open representation is wrong? Probably because we had the mental image of our desired date being between the begin and end dates of the versions. But "between" as we ordinarily understand it is not "BETWEEN" as SQL understands it. We would not have gotten confused, of course, if we had used the closed-closed approach. In that case, we could have written:

10/7/2008 9:25 AM

Time and Time Again - Managing Time in Relational Databases, Part 26: ...

3 of 5

http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

WHERE "5/23/07" BETWEEN EFFECTIVE-BEGIN-DATE AND EFFECTIVE-END-DATE But because both clauses return the correct result, provided each is used with its corresponding method of representing periods of time, both methods are equally correct. So in our own vacillation between these two methods, it was this consideration which led us to (sometimes) describe the closed-closed method as the one we preferred. What, then, is the advantage of using the closed-open method? Well, look again at Figures 2 and 3. In both cases, V1 and V2 are contiguous. We know this because we have set up the example on the assumption that there is no gap in time between them. With the closed-open representation, however, we can instantly see that there is no such gap. But with the closedclosed representation, we would also need the information that the clock tick granularity being used is a granularity of one day. The real problem with the closed-closed representation, however, is not what is or is not immediately clear to us human beings. It is the relative complexity of the code which will often be called upon to determine, of two consecutive versions, whether or not they are contiguous, i.e., whether or not there is a clock tick between them. With a closed-closed representation, that code will also have to know what the clock tick granularity is. So in our own vacillation between these two methods, it was this consideration which led us to (sometimes) describe the closed-open method as the one we preferred. Once the SQL standards groups can agree on temporal extensions to the standard, part of that agreement will certainly be a way of representing time periods directly, without relying on the confusing circumlocutions of various ways to use pairs of dates to represent time periods. But our concern, in these articles, is with today's SQL, and so we must choose a date-pair method of representation. Therefore, from this point forward, unless we discover other considerations which would dramatically tip the scale the other way, we will use the closed-open representation of time periods. Points in Time and Periods of Time: Theoretical Issues During our "closed-closed" phase, some time ago, one of our readers wrote us to say that closedclosed was simply wrong for what we will call "theoretical" reasons. He wrote: A time period most definitely does not end one clock tick prior to the end date. It ends at precisely the time specified as the end date….. If the period ends at 12:10:15 then when the clock strikes 12:10:15 the period is over THEN and the next period begins precisely THEN. At that instant… If you don’t do it that way then you are saying that date/time fields that are used for the beginning times are interpreted as meaning the beginning of the stated time period and date/time fields used for ending times are interpreted a DIFFERENT way, namely as indicating the end of the stated time period. One can’t do things that way, i.e., changing the definition of a data type based on the context. This reader also has an argument that is practical in nature, namely that with the closed-open method, date arithmetic will always involve a "minus 1" clause that would not be required on the closed-closed approach. But as we have seen, there are practical concerns no matter how we

10/7/2008 9:25 AM

Time and Time Again - Managing Time in Relational Databases, Part 26: ...

4 of 5

http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

choose to use date-pairs to represent time periods, and these concerns are therefore not conclusive. Our first response to our reader's theoretical argument, briefly, is this. First, the closed-open approach does not "chang(e) the definition of a data type." If we changed the definition of a data type depending on which columns we applied it to, then the DBMS or programming language that accessed those columns would not be able to handle them correctly. So we are not changing the definition of the data type. The issue, instead, is what we take each date to mean. And as long as we write our SQL carefully (see the WHERE clauses, above), we will get the correct results no matter which date-pair interpretation we choose. Our second response is that decades of computer science research have shown that we can and should represent the flow of time as a succession of atomic clock ticks, ticks which may be of any actual length of time, such as a day, a nanosecond, etc. But no one ever has, and no one ever will, invent a clock so precise that there is no time at all between its ticks. Only if we had such a clock could we use clock ticks to represent time periods in a manner that conforms to the intuitions of this reader, namely that "If the period ends at 12:10:15 then when the clock strikes 12:10:15 the period is over THEN and the next period begins precisely THEN. At that instant." This interpretation creates an ambiguity in the use of date-pairs to delimit time periods that can never be resolved. For on this interpretation, two adjacent versions overlap at exactly that tick of the clock. On that tick of the clock, this interpretation requires us to say that both V1 and V2 are in effect. Without going into any detail about the theoretical underpinnings of these arguments, we can make the following points. First, at the Planck scale, both space and time are, as far as we can measure them, discrete, with a unit of Planck time (5.3906 x 10 to the minus 44 seconds) being the time it takes for a photon of light to travel the Planck distance (1.6160 x 10 to the minus 35 meters). But this is too theoretically remote to be relevant to a discussion of time as far as IT is concerned. However, it does cast considerable doubt, obviously, on the hope that we will ever have a clock whose ticks do not take up a finite period of time. Second, with quantum physics considerations aside, both space and time are continuous, but neither points (in space) nor instants (points in time) are. The ensuing conceptual difficulties were first noted by Zeno, which he described in his paradox of the hare and the tortoise. The solution to the correct representation of something continuous by means of something discrete was ultimately solved, for space at least, by Leibniz and Newton. And their concept of a limit applies equally well to time. But this is still too remote for its relevance to managing time in relational databases to be apparent. However, for a good basic discussion of these issues, we recommend Sowa 2000.3 Finally, though, the concepts necessary to correctly manage the relationships among time periods represented as pairs of dates were originally formulated by James F. Allen in the mid-1980s. This reference, and an excellent exposition of this material, is contained in Snodgrass 2000 and in Darwen and Lorentzos 2002. Both contain references to the earliest of three articles in which Allen explained his concepts. With our digressions completed, we will proceed to a discussion of original and temporal inserts in

10/7/2008 9:25 AM

Time and Time Again - Managing Time in Relational Databases, Part 26: ...

5 of 5

http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

our next article, and will explain how the temporal correlates of entity integrity and referential integrity apply to them. References: 1. C. J. Date, Hugh Darwen, Nikos Lorentzos. Temporal Data and the Relational Model. Morgan-Kaufmann, 2002, 90-94. 2. Snodgrass, R. T. Developing Time-Oriented Database Applications in SQL. MorganKaufmann, 2000. 3. Sowa, John F. Knowledge Representation. (Brooks-Cole, Pacific Grove CA, 2000, 103-124. For more information on related topics, visit the following channels: Databases ©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company. Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

10/7/2008 9:25 AM

Related Documents