Extensions to Slowly Changing Dimension Types

One of the bedrock concepts in dimensional modelling is that of “Slowly Changing Dimensions”, which are structures and loading techniques that allow the system to account for time-variant relationships between attributes in a hierarchy. For example, tracking changes in the address of a customer as his wife’s employer forces them to move from Colorado Springs to Arlington VA. Ahem.

There are three commonly recognised types:

  • Type 1: Dishonestly pretend that he always lived in Arlington by overwriting the “Colorado Springs” attribute in his customer record.
  • Type 2: Recognise that this is now the third time that said customer has had to move by preserving records of his previous residences in London, Dayton OH, Colorado Springs and Arlington. A different customer line is created in the dimension table for each residence.
  • Type 3: Hide part of the truth by acknowleding his previous residence in Colorado Springs as a different attribute from the current address, but not acknowledging previous inconveniences.

This characteristic is not applicable solely at the dimension level. Individual attributes within a dimension can have different SCD types, so you might choose to make “Marital Status” a Type 1 attribute and “City of Residence” a Type 3.

While SCD Type 2 is the one that preserves the highest level of historical accuracy it has one failing unique among the three types — retrieval of complete set of historical facts history for customers currently resident in Arlington requires a complex query structure and generally leads to poor performance.

This is a situation handled by a hybrid type of SCD that I have been using for the past eight years, and which I was very happy to find documented by Margy Ross and Ralph Kimball in a typically useful article at Intelligent Enterprise Magazine. This hybrid technique, which you will find in that article under the heading Mini Dimension with “Current” Overwrite, breaks out rapidly changing attributes into a new dimension in which we identify the set of attribute values in effect at the time that the fact event took place (sales transaction date, for example).

So in the case of our inconvenienced customer we would store the current “City of Residence” in the customer dimension table and create another “Customer City of Residence” dimension joined as a new attribute directly to the fact table. The latter dimension would indicate the city of residence for the customer at the time that the event (eg. a sales transaction) took place.

I’ve used this in the context of preserving the history of who is responsible for managing items (and for various other changeable attributes of the 200,000-ish items catalogued), and the “Current Manager” attribute of the item dimension and the independent “Historical Manager” dimension are aimed at solving different business problems.

  • In one case an item manager can view the sales (or inventory, or backorder etc) history of the items that they currently manage. This is an important tool for forecasting future trends by extrapolating from historical values, and also tends to support tactical analysis.

  • In the other case the sales and inventory valuation can be historically assigned to cost centers to which the item managers belong — more of a strategic organisation-level view.

The latter tends to be a more performant report, interestingly enough, because the lower cardinality of the foreign key column in the fact table (“item manager”, rather than “item”) makes precalculated result sets (materialized views, if you like that sort of thing) more compact and thus more likely to be implemented, to be honest.

Advertisements

8 thoughts on “Extensions to Slowly Changing Dimension Types

  1. An interesting post but I’m not sure I agree…

    You say SCD2 has one failing yet quoted two – more complex query and generally poor performance. As far as more complex query goes I think that’s debatable – adding a predicate to identify the record from the SCD2 that you want is as simple as:

    WHERE TO_DATE(’01-DEC-2006′,’DD-MON-YYYY’) BETWEEN from_date AND to_date

    …isn’t it ? I wouldn’t say that was complex…well, if it is, then I’m revoking SQL access from all my users with immediate effect!

    As for poor performance, well, yes, possibly but what are you comparing the performance to in order to say it’s poor? It obviously is poorer than SCD1 or 3 because it has, in all liklihood, more rows/data to read through…but the degree of that will depend on the volatility of the dimension in question. If you mean poor because the BETWEEN gives the CBO grief in determining the right plan and you get a suboptimal plan as a result then that’s also possible but more of a tuning issue…get the plan right and then the only difference is the volume of rows read.

    Also, I’m assuming that you’d need to put the historical changes for any attribute that is changing into a mini dimension otherwise when you get a brand new fact from a new source and want those historical values, you’d have lost them right?

    How did you get the value of the Customer City of Residence that was in place at the time (transaction_date) of the Fact – you looked it up using the BETWEEN predicate from above right? So where’s the difference between that and the standard SCD2?

    I’m rambling here…it just doesn’t seem to offer anything over the standard SCD2 from where I’m sitting…which means one of a number of things:

    1. I’ve not understood your post / the approach being outlined
    2. I’m right and there is no benefit from the approach being outlined over an SCD2
    3. 1 or 2 may both be right depending on the scenario, business requirements, volatility etc…

    That’s it…my brain has finally shutdown and started the weekend so I better shut up now!

    ;-)

  2. Jeff,

    The failing of the SCD2 is in answering questions of the form, “What is the sales history for the customers currently resident in Arlington” — ie. ignore their place of residence at the time of the sale in favour of their place of residence now. In a pure SCD2 this requires a query of the form …

    where dim.customer_natural_key_id in
    (select customer_natural_key
    from dim_customer
    where city = ‘Arlington’
    and is_current_record = ‘Y’)

    There’s a lot of variations on that of course, in particular in the method chosen for finding the current record for the customers. They’re all more complex than queries against SCD1 or SCD3 types, of course, and tend to make the cardinality estimation less reliable (although list-partitioning the dim table on “is_current_record” helps in solving that problem, even if it might cause others).

    You might choose to keep the city as an SCD2 but also add another SCD1 column (or SCD3 columns maybe) to flag every historical customer record with its current city — that would be another neat-ish solution to keeping the query simple and would give you excellent performance for queries based on comparison of current and historical values.

    Whether the historical value is to be maintained as SCD2 or pushed to the fact table also depends, I suggest, on it’s importance as a filtering condition. If the user community is going to be performing a lot of “Historical City = Colorado Springs”-type queries then there’s a lot to be said for avoiding the query path through the customer dimension. With the historical city stored directly in the fact table, either as a natural or as a synthetic key (the ablity to use a natural key depending heavily on whether fact table rows can be compressed on insert or not), the table can be partitioned on that attribute in a way that is impractical/flaky when attempted on a parent attribute of the customer. It is also a darned sight easier to maintain a summary table on it, as I mentioned before.

    Finding the value in force at the time of a transaction is generally best done through a factless fact table built on the customer – at least, that’s how it was done in the example of the item manager I quoted (the correct historical value for a transaction is often based on that value in force several months before, as a single transaction often generates followup transactions in the source system several months later and the follow-ups have the same original transaction date as the master transaction that they’re based on).

    Anyhoo, in the absence of maintaining an SCD2 on the customer city this catalogung is the only way of retrieving accurate history, but there’s also a lot to be said for maintaining a customer catalogue in the staging area (star or even 3NF, dare I say) for handling these issues anyway. In the case of the items it was maintained in the production area as a regular star, along with all of the different price components and inumerable other attributes.

    10:45am here — I’m guessing 6:45pm where you are. I can’t believe I have to wait an extra eight hours ’till the weekend.

  3. Firstly, sorry for tardiness of response…no internet connectivity at home due to the incompetence of British Telecom but that’s another story!

    OK – I accept that the SCD2 query is more complex than SCD1 or SCD3…but I wouldn’t say it was complex per se – but then I do write SQL all day long so my perspective may be coloured there. Given the more complex nature of the query, it can have effects on the optimizer, particularly as you say, with cardinality, although as you also say, there are ways to address this and other issues via partitioning etc.

    The thing is, for each dimension, either one needs the history or one doesn’t…yes, maybe an SCD2 requires a more complex SQL structure and can have performance issues…but if one needs the history so that one can do reporting “as of” a specific time then one either needs a “pure” SCD2 or one needs enough pivoted history attributes on the SCD3 to answer the question posed – I’ve never liked SCD3 personally as it’s a compromise based on “knowing” what one would ever need from the data – personally I go with either SCD1 or SCD2 as one day the limitations of one’s particular SCD3 choice will eventually come back to haunt one.

    The DW I’ve been involved with lately uses a mixture of SCD1 and SCD2 but for the SCD2s it tracks the history of the whole record – not just a mini dimension of the “volatile” attributes. The data architecture approach has been to create a Historical Data Store with “SCD2 esque” archive tables tracking changes on all the incoming sources together with a Business Data Model in 3NF which integrates all these sources in one temporal business model. An Analytical Modelling Layer sits on top of the BDM to provide Star schemae and other pivots of derived measures as well as other performance enhancing objects such as MVs.

    Now that I’ve read (again, briefly) the Kimball article, I can see that in our BDM we mainly use a similar approach to the mini dimension with current overwrite except our Primary Dimension holds only the warehouse SID to legacy key transformation and no other “SCD1” attributes, whilst our equivalent to the Profile Mini Dimension holds all of the attributes, not just the ones which are deemed to be “volatile” and in need of SCD2ing. This leads to some reasonably large dimensions in our BDM but means we have every incarnation of every record and can answer questions as of any time – unlike SCD1/3 approaches. In cases where the users haven’t thought of all the questions they want to ask yet, I think this is the safest route – if a little heavier on performance and space. So far it’s proving it’s worth.

  4. Well I’m certainly sympathetic to your views there Jeff. You’re of course right that the complexioty of accessing all SCD2 records by current attributes is low in absolute terms — I wouldn’t want to hire anyone who couldn’t do it, for example — but in terms of encoding the relationship into a BI tool in a robust manner, any additional complexity (and complexity of the subquery type in particular) is always unwelcome.

    +1 to your comments on SCD3 — I’ve never used it, and it has always seemed like something a little esoteric to me. Whether you save the previous version of an attribute (the essential difference between SCD1 and SCD3) seems like something that can be applied to SCD2 as well — saving contemporary value and previous value in the same record. If previous values are usful in an SCD1 environment then why not use them in SCD2 as well? What would that be called … SCD4?

    So I don’t like the classification system as it stands because it fails to take into account a lot of the possibilities … preserving history, preserving current values, preserving previous values, preserving next values etc.. A lot of these concepts are “orthogonal” to each other and can be applied independently to different situations.

  5. Hi David,
    Just came across your site — lots of great reading. I have a question for you: why bother with a sub-select to find the current city of residence in an SCD2, as in your example:

    where dim.customer_natural_key_id in
    (select customer_natural_key
    from dim_customer
    where city = ‘Arlington’
    and is_current_record = ‘Y’)

    Assuming a dimensional (star) data model with fully bitmapped dimension tables, I would simply do:

    select
    time.c2, time.c2,
    cust.c1, cust.c2,
    f.c1, f.c2, etc.
    from
    sales_fact fact,
    dim_calendar time,
    dim_customer cust
    where fact.time_id = time.time_id
    and fact.customer_id = cust.customer_id
    and time.period_date between {startdate} and {enddate}
    and cust.city = ‘Arlington’
    and cust.is_current_record = ‘Y’

    Of course, the city and is_current_record columns would have bitmap indexes on them, Oracle would do a bitmap index scan on both followed by a Bitmap Merge, and complete the rest of its star transformation. The Oracle 9.2.0.4 Optimizer on HP-UX does this all the time in our data warehouse when combining multiple attributes for a dimension.

    Just wanted to pose that to you and see if I could learn something in the process. Many thanks and best regards,
    Terry

  6. The difference there, Terry, is that we want to find all records for customers whose current city is Arlington, regardless of what their city was at the time of the sales event. There would be multiple records for a customer, with different customer_id’s and each one potentially representing a different city of residence.

  7. Ah ha, of course David. Thanks — I missed the obvious there! I’ll have to run a test or two with a sub-select like to see if, in our environment, I see any change in the execution plan. I’ll get back to you on it.

    Best of luck at XM — hoping you are enjoying it there. I start a new job at Raytheon after Memorial Day, after 8 years at EDS. I’m really looking forward to it, just hoping that it will turn out as great as I am hoping. Ciao.

  8. Ah, Raytheon, eh? I applied for a senior data warehouse architect job with them in Alexandria, VA earlier this year, but never got anywhere with it — one of those applications that drops into the void and you never hear anything back. C’est la vie. Now if only I could find a way of unsubscribing from their job site …

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s