Holding On to the Past
One of the most compelling reasons that organizations have for warehousing their data - for hanging on to their data at all - is the ability to use it to predict what might happen in the future. How will customer behavior change if prices increase by 5 percent? What time of year will we likely see a dip or a surge in sales? What are the telltale signs that a customer is preparing to cancel a subscription?
But, of course, in order to have any idea about the future we must first look to the past. That’s the concept behind holding onto data in the first place, but there’s a second level that’s important here, and that’s commonly referred to as historical dimensional data - or Type 2 data.
Holding on to Type 2 data is critical for a host of applications because it allows us to distinguish between the present situation for a particular person, place, product, or other dimension - and past versions of that same entity. For example, while I may live in Ohio now, there was a time when I lived in New Jersey. This information - what my address was in New Jersey, and over what dates it was correct - can be used to do proper analysis of the past, in order to make accurate predictions for the future.
Most of my clients, if they’re not already aware of the idea of Type 1 versus Type 2 data, easily grasp the idea of hanging on to historical data and supplying effective dates for each record. What tends to take a bit more effort to convey are the potential use cases for the data once it’s been properly collected like that.
I tend to respond with the skim milk example.
If you’re like me, you grew up with there being an item on the shelves in the refrigerated dairy section of supermarkets called skim milk. There was whole milk, then 2% milk, then 1% milk, and then skim milk. In recent years, however, there’s been a trend to rename skim milk to fat free milk. Same product, different name. From a Type 2 data perspective, it’s the same product - the same base key (maybe a UPC code) - with two different Type 2 records, one up through, say, 2016, with the name of skim milk for the product, and another from 2017 on with the name of fat free milk.
Now, because we have both records, and they’re both tied to the same UPC over different time periods, we’re able to do some analysis that is independent of the name… for example, the trend of overall purchases, year over year. Or the relative popularity of the product, compared to 1% milk over time.
But we’re also able to compare how well the product sold when it was skim milk versus when it was fat free milk. We can see whether or not customers who bought skim milk continued to buy fat free milk or whether it’s a different contingent of buyers. We can see whether there is a difference in how well the product sold with one name versus the other.
And if we change the name again? We create another record and the same possibilities hold true. Holding on to the past lets us predict the future with that much more accuracy.