Today we are building on last weeks article that described what dimensional and facts data are and what the characteristics for each are.
Today we are focusing on how to handle dimensional data.
Dimensional data is descriptive. We use it to give context to the business event that has occurred. We describe the sales event by enriching with Item description, customer segment and geographic location.
The masterdata that makes up our dimensional data is often created long before the business event occurs. It is quite seldom the base information around a store is created after that store starts selling items. This means that the dimensional data is not very sensitive to data latency. It is much more important that it can reflect a certain point in time than that it is updated seconds after the update is done in the ERP system.
So, dimensional data works well with batches or microbatches. A single dimension is often made up of several different data sources, item can contain basic information but also category tree and origin information. It is important to make sure that all sources are aligned on IDs and valid dates. For any given time in history you may want to know what category a certain item belonged to.
When presenting the descriptive dimensional data together with the facts data there are two concepts to keep in mind:
Conformed dimensions – this means that we try to keep the dimensions similar between different use cases. An item dimension for instance is reused wherever possible so we don’t end up with several different versions of that same data.
Slowly Changing Dimensions – This means that there are different ways of handling historical changes in the data model.
The three most common ones are:
SCD 1 – Type 1 means that with every update in master data we update the dimension without keeping the history. This results that history is rewritten according to the latest truth.
SCD 2 – Type 2 means that we keep every update of relevant data. Each update in master data results in a new row in the dimension. That row has a valid timespan, valid_from to valid_to, that describes when the data is valid.
SCD 3 – Type 3 means that we keep two versions in columns. Current value (latest value) and Original value.
Then there are hybrids of SCD 1/2/3 where some columns are historised and some are updated.
Two other ways of modelling data, Prejoined tables and Nested tables, can be compared to SCD2 since they describe the data at that particular time. (with some caveats that depends on how the data is loaded).
For anyone interested the concepts come from Ralph Kimballs (among others) books about Dimensional Modelling.
The books are old and most of the technical aspects can be disregarded but the business facing concepts still hold up to this day and can be easily modernized.
Next week, we are diving into Facts data (and why streaming facts is a good thing)