Today we will discuss the handling of Facts data.
Facts as we have discussed before are business events. Sales, stock movement, orders, clicks, etc.
We will use these facts in several ways including (but not limited to):
- Historical reporting & analysis
- Building predictive or prescriptive analytics based on historical behaviour
- Trigger an action based on the event or a series of events
As we can determine from the above scenarios there are very different requirements on the data when it comes to:
- Latency (time from event occurred until data is available or action is taken)
- Completeness (have we processed all of the data)
- Correctness & accuracy (have we performed adequate data quality checks)
To support all use cases we have to support both low-latency requirements as well as giving a total picture of historical facts.
There are several ways of handling this:
- Separating the different use cases – If there is little overlap between the different use cases it may make sense to just separate them. Handle events fast and efficient and batch the same data into the historical data store.
- Stream everything – This is a compelling pattern because you only need to handle one code base, the tricky part is handling the data completeness and data quality issues. They will in most likelihood require periodic jobs that clean and aggregate the data.
- Separate after ingestion – Start by streaming all events, store the raw events in a data store, continue the streaming use case and finally pick up the historic data from the raw data store into the historical tables in a batch job.
I personally favour the Separate after ingestion pattern, mainly because it does not hamper the streaming case but it also allows you to handle the tricky business logic and data logic in an easier manner. As pattern #2 I would choose to stream everything.
What about the dimensions?
So, when the event data has reached a certain point in your data platform you want to enrich the data with the dimensional data. The traditional way of handling this is to use something called surrogate keys that gives you a direct way of joining Facts to Dimensions. The original ID is swapped against the surrogate key. This however can be both compute consuming and complex so depending on the use case it can be good to just keep the original IDs and handle the joins with those keys instead. For very fast querying though it makes sense to use simple joins such as key <-> key.
Early arriving facts is a term where the events occur before the dimensional data has arrived, an item has been sold before the item description has arrived to the data platform. In that case it is easiest to create a dummy dimension row for that ID and update it once the dimensional data has arrived. Do not replace the IDs in the fact table with “unknown” or similar, that will destroy the data.
Aggregates
Aggregates can be important, especially for lowering query time and compute cost. These should be used whenever you see that the granular data is often summed up to day/week/year or in some other dimension.
They should however be used when needed rather than be created for every eventuality since if the underlying data changes you will need to reload the aggregates.
That is it for today, in a later post we will look at how to drive your data pipelines based on the data received rather than the processing date. (this is important)