Today I want to look at a topic that happens to all teams running a data platform.
How do we handle re-runs in the best manner?
First of all, I want to look at the purpose of the data platform, generally it is to provide the right data at the right time to the right persons. In a perfect world nothing would go wrong from the source into the data platform.
Usual scenarios that happens are:
- We do not receive all data from the source and the source does not send only the missing data. For instance we may get a full day of financial transactions every night.
- There is an error in the data at some point in the chain so we have to reload data historically.
Manually we can definitely handle these two cases, it is easy to clean the wrong data and run the pipeline with the correct data. But of course we do not want to handle anything manually.
First off we have to separate the data platform into two layers, Ingestion and detail/dimensional model/Use case layer (here labeled the Data layer).
In Ingestion we want to keep the data that we have received, even if it means having duplicates of data. This is because getting data into the data platform should be robust and we should never doubt if we have the data in the platform.
In Data layer we want to store the data in a proper data model. This data needs to comply to our set data quality rules. This is the layer where we need to make sure that we have the correct data and only the correct data.
So, for the strategy then:
- Always run your data pipelines based on a designated business date and key. With that business key you can identify unique transactions or chunks of transactions. Say that the 5th of may is the date of the events you have to rerun, if you are handling it with the business_date (5th) then you can delete the 5th from the Data layer and reinsert the correct data. However if you are driving the pipeline based on ingestion_date then you are dependent on everything running correctly on the designated day.
- Generally use delete-insert based on the natural key. This way you handle both getting any missing data in, you also handle any potential wrong rows that are already existing. Doing a proper merge will require you to both check for existing rows but also check the values in each row and that is a costly procedure.
- If you have rerun the Data layer, remember to rerun all aggregations as well so you end up with a consistent state. Luckily a lot of tools exist that handle this for you.
To summarize:
- Expect to get wrong/duplicate data at some point
- Use natural_key to identify the chunks of data to be corrected
- Delete-insert whenever possible
Happy deleting