In this article we are looking at Data Retention and Data Deletion inside the data platform.
There are two typical scenarios that we need to cater for when it comes to deletion:
- Data lifecycle / Data retention window. – There is normally a time period for which we are allowed or obliged to keep the data. A typical example of a time period can be 3 years. This is essentially removing the whole chunk of data after the expiration date is passed.
- Deletion of specific parts of the data. – This scenario happens occurs when a rule says that some part of the data needs to be deleted. A typical example of this is the Right to be forgotten – This GDPR mechanism allows the customer to have his/her data removed from the company. Another example is when a customer has stopped being a customer, after a given time that customer needs to be removed. This scenario is removing parts of the data regardless of the time window.
In order to be fully compliant we need to delete the data from all parts of our data platform. That includes the archive, the normalised storage and the use case storage.
All data object should have a data retention rule set, in some cases such as dimensional non-sensitive data that length of time can be endless. It should be a conscious decision nevertheless.
How do we delete for scenario 1?
As a side not, if we are managing personal data but still want to keep the data we can anonymise the data. This is quite complicated and easiest is to strip it from personal data and then aggregate that data.
When it comes to actual deletion the easiest way of performing deletion of large fact tables (which is generally where you need to delete the data) is to have the data partitioned by time and then drop the oldest partitions as you go along. That means all rows belonging to that partition that is 37 months old (if our retention period is 36 months) are dropped.
When you are in the archive it is best to save the archived files with a time stamp that contains the validity time of the data (not necessarily when the data was written). If you have that time stamp the data can easily be deleted.
How do we delete for scenario 2?
This one is trickier.
This actually requires you to go in and delete single or multiple rows in your data during the whole historic time period.
When it comes to personal data, you may anonymise the data by deleting any customer descriptive information, also looking at implicit identifiers.
If anonymisation is not enough there are two possible ways of achieving deletion:
- Perform a database delete on the table (if it is a small table the delete is fine). If the table is large you may issue deletes per partition and roll through your history.
- Rewrite the table without the data to be deleted. This is normally a three-step process: 1. Copy the data from the table to a clone table minus the data to be deleted. 2. Drop the data in the original table. 3. write the data from the clone table into the original table.
The archived data is much harder to remove certain rows from. In some cases you can take the risk of not deleting the data by encrypting the archive and limiting access to it.
Happy deleting.