First off we have to talk about what Data quality is and why it is important.
Data quality is a series of measurements that describes how well your data conforms to different dimensions. That is a tricky way of saying that Data Quality is not one measurement, it is many measurements. The purpose of the measurements is to describe if the data is complete and correct. The right data at the right time.
Why do you need to worry about Data Quality?
You need to have the right data quality for the purpose you want.
If you are directing robots in a factory, the data needs to be extremely accurate. otherwise the product may be faulty or an accident may occur.
If you are doing sentiment analysis over millions of customers, then each customer interaction in itself may not need to be 100% accurate as long as the majority of the interactions are.
So, in summary: The definition of Good data quality depends on the usage.
What are common Data Quality measurements?
- Accuracy: Does the data correctly reflect the real-world situation it’s supposed to represent? Does the data adhere to defined standards?
- Completeness: Is all the required data present? For each transactions check if there are missing values, blank fields, or placeholders. But also check if all of the transactions have been captured.
- Consistency: Is the data represented in the same format and free of contradictions throughout a dataset or across different systems? Do we for instance have the same Item name in the product hub, in the store and in the analytics platform?
- Timeliness: How current is the data? Does it reflect information in a timely manner for its intended use? How long after the event occurred is it available for consumption?
- Validity: Does the data conform to the specific format, type, or range that is expected?
- Uniqueness: Are there any duplicate records within the dataset?
- (Referential) Integrity: Do we have the data for linked datasets? For instance if we have a sales transaction that refers to a customer, item and store – does that store exist in the store table? does the item data exist?
- Conformity: Checks if data adheres to industry standards, regulations, and any predefined formats or structures.
- Precision: Does the data follow the right precision, very important for measurement data. Can also lead to aggregating errors if the some decimals are missing and millions of records are summarized.
In the upcoming articles we will discuss where to do data quality measurements in the source systems and in the data platform as well as discuss strategies to address data quality issues.