Multi agent development

Data Tips #18 – Data Quality 3. Data Quality in the Data Platform

We have in the two previous articles discussed what are different dimensions that can be measured and the holistic view to take. We also looked at where it makes most sense to handle data quality issues (as early as possible).

In a perfect world, data quality is handled in the source or close to the source. We can also handle data quality and data governance mechanically in the data contracts before the data reaches the data platform. This is unfortunately not always the case and we need to accept the need to handle varying data quality.

In todays article we will look at where to handle data quality in the different steps in the data platform.

For simplification I will divide the data platform into three layers or steps in the platform, there are many different ways of working with at data platform and todays article does not take any stand in what way is best.

The three layers we are looking at in this case are:

  • Ingestion – In this layer we are taking data or receiving data and reading the data into the data platform. In most cases we are not remodelling the data in any way.
  • Reusable data model storage – In this layer we are storing data in a modelled manner. The data is modelled for reuse and is stored in such a data model that should closely represent the information model.
  • Use case layer – In this layer we have our use cases. It may be the results of a ML model, a star schema, a prepared table for dashboarding, or any other type of use cases. The data may be reused many times between use cases in full or in subsets depending on the use case.

These are the full checks I would perform in the different layers when building a new solution, this is without having any special requirements on the data, in those cases you should of course add checks as needed.

  • Ingestion – When ingesting the data into the platform I would check Validity to make sure that the row conforms to the formats expected. Completeness to make sure that the right data elements are filled and not missing. Conformity to make sure the data conforms to industry standards. (this check may also be done in the next layer. Precision to make sure that we have the right precision on the data and do not lose any precision along the way. Accuracy to make sure the data represents the real-world example it is supposed to represent.
  • Reusable data model storage When modelling the data and when you have access to the full dataset and referenced datasets I would check Completeness to make sure we have received and read all the data we should have read and received. Consistency to make sure that the data is similar between datasets. Timeliness to check the latency from event occurence to when the data is available in the platform. Uniqueness to check for duplicates. (Referential) Integrity to make sure that the data referenced exists.
  • Use Case Layer When modelling the data according to the use case built I would check Completeness to make sure we have processed all the data we should have. Timeliness to check the latency. Uniqueness to make sure we have not created duplicates in processing. Integrity to check the integrity to all data referenced. Accuracy to make sure we have the data in the manner according to the use case

If I were to build a new solution with bare minimum I would check these things

  • Ingestion Check data formats and missing values in the data. (Validity & Completeness)
  • Reusable Data Model Layer Check if we have read all data, check duplicates and check referential integrity. (Completeness, uniqueness, integrity)
  • Use Case Layer Check if we have processed all data and if it conforms to what we want to achieve. (Completeness, Accuracy)

In conclusion there are some data quality checks that are mandatory in order to have known reliability on the data in the data platform. After the mandatory checks are in place all checks should be done according to need.

Share on social media: