AI Agent Crew

Data Tips #08 – Data Models for consuming data

Today I am going back to a very basic but central topic. How do we present data to our analysts and scientists? The data may also in the next step be the basis for a dashboard or other deployed front-end.

The underlying data model for any dashboard or analysis should have (at least) the following characteristics:

  • Easy to use and understand
  • Good performance
  • Stability over time

We will will leave the data characteristics for a later topic, such as making sure it is the right data with the right data quality and so on, today we are focusing on the models.

There are many different ways of modelling data, everything from very denormalized data down to very normalized models such as 3rd normal form, data vault or anchor modelling. I think that when presenting data to analysts, scientists or business users the data should be more on the de-normalized side because then it is easier to use and understand.

Here are my three preferred patterns:

Star Schema – The star schema is made up of a central fact table with measures and connections to dimension tables that holds the descriptions, hierarchies, etc. with a key. The star schema offers flexibility in the analysis with easy joins, filterings and groupings. It also allows for changes in hierarchies, descriptions and other dimensional data over time. Drawbacks are that the joins on run-time may be a performance issue and that it requires good modelling techniques and higher effort when developing.

Star Schema

Pre-joined table – The prejoined table takes all of the facts and dimensions puts them on the same row thus it offers all data needed in a single select without the need to join other tables. This is a simple way of modelling and can be very effective, especially during an early phase where rapid prototyping is being done. The drawbacks to such a table is that it requires quite large space and that each row is fixed in time, meaning if a hierarchy changes the changes are not reflected in the data.

Prejoined table

Nested type tables – This is a type of table where some of the columns use complex data types such as an array or even a json element. I have added a simplified example below where you can see that under the store column there is an array that contains store id, store name, etc. The upsides of this type of table is that it can greatly improve data retrieval when you select a few rows from a table with many rows because you do not have to join a large dataset with another large dataset. It also allows for excellent flexibility of adding elements to the dimensional data without having to change the structure of the table. Drawbacks are that it can be complex to write queries against, sometimes having an extra layer that does the extraction can be beneficial. It can also be heavy on the compute resources for unnesting the data.

Nested type table

In summary I would use data models in the following scenarios:

  1. Star Schema – if you have a relatively stable data model, especially if many facts (transactions such as sales, orders, etc) uses the same dimensions (store, item, date, customer, etc)
  2. Prejoined table – if you are in a prototyping mode where you often add data and change the table. May also be beneficial when exposing data to non-dashboard tools.
  3. Nested type table – If you have large volumes and need quick response times. Also very useful if the dimensional data changes frequently. Does require extra effort and competence on the analysts side or an extra layer (such as views) for un-nesting the tables.

Share on social media: