Data Quality: What is It?

Stefan Urbanek
20 Jan 2011
Share

Whether one is a journalist using data for an investigation or a governement publishing its budget its important that we can check assess that data's quality. It's also true that if I'm a user of a data catalogue it's very useful for me to know something about the dataset before I try to download it -- not just it's quality, but its characteristics, size etc.

Resource Quality

We have to have the data first to be able to measure them. Data catalogue gives us references (URLs) to various datasets and datasources, however to which extent we can use them? Here are properties of data resources in regard to their quality:

Availability - can it be machine downloaded? Does the server reply with "404 Not Found"? (This is 5 stars of openness item 1)
Processability - Is it in a convenient format - one that can be machine processed into structured form? Is it in closed proprietary format? (This is 5 stars of openness item 2)

Data Quality and Quality Dimensions

Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose. What are the main dimensions of data quality?

Completeness - extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user's expectations and data availability. Can be measured in an automated way.
Accuracy - data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings. (NB: data can be complete but not accurate)
Credibility - extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.
Timeliness (age of data) - extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source

Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:

Consistency - do the facts in multiple datasets match? (some measurable)
Integrity - can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)

Next time we will talk about "What is acceptable data quality?"

Someone Built a Sheet Music Directory on CKAN. I Did Not See That Coming.

In Category on 24 Jun 2026

The Most Unexpected CKAN Use Case I've Ever Seen: A Sheet Music Directory With AI Metadata

Wolfgang from Ondics built an open source sheet music catalog on CKAN — with AI metadata generation, YouTube playback, and cross-instance sharing. Here's how.

In Category on 23 Jun 2026

See What's New in the CKAN World: Ecosystem Catalog, HDX Spotlight, New Community Forum — and CKAN Running a Sheet Music Directory

A recap of what the CKAN community covered on June 17, 2026: a live demo of the new CKAN Ecosystem Catalog, a deep-dive into HDX Tabular Data Endpoints, the launch of the new community discussion forum — and, surprise surprise, a very unexpected use of CKAN as a sheet music directory with AI-assisted metadata. Yes, really.

Data Quality: What is It?

Resource Quality

Data Quality and Quality Dimensions

The Most Unexpected CKAN Use Case I've Ever Seen: A Sheet Music Directory With AI Metadata

See What's New in the CKAN World: Ecosystem Catalog, HDX Spotlight, New Community Forum — and CKAN Running a Sheet Music Directory

Connect with CKAN