Data Quality: What is It?
January 20, 2011 in Data, Data Quality
Whether one is a journalist using data for an investigation or a governement publishing its budget its important that we can check assess that data’s quality.
It’s also true that if I’m a user of a data catalogue it’s very useful for me to know something about the dataset before I try to download it — not just it’s quality, but its characteristics, size etc.
Resource Quality
We have to have the data first to be able to measure them. Data catalogue gives us references (URLs) to various datasets and datasources, however to which extent we can use them? Here are properties of data resources in regard to their quality:
Availability – can it be machine downloaded? Does the server reply with “404 Not Found”? (This is 5 stars of openness item 1)
Processability – Is it in a convenient format – one that can be machine processed into structured form? Is it in closed proprietary format? (This is 5 stars of openness item 2)
Data Quality and Quality Dimensions
Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose.
What are the main dimensions of data quality?
Completeness – extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user’s expectations and data availability. Can be measured in an automated way.
Accuracy – data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings. (NB: data can be complete but not accurate)
Credibility – extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.
Timeliness (age of data) – extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source
Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:
Consistency – do the facts in multiple datasets match? (some measurable)
Integrity – can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)
Next time we will talk about “What is acceptable data quality?”
ckan 




David James said on January 20, 2011
Stefan, thanks for calling attention to data quality. I think more discussions of this are useful.
We had an interesting discussion last year on the Sunlight blog about data set quality rating systems (or you might prefer the term “data set annotation”):
I like that you’ve included “consistency” as a measure of data quality. That said, from a logical perspective, if you use that definition, consistency becomes a measurement of a binary relation. Put directly, you can talk about the consistency of data sets A and B, but you cannot talk meaningfully about the consistency of data set A in isolation.
They way you used “integrity” above is confusing to me and likely to confuse others: “Integrity – can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)” Integrity, when it comes to data systems (such as databases) means that “there is a close correspondence between the facts stored in the database and the real world it models” (from Wikipedia). What you are describing might be better described as “fusability,” “joinability,” or “interoperability”. You are getting at the idea that data sets with good metadata (e.g. column descriptions) are relatively easier to blend with other data sets.
I would argue against “timeliness” as a dimension of data quality. If I am looking for employment data for 1997, I want it to be complete and accurate for that period of time (1997). I don’t need it to be “timely” in the sense that it has information from the present day.
I really like that you are approaching this differently than I would have thought about it. Generally speaking, you seem willing to bring in very subjective and complex notions of quality.
For example, your notion of credibility is also interesting, but tricky. I think it is reasonable to expect a data set to have meta data that describes its provenance (origin). But the notion of “credibility” is much more complicated and subjective. Practically speaking, I would suggest that data sets and data catalogs around the world start with making provenance very clear — and then later work to find ways to expose the complex social question of credibility based on that provenance.
Chris Wallace said on January 26, 2011
I linked this blog to a question I asked on getthedata . I think it’s an important question although it hasn’t attracted much interest there.
For me, integrity of datasets is about internal consistency, both within the data (of redundancy caused by keys, indexes and functional relatiionships, and between the data and the schema (the data is in the right format). Integrity can be checked mechanically against a set of integrity rules.
Of greater importance I think is the question of what I would term veracity. This is about the relationship between the data and the real world which it represents. Your Completeness and Accuracy are about this relationship. You claim that Completeness “Can be measured in an automated way”, but if its about the model-realworld relationship, that’s not the case- we can never know that every member of staff is in the phone book without making another model to check it against. Veracity is critical but can only be estimated or judged by knowledge of the process by which the data was acquired and the measures taken to check the veracity, which is why we have to use surrogates like the credibility of the source.