What CKAN Made Possible: A Sixteen-Year Open Data Story
From NYCBigApps to federal metadata standards β a co-founder's story of why CKAN's real value was never the portal
Processability - Is it in a convenient format - one that can be machine processed into structured form? Is it in closed proprietary format? (This is 5 stars of openness item 2)
Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose. What are the main dimensions of data quality?
Accuracy - data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings.Β Β (NB: data can be complete but not accurate)
Credibility - extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.
Timeliness (age of data) - extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source
Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:
Integrity - can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)
Next time we will talk about "What is acceptable data quality?"
From NYCBigApps to federal metadata standards β a co-founder's story of why CKAN's real value was never the portal
CKAN co-steward Steven De Costa makes the case for CKAN as the trusted data foundation for AI β providing provenance, machine operability, and the cold reasoning that keeps AI outputs grounded in reality.