Support Us

You are browsing the archive for Data Quality.

CKAN extensions Archiver and QA upgraded

davidread - January 27, 2016 in Data Quality, Extensions

Popular CKAN extensions ‘Archiver’ and ‘QA’ have recently been significantly upgraded. Now it is relatively simple to add automatic broken link checking and 5 stars of openness grading to any CKAN site. At a time when many open data portals suffer from quality problems, adding these reports make it easy to identify the problems and get credit when they are resolved.

Whilst these extensions have been around for a few years, most of the development has been on forks, whilst the core has been languishing. In the past couple of months there has been a big push to merge all the efforts from US (data.gov), Finland, Greece, Slovakia and Netherlands, and particularly those from UK (data.gov.uk), into core. It’s been a big leap forward in functionality. Now installers no longer need to customize templates – you get details of broken links and 5 stars shown on every dataset simply by installing and configuring the extensions. And now we’re all on the same page, it means we can work together better from now on.

ckanext-qa ckanext-archiver

The Archiver Extension regularly tries out all datasets’ data links to see if they are still working. File URLs that do work are downloaded and the user is offered the ‘cached’ copy. Otherwise, URLs that are broken are marked in red and listed in a report. See more: ckanext-archiver repo, docs and demo images

The QA Extension analyses the data files that Archiver has downloaded to reliably determine their format – CSV, XLS, PDF, etc, rather than trusting the format that the publisher has said they are. This information is combined with the data license and whether the data is currently accessible to give a rating out of 5 according to Tim Berners-Lee’s 5 Stars of Openness. A file that has no open licence, or is not available gets 0 stars. If it passes those tests but is only a PDF then it gets 1 star. A machine-readable but proprietry format like XLS gets it 2 stars, and an open format like CSV gets it 3 stars. 4 and 5 star data is that which uses standard schemas and references other datasets, which tends to mean RDF. See ckanext-qa repo, docs and demo images

UK spending: is the data all there?

Mark Wainwright - September 13, 2012 in Data, Data Quality, News

Data tells stories, and some of the most interesting (and newsworthy) stories are about spending. There is buckets and buckets of spending data at data.gov.uk, the UK government’s CKAN portal. So why don’t journalists write more stories about it?

There are various reasons, of course. It is quite a lot of work to trawl through and process spreadsheets of data in the hope of finding the story you want. One key reason is that it is sometimes hard to find all the data and be sure it is complete and up-to-date. The data.gov.uk folk, in collaboration with the Open Knowledge Foundation’s OpenSpending, have just made that a little easier.

Their nifty little tool shows which departments have released full spending data in line with Treasury guidelines, as well as providing direct links to the data. Read about it on the OpenSpending blog here.

[IMG: data.gov.uk departmental spending report]

QA extension added to the Data Hub

John Glover - January 9, 2012 in Data Quality

The CKAN Quality Assurance extension is now live on the Data Hub. The extension calculates a five star rating for each dataset resource based on Tim Berners-Lee’s five stars of openness. It also provides a dashboard that lists broken links and openness scores, which can be viewed now at http://thedatahub.org/qa. Screenshots of two of the dashboard pages are shown below.

QA Dashboard

5 Star Ratings

The main change from previous versions of the extension is that we can now calculate the five star ratings for dataset resources as soon as they are added or updated. This is because the QA extension makes use of the the new CKAN 1.5.1 ability to schedule tasks/processes to run in the background, which allows us to update information and perform potentially slow tasks much more frequently in response to user actions. This is hopefully a good first step towards making quick feedback available to users and in turn helping the community to improve the quality of our data.

In future we hope to improve the actual work done by the QA system in order to provide a more in depth analysis of CKAN resources, and there are also plans to integrate the QA data with our web user interface. A list of planned improvements can be found on the CKAN Trac, but if you have any other suggestions then please comment here or on the CKAN Discuss mailing list.

The CKAN wiki has been updated with information about the QA process and changes to our domain model. If you are interested in adding the qa extension to your own CKAN instance, or in helping with development, you can get it here: http://github.com/okfn/ckanext-qa.

Moderated Edits

Lucy Chambers - June 20, 2011 in Data Quality, datapkg, Extensions

Moderated Edits are now live on test.ckan.net.

Some users need strict control over what information gets published.  However, the aim of CKAN is to make editing the metadata as accessible, easy and open for as many people as possible.  This feature is an attempt to find this middle ground between ownership and encouraging participation from everyone.

You can see an example package that shows some of the Moderated Edits functionality here.

CKAN has always had a “versioned data model”.  This means that it is possible to look at the history of the packages it contains.   This feature is very similar to how you see history in wikipedia.  However, in wikipedia, this historical information is also used to inform editors of what has changed and to use this information in forthcoming edits.  Moderated edits works in exactly the same way, with the exception that what finally gets published needs to be approved by a moderator.  Most of the time, we hope, all the moderator will have to do is click the approve button…


Moderated Edits Design Detail:

With the Moderated Edits feature, we are trying to achieve these goals:

  • To get people involved with making edits to CKAN metadata.
  • To have an ownership model as to who can moderate and validate these changes.
  • To not put too huge a burden on these owners.
  • This feature allows anyone to edit a package and create a new revision, but requires an owner/moderator to approve a revision before it is are made “official”.

To achieve this we needed to:

  • Allow a group of changes to be stored as a new revision.
  • Allow a linear stack of “community” revisions.
  • Providing a way for the editor and moderator to compare previous revisions to the current one.
  • When a moderator approves a change it creates a new revision flagged “moderated”.
  • Provide a way for the editor and moderator comment on revisions if necessary.

Visual design:

  • Keep as simple and as out the way as possible
  • We decided to go with a list of revisions on the right. Clicking on a revision lets you compare the current revision to it.
  • Shadow boxes underneath each field displaying the compared.
  • Color coding or revisions to match colors of shadow box.
  • Extra approve button for moderators

Any feedback on this feature will be much appreciated.  It is an interesting and new area as unlike wikipedia, CKANs data is structured i.e its not just free flowing text.  We are very happy with the result but any suggestions will be very welcome.

 

Data Quality: What is It?

Stefan Urbanek - January 20, 2011 in Data, Data Quality

Whether one is a journalist using data for an investigation or a governement publishing its budget its important that we can check assess that data’s quality.

It’s also true that if I’m a user of a data catalogue it’s very useful for me to know something about the dataset before I try to download it — not just it’s quality, but its characteristics, size etc.

Resource Quality

We have to have the data first to be able to measure them. Data catalogue gives us references (URLs) to various datasets and datasources, however to which extent we can use them? Here are properties of data resources in regard to their quality:

  • Availability – can it be machine downloaded? Does the server reply with “404 Not Found”? (This is 5 stars of openness item 1)

  • Processability – Is it in a convenient format – one that can be machine processed into structured form? Is it in closed proprietary format? (This is 5 stars of openness item 2)

Data Quality and Quality Dimensions

Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose.

What are the main dimensions of data quality?

  • Completeness – extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user’s expectations and data availability. Can be measured in an automated way.

  • Accuracy – data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings.  (NB: data can be complete but not accurate)

  • Credibility – extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.

  • Timeliness (age of data) – extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source

Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:

  • Consistency – do the facts in multiple datasets match? (some measurable)

  • Integrity – can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)

Next time we will talk about “What is acceptable data quality?”