This guest blog post is from Fadi Maali, of the Digital Enterprise Research Institute (DERI) who together with Richard Cyganiak (also based at DERI) has written an extension to CKAN to allow integration with Google Refine.
CKAN has established itself as a rich hub of information and one of the first destinations people use to search for and share open data. With the release of
CKAN Storage API, CKAN evloves from being a registry of datasets holding only their metadata and access information into a repository that can hold the actual data. This significantly lowers the barrier of sharing data as the need to have access to some sort of web hosting is eliminated.
Real-world data (almost) always contains errors, duplication, typos and inconsistencies. Even with carefully curated data, a room for improvement, enrichment or restructuring is always there. For many people, talking about data cleaning up automatically triggers a pointer to
Google Refine. Google Refine is a powerful workbench to work with data through a set of functionalities that jointly help users understand their data, clean it, transform it and eventually export it in a required format. Faceted browsing, clustering, mass-editing are few examples of the cool features available in Google Refine.
So with the rich data hub (a.k.a CKAN) and with
Mr. Clean for data (a.k.a Google Refine), here is my charming workflow to enhance quality of datasets listed on CKAN:
- get the data from CKAN, import it into Google Refine and perform any required tidying-up
- export the data from Google Refine in CSV (and optionally in RDF using Google Refine RDF Extension)
- extract and save Google Refine operation history
- upload the files to CKAN Storage and keep track of the files URLs
- update the corresponding package on CKAN.net or register a new one with the uploaded files associated as resources.
Notice the inclusion of Google Refine operation history, which is a JSON representation of all operations applied to the data in Google Refine since the data was imported. The history enables examining the operations and re-applying them (or just a subset of them) to the raw data or to a similar dataset. It is a record of the provenance of the data editing operations that were applied.
Google Refine CKAN Extension streamline this flow and make it possible from within Google Refine. The figure below illustrates the flow. With a few clicks, cleaned data can be uploaded to CKAN Storage and linked to a package on CKAN. Optionally, operation history can also be uploaded.
The Extension is
available and easy to install. Polish the data, share the brilliant results with the CKAN community and let us know your feedback!