publicdata.eu: Data, Apps and 800,000 Triples
The below post is written by Friedrich Lindenberg and cross-posted from the LOD-2 blog.
Last week, we officially launched publicdata.eu, the Open Knowledge Foundation’s European-level data registry. After releasing an experimental data catalogue federation and scraping frontend earlier this year, this is the first iteration based on CKAN, our data management system. While the basic functionality is still that of a read-only dataset search, a lot has changed behind the scenes.
The site now uses CKANs new harvesting capabilities, originally developed for the UK’s location programme. Using this framework, we were able to pull a large number of data catalogues into this joint index – including all instances of CKAN (such as data.gov.uk), France’s Data Publica, Swedens OpenGov.se, CSI Piemonte’s Dati Piemonte and several municipal catalogues, including those of London, Paris and Vienna. In the near future, we hope to also include some geodata directories, such as the EU’s national INSPIRE registries.
Another major story in the current development was RDF support. While CKAN has had batch export to RDF for a while and the semantic.ckan.net subdomain is offering those exports for download, publicdata.eu is stepping up support: We’re now offering a live RDF API for DCat export, a SPARQL endpoint based on a background triple store that is updated whenever data changes as well as some support for DCat RDF imports in our harvesters. This means CKAN now potentially has round-trip support for DCat and that we can go ahead in implementing the proposed standard for DCat data catalogue federation.
As we started to gather increasing numbers of data packages, we decided to try out a few normalization techniques to the data we had gathered. Starting in the messiest place, the first aspect to tackle was file formats. While there is no hope for datasets with “paper” as the mime type, “shapefiles” and “commasheets” can be easily translated into their proper types via a simple script.
Another piece of information that we were easily able to generate was the member state (and in some cases NUTS classification) of the affected region. This allowed us to create a map-based overview of data availability thoughout Europe. Besides being a nice way to facet the data, this also helps to show which countries are leading in their effort to open up government information.
We then did the same thing to categorizations: several of the catalogues we harvested contain their own small taxonomies. Looking at the similarities, it was easy to extract a set of 14 common categories – most of which roughly align with first-level Eurovoc items. Still, the larger number of source categorizations remains untranslated and highlights the need for a proper taxonomy management to be integrated with the catalogue in LOD2.
Finally, comes the most visible aspect: CKAN received both a face lift and an integrated apps catalogue. Realizing the need to give some of the fabulous contestants for the Open Data Challenge a permanent home, we decided to integrate a gallery of the shortlisted entries right into the core of publicdata.eu.