Support Us

You are browsing the archive for Deployments.

Link Digital’s Enterprise CKAN Stack for AWS is Now Available on GitHub

Steven De Costa - October 13, 2016 in Deployments, Featured, Partners

As part of the commitment made at the White House Open Data Roundtable, Datashades, also trading as Link Digital, has recently released the preview of an Enterprise CKAN Stack for AWS.

The stack presents Link Digital’s best practice, with independently scalable layers, easily adapted to CI workflows and automated system maintenance. It is now freely available to use on our Datashades GitHub repository.

This OpsWorks stack has been in active use by Link Digital and presents a basis on which Link Digital builds and supports its Government Open Data platforms. Hence, the project can justly be called “eating your own dog food”.

Even now that there is a number of improvements in progress, we believe that the newly-published alpha version of the project will add value to the Public Data community.

To build an OpsWorks stack you will need these CloudFormation templates.
When entering parameters for the CloudFormation template you will need the following cookbook URL for the OpsWorks stack.

Steven De Costa at the IODC CKAN Booth

Steven De Costa at the IODC CKAN Booth

A longer monologue from a dev list discussion:

Attaching our high level architecture using RDS on AWS — for UAT and PROD: appendix_8_updated_aws-hosting-environment-2.

CloudFormation scripts for building out CKAN in a HA config can be found at https://github.com/DataShades/ckan-aws-templates

OpWorks version is here: https://github.com/DataShades/opswx-ckan-cookbook

Happy to collaborate on this and make it shine brighter :)

There are a few other relevant scripts under our datashades set of repos, such as the ASG one here: https://github.com/DataShades/updateasg

And, the general cloud storage one here: https://github.com/DataShades/ckanext-cloudstorage

And the S3 related one here: https://github.com/DataShades/ckanext-s3filestore

We’ve also improved the SSO approach with Saml2: https://github.com/DataShades/ckanext-saml2

And, begun some work for manipulating ACLs, which is important for private dataset resources you’d want to switch to ‘public’ when published: https://github.com/DataShades/ckanext-acl

Although not formally part of the CKAN roadmap I have a working model of where I’d like CKAN to head when it comes to enterprise file/data storage and access. If you are familiar with the concept of resource views then the idea I’m keen to pursue is similar. It is a concept of resource containers (not para-virtualization containers but storage or access point containers). The idea is to make CKAN extendable via extensions of a type that allow it to do more orchestration around how data is stored and made usable below the discovery layer of the metadata.

The story would be something like:
As a platform operator, I need to be able to configure a variety of storage and access endpoint possibilities, so that custodians can select where data is placed based on type of data or business need.

Resource container extensions would then be built to accommodate things like:

  1. Big data, transnational data feeds
  2. Semantic lakes
  3. Large file storage blobs
  4. Self declarative structured data (likely using data packaging/frictionless data)
  5. For cost auditing and accountability – storage into specified paid cloud accounts (different AWS, Azure, etc. accounts based on organisation)

I would image that resource view and resource container extensions would be paired in many cases to allow for the view to provide greater access and control of the data to provide an ability to query and extract insights from the data.

The European Data Portal has around 650k datasets. It is true that once a CKAN portal gets to such a size then it can be a chore to do anything over the entire set of data in quick time. However, with the entire catalog readable via API there is a place for other tools to come into the picture to provide meta analysis or broader views over all data in a portal.

CKAN’s structure allows for data ownership and custodianship to remain flexible as the governing entities change over time. If we keen those functions lightweight and build the more intensive data processing tasks within a resource container layer then I think that is the big win :) I see datastore and filestore as examples of resource containers. Datapusher is an example of an ETL that works with datastore but similar tools and concepts can be worked into the model and the open source goodness can grow organically to meet lots of different organisational needs.

Where CKAN differs from other portal software, in my experience, is that it can be used for open Government data, research data, private sector data and ‘data as knowledge’ in virtually any situation. Other portal software appears to be built around capturing a particular market opportunity to generate data as knowledge for a particular customer segment – civic hackers, jurisdictional bureaucrats, open data policy implementations, etc.

CKAN’s harvesting is good, but certainly not perfect. The approach for pushing from CKAN to elsewhere is likely to be used more in our future work, or as we refactor the architecture of current implementations. See: https://github.com/DataShades/ckanext-syndicate

By using multiple CKAN environments it is pretty easy to have catalogs of ‘working data’ that then push to the ‘published data’ catalog. We use this approach for Government open data when from the bottom up you have agency data collected into CKAN based information asset registers. Sometimes the data doesn’t even exist, but the data management plan can at least first be registered prior to populating the dataset with resources. Once the data is ready it can then be published and syndicated upward to a higher level jurisdictional portal – such as a council, city, state or province. Similarly such datasets can then be syndicated upward again into a national or regional portal – perhaps with further ETL functions put in place to combine the similarly structured data from multiple agencies into a master dataset that presents a larger view of the entire data collection effort.

If the domain of data collection differs, such as in a field of research, then the same architecture can still apply. Multiple research schools of chemistry, for example, could publish working data locally then syndicate upward into a global repository that allows for meta analysis of all research outcomes over the entire domain’s efforts. We’re working on a project in just this manner that is referenced here: http://linkdigital.com.au/news/2016/09/building-mdbox-an-open-access-simulation-data-repository-on-ckan-and-aws

Lastly, published open data is the result of effort which is put into a process of data collection and, usually, some analysis and clean up. The tools used to process data, to prepare, collect or visulise are all part of the value a dataset represents. To bridge data and code we’ve released a very simple resource view for GitHub repositories that can be found here: https://github.com/DataShades/ckanext-githubrepopreview 

Open Government initiatives are formed around principles of transparency, participation and collaboration. There is a desire to enable public-private collaboration over the long term and there is a role for Government to act as impresario to stimulate new markets and economic activity from publishing open data (ref: https://www.nesta.org.uk/sites/default/files/government_as_impresario.pdf). The reason we built the GitHub resource view is to encourage open source projects to emerge in connection to public datasets, via linking the opportunity for discovery of helpful code with the discovery of helpful datasets.

Sorry for the long monologue! I could have more succinctly just said CKAN rocks, check out all the open source goodness surrounding it and jump in :)

CKAN based source code of the European Data Portal now available

wendyc - March 20, 2016 in Deployments

The European Data Portal version 1 was released on 18 February 2016. Today, the source code of the CKAN based European Data Portal and its extensions are released on GitLab. So, what extensions were made to CKAN?

What can I find on the European Data Portal?

Let’s start with some background information to understand the function of the portal and its components. The primary purpose of the European Data Portal is to harvest the metadata of national, regional and local Open Data portals and act as a single point of access to all Open Data available across Europe. The number of harvested datasets increased from 240,000 in November 2015 to over 415,000 datasets today. All metadata is made available in 6 languages: English, French, German, Italian, Polish and Spanish.

Next to the harvesting process, the portal offers more information around providing data to support public bodies in releasing more data. The Goldbook provides an overview of everything you need to know as a data holder who wants to start publishing data. You can also find an explanation on how a portal can be harvested by the European Data Portal. The section using data underlines the benefits of re-using Open Data as well as a checklist illustrating the key steps you need to go through before using data. Share your story and take part in our survey to raise further awareness at public sector level for the release of more data. Finally, the Library contains a huge amount of additional training material, use cases, reports and so on.

In a nutshell, the Portal contains metadata, training material, reports, use cases and all, but how does the CKAN platform integrated into the Portal really work?

The CKAN extensions of the European Data Portal

In order to integrate the required functionalities into CKAN, an extension was specifically developed. This extension provides several aspects of the overall concept. Most notable are the support for multilingual metadata, the implementation of the DCAT application profile for data portals in Europe (DCAT-AP) standard and the synchronisation with a Linked Data triplestore.

The multilingual feature which enables the current search functionality in six different languages was realised by adding an additional metadata field, which holds the translations for arbitrary languages for each respective metadata field. Fields of datasets and resources are taken into account. When rendering the views, the current language setting is considered for serving the appropriate translation. This feature is used to integrate CKAN into an external service, where all relevant metadata is automatically translated into multiple languages by a machine translation service.

The CKAN core schema was considerably extended to fully support DCAT-AP, which provides a RDF vocabulary for specifying public datasets. Therefore, a mapping from the vocabulary to the JSON-style schema of CKAN was designed and implemented. In addition, a mechanism was added to automatically replicate the metadata into a Virtuoso triplestore. By doing so, all datasets are directly available via a SPARQL endpoint. Besides the availability as JSON and HTML, every dataset can be served as RDF.

The activities for extending CKAN in the context of the European Data Portal are in line with similar efforts of the CKAN community to enhance the software further towards Linked Data and multilingual support (see extensions ckanext-fluent and ckanext-dcat). The examples given are just some initiatives that will be further developed in the near future.

Access the Source Code on GitLab!

Authors: Wendy Carrara, Eva van Steenbergen and Fabian Kirstein on behalf of the European Data Portal

ESRC Consumer Data Research Centre

alexsingleton - October 28, 2015 in Deployments

Integral to our activities as part of the ESRC Consumer Data Research Centre, we spent the summer working on a project that would create a searchable catalogue of the various data holdings that we are assembling, including retailer data that we have negotiated access to, but also a wealth of value added open data products. The site is available here: data.cdrc.ac.uk

One aspect that we were especially pleased with is the introduction of data stores for each local authority in the UK. These all have a separate URL for their own datastore; so, Liverpool could be found here for example: data.cdrc.ac.uk/lad/liverpool

We do not believe in simple replication of data sources available elsewhere, and we have added value to each open data deposit by reengineering these into new formats that are optimized for simple analysis, and which we hope are going to limit barriers to entry. As of 5/10/15 we created 8,738k separate data items for a very wide variety of topics.

Not every local authority have resources to create their own datastore, and for those which do, we hope that what we have created will be complementary. We have also linked many of the outputs through to our mapping interface which is available here: maps.cdrc.ac.uk

Some Technical Bits….

Given the location of this blog post; for the development we used the CKAN platform as this was open source and was widely used in those other data stores that we were familiar with. Off the shelf we have however made some considerable customisation.

The infrastructure we used to develop the CKAN included developing on Docker images for all the services that the CKAN relies upon, including a service management and configuration system. We also were dealing with multiple uploads that had been created using either R, Python and PostGIS, so we also scripted a bulk dataset uploading tool.

Some specific customisation:

For the products/topics/LADs/National/Regional search tabs:

  • Added support for filters based on products/topics/LADs
  • Added groups/labels for Open/Safeguarded/Secure datasets.
  • Added an interactive map on the front page (based on our maps.cdrc.ac.uk platform)
  • Added a Twitter feed
  • Add a blog proxy to a WordPress blog aggregator
  • Add download tracking
  • Improve efficiency of the CKAN code associated for the group listing
  • Add a geojson preview on the dataset pages
  • Prevented non-logged in users downloading – unfortunately we need to have this functionality to provide usage data to our funding body (sorry!)
  • Add system-wide notification messages
  • Add Google Analytic tracking code

Data blog

Customize a WP theme Cerauno to fit into the CKAN

Other additions

  • A plugin was developed to improve the user registration form (https://github.com/esrc-cdrc/ckan-ckanext-userextra)
  • Add checkboxes for newsletter options and a dropdown menu for sectors
  • Customized the metadata with a third-party plugin ckanext-schemin
  • Added a commenting system with a third-party plugin ckanext-ytpcomments
  • Improved the user experience of the commenting system by changing the look and feel, and allowed in-place commenting and editing.

We also did some Major bugfixes/improvements to the CKAN including:

  • Fixed the tracking system (broken by latest releases of CKAN)
  • Fixed the type system for groups

Besides these, there were various other small bugfix/improvements on the CKAN and third-part plugins.

We hope that these and our continuing contributions have been of use, and that you enjoy our data store: data.cdrc.ac.uk

Thanks in particular go to the hard work of Data Scientists Wen Li, Hai Nguyen and Michail Pavlis who have spent much of summer working on this project.

Beauty behind the scenes

Tryggvi Björgvinsson - August 5, 2015 in Deployments, Extensions, Featured

Good things can often go unnoticed, especially if they’re not immediately visible. Last month the government of Sweden, through Vinnova, released a revamped version of their open data portal, Öppnadata.se. The portal still runs on CKAN, the open data management system. It even has the same visual feeling but the principles behind the portal are completely different. The main idea behind the new version of Öppnadata.se is automation. Open Knowledge teamed up with the Swedish company Metasolutions to build and deliver an automated open data portal.

Responsive design

In modern web development, one aspect of website automation called responsive design has become very popular. With this technique the website automatically adjusts the presentation depending on the screen size. That is, it knows how best to present the content given different screen sizes. Öppnadata.se got a slight facelift in terms of tweaks to its appearance, but the big news on that front is that it now has a responsive design. The portal looks different if you access it on mobile phones or if you visit it on desktops, but the content is still the same.

These changes were contributed to CKAN. They are now a part of the CKAN core web application as of version 2.3. This means everyone can now have responsive data portals as long as they use a recent version of CKAN.

New Öppnadata.se

New Öppnadata.se

Old Öppnadata.se

Old Öppnadata.se

Data catalogs

Perhaps the biggest innovation of Öppnadata.se is how the automation process works for adding new datasets to the catalog. Normally with CKAN, data publishers log in and create or update their datasets on the CKAN site. CKAN has for a long time also supported something called harvesting, where an instance of CKAN goes out and fetches new datasets and makes them available. That’s a form of automation, but it’s dependent on specific software being used or special harvesters for each source. So harvesting from one CKAN instance to another is simple. Harvesting from a specific geospatial data source is simple. Automatically harvesting from something you don’t know and doesn’t exist yet is hard.

That’s the reality which Öppnadata.se faces. Only a minority of public organisations and municipalities in Sweden publish open data at the moment. So a decision hasn’t been made by a majority of the public entities for what software or solution will be used to publish open data.

To tackle this problem, Öppnadata.se relies on an open standard from the World Wide Web Consortium called DCAT (Data Catalog Vocabulary). The open standard describes how to publish a list of datasets and it allows Swedish public bodies to pick whatever solution they like to publish datasets, as long as one of its outputs conforms with DCAT.

Öppnadata.se actually uses a DCAT application profile which was specially created for Sweden by Metasolutions and defines in more detail what to expect, for example that Öppnadata.se expects to find dataset classifications according the Eurovoc classification system.

Thanks to this effort significant improvements have been made to CKAN’s support for RDF and DCAT. They include application profiles (like the Swedish one) for harvesting and exposing DCAT metadata in different formats. So a CKAN instance can now automatically harvest datasets from a range of DCAT sources, which is exactly what Öppnadata.se does. For Öppnadata.se, the CKAN support also makes it easy for Swedish public bodies who use CKAN to automatically expose their datasets correctly so that they can be automatically harvested by Öppnadata.se. For more information have a look at the CKAN DCAT extension documentation.

Dead or alive

The Web is decentralised and always changing. A link to a webpage that worked yesterday might not work today because the page was moved. When automatically adding external links, for example, links to resources for a dataset, you run into the risk of adding links to resources that no longer exist.

To counter that Öppnadata.se uses a CKAN extension called Dead or alive. It may not be the best name, but that’s what it does. It checks if a link is dead or alive. The checking itself is performed by an external service called deadoralive. The extension just serves a set of links that the external service decides to check to see if some links are alive. In this way dead links are automatically marked as broken and system administrators of Öppnadata.se can find problematic public bodies and notify them that they need to update their DCAT catalog (this is not automatic because nobody likes spam).

These are only the automation highlights of the new Öppnadata.se. Other changes were made that have little to do with automation but are still not immediately visible, so a lot of Öppnadata.se’s beauty happens behind the scenes. That’s also the case for other open data portals. You might just visit your open data portal to get some open data, but you might not realise the amount of effort and coordination it takes to get that data to you.

Image of Swedish flag by Allie_Caulfield on Flickr (cc-by)

CKAN Census 2014

Adrià Mercader - May 12, 2014 in Community, Deployments, Featured

Census!

 

Update: We have now also published a Deployment Survey to learn more about how CKAN is deployed and installed:

/deployment-survey/


CKAN is used by hundreds of organizations across the world to publish their Open Data on-line. More and more CKAN sites are going live and CKAN is being used in many new ways, integrating with other tools and being extended with new features.

We want to get a clearer picture of the current CKAN ecosystem to better understand how CKAN is being used and help scope the project roadmap. And if at the same time we can bring the ckan.org instances page up to date all the better!

So if you are developing or managing a CKAN site or know someone who does, can you spend 5 minutes filling out this quick survey?

/census

Any help in spreading out the link will be much appreciated as well. The results will make a big difference, and of course, they will be made available to the community later on.

This is the first in a series of surveys that will focus on different aspects of maintaining and using CKAN. In the following weeks we’ll send another one around focused on Deployment and Installation.

Let’s map CKAN usage across the world!

 

Photo by suziesparkle

 

CKAN for research data management: a round-up from St Andrews

Mark Wainwright - November 28, 2013 in Deployments, News

A new blog post from Birgit Plietzsch at St Andrews University provides an interesting survey of projects using CKAN for research data management projects. St Andrews themselves have a pilot project in this area, and Dr Plietzsch had solicited input from other projects on the ‘ckan4rdm’ mailing list. The post summarises the responses she received.

It’s noticeable that there are now quite a few RDM projects using CKAN in production environments or in pilots, most of them newcomers since the CKAN4RDM workshop earlier this year. Another project in the area that didn’t make it into the St Andrews round-up is EDaWaX, subject of a recent post on this blog.

If you’re interested in using CKAN in a research data management setting it is worth joining the ckan4rdm list. (It is a low-traffic list), and maybe sending it a note introducing yourself and your plans in the area.

Partner profile: Liip, Switzerland

Mark Wainwright - October 9, 2013 in Deployments, Partners

The Open Knowledge Foundation’s CKAN Professional
Partnership Programme means
that governments and other users all over the world can get paid
support from a certified local provider, and with access to the core
development team if necessary. This post is the first of a series on
current CKAN partners.


Liip AG is a web development company based in
Switzerland, which does large-scale, high-quality projects in a range
of areas, including e-commerce, online learning, mobile – and, of
course, Open Data. Their first big project as a CKAN Partner is
opendata.admin.ch, the federal Open Data
portal for Switzerland. The site, which Liip developed together with
five government agencies and the Open Government Data consultancy
itopia, was officially launched on 16 September
at OKCon in Geneva.

[Image: opendata.admin.ch]


Switzerland’s new Open Data portal, opendata.admin.ch

The current site is a pilot, produced for the Federal Archive, and
experience from using it will guide the future development of Open
Data in Switzerland. It is hoped that it will foster economic growth as well
as government transparency and efficiency. A study commissioned by the
Federal Archive concluded that open government data in Switzerland had
the potential to be worth over a billion Euros a year in economic
growth.

At present the site has over 1600 datasets, including regional
boundaries, demographics, election data, weather data, and more. Much
of the data is harvested from a range of government bodies, such as
the Federal Statistical Office and the Meteorological Office. To this
end Liip wrote a number of custom harvesters to extract the datasets
from different existing information systems, using CKAN’s harvesting
infrastructure. To make the system easily and robustly scalable when
other data providers – such as cities and cantons – join in future,
they designed an architecture with a central CKAN installation
harvesting from two satellite installations, which themselves harvest
from the other systems.

As well as the custom harvesters, they also wrote a number of other
custom extensions to adjust the look and feel of the site. Like CKAN
itself, all their extensions are openly licensed
under the Affero Gnu Public License (AGPL), and they have been involved
in contributions to the core code, particularly in the area of CKAN’s
multilingual capability – essential in a country like Switzerland with four
national languages.

At the moment Liip is integrating datasets and webservices of two
offices of the canton of Zurich into the federal pilot portal, and helping
the city of Zurich to migrate their current open government data portal
to a state-of-the-art solution using CKAN.

opendata.admin.ch marks a significant milestone in Open Data in
Switzerland. Only a week before its launch, the National Council (the
lower house of Switzerland’s parliament) voted by a large majority in
favour of an ‘Open Government Data masterplan’. Hopefully we will be
hearing much more of Swiss open data in the future.

EDaWaX: Choosing CKAN for managing research data

Hendrik Bunke - September 25, 2013 in Deployments, Extensions, Feature

This is a guest post by Hendrik Bunke of the EDaWaX project, cross posted from the project blog. EDaWaX is a German project which aims to greatly increase the amount of research data in Economics that is made open.



One aim of EDaWaX is to develop and implement a web-platform prototype for a publication-related research data archive. We’ve chosen CKAN – an open source data portal platform – as basis for this prototype.

This post describes the reasons for this decision and tries to give some insights into CKAN, its features and technology. We’ll also discuss these features both in regard to our special use case and to suitability for research data management in general.

Before you proceed, it might be useful to have at least a short look at an article that covers a similar topic and does it far more extensively than this blog post. It’s written by Joss Winn and is titled Open Data and the Academy: An Evaluation of CKAN for Research Data Management. The paper was made available on GoogleDrive so others could comment or even add to the article. I’m also mentioning Joss’ paper to show that there is already an ongoing discussion – for example in this mailing list thread – about how to adapt CKAN, which is at the moment mainly used for government data, for research data management.

This post focuses on our special EDaWaX perspective and does also provide some more technical introduction (installation, writing extensions, using the API etc.). In addition we describe our own CKAN extensions that add basic theme customisations and custom metadata fields.

We hope this will be useful for those who are looking for a decent solution for a research data repository and have heard only a little or (most probably) nothing at all about CKAN yet.

EDaWaX criteria for research data archive software

We won’t go into detail about the EDaWaX project here. In short, EDaWaX is looking for ways to publish and curate research data in economics. Our focus is on publication-related data, meaning especially the data that authors of journal papers have used for their articles. One objective of the project is the development of a data archive for journals using an integral approach.

Our projected web application should demonstrate some features that the EDaWaX studies revealed to be important for replication purposes. We evaluated several software packages dealing with data publishing and had only a few, very general but fundamental, criteria for the software:

  • Open Source: This is a fundamental principle for us, but there are also practical reasons for this. We want to be able to modify and extend the software, and we would like to share our extensions.

  • API (reading and writing): This is quite important for a modular and flexible infrastructure. We also want to provide integration packages for other systems (CMS or special e-journal software). We think that research data must not just be stored in, perhaps closed, ‘data-silos’, but should be accessible and reusable as much as possible. An API opens up a lot more possibilities for this purpose.

  • Simple User Interface: We are mainly targeting authors and editorial offices who don’t have the time, resources and know-how to learn and use complicated UIs and workflows. This is also important for lowering the general barriers for publishing research data.

  • RDF metadata representation: We are aware that this might be a somewhat ‘avant-garde’ criterion. But we predict that it will be more and more important in the near future to have a general, linkable and machine-readable metadata interface, so our research data can be used and adopted as widely as possible.

The main ‘opponents’ of CKAN in this small ‘contest’ were Dataverse and Nesstar. But while both are well established platforms dedicated especially to research data (which CKAN is not), neither met most of our criteria. Nesstar is proprietary, not open software, you have to pay for the server component, and the only way to upload data is the use of the so called ‘Publisher’, a Windows-only client. That’s a no-go for us. Dataverse’s main problem compared to CKAN (besides the fact that it is an unfriendly Java-beast of software ;-) is the lack of a decent API. There is at least now a reading API (since March 2012), but you cannot use it to upload data. So, in the end there was no question which software we would choose: CKAN. Let’s see in detail why.

What is CKAN?

CKAN — an abbreviation for “Comprehensive Knowledge Archive Network”, which does not exactly describe its actual use-cases today — is an open-source (check) web platform for publishing and sharing data. Written in Python, it offers a simple, nice looking, and very friendly user interface (check) and provides by default a RDF metadata representation for each dataset (check). The feature that, in our view, makes CKAN really outstanding is its API, which allows access to nearly every function of the system including writing and deleting of datasets (check; more on that later).

CKAN has many, many more features that could be listed here, including harvesting, data visualisation and preview, full-text and fuzzy search, and faceting. It is widely used, mainly in the field of open government data, where it has become a de facto standard software package. CKAN powers the national open data portals of the UK, the USA, Australia or the EU, to name only a few well-known examples. The CKAN website has an impressive list with all known production instances.

The very active development of CKAN is led and organised by the Open Knowledge Foundation (OKF). There are around ten people at OKF who are mainly working on CKAN, most of them developers, and in addition at least 30-40 developers are contributing actively.

If you want to contribute to CKAN or develop an extension you should subscribe to the developer mailing list. There’s also a general discussion list, and if you want to use CKAN for research data management (and in the end that’s what this article is all about) please immediately subscribe to the quite newly established and already mentioned list ckan4rdm.

CKAN’s source code can be found at github. It is written very cleanly (forced by clear coding standards). As a reasonably experienced Python developer you won’t have major difficulties in understanding the code.

If you just want to test the front end, i.e. the user interface of CKAN, you don’t have to install it yourself. OKF is running the public open data portal datahub.io, where you can register and upload data. The portal is not only for testing purposes. For example, all the RDF data of the Linked Data cloud is registered there. There’s also a ‘pure’ demo site that gives you a first impression.

However, if you’re really considering CKAN to power your open data portal you should of course install it yourself.

Installation

Before you start to install and use CKAN please have a look at its extensive and excellent documentation. We will only give some hints here, to give you a starter.

If you have decided to give CKAN a try, you have two install options. If your machine runs a 64-bit Ubuntu 12.04, you can try to install all needed packages via apt-get. The package installation will do all necessary basic configurations for you, so it might be more convenient. However, this method also involves some lack of flexibility, so we would not recommend it. Moreover, if you may want to develop your own CKAN extension (more on that later) or use another OS platform, you must use the second method and install CKAN from source. I didn’t find that to be too difficult and, again, if you have some experience as developer it will be no real problem. In this scenario CKAN will be installed via git and pip in a virtualenv, which you will most probably already be familiar with. The application then is ran with paster. Under the hood, CKAN uses the Python framework Pylons (which has now merged with another framework and is called Pyramid; but that’s another story).

In addition to the core package of CKAN, you will have to install and configure some packages that CKAN requires. Nothing fancy here, though. CKAN uses PostgreSQL as its database, and for searching and indexing it relies on Solr, which involves the installation of a Java JDK and a Java application server like Jetty or Tomcat. It’s worth mentioning that you can run CKAN without Solr, but you’ll lose a lot of advanced search functionality like faceting for instance. The same goes for the database. Besides the almighty PostgreSQL you can also use the lightweight SQLite. This is quite handy for testing purposes or for development, but not recommended or supported for production installations.

API

As mentioned before, we think it’s the API that makes CKAN really outstanding. “All of a CKAN website’s core functionality (everything you can do with the web interface and more) can be used by external code that calls the CKAN API”, as the documentation states. And that’s true. You can

  • get all sorts of lists for packages, groups or tags;

  • get a full metadata representation of any dataset or resource (which is the actual data or file);

  • do all the kinds of searches you can do with the web interface;

  • create, update and delete datasets, resources and other objects. I’m emphasizing this because it’s really a killer feature, which enables you to develop your very own application based on API calls to an external CKAN installation. It makes, for example, mobile apps possible. Or you can write plugins for your local CMS, journal system or whatever.

From a programmer’s perspective this is just great, great, great. And even with our focus, open research data management, it enables a lot more usage scenarios than a simple web portal with a closed, proprietary database would do.

And in fact, it is quite easy to use the API. There are client libraries for any common web programming language (Python, Java, Perl, PHP, Javascript, Ruby), so you don’t need to write the basic functions on your own. A very simple Python script like the one below is sufficient to upload a file to a CKAN instance:

import ckanclient

CKAN = ckanclient.CkanClient(api_key=<your_CKAN_api_key>,
    base_location=<url_of_CKAN_instance/api>)
upmsg = CKAN.upload_file(<your_local_filename>)
print upmsg #this is not necessary ;-)

For demonstration and testing purposes we’ve developed a small sample application. It was built with the Pyramid framework and can completely manage the datasets of a certain group at an external CKAN instance. The demo pics show the list of the packages and a form to create a new dataset. Since this instance is for developing and evaluation purposes only it’s not public, but hopefully the pics will give you a first impression of what’s possible.

Custom application using the API: list of packages

Custom application using the API: list of packages

add_dataset

Pyramid app: add_form

It’s worth mentioning that, of course, writing your own application around the CKAN API also allows you to simply add features CKAN might not have. So, for instance, the little red X mark at the right side of all packages (screenshot #1) enables a direct deletion of the package. That’s something CKAN’s UI does not offer by default.

OK, you’re saying, I got it, the API is great. But I don’t want to program an external application. I just want to stick with the original platform, but I need a different look, and even more special functionalities. So, is CKAN extensible?

Short answer: Yes.

Long Answer: Writing Extensions

Adding a custom theme or functionality is done with so called extensions. CKAN extensions are ordinary Python packages containing one or more plugin classes. You can create them with paster in your virtual environment.

paster create -t ckanext ckanext-mycustomextension

Note that you must use the prefix ckanext-, for otherwise CKAN won’t detect and load your package. You then have to install it as a develop package in the Python path of your virtual environment. That’s done the usual way with

cd <path_to_your_extension>

python setup.py develop

or even

pip install -e <path_to_your_extension>

Please refer to the docs for a detailed description on writing extensions. Basically you use the so called PluginToolkit and a whole bunch of interface classes with which you can hook into CKAN core functionality with your own code. You will most probably also need to overwrite some Jinja templates, especially if you want to create a new look for your portal.

CKAN provides some basic example extensions that will quickly give you a rough understanding of how the plugin mechanism works. In addition there are many (many!) CKAN extensions already available. You can browse them at github.

So, what are the extensions we are developing for EDaWaX?

EDaWaX extensions and implementations

Basically we are working on two extensions for EDaWaX at the moment. The first one, called ckanext-edawax, is mainly for the UI. It tweaks some templates and UI elements (logo, fonts, colors etc.). In addition it removes elements we do not need at the moment, like ‘groups’ or facet fields, and it renames the default ‘Organizations’ to ‘Journals’, since this is our only type of organization and we’d like to reflect this focus. We will also add new elements, like proposed citation in a dataset view. You can get the idea of the prototype with these screenshots.

edawax_frontpage

EDaWaX custom frontpage

edawax_datasets

EDaWaX datasets view

edawax_single

EDaWaX single dataset view

Our second package, ckanext-dara, relates to metadata. CKAN offers only a kind of general and limited set of metadata for datasets (like title, description, author), that does not reflect any common schema. You can, nevertheless, add arbitrary fields via the webinterface for each dataset. But that’s not schema based. The approach of CKAN here is to avoid extensive metadata forms, that might restrict the usability of the portal, and also not to specialise on certain types or categories of data, like, you name it, research data. Dedicated research data applications like Dataverse do have an advantage here. Dataverse’s metadata forms are based on the well-known and very extensive DDI schema. CKAN is not originally a research data management tool, and the lack of decent metadata schema support is one point where this hurts. However, this more general approach as well as the plugin infrastructure (a feature that Dataverse does not offer, AFAIK) enables us to customise the dataset forms, add specific (meta‑)data, and to guarantee compatibility with a given schema. For EDaWaX this will be the da|ra schema, which itself is partially based on the well-known, but less complex data-cite schema. The German based da|ra is basically a DOI registration service for social science and economic data. Since we will automatically register DOIs for our datasets in the CKAN portal with da|ra it makes perfectly sense that we use their schema (which we must do anyway when submitting our metadata).

ckanext-dara is the CKAN-extension where all the metadata functionality as well as the DOI registration will be added. It is also planned to publish this package as Open Source on github. So far the development has concentrated on extending the standard CKAN dataset forms with da|ra specific metadata. The problem here is the conflict between usability and the aspiration to get as much metadata as possible. You know, we are working in a library. For librarians metadata is important. Very important. You could say that librarians think in metadata. We want every single detail of an object to be described in metadata, if possible in very cryptic metadata. Since metadata schemas are often (if not always) created by librarians, they tend to be kind of exhaustive. The current da|ra schema knows more than 100 elements, which is only a small set compared to DDI which knows up to 846 elements. Now please imagine you’re a scientist or the editor of a scientific journal who is asked to upload research data to our CKAN based platform. Would you like to see a form that asks for ~100 metadata elements? You certainly wouldn’t. Chances are good that you would immediately (and perhaps cursing) leave the site and forget about this open data thing for the next two decades or so.

To deal with this conflict we are following a twofold approach. First, we are dividing the da|ra metadata schema into three levels reflecting the necessity. Level 1 contains very basic metadata elements that are mandatory, like title, main author, or the publication year. These fields of level 1 correspond to the mandatory fields of the DataCite metadata schema. For this level we only need two or three new fields in addition to the ones that are already implemented in CKAN. Level 2 contains elements that are necessary (or perhaps better: desirable) for the searchability of the dataset or special properties of it. And finally Level 3 reflects the special metadata we need to ensure future reuse by integrating authority files. By integrating these authority files it should be possible to link persons to their affiliations, to articles, research data, to keywords or  to special fields of research.

Second, we will try to integrate these different levels of metadata as seamlessly as possible into CKAN’s UI. The general idea behind this is to give users the choice of which metadata functionalities they would like to equip their data with. To achieve this we collapse the subforms for level 2 and level 3 in the dataset form with a little help from jquery. The following screenshots give you an idea. Please note that this is an early stage of development and nothing’s finalised yet. We have not implemented level 3 yet. However, you can see that the form for level 2 (as well as later for level 3) is collapsed by default, so the “quick’n’dirty” user won’t have to deal with it if she does not want to. We are still thinking, however, about the motivation/information text and its presentation.

edawax_addform

ckanext-dara addform

edawax add 2

ckanext-dara addform extended

Journal CMS add-on

In addition to building the opendata portal itself we are planning to develop an add-on for an existing E-Journal, using the CKAN API. This will be done for the CMS Plone, which is the base of the E-Journal ‘Economics’. It should mainly be a testcase for usability of the CKAN API for editorial offices. Editors of ‘Economics’ have some experience with Dataverse (and are not always happy with it) so we have a very good setting here. Generally we consider integration in third-party systems to be very important for the acceptance of CKAN as a repository for publication-related research data. Users should not be bothered with having to use two (or even more) different systems for data and text. This approach also gives the maximum of integration for data and articles. Dataverse, for example, will develop such functionalities for OJS (Open Journal System), hopefully within the next two years. CKAN has a kind of head start here due to its great API, but I think we need to popularise CKAN in this respect, so this package will be developed as a ‘proof-of-concept’.

Conclusion

At least for our use case in the EDaWaX context, CKAN has proven so far to be the best available solution for an open research data portal. Due to its current focus on open government data, it might show some desiderata regarding the use for research data. CKAN is focused on data publishing, not data curation. This shows up clearly in its very basic support for metadata. But as we’ve shown, CKAN has two fantastic features – the API and the plugin mechanism – that facilitate the development of extensions and third-party apps for use-cases in the field of research data. Development in this direction has started already (not only in our project) and it’s foreseeable that those efforts will be ramped up soon.

So, if publishing of open research data is on your schedule please consider using CKAN and give it a serious try. It’s worth it. If you have any questions, comments or criticism please leave a note in the comments section. Please feel also free to write an email to h.bunke <at> zbw.eu in case you have a more specific question.

Open Data in Bermuda: local developers create Bermuda.io

Mark Wainwright - September 25, 2013 in Deployments, News

Two developers based in Bermuda have launched a new online CKAN-based repository of Bermuda public open data, Bermuda.io.

Louis Galipeau and Andrew Simons (below) took key public documents which were not previously available online, and put them on the site, which they set up for the purpose. Previously, members of the public could consult the documents and data only by looking at hard copies at the Bermuda National Library and Bermuda Archives. With the launch of Bermuda.io, users can now freely view or download them anywhere. Where necessary, documents have been scanned to get an electronic copy.

[Photo: Galipeaau and Simons]

Galipeau and Simons plan to publish a wide range of public data in both human and machine readable formats. They are currently compiling two decades’ worth of financial statements from all government controlled organizations and public funds. The following have already been published, with records going back 20 years:

  • The annual report of the Auditor General
  • The “Budget Book” (Estimates of Revenue and Expenditure for the Year)
  • The audited financial statements of the Consolidated Fund of The Government of Bermuda
  • The Bermuda Digest of Statistics
  • The Census Report

The developers have started with these documents because they go to the heart of the operations of Government. The Budget Book details planned revenue and spending, while the audited financials of the Consolidated Fund show the actual figures. The Auditor General’s report provides an independent opinion on the government’s financial management. Finally, the Digest of Statistics and the Census report contain economic and demographic data which provide important context.

Andrew Simons says the documents will enable people to have more informed discussions and debates. “They answer questions like ‘What’s the revenue from lobster licences each season?’, ‘What school renovations are planned for the following year?’ and ‘How much does a firefighter earn?’” He hopes the site will foster wider civic engagement on the island – and adds that it would not have been possible without CKAN.

The developers welcome feedback and suggestions on the site. Anyone interested can follow the project blog or Twitter account.

Welcome to the new-look Datahub

Mark Wainwright - July 1, 2013 in Deployments, News

Yesterday the Datahub was updated to run CKAN 2.0.1, the latest version of CKAN. This means that Datahub users can take advantage of the shiny new features in CKAN 2.0, such as the ability to follow datasets and groups, activity streams, dashboards, improved UI, and more. The Datahub has also had a makeover with a custom theme. We think it looks rather swish.

What is it?

The Datahub is a free, public platform for cataloguing and publishing data (of course, using CKAN). Its ‘groups’ can be used to bring together collections of data in a particular area, or even for lightweight data publishing by small organisations. As with a wiki, the ability of any user to edit metadata (e.g. to fix broken links) helps keep it relevant and up-to-date.

New features

Using CKAN 2.0 brings major improvements to the Datahub, such as:

  • Improved, clearer interface for adding new datasets.
  • Activity streams, letting users follow datasets and groups of interest. For instance, you could make a Datahub Group for data you publish on another website. People could use it to search your data and get notifications when you publish new or updated data.

For a full list of what’s new in CKAN 2.0, see this post.

Note: at present, CKAN 2.0’s “Organizations” are not available on the Datahub. Organizations are about controlling authorisation, whereas the Datahub is about collaborative editing by anyone. (If you want control, you probably want your own CKAN!) However, we’re looking at ways to make Organizations available to selected users in the future.

Using the Datahub

If you haven’t used the Datahub before, it’s easy to sign up and start registering data, which can be published anywhere on the Internet. (Data can’t be stored directly on the Datahub at the moment.) You may find the CKAN User Guide handy, and you can practice adding datasets, groups etc on the CKAN Demo.

Backup and more help

The upgrade shouldn’t cause any problems, but just in case, the old version of the site is still available, in read-only mode, at http://old.datahub.io. It will stay there for a month in case of any problems. (If, for some reason, you need the old site to be up for longer than a month, please get in touch.)

If there are features or extensions you’d like to see enabled on the Datahub, why not discuss them on the mailing lists?