Support Us

You are browsing the archive for Data.

120+ CKAN Portals in the Palm of Your Hand. Via the Open Data Companion (ODC)

Osahon Okungbowa - September 24, 2015 in Community, Data, News, Visualization

CKAN is a powerful open-source data portal platform which provides out-of-the-box tools that allow data producers to make data easily accessible and reusable by everyone. Making CKAN as Free and Open Source Software (FOSS) has been a key factor in helping grow the availability and accessibility of open data across the Internet.

The emergence of mobile devices and the mobile platform has led to a shift in the way people access and consume information. Popular consensus  and reports show that mobile device usage and time spent on mobile devices are rapidly increasing. This means that mobile devices are now one of the fastest and easiest means of accessing data and information. Yet, as of now, open data lacks a strong mobile presence. Read the rest of this entry →

Export Datasets from CKAN to Excel

Sean Hammond - November 9, 2014 in Data, Presentations, Tutorial

ckanapi-exporter is a new API script that we’ve developed for exporting dataset metadata from CKAN to Excel-compatible CSV files. Check out the short presentation below, and visit ckanapi-exporter for more details:

CKAN at OKFestival: raw data now!

Mark Wainwright - September 27, 2012 in Data, Events

Last week’s OKFest is finally over, after a hectic week of talks, workshops, films, hackathons, and more. You can read about highlights such as Hans Rosling’s brilliant talk over on the OKFN blog. The biggest challenge for me was being in two places at once on Wednesday afternoon, with both the CKAN workshop and a panel discussion, including me, in the Open Science stream on ‘Immediate access to raw data from experiments’, where I was on the panel, running in nearby buildings at overlapping times. (Happily I more or less pulled it off.)

The CKAN workshop was a surprise hit, with over 30 people crowding round to hear about CKAN’s latest features and future directions. Some went on to ask questions of CKAN developers about installing, using the API, writing extensions, and more, while others joined a discussion with Antonio Acuña, head of data.gov.uk, about starting a users’ group and about data.gov.uk’s experiences and recommendations.

The experimental data session led to a lively and interesting discussion, chaired by Panton Fellow Sophie Kershaw. From the panel, I spoke about the advantages of publishing data as soon as possible. Researchers are the biggest re-users of their own data and stand to benefit most from publishing it – provided the publication platform chosen is simple to use and provides added value.

[IMG: Panel discussion]

L to R: Sophie Kershaw, Mark Wainwright, Joss Winn, Mark Hahnel

Next, Joss Winn of the Orbital project spoke about the platform they are developing (based on CKAN) to enable immediate access to various kinds of experimental data. He stressed that immediate access need not mean immediate publication – it may not be possible to publish the data now for various reasons. However, a good management system should help the researcher rather than be an extra burden, and makes it trivial to publish later at the flick of a switch. Finally, Mark Hahnel of Figshare pointed out that funders will increasingly look at all outputs from research they fund, not just publications – and increasingly this may mean that researchers are required to publish data.

Researchers often have reasons for not publishing data – some good and some bad. But this week Ben Goldacre’s new book provides a timely reminder that data left unpublished can lead to research that does not forward the cause of knowledge, and even actively retards it. Surely almost all scientists starting out in their career hope to expand the frontiers of science, and there couldn’t be a clearer demonstration that one simple step will help: publish the data!

UK spending: is the data all there?

Mark Wainwright - September 13, 2012 in Data, Data Quality, News

Data tells stories, and some of the most interesting (and newsworthy) stories are about spending. There is buckets and buckets of spending data at data.gov.uk, the UK government’s CKAN portal. So why don’t journalists write more stories about it?

There are various reasons, of course. It is quite a lot of work to trawl through and process spreadsheets of data in the hope of finding the story you want. One key reason is that it is sometimes hard to find all the data and be sure it is complete and up-to-date. The data.gov.uk folk, in collaboration with the Open Knowledge Foundation’s OpenSpending, have just made that a little easier.

Their nifty little tool shows which departments have released full spending data in line with Treasury guidelines, as well as providing direct links to the data. Read about it on the OpenSpending blog here.

[IMG: data.gov.uk departmental spending report]

CKAN at PyCon UK

James Gardner - September 23, 2011 in Data, Events

The CKAN team are oragnising a talk, a workshop and a sprint at the forthcoming PyCon UK conference in Coventry this weekend.

We plan to divide the CKAN slot into two parts:

  • Talk – 30 mins description of what open data is all about, how CKAN can help and the technical background around how we use Python and support extensions
  • Workshop – 20 mins helping people to get up and running with their own instance (perhaps adding some data and changing the theme too if there is time)
The Sprint will run from 14:20 on Saturday until 5pm on Monday and we’ll be concentrating on:
  1. Geospatial features
  2. Our “webstore” for hosting data
If you are coming to the conference feel free to drop in.
More information:

Under the hood of CKAN

Open Knowledge International - May 3, 2011 in Data, News

The following is a guest post from Christophe Guéret, member of the working group on EU Open Data

CKAN.net is a community-based effort for creating an open catalog of public datasets. Using the web site, everyone is free to register datasets thereby stating their existence and the possible links between them along with extra meta-data (license, author, …). One of the nice features of CKAN is that the data about the datasets is stored in a structured, and consistent way. This allows for a direct export of this information into RDF data. It also seduced Richard Cyganiak and Anja Jentzsch who, last year, decided to drop the wiki pages (1,2) they used to draw the LOD Cloud in favor of CKAN.

In addition to storing structured data, CKAN also offers a convenient API for accessing it. It’s a ReST API which also comes with several binding interface for Python, PHP, Perl, … and make it easy to get a list of packages matching some criteria. This API can, for instance, be employed to get a list of packages tagged as ‘lod-cloud’ and render them using protovis as shown by Ed Summers and Richard Cyganiak on this site. An other interesting use is to get the same data and reformat it into some suitable for consumption by network analysis software (c.f. this blog post). In plus of offering a wide range of visualisations for the network, these software can also compute several metrics highlighting aspects of the graph that can not be observed by looking at some nodes individually.

Here are two examples of rendering, the first realised by Ed Summers and Richard Cyganiak:

Animation of the LOD Cloud, rendering done by Protovis: Linked Open Data Graph

The second realised by Rinke Hoekstra:

Visualisation of the clusters in the LOD Cloud, rendering done by Gephi:
LOD Cloud Analysis

If you haven’t tried it yet, go check the API and its documentation and start re-formating the data from CKAN.net to make something new out of it ;-)

Data Quality: What is It?

Stefan Urbanek - January 20, 2011 in Data, Data Quality

Whether one is a journalist using data for an investigation or a governement publishing its budget its important that we can check assess that data’s quality.

It’s also true that if I’m a user of a data catalogue it’s very useful for me to know something about the dataset before I try to download it — not just it’s quality, but its characteristics, size etc.

Resource Quality

We have to have the data first to be able to measure them. Data catalogue gives us references (URLs) to various datasets and datasources, however to which extent we can use them? Here are properties of data resources in regard to their quality:

  • Availability – can it be machine downloaded? Does the server reply with “404 Not Found”? (This is 5 stars of openness item 1)

  • Processability – Is it in a convenient format – one that can be machine processed into structured form? Is it in closed proprietary format? (This is 5 stars of openness item 2)

Data Quality and Quality Dimensions

Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose.

What are the main dimensions of data quality?

  • Completeness – extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user’s expectations and data availability. Can be measured in an automated way.

  • Accuracy – data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings.  (NB: data can be complete but not accurate)

  • Credibility – extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.

  • Timeliness (age of data) – extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source

Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:

  • Consistency – do the facts in multiple datasets match? (some measurable)

  • Integrity – can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)

Next time we will talk about “What is acceptable data quality?”

Raw Data in CKAN Resources and Data Proxy

Stefan Urbanek - January 11, 2011 in Data

Couple of weeks ago, James Gardner (thejimmyg) created a proof-of-concept preliminary implementation of small WSGI application jsondataproxy for transforming URL resources into a common structured data format – JSON-P/JSON.

Main purpose of the data proxy is to provide data contained in the CKAN resources for data previews and for raw data API. The raw data API should return structure in normalized/common structured format. Regardless of the resource type and format, the requester receives JSON (or any other output format supported by the data proxy in the future, explicitly asked by requester)

Data proxy runs as a separate web service. It does not take CKAN server resources and can be reused by more CKAN instances or by other similar data services The concept of raw data API is depicted in figure 1:

  1. User or third party application requests raw data from CKAN:
    GET ckan.net/api/data/RESOURCE_ID
    The request is handled by CKAN and routed to ckanext dataapi module.
  2. The dataapi module constructs data proxy request URL and returns it to the requestor in the form of HTTP 302 redirect reply. The reply has Location: header set to the dataproxy URL with appropriate parameters set.
  3. Requesting application should handle the 302 request and ask for data from the data proxy.
  4. Data proxy streams the data from resource through data transformer (see below) and replies with data in a common structured form JSON.

Proposal for CKAN Data API:

It might be helpful if package resources had names or identifiers as well:

api/data/PACKAGE/RESOURCE_REFERENCE

Possible built-in resource references might be: – ‘default‘ – reserved keyword for ‘the only one resource’ if there is only one, or first resource if there are more or the one with flag ‘default‘ – ‘latest‘ – to be able to access ‘latest’ resource within package (or ‘actual’ or ‘last’?) – alphanumeric identifier (not starting with number) – number – index of resource as human/visitor sees it on page.

Data Proxy Internals

The service handles requests with one obligatory parameter: url which specifies resource to be transformed. After receiving the requests, resource type is determined from file extension or from provided type parameter(*). Based on the resource type appropriate data source streaming object is selected. Data are read from the source and transformed into list of rows which are returned in the JSON reply.

Data Proxy handles resource redirects. That is, if the resource server replies with 302 reply, dataproxy handles the request correctly.

(*) This will change in the future to HTTP Content-Type

Data transformations are handled in modular fashion through registered data transformers. Adding new data type is a matter of providing transformer metadata (type name, accepted mime and file types, name of class handling the transformation).

Accepted dataproxy parameters:

  • type – resource file type, you have to supply this when there is no file extension. Currently implemented resource types are XLS and CSV.

  • max-results – maximum number of rows being returned from the resource. Can be used for data previews.

  • format – output format: JSON or JSONP

For XLS:

  • worksheet – worksheet number

Data Proxy plans:

  • make use of server provided Content-Type header
  • handle zipped files
  • handle more types of resources, such as google Docs
  • provide more interesting metadata for resource preview (for example basic data quality audit information, such as field completeness)

Backend for handling data transformations will use Data Brewery data streams. Advantages:

  • stream from more different types of data resources
  • data auditing
  • provide more metadata where available (fields, types,…)

When we decide to do off-line resource processing and archiving, code from data proxy and brewery can be reused.  From user’s perspective it might look like document upload on scribd/slideshare: document is queued and “magic” happens in the background which will not only archive resource, but provide information such as:

  • data preview
  • field list (can be made searchable in CKAN later)
  • data quality information (such as % filled cells in a column)