Raw Data in CKAN Resources and Data Proxy

Stefan Urbanek
20 Apr 2021
Share

Couple of weeks ago, James Gardner (thejimmyg) created a proof-of-concept preliminary implementation of small WSGI application jsondataproxy for transforming URL resources into a common structured data format - JSON-P/JSON. Main purpose of the data proxy is to provide data contained in the CKAN resources for data previews and for raw data API. The raw data API should return structure in normalized/common structured format. Regardless of the resource type and format, the requester receives JSON (or any other output format supported by the data proxy in the future, explicitly asked by requester) Data proxy runs as a separate web service. It does not take CKAN server resources and can be reused by more CKAN instances or by other similar data services The concept of raw data API is depicted in figure 1:

User or third party application requests raw data from CKAN: GET ckan.net/api/data/RESOURCE_ID The request is handled by CKAN and routed to ckanext dataapi module.
The dataapi module constructs data proxy request URL and returns it to the requestor in the form of HTTP 302 redirect reply. The reply has Location: header set to the dataproxy URL with appropriate parameters set.
Requesting application should handle the 302 request and ask for data from the data proxy.
Data proxy streams the data from resource through data transformer (see below) and replies with data in a common structured form JSON.

Proposal for CKAN Data API: It might be helpful if package resources had names or identifiers as well: api/data/PACKAGE/RESOURCE_REFERENCE Possible built-in resource references might be: - 'default' - reserved keyword for 'the only one resource' if there is only one, or first resource if there are more or the one with flag 'default' - 'latest' - to be able to access 'latest' resource within package (or 'actual' or 'last'?) - alphanumeric identifier (not starting with number) - number - index of resource as human/visitor sees it on page. Data Proxy Internals The service handles requests with one obligatory parameter: url which specifies resource to be transformed. After receiving the requests, resource type is determined from file extension or from provided type parameter(*). Based on the resource type appropriate data source streaming object is selected. Data are read from the source and transformed into list of rows which are returned in the JSON reply. Data Proxy handles resource redirects. That is, if the resource server replies with 302 reply, dataproxy handles the request correctly. (*) This will change in the future to HTTP Content-Type

Data transformations are handled in modular fashion through registered data transformers. Adding new data type is a matter of providing transformer metadata (type name, accepted mime and file types, name of class handling the transformation). Accepted dataproxy parameters:

type - resource file type, you have to supply this when there is no file extension. Currently implemented resource types are XLS and CSV.
max-results - maximum number of rows being returned from the resource. Can be used for data previews.
format - output format: JSON or JSONP

For XLS:

worksheet - worksheet number

Data Proxy plans:

make use of server provided Content-Type header
handle zipped files
handle more types of resources, such as google Docs
provide more interesting metadata for resource preview (for example basic data quality audit information, such as field completeness)

Backend for handling data transformations will use Data Brewery data streams. Advantages:

stream from more different types of data resources
data auditing
provide more metadata where available (fields, types,...)

When we decide to do off-line resource processing and archiving, code from data proxy and brewery can be reused. From user's perspective it might look like document upload on scribd/slideshare: document is queued and "magic" happens in the background which will not only archive resource, but provide information such as:

data preview
field list (can be made searchable in CKAN later)
data quality information (such as % filled cells in a column)

In Category on 15 Mar 2024

Patch releases for CKAN 2.9 and 2.10 now available

Keep your site up to date with the latest round of fixes and improvements

In Category on 12 Mar 2024

Keep Sharing, Keep Inspiring: #CKANDemoDays Goes On!

In the spirit of fostering a more inclusive and collaborative environment, we're thrilled to announce that #CKANDemoDays is not coming to an end — it's just getting started! Submit your video and become a part of a growing collection of CKAN success stories. Let us amplify your success!.

Raw Data in CKAN Resources and Data Proxy

Patch releases for CKAN 2.9 and 2.10 now available

Keep Sharing, Keep Inspiring: #CKANDemoDays Goes On!

Connect with CKAN