Couple of weeks ago, James Gardner (thejimmyg) created a proof-of-concept preliminary implementation of small
WSGI application jsondataproxy for transforming URL resources into a common structured data format - JSON-P/JSON.
Main purpose of the data proxy is to provide data contained in the CKAN resources for data previews and for raw data API. The raw data API should return structure in normalized/common structured format. Regardless of the resource type and format, the requester receives JSON (or any other output format supported by the data proxy in the future, explicitly asked by requester)
Data proxy runs as a separate web service. It does not take CKAN server resources and can be reused by more CKAN instances or by other similar data services The concept of raw data API is depicted in figure 1:
- User or third party application requests raw data from CKAN:
GET ckan.net/api/data/RESOURCE_ID
The request is handled by CKAN and routed to ckanext dataapi module.
- The dataapi module constructs data proxy request URL and returns it to the requestor in the form of HTTP 302 redirect reply. The reply has Location: header set to the dataproxy URL with appropriate parameters set.
- Requesting application should handle the 302 request and ask for data from the data proxy.
- Data proxy streams the data from resource through data transformer (see below) and replies with data in a common structured form JSON.
Proposal for CKAN Data API:
It might be helpful if package resources had names or identifiers as well:
api/data/PACKAGE/RESOURCE_REFERENCE
Possible built-in resource references might be: - '
default' - reserved keyword for 'the only one resource' if there is only one, or first resource if there are more or the one with flag '
default' - '
latest' - to be able to access 'latest' resource within package (or 'actual' or 'last'?) - alphanumeric identifier (not starting with number) - number - index of resource as human/visitor sees it on page.
Data Proxy Internals
The service handles requests with one obligatory parameter: url which specifies resource to be transformed. After receiving the requests, resource type is determined from file extension or from provided type parameter(*). Based on the resource type appropriate data source streaming object is selected. Data are read from the source and transformed into list of rows which are returned in the JSON reply.
Data Proxy handles resource redirects. That is, if the resource server replies with 302 reply, dataproxy handles the request correctly.
(*) This will change in the future to HTTP Content-Type
Data transformations are handled in modular fashion through registered data transformers. Adding new data type is a matter of providing transformer metadata (type name, accepted mime and file types, name of class handling the transformation).
Accepted dataproxy parameters:
For XLS:
- worksheet - worksheet number
Data Proxy plans:
- make use of server provided Content-Type header
- handle zipped files
- handle more types of resources, such as google Docs
- provide more interesting metadata for resource preview (for example basic data quality audit information, such as field completeness)
Backend for handling data transformations will use
Data Brewery data streams. Advantages:
- stream from more different types of data resources
- data auditing
- provide more metadata where available (fields, types,...)
When we decide to do off-line resource processing and archiving, code from data proxy and brewery can be reused. From user's perspective it might look like document upload on scribd/slideshare: document is queued and "magic" happens in the background which will not only archive resource, but provide information such as:
- data preview
- field list (can be made searchable in CKAN later)
- data quality information (such as % filled cells in a column)