Support Us

Pyramids, Pipelines and a Can-of-Sweave – CKAN Asia-Pacific Meetup

September 18, 2015 in Community, Featured, Presentations

Florian Mayer from the Western Australian Department of Parks and Wildlife presents various methods he is using to create Wisdom.

Data+Code = Information;
Information + Context = Wisdom

So, can this be done with workbooks, applications and active documents?

As Florian might say, “Yes it CKAN”!

Grab the code and materials related to the work from here:
http://catalogue.alpha.data.wa.gov.au/dataset/data-wa-gov-au

asia-pacificThis presentation was given at the first Asia-Pacific CKAN meetup on the 17th of September, hosted at Link Digital, as an initiative of the CKAN Community and Communications team. You can join the meetup and come along to these fortnightly sessions via video conference.

If you have some interesting content to present then please get in touch with @starl3n to schedule a session.

1 response to Pyramids, Pipelines and a Can-of-Sweave – CKAN Asia-Pacific Meetup

  1. Just wanted to bring this quick Q&A from the dev list into the comments here for others to see :)

    From Denis:

    Great job Florian! Making all those pieces work end-to-end must have been no small feat.

    Have you thought about auto-refreshing plots and other output when new data comes in, pushing those back into CKAN as well?

    It’s something I’ve been curious about for a while – for example, letting users upload a sandboxed script. Or, specify an ipython notebook which takes in a dataset resource as the data input, and generates a ResourceView compatible output. Then every time the resource is updated, the script would be re-ran automatically. Seeing all you’ve done here, you must have thought about something similar, so I’m curious to hear your thoughts.

    From Florian:

    Thanks Denis!

    I thought long and hard about the automation of generating products from data resources. I like the idea of sandboxed ipython notebooks! With datacats providing a dockerised CKAN, auto-provisioning Jupyter docker containers on demand to develop / run iPy notebooks wouldn’t be that far away, right?

    Actually we have a semi-automated solution in place to generate products from data. I’ll add that to the CKAN-o-Sweave example soon.
    Our users each have access to their own account on an RStudio Server instance. Each of them has a local, version-controlled copy of our reporting repo (CKAN o’ Sweave’s precursor) in their RStudio Server account.
    For each dataset and resulting figure(s), we have one R script that reads the data (CSV) directly from CKAN, generates one or several figures (as PDF) from the data (or runs any kind of analysis we need) and finally uploads itself (as TXT) and the figures back to CKAN (overwriting the previous versions of code TXT and figure PDF(s)). The script sources a secret (gitignored) file with the respective user’s CKAN API key (required for upload to CKAN using ckanr::resource_update()).
    These scripts are under version control inside our reporting code repo (I’ll add an example to CKAN-o-Sweave in a similar way).

    That provides us with the following situation:
    – creating these scripts is very easy – we simply take working R code (producing a figure from a data.frame), prepend it with the “load CKAN resource into R data.frame” code, and append the “upload code and products back to CKAN” code.
    – to produce the products from data once data are refreshed, we simply “source” the R script. That’s one user-friendly click of a button in RStudio. Doesn’t get any easier than that!
    – my researchers have full access to the actual script. I found that every layer of abstraction introduces an order of magnitude of possible bugs and confusion for end users.
    – we went with R instead of Python as our end users’ skill base is more around R than Python. YMMV.
    – scripts can be run on demand, or as part of another “refresh_all_the_figures.R” script. This allows any automation we require later on, but keeps things simple and on-demand at the same time.
    – every user is authenticated with their own CKAN API key, so their actions are logged, and we can kick the right behinds in case of mishaps.
    – the scripts of many authors sit in the same code repo as the reports, so they are within reach when needed, and serve as a simple library of existing solutions to steal from.

    We deliberately separated the process of processing data and compiling the reports for the following reasons:
    – performance: the products are produced once, and read often. It makes sense to compile them when the data comes in, not when they are accessed / read.
    – separation of concerns: the products are used not only for the reports, so we don’t want to tie these two processes too tightly together.
    – scalability: both requirements to the analyses and the reports evolve constantly, so any level of abstraction and any tight integration creates a maintenance footprint. Keeping things light-weight allows to “move fast and break things”.
    – qa: having a human brain between data coming in (which is a manual process to update the CKAN data resource anyways) and making sure the product still makes sense is hard to automate. My users like feeling in charge of the QA step and being able to fiddle with the product where and when necessary – especially when their name (as the dataset maintainer) and “last updated on” date signs off their update to the CKAN dataset.

    So yeah, simple user-sourced R scripts are a pretty low-key solution, but for us and in the context of a large reporting scenario it’s the sweet spot between automation, having human common sense in between components to review and QA, scalability and flexibility.

    In contrast to this large use-case (a dozen reports of 1500+ pages using 600+ CKAN resources), compiling products at the same time as reports would make a ton of sense for smaller reporting projects, e.g. research papers, technical appendices, R package vignettes and the like.

    Finally, where there are too many possible permutations of input settings to produce the desired output (and we can’t pre-compile one product to fit all needs), it’s not too hard to wrap an R script into an RShiny app, such as my much hawked-out timeseries explorer http://rshiny.yes-we-ckan.org/shiny-timeseries/

    So yeah, an iPython notebook server would be a useful addition to our data.wa.gov.au stack. (Similarly, an “open this spatial resource in QGIS server” button would tickle many fancies I guess.)
    I’ll have to look into adding https://github.com/jupyter/jupyterhub to http://govhack2015.readthedocs.org/en/latest/3_Workbench/ to spawn ipython notebook containers. Any help or lessons learned in that space would be appreciated!

    Cheers,
    Florian