Link Digital’s Enterprise CKAN Stack for AWS is Now Available on GitHub
As part of the commitment made at the White House Open Data Roundtable, Datashades, also trading as Link Digital, has recently released the preview of an Enterprise CKAN Stack for AWS.
The stack presents Link Digital’s best practice, with independently scalable layers, easily adapted to CI workflows and automated system maintenance. It is now freely available to use on our Datashades GitHub repository.
This OpsWorks stack has been in active use by Link Digital and presents a basis on which Link Digital builds and supports its Government Open Data platforms. Hence, the project can justly be called “eating your own dog food”.
Even now that there is a number of improvements in progress, we believe that the newly-published alpha version of the project will add value to the Public Data community.
A longer monologue from a dev list discussion:
Attaching our high level architecture using RDS on AWS — for UAT and PROD: appendix_8_updated_aws-hosting-environment-2.
CloudFormation scripts for building out CKAN in a HA config can be found at https://github.com/DataShades/ckan-aws-templates
OpWorks version is here: https://github.com/DataShades/opswx-ckan-cookbook
Happy to collaborate on this and make it shine brighter :)
There are a few other relevant scripts under our datashades set of repos, such as the ASG one here: https://github.com/DataShades/updateasg
And, the general cloud storage one here: https://github.com/DataShades/ckanext-cloudstorage
And the S3 related one here: https://github.com/DataShades/ckanext-s3filestore
We’ve also improved the SSO approach with Saml2: https://github.com/DataShades/ckanext-saml2
And, begun some work for manipulating ACLs, which is important for private dataset resources you’d want to switch to ‘public’ when published: https://github.com/DataShades/ckanext-acl
Although not formally part of the CKAN roadmap I have a working model of where I’d like CKAN to head when it comes to enterprise file/data storage and access. If you are familiar with the concept of resource views then the idea I’m keen to pursue is similar. It is a concept of resource containers (not para-virtualization containers but storage or access point containers). The idea is to make CKAN extendable via extensions of a type that allow it to do more orchestration around how data is stored and made usable below the discovery layer of the metadata.
The story would be something like:
As a platform operator, I need to be able to configure a variety of storage and access endpoint possibilities, so that custodians can select where data is placed based on type of data or business need.
Resource container extensions would then be built to accommodate things like:
- Big data, transnational data feeds
- Semantic lakes
- Large file storage blobs
- Self declarative structured data (likely using data packaging/frictionless data)
- For cost auditing and accountability – storage into specified paid cloud accounts (different AWS, Azure, etc. accounts based on organisation)
I would image that resource view and resource container extensions would be paired in many cases to allow for the view to provide greater access and control of the data to provide an ability to query and extract insights from the data.
The European Data Portal has around 650k datasets. It is true that once a CKAN portal gets to such a size then it can be a chore to do anything over the entire set of data in quick time. However, with the entire catalog readable via API there is a place for other tools to come into the picture to provide meta analysis or broader views over all data in a portal.
CKAN’s structure allows for data ownership and custodianship to remain flexible as the governing entities change over time. If we keen those functions lightweight and build the more intensive data processing tasks within a resource container layer then I think that is the big win :) I see datastore and filestore as examples of resource containers. Datapusher is an example of an ETL that works with datastore but similar tools and concepts can be worked into the model and the open source goodness can grow organically to meet lots of different organisational needs.
Where CKAN differs from other portal software, in my experience, is that it can be used for open Government data, research data, private sector data and ‘data as knowledge’ in virtually any situation. Other portal software appears to be built around capturing a particular market opportunity to generate data as knowledge for a particular customer segment – civic hackers, jurisdictional bureaucrats, open data policy implementations, etc.
CKAN’s harvesting is good, but certainly not perfect. The approach for pushing from CKAN to elsewhere is likely to be used more in our future work, or as we refactor the architecture of current implementations. See: https://github.com/DataShades/ckanext-syndicate
By using multiple CKAN environments it is pretty easy to have catalogs of ‘working data’ that then push to the ‘published data’ catalog. We use this approach for Government open data when from the bottom up you have agency data collected into CKAN based information asset registers. Sometimes the data doesn’t even exist, but the data management plan can at least first be registered prior to populating the dataset with resources. Once the data is ready it can then be published and syndicated upward to a higher level jurisdictional portal – such as a council, city, state or province. Similarly such datasets can then be syndicated upward again into a national or regional portal – perhaps with further ETL functions put in place to combine the similarly structured data from multiple agencies into a master dataset that presents a larger view of the entire data collection effort.
If the domain of data collection differs, such as in a field of research, then the same architecture can still apply. Multiple research schools of chemistry, for example, could publish working data locally then syndicate upward into a global repository that allows for meta analysis of all research outcomes over the entire domain’s efforts. We’re working on a project in just this manner that is referenced here: http://linkdigital.com.au/news/2016/09/building-mdbox-an-open-access-simulation-data-repository-on-ckan-and-aws
Lastly, published open data is the result of effort which is put into a process of data collection and, usually, some analysis and clean up. The tools used to process data, to prepare, collect or visulise are all part of the value a dataset represents. To bridge data and code we’ve released a very simple resource view for GitHub repositories that can be found here: https://github.com/DataShades/ckanext-githubrepopreview
Open Government initiatives are formed around principles of transparency, participation and collaboration. There is a desire to enable public-private collaboration over the long term and there is a role for Government to act as impresario to stimulate new markets and economic activity from publishing open data (ref: https://www.nesta.org.uk/sites/default/files/government_as_impresario.pdf). The reason we built the GitHub resource view is to encourage open source projects to emerge in connection to public datasets, via linking the opportunity for discovery of helpful code with the discovery of helpful datasets.
Sorry for the long monologue! I could have more succinctly just said CKAN rocks, check out all the open source goodness surrounding it and jump in :)