Today we're launching
Express Loader - an enhanced way to load data into CKAN's DataStore. Data is loaded
10 times as fast as the existing DataPusher. It's more robust and has a simpler architecture.
More and more CKAN sites are offering their CSV data using DataStore, which provides users with an excellent data API. However experience has shown that getting the CSV into DataStore is not always smooth.
A key improvement over DataPusher is the speed of loading large data files. In our
tests a 500MB CSV took 39 minutes to load in with DataPusher, but with Express Loader that is cut to 3 minutes 26s. With Express Loader the data is streamed in with one SQL command, instead of 4000 separate INSERTs, and it uses optimizations like deferred indexing.
Express Loader is also more robust, coping better with problematic CSV files. DataPusher detects column types based on the first few rows, but where a column starts off numeric and then later on contains text - for instance "N/A" - it causes an error. As we all know, "perfect" data rarely exists in the real world so Express Loader takes a more pragmatic approach. It loads all columns safely as text and allows these to be converted to the proper data type later on with the new type-casting features of the Data Dictionary.
The repository and install instructions are here:
https://github.com/davidread/ckanext-xloader. It supports CKAN versions from the latest master back to v2.3. We encourage users to give it a try and feed back to help us guide future development. If we can get enough support, it could replace DataPusher as the default loader for CKAN DataStore.
OpenGov Inc. has sponsored this development work, with the aim of benefiting open data infrastructure worldwide. OpenGov is a gold member of the CKAN Association.