The newly released CKAN 2.7 adds new functionality and makes DataStore search up to
17x faster at the same time.
This article covers:
- new total row calculation
- new result generation
- new record formats
TL;DR:
- Upgrade to 2.7 then resource views and other code that uses datastore_search is faster with no other changes required
- Update your datastore_search client code use one of the new records_format=csv and/or include_total=false options to make it much, much faster
Total Row Calculation
The first opportunity to make DataStore search faster came when users observed search slowing down as their datasets grew. The same slow-down was not happening when using SQL search, so this problem was specific to our datastore_search implementation.
Profiling datastore_search showed that a window function used to calculate the total number of matching rows was not being cached by PostgreSQL. Instead is was recalculated for every row returned, meaning we walked the whole table once for every row we returned. Pulling the total calculation out of the original query more than doubled the speed for larger queries, with a slight cost for small queries (we now make two round trips to the database instead of one)
If you store tens or hundreds of millions of rows in your DataStore tables you might be interested in improving the total calculation even further by caching the total counts in Redis at the API layer or in PostgreSQL with a trigger. However CKAN 2.7 fixes the worst issue with total calculation for most large tables.
Result Generation
Profiling datastore_search then showed most of our time spent handling data in Python: SQLAlchemy marshalling result rows into Python objects, DataStore walking over those objects and converting them into Python dictionaries and simplejson encoding those dictionaries to JSON to send as an API response.
The best optimisation is to do less work; can we skip all this data handling somehow?
It happens that PostgreSQL 9.3 has a rich set of JSON functions and is fully capable of building the exact response records we need. Embedding those records in our JSON API response is now supported by simplejson 3.10's new RawJSON object. This change makes our PostgreSQL query more complicated but it cuts out almost all data handling in Python, doubling our speed again for large queries.
But, can we do even less work?
There's no need to calculate the total records for cases like dumping DataStore tables to a file or updating a page of data in a resource view after the first page. With the new result generation optimisation making that part of our test query so much faster, the time simply generating a total record count now accounts for almost half the total. Let's use the new include_total=False parameter to skip that too. We're now producing results seven times as fast as we could on CKAN 2.6
Acknowledgements
This work was sponsored by OpenGov and incorporated into CKAN 2.7 with the help of CKAN core developers and contributors.