CKAN is widely used to publish datasets containing millions of records. While smaller exports performed well, very large Datastore dumps exposed a scalability limitation.
Consider a dataset with 13 million records — not unusual for census data, public health records, or economic indicators. In previous CKAN versions, downloading this data through the Datastore could take 30 minutes or more, assuming the request did not time out first.
The technical cause was offset pagination. To fetch page 1,000, the database skips 999,000 rows. To fetch page 2,000, it skips 1,999,000 rows. Each page requires more work than the last.
In simple terms: It's like finding page 1,000 in a book by flipping through all 999 pages first, then finding page 1,001 by flipping through 1,000 pages. The further you go, the slower it gets.
At large scale, this approach does not perform reliably.
The solution
CKAN 2.12 replaces offset pagination with keyset pagination.
Instead of counting rows, the system uses the indexed ID field as a bookmark. The database jumps directly to the correct position every time, regardless of how far into the dataset you are. It's like using a book's index instead of flipping pages.
Results:
- That 13-million-record dataset now downloads in 2 minutes instead of 30
- 15x performance improvement
- Downloads complete reliably — timeouts eliminated
- Performance remains consistent as dataset size increases
Advanced filtering
Improved download performance is only part of the change. CKAN 2.12 also introduces a way to filter data before export.
In earlier versions, users typically had to download entire datasets and apply filters locally, even when only a subset of the data was required. For large datasets, this approach was slow and often impractical.
With CKAN 2.12, users can apply structured filters directly at the Datastore level. For example, a request can specify a defined time range, age range, and a limited set of regions: year BETWEEN 2020 AND 2023 AND age BETWEEN 18 AND 65 AND region IN ('North', 'Central', 'East')
Only matching records are returned, reducing the amount of data transferred and processed.
This functionality is provided hese improvements will be included in the upcoming CKAN 2.12 release.by the Advanced Query Filter specification developed by Adrià Mercader from the CKAN Tech Core Team.
The specification introduces a consistent and scalable filtering model, including:
- Range queries:
population > 100000 or year BETWEEN 2020 AND 2026 - Complex logic: Nested AND/OR conditions like
(age > 25 AND income < 50000) OR education = 'graduate' - Unified syntax: Consistent query language across all CKAN instances
The filtering system does two things. First, it enables the new pagination approach. Second, it allows users to filter data before download rather than retrieving everything and filtering locally.
These changes were integrated and prepared for release by Ian Ward and will be included in CKAN 2.12.
What you get
If you publish data:
- Large dataset exports now complete reliably without manual intervention
- Users can download complete datasets without contacting support
- Infrastructure costs decrease as database load becomes predictable
If you use data:
- Census datasets with 10+ million records download in minutes, not hours
- No more timeout errors halfway through large exports
- Filter datasets before download to extract only relevant records
- Build data pipelines that don't break when dataset size increases
If you manage CKAN:
- Reduced support burden for failed downloads
- Consistent, predictable database performance under load
- Advanced filtering capabilities without custom development
Availability
These improvements will be included in the upcoming CKAN 2.12 release.