CKAN Turns 20: Two Decades of Open Data Infrastructure
CKAN turns 20. Explore how an open-source experiment became global data infrastructure, powering governments, research, and public-interest data worldwide.
CKAN is widely used to publish datasets containing millions of records. While smaller exports performed well, very large Datastore dumps exposed a scalability limitation.
Consider a dataset with 13 million records — not unusual for census data, public health records, or economic indicators. In previous CKAN versions, downloading this data through the Datastore could take 30 minutes or more, assuming the request did not time out first.
The technical cause was offset pagination. To fetch page 1,000, the database skips 999,000 rows. To fetch page 2,000, it skips 1,999,000 rows. Each page requires more work than the last.
In simple terms: It's like finding page 1,000 in a book by flipping through all 999 pages first, then finding page 1,001 by flipping through 1,000 pages. The further you go, the slower it gets.
At large scale, this approach does not perform reliably.
CKAN 2.12 replaces offset pagination with keyset pagination.
Instead of counting rows, the system uses the indexed ID field as a bookmark. The database jumps directly to the correct position every time, regardless of how far into the dataset you are. It's like using a book's index instead of flipping pages.
Results:
This keyset pagination solution was developed by Yan Rudenko.
Improved download performance is only part of the change. CKAN 2.12 also introduces a way to filter data before export.
In earlier versions, users typically had to download entire datasets and apply filters locally, even when only a subset of the data was required. For large datasets, this approach was slow and often impractical.
With CKAN 2.12, users can apply structured filters directly at the Datastore level. For example, a request can specify a defined time range, age range, and a limited set of regions: year BETWEEN 2020 AND 2023 AND age BETWEEN 18 AND 65 AND region IN ('North', 'Central', 'East')
can now be expressed in search or delete queries as:
"filters": {
"year": {"gte": 2020, "lte": 2023},
"age": {"gte": 18, "lte": 65},
"region": ["North", "Central", "East"]
}
Only matching records are returned, reducing the amount of data transferred and processed.
This functionality is provided hese improvements will be included in the upcoming CKAN 2.12 release.by the Advanced Query Filter specification developed by Adrià Mercader from the CKAN Tech Core Team.
The specification introduces a consistent and scalable filtering model, including:
population > 100000 or year BETWEEN 2020 AND 2026(age > 25 AND income < 50000) OR education = 'graduate'The filtering system does two things. First, it enables the new pagination approach. Second, it allows users to filter data before download rather than retrieving everything and filtering locally.
These changes were integrated and prepared for release by Ian Ward and will be included in CKAN 2.12.
If you publish data:
If you use data:
If you manage CKAN:
These improvements will be included in the upcoming CKAN 2.12 release.
CKAN turns 20. Explore how an open-source experiment became global data infrastructure, powering governments, research, and public-interest data worldwide.
A recap of CKAN Monthly Live #39 covering POSE Phase II updates, the two upcoming storytelling workshops, a preview of the CKAN Ecosystem Catalog, and how the community can plug into CKAN@20 anniversary activities in 2026.