📖 Useful resources and links:
- CKAN 3.0 Taskforce Board: github.com/orgs/ckan/projects/3/views/1
- Github discussion: github.com/ckan/ckan/issues/7489
Context: Last time we announced that the new CKAN 3.0 taskforce was formed, thanks to the addition of two skilled backend/CKAN developers, Dragan and Svetozar. They will work closely with technical lead Anuar Ustayev and product owner Alex Gostev. In order to form a strategy for the next big release the team facilitates and participates in discovery activities to approve or disapprove hypotheses for problems and solutions.
The research is now underway, and we are excited to share the journey with you through a series of articles. "Graph Databases" is the first series of articles to add transparency to the discovery work of all contributing developers involved. The goal of this communication is to shed light on our ongoing research while inviting the community to participate. We believe that engaging with our community during this research will provide valuable insights, evidence, and feedback that will help us achieve our objectives.
Please share your feedback if you have your opinion, experience, or issues related to the research:
Today, we kick off this series with our first article, delving into the initial phases of the research. Stay tuned as we keep you updated on their progress and insights, and read on to learn more.
CKAN has long been a reliable solution for numerous organizations seeking to manage and share data. However, as the needs of users differ according to their unique needs, CKAN's SQL databases are proven to be a reliable solution for working with DCAT, though there is a potential for simplifying the current workflows where RDF triples are involved. To discover this area, we’re committed to exploring potential approaches for working with graph databases.
Below, we've captured the essential highlights from the discussion. For a complete overview, have a look at the GitHub conversation: https://github.com/ckan/ckan/issues/7489.
The limitations of the SQL databases currently used in CKAN have led to restrictions on the number of applications and support for specific metadata standards like DCAT. The team is examining whether lossless graph database harvesting can be retained in CKAN, considering that SQL databases do not fully support all the required metadata standards.
During the problem discovery phase, the team has been gathering evidence from Swiss and EU customers using graph databases and examining past experiences, such as a German client who faced issues with the DCAT-AP metadata standard. The main problem identified by PO Alex Gostev is the loss of connections between catalog entities and relations during the harvesting of external catalogs in DCAT format.
Two solutions are being researched to address the problem:
- Lossless harvesting storage with 2 databases (graph and Postgres) Solution: Lossless harvesting storage with 2 databases (graph and Postgres) #7514
- A generic solution to store triples Solution: Generic solution to store triples #7515
The discussion also underscores the importance of understanding why SQL databases may not be sufficient for CKAN's needs and the necessity of exploring graph databases. Several critical questions have been raised to evaluate the proposed solutions, such as their potential impact on current installations, backward compatibility, ease of migration, database schema changes, API changes, and deprecation of interfaces.
Governments and organizations have widely adopted metadata standards like DCAT, Dublin Core, and Frictionless Data, with various extensions. Although CKAN has an extension to support DCAT, it may not be sufficient to cover all DCAT features, which is why the team is considering graph database support.
Interesting note: there’s a Taiwanese project that utilized triplestores with CKAN. The project involved converting original data to triples and importing them into CKAN and OpenLink Virtuoso to achieve a SPARQL endpoint. It is suggested that CKAN 3.0 could incorporate a separate triple store with a SPARQL endpoint, and the exciting thing is that we have a contributor from the Taiwanese project that’s willing to help the team!
In the upcoming articles, we will explore two potential solutions in detail: 1) a generic approach for storing triples, and 2) a lossless harvesting storage method utilizing two databases (Graph and PostgreSQL). So stay tuned!