I'm developing a simple triplestore from scratch. Specifically, i'm programming an simple graph data structure with ability to add and remove triples, extract triples, run simple queries and apply simple inferences. I do that in Emacs Lisp. I do that for learning purposes only (i know that one should not reinvent the wheel).
Where should i look for guidance on how to write a triplestore? I know about Part I of "Programming the Semantic Web", where you write your own simple triplestore in Python, but i want to know all the alternatives. For example, what RDF libraries have good sources, so i can find out, how all this works internally?
asked 09 Jan, 13:51
If you want to look at existing implementations, Jena and Sesame are both open source, and contain both in-memory and disk-based databases. Both are in Java, so if you're comfortable with Java, that's a good place to start.
Since you mentioned Python, you can always look at the source of rdflib -- admittedly I'm not very familiar with it, but they probably have a triplestore implementation that you can reference.
If you like C#, then there's .NetRdf which you can use as a reference, I know it has a native SPARQL engine and probably some sort of persistence, even if only in-memory.
There is plenty of academic literature that you could reference -- the academics on here can make specific suggestions and/or you can hit up google scholar. Further, information on how traditional relational databases work is worth reading. That won't help you come up with a good way to add and remove triples (you'll probably want a datastructures reference for that; read up on stuff like B+ trees), but understanding what a join is and how they can be implemented, is useful to know regardless of how what the data is (relational vs triples). This was the most helpful reference for me wrt to learning about how an RDF database ought to work, because they're still just databases, and a lot of the traditional wisdom is relevant.
Pairing the knowledge gained by reading up on this stuff with code review's of existing implementations should give you a pretty good idea of how to build your own to learn from.
Andy makes a good point, there are a lot of design choices you'll face when deciding on how an in-memory vs an on-disk rdf database is implemented. Thus, I'd recommend you stick with just an in-memory implementation to learn from.
You can create a working index (SPO) with just a list where scans over that index are simply a filtered view of the list. Is that slow? Yes. Will that scale? No. But will it work? Yes.
From there, you have lots of places to explore; how does having more than one index, say adding POS and OPS, change things? What happens if your indexes are based on tree structures, or hash tables, rather than a standard list? You can look at approaches that trade write performance for read performance. There will be plenty of things to play around with and to learn from.
Lastly, if you're going to implement a SPARQL engine, you'll want to get real familiar with the sparql spec.
Since you mention LISP, you really ought to familiarize yourself with Allegrograph. Allegrograph, developed by the nice folks at Franz Inc., has a free version. It isn't open source, but their APIs (LISP, Java, etc.) would be worth familiarizing yourself with. Also, incidentally, Franz develops one of the defacto LISP implementations called Allegro CL.
I've always wanted to try creating a triple store using Ehcache for the persistence backed indexes, but that would be Java. If you did something like that you could do it by extending Jena and make fairly quick work of it and learn a lot in the process.