I've had people asking for dereferencable versions of several data sets, including one that is about 670M triples.
I've resisted this so far because the cost of the conventional way of doing this, with a general purpose triple store, is just too much. If I do this in AWS, I'd be expecting to pay at least $8000 a year in hosting costs, and I don't think the performance would be all that good. Since I can't really make people pay to use it or put ads in it, I just can't see a way to find revenue to offset the cost.
I went on an optimizing binge for some other web sites and cut the hosting costs by 75% and that made me revisit the question of the Linked Data endpoint.
Here's the idea: I could split up my data set into NTriples, Turtle or RDF/XML files that I either put on a static web server or stuff into Amazon S3. This ought to be very cheap to run (less than 1/10 of the conventional system) and also wooden round reliable. Because the scalability is so much better, I could also have a much higher limit for the number of triples that you get back than most LD endpoints. In those few cases where I do need to apply a cutoff, my Map/Reduce framework could sort triples in order of importance as TBL has suggested.
The devil here is in the details. A system like this won't be so smart about content negotiation. Ideally I'd like to store gz compressed files on disk and send them over the wire in gz form and never worry about compatibility.
So the question is: is anybody publishing Linked Data using static files? What practices should I use to make such an endpoint maximally compatible and useful to end users while keeping reliability and operating costs low?
asked 18 Nov '12, 11:15
I don't know how scalable it would be, but I described a demo of doing content negotiation with static files at http://www.snee.com/bobdc.blog/2011/05/quick-and-dirty-linked-data-co.html . And, as it mentions, NetKernel might be something to look into.
The content negotiation could be a little sesame or jena servlet. Which would do the content-negotiation and the translation/compression work.
My advice is to use bzip if you pay per byte, gzip if you pay per cpu clock.
The non sparql uniprot.org website has about 100 million files accesible as rdf/xml via HTTP. We loaded each record into a key value store (BerkelyDB/je) and where the key would be the file name and the value is a record (which is translated to RDF/XML on the fly if requested). In your case S3 would be as good if not better.
answered 19 Nov '12, 03:26
For just dereferenceability, this doesn't sound like an RDF-specific problem, by which I mean perhaps there's a happy medium between a static directory and full SPARQL? ... and I guess there should be off-the-shelf solutions?
For example, a key-value store should be better geared up for such a task than a (standard) filesystem (assuming good support for "long" values).
Another option is to go for a custom file system like Pomegranate:
If not Pomegranate, then perhaps having a glance through these blog posts might be of interest.