|
I've had people asking for dereferencable versions of several data sets, including one that is about 670M triples. I've resisted this so far because the cost of the conventional way of doing this, with a general purpose triple store, is just too much. If I do this in AWS, I'd be expecting to pay at least $8000 a year in hosting costs, and I don't think the performance would be all that good. Since I can't really make people pay to use it or put ads in it, I just can't see a way to find revenue to offset the cost. I went on an optimizing binge for some other web sites and cut the hosting costs by 75% and that made me revisit the question of the Linked Data endpoint. Here's the idea: I could split up my data set into NTriples, Turtle or RDF/XML files that I either put on a static web server or stuff into Amazon S3. This ought to be very cheap to run (less than 1/10 of the conventional system) and also wooden round reliable. Because the scalability is so much better, I could also have a much higher limit for the number of triples that you get back than most LD endpoints. In those few cases where I do need to apply a cutoff, my Map/Reduce framework could sort triples in order of importance as TBL has suggested. The devil here is in the details. A system like this won't be so smart about content negotiation. Ideally I'd like to store gz compressed files on disk and send them over the wire in gz form and never worry about compatibility. So the question is: is anybody publishing Linked Data using static files? What practices should I use to make such an endpoint maximally compatible and useful to end users while keeping reliability and operating costs low? |
|
This might be interesting : http://kwijibo.github.com/trilby/ This is really interesting. What is the scalability hangup with this project? I note that this project also covers some services over and above just the dereferencing such as full text search. |
|
I don't know how scalable it would be, but I described a demo of doing content negotiation with static files at http://www.snee.com/bobdc.blog/2011/05/quick-and-dirty-linked-data-co.html . And, as it mentions, NetKernel might be something to look into. unfortunately bob this link ian't working The semanticweb.com thing that turns the URL into a link included the period at the end of the sentence as part of the link. I just added a space before the period (a habit I've been getting into more lately with URLs at the end of sentences; perhaps my experience writing Turtle and SPARQL queries also encourages me to put a space before each period) and it seems to work fine now. |
|
For just dereferenceability, this doesn't sound like an RDF-specific problem, by which I mean perhaps there's a happy medium between a static directory and full SPARQL? ... and I guess there should be off-the-shelf solutions? For example, a key-value store should be better geared up for such a task than a (standard) filesystem (assuming good support for "long" values). Another option is to go for a custom file system like Pomegranate:
If not Pomegranate, then perhaps having a glance through these blog posts might be of interest. |

