6
1

I've had people asking for dereferencable versions of several data sets, including one that is about 670M triples.

I've resisted this so far because the cost of the conventional way of doing this, with a general purpose triple store, is just too much. If I do this in AWS, I'd be expecting to pay at least $8000 a year in hosting costs, and I don't think the performance would be all that good. Since I can't really make people pay to use it or put ads in it, I just can't see a way to find revenue to offset the cost.

I went on an optimizing binge for some other web sites and cut the hosting costs by 75% and that made me revisit the question of the Linked Data endpoint.

Here's the idea: I could split up my data set into NTriples, Turtle or RDF/XML files that I either put on a static web server or stuff into Amazon S3. This ought to be very cheap to run (less than 1/10 of the conventional system) and also wooden round reliable. Because the scalability is so much better, I could also have a much higher limit for the number of triples that you get back than most LD endpoints. In those few cases where I do need to apply a cutoff, my Map/Reduce framework could sort triples in order of importance as TBL has suggested.

The devil here is in the details. A system like this won't be so smart about content negotiation. Ideally I'd like to store gz compressed files on disk and send them over the wire in gz form and never worry about compatibility.

So the question is: is anybody publishing Linked Data using static files? What practices should I use to make such an endpoint maximally compatible and useful to end users while keeping reliability and operating costs low?

asked 18 Nov '12, 11:15

database_animal's gravatar image

database_animal ♦
8.4k1612
accept rate: 15%


You should check out Linked Data Fragments. It's a way of exposing Linked Data in a way that can be queried using SPARQL, without actually the need of a SPARQL endpoint. More specifically it comes with a client that answers a SPARQL query by dereferencing multiple URLs instead of just a single /query? request. Here's an example: http://client.linkeddatafragments.org/

On that line I've taken the idea one step further and created this: https://github.com/lmatteis/ldstatic - essentially it parses RDF and creates Turtle files for each of the possible patterns. This makes it so you can host the resulting files on a static web server (such as GitHub as well) and query them using a basic Linked Data Fragments client.

Pretty cool stuff I think.

permanent link

answered 13 May, 13:20

Luca%20Matteis's gravatar image

Luca Matteis
1.6k139
accept rate: 13%

edited 13 May, 13:22

impressive !!!

(13 May, 19:50) Maatary Maatary's gravatar image

This might be interesting : http://kwijibo.github.com/trilby/

permanent link

answered 18 Nov '12, 14:41

AndyS's gravatar image

AndyS ♦
13.7k37
accept rate: 32%

This is really interesting. What is the scalability hangup with this project?

I note that this project also covers some services over and above just the dereferencing such as full text search.

(20 Nov '12, 17:16) database_animal ♦ database_animal's gravatar image
1

I am the author of Trilby. I wrote it to cater to small datasets, focussing on features and ease of setup over scalability. It won't scale too well because it uses flat file storage and PHP (which has a share-nothing architecture). I did it this way so that Trilby would have no dependencies on database servers etc. So it loads indexes into memory with each request, and the indexes tell it which descriptions to retrieve from file. In practice, over about turtle files over 30MB would get annoyingly slow to load for their size, but query time was still OK.

(27 Jul '13, 05:37) keithalexander keithalexander's gravatar image

I don't know how scalable it would be, but I described a demo of doing content negotiation with static files at http://www.snee.com/bobdc.blog/2011/05/quick-and-dirty-linked-data-co.html . And, as it mentions, NetKernel might be something to look into.

permanent link

answered 18 Nov '12, 17:43

bobdc's gravatar image

bobdc
4.1k7
accept rate: 15%

edited 20 Nov '12, 17:34

unfortunately bob this link ian't working

(20 Nov '12, 17:17) database_animal ♦ database_animal's gravatar image
1

The semanticweb.com thing that turns the URL into a link included the period at the end of the sentence as part of the link. I just added a space before the period (a habit I've been getting into more lately with URLs at the end of sentences; perhaps my experience writing Turtle and SPARQL queries also encourages me to put a space before each period) and it seems to work fine now.

(20 Nov '12, 17:36) bobdc bobdc's gravatar image

The content negotiation could be a little sesame or jena servlet. Which would do the content-negotiation and the translation/compression work.

My advice is to use bzip if you pay per byte, gzip if you pay per cpu clock.

The non sparql uniprot.org website has about 100 million files accesible as rdf/xml via HTTP. We loaded each record into a key value store (BerkelyDB/je) and where the key would be the file name and the value is a record (which is translated to RDF/XML on the fly if requested). In your case S3 would be as good if not better.

permanent link

answered 19 Nov '12, 03:26

Jerven's gravatar image

Jerven ♦
4.6k610
accept rate: 35%

i know i killed a linux machine by putting 700,000 files in a directory 12 years ago, but I think newer filesystems are more scalable. Perhaps JDBM or S3 as a backend to a transcoder is a good idea.

(20 Nov '12, 17:20) database_animal ♦ database_animal's gravatar image

For just dereferenceability, this doesn't sound like an RDF-specific problem, by which I mean perhaps there's a happy medium between a static directory and full SPARQL? ... and I guess there should be off-the-shelf solutions?

For example, a key-value store should be better geared up for such a task than a (standard) filesystem (assuming good support for "long" values).

Another option is to go for a custom file system like Pomegranate:

Emerging popular Web applications, e.g. online photo and social network services, exhibit the requirement for accessing lots of small files. Small files are concurrently created and read in file systems. Particularly, NVersion and 1-N Mapping issues of distributed file systems, especially for small files, have not been addressed yet. In this paper, we propose to evolve the directory model, interface and storage structures to address NVersion and 1-N Mapping issues and support semantic data access. We implement a novel distributed file system to manage billions of small files, and focus on decreasing metadata service latency and improving small file I/O to promote the efficiency of handling small files. Our evaluation demonstrates that the new directory model, interface and storage structures can not only save storage space, but also increase small file performance. With 32 servers, we deliver the leading metadata and small file I/O performance: 180,000 file creates per seconds and 1GB/s aggregated write bandwidth.

Project Features

  • Pomegranate is a distributed file system over distributed tabular storage (xTable) implemented in C;
  • Implement a flexible memory caching layer, which is snapshot-able and fault tolerant;
  • A reliable storage layer that support massive tiny file reads and writes;
  • A key/value interface as a supplementary for POSIX I/O interface;

If not Pomegranate, then perhaps having a glance through these blog posts might be of interest.

permanent link

answered 20 Nov '12, 18:36

Signified's gravatar image

Signified ♦
24.0k1623
accept rate: 37%

edited 20 Nov '12, 18:36

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×885
×262
×27
×10
×7

question asked: 18 Nov '12, 11:15

question was seen: 1,424 times

last updated: 13 May, 19:50