I've heard that it's often difficult to scale triple stores beyond 5b triples. [please correct me if its wrong] In such case, people do look for other options like HBase, Cassandra for triple storage. If so, I have following questions:

a) What are the factors that limit the scaling of triple stores?
b) In case, we move triples storage to HBase or other NoSQL database, is there a SPARQL (and inferencing) engine that works on top of NoSQL storage? I think this is the advantage that triple stores provide, but still would like to know how querying (& inferencing) is done when triples are stored in NoSQL databases.

Thanks.

asked 27 Feb '12, 14:05

metaweb87's gravatar image

metaweb87
83438
accept rate: 5%

edited 27 Feb '12, 16:09

could anyone point to the active projects in context of point (b)?

(28 Feb '12, 10:36) metaweb87 metaweb87's gravatar image

I don't think that's true. Stardog, OWLIM, AllegroGraph, BigData, and 4Store probably all scale to that and beyond. AllegroGraph claims w/ enough hardware they can do 1 trillion. So I don't think it's the case that existing triple stores can't scale to that size. Why do you think they can't?

To specifically answer your question, the answer to a) most likely differs from triple store to triple store and I'd wager its query answering at that scale that can be more troublesome more than loading that many triples. Of course, as the SP2b benchmark shows, its trivial to come up with queries that at very low scales bring triples stores to their knees. But loading 5b triples isn't all that hard.

As for b), I think there are people working on that, googling around brings up some interesting results. I don't know of any specific work, though someone will probably shout and pimp their project. But if you can do SPARQL query answering over an existing NoSQL storage solution, you can probably sort out a way to apply some forward chaining and get your inferences from that.

You might want to take an existing triple store, or a couple, for a trial run before heading down the path of a NoSQL solution, my two cents.

permanent link

answered 27 Feb '12, 14:39

mhgrove's gravatar image

mhgrove
3.3k17
accept rate: 29%

I think there's a difference on what's "enough" hardware to scale -in triple store and NoSQL store, to over 5B triples. I would like to know what sort of hardware setup is required/sufficient to store 5b triples in case of triple store and NoSql database.

Besides claims & benchmarks, I'd really appreciate if someone could share their experience of scaling on commodity hardware themselves;

I also agree that relatively query answering is more expensive over large scale than storage.

(27 Feb '12, 15:43) metaweb87 metaweb87's gravatar image
2

Storing is not the problem. Its answering queries with good performance where the difficulty is. Most NoSQL solutions that scale well limit the types of queries one can ask.

A 5B triple store can easily run on a 32 or 64GB ram machine. And depending on the queries/store this could give performance that meet your needs.

Remember that some Triple stores with graph indexes can be used as key value stores as well with similar value gathering performance.

(29 Feb '12, 12:19) Jerven ♦ Jerven's gravatar image

Well, a few organizations, such as DERI and Sindice, have triple stores in the 50-80 billion triple range.

People who are doing this are using parallel clusters and using quite a bit of hardware -- so even though we know triple stores "scale" to this range, the cost and hassle might be more than a lot of people want to do.

Now, if you could compress a triple down to 20 bytes, you could store 100 billion triples on a consumer 2TB hard drive. The trouble is you can only access these via a "full scan", which means that very simple queries can take a long time to run. The special thing that SPARQL engines bring to the table is an indexing structure which lets you write crazy queries with lots of joins and generally get pretty good performance. This is an expensive index structure, so a triple store is typically going to take more time to load than, say, mysql or Berkeley DB.

You can definitely build something that loads data more quickly or that answers a certain kind of query more quickly, but with triple stores you pay a definite performance price for being able to write highly flexible SPARQL queries.

Remember that RDF is not intended to compete for online transaction processing (OLTP) which where many of the fashionable NoSQL systems shine. RDF is more for data warehousing applications, where an ETL stage may happen before data is loaded into a database -- that ETL stage may use algorithms that parallelize well and work on Map-Reduce clusters.

In all, to "load it all into a triple store" is the fast track to semantic project failure. What you should do is gather and clean data outside your triple store and produce a curated collection of data to load into it that will deliver the most impact per triple.

permanent link

answered 27 Feb '12, 16:13

database_animal's gravatar image

database_animal ♦
8.4k1612
accept rate: 15%

2

Bingo.

"a) What are the factors that limit the scaling of triple stores?"

The correct answer is the impatience of humans.

(29 Feb '12, 12:55) Signified ♦ Signified's gravatar image

There is a summary about that topic, see "Large-Scale Linked Data Processing: Cloud Computing to the Rescue?" ;)

permanent link

answered 02 Mar '12, 13:46

zazi's gravatar image

zazi
3.4k1213
accept rate: 13%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×206
×164
×15
×8

question asked: 27 Feb '12, 14:05

question was seen: 1,868 times

last updated: 31 Mar '12, 22:37