I've heard that it's often difficult to scale triple stores beyond 5b triples. [please correct me if its wrong] In such case, people do look for other options like HBase, Cassandra for triple storage. If so, I have following questions:
a) What are the factors that limit the scaling of triple stores?
I don't think that's true. Stardog, OWLIM, AllegroGraph, BigData, and 4Store probably all scale to that and beyond. AllegroGraph claims w/ enough hardware they can do 1 trillion. So I don't think it's the case that existing triple stores can't scale to that size. Why do you think they can't?
To specifically answer your question, the answer to a) most likely differs from triple store to triple store and I'd wager its query answering at that scale that can be more troublesome more than loading that many triples. Of course, as the SP2b benchmark shows, its trivial to come up with queries that at very low scales bring triples stores to their knees. But loading 5b triples isn't all that hard.
As for b), I think there are people working on that, googling around brings up some interesting results. I don't know of any specific work, though someone will probably shout and pimp their project. But if you can do SPARQL query answering over an existing NoSQL storage solution, you can probably sort out a way to apply some forward chaining and get your inferences from that.
You might want to take an existing triple store, or a couple, for a trial run before heading down the path of a NoSQL solution, my two cents.
answered 27 Feb '12, 14:39
Well, a few organizations, such as DERI and Sindice, have triple stores in the 50-80 billion triple range.
People who are doing this are using parallel clusters and using quite a bit of hardware -- so even though we know triple stores "scale" to this range, the cost and hassle might be more than a lot of people want to do.
Now, if you could compress a triple down to 20 bytes, you could store 100 billion triples on a consumer 2TB hard drive. The trouble is you can only access these via a "full scan", which means that very simple queries can take a long time to run. The special thing that SPARQL engines bring to the table is an indexing structure which lets you write crazy queries with lots of joins and generally get pretty good performance. This is an expensive index structure, so a triple store is typically going to take more time to load than, say, mysql or Berkeley DB.
You can definitely build something that loads data more quickly or that answers a certain kind of query more quickly, but with triple stores you pay a definite performance price for being able to write highly flexible SPARQL queries.
Remember that RDF is not intended to compete for online transaction processing (OLTP) which where many of the fashionable NoSQL systems shine. RDF is more for data warehousing applications, where an ETL stage may happen before data is loaded into a database -- that ETL stage may use algorithms that parallelize well and work on Map-Reduce clusters.
In all, to "load it all into a triple store" is the fast track to semantic project failure. What you should do is gather and clean data outside your triple store and produce a curated collection of data to load into it that will deliver the most impact per triple.
answered 27 Feb '12, 16:13
There is a summary about that topic, see "Large-Scale Linked Data Processing: Cloud Computing to the Rescue?" ;)
answered 02 Mar '12, 13:46