Well, a few organizations, such as DERI and Sindice, have triple stores in the 50-80 billion triple range.
People who are doing this are using parallel clusters and using quite a bit of hardware -- so even though we know triple stores "scale" to this range, the cost and hassle might be more than a lot of people want to do.
Now, if you could compress a triple down to 20 bytes, you could store 100 billion triples on a consumer 2TB hard drive. The trouble is you can only access these via a "full scan", which means that very simple queries can take a long time to run. The special thing that SPARQL engines bring to the table is an indexing structure which lets you write crazy queries with lots of joins and generally get pretty good performance. This is an expensive index structure, so a triple store is typically going to take more time to load than, say, mysql or Berkeley DB.
You can definitely build something that loads data more quickly or that answers a certain kind of query more quickly, but with triple stores you pay a definite performance price for being able to write highly flexible SPARQL queries.
Remember that RDF is not intended to compete for online transaction processing (OLTP) which where many of the fashionable NoSQL systems shine. RDF is more for data warehousing applications, where an ETL stage may happen before data is loaded into a database -- that ETL stage may use algorithms that parallelize well and work on Map-Reduce clusters.
In all, to "load it all into a triple store" is the fast track to semantic project failure. What you should do is gather and clean data outside your triple store and produce a curated collection of data to load into it that will deliver the most impact per triple.
answered
27 Feb '12, 16:13
database_animal ♦
8.0k●6●12
accept rate:
16%
could anyone point to the active projects in context of point (b)?