edit: I'm using dotnetrdf
I've got this tiny program which tries to load a file from dbpedia. However I'm running out of memory, is there anyway around this short of buying a better machine?
Tools like Jena and .NetRDF let you create in-memory triple stores.
These are convenient (you can use them the way Perl programmers use Hashtables) but you've discovered that it is very easy to fill memory if you work with too many triples.
You've got two options for working with larger data sets.
One of them is to graduate to using a disk or cluster-based triple store such as OpenLink Virtuoso, OWLIM-SE, BigData, Allegrograph or Stardog. These are designed for large scale, and you'll be able to handle larger data sets with them than with your RDF toolkit on the same hardware. Your toolkit should have an API for accessing them.
Triple stores do consume RAM, however, so it is a wise decision to stuff the slots in your workstation with the most RAM you can.
The most radical way to reduce RAM requirements is to process the triples in a streaming mode.
For instance, Unix tools like awk, grep, sort, sed and uniq can be used here to process N-Triples files. You can take statistics on what uri(s) appear and then select some subset to deliver to a triple store. (Enjoy these on Windows with Cygwin or you can do the same kind of stuff with Windows Powershell)
If you run the triple file through the unix sort command, you'll find all the triples with a given subject sort together. It's easy then to scan the file and put all the triples for each individual into an in-memory triple store (this is the "reduce" in Map/Reduce) You can then write SPARQL queries against each individual to ask whatever questions you want.
Now the above scheme is limited to facts that are directly attached to subjects and can't do more advanced graph traversals. However, thinking this way you can do other things. Suppose you want to traverse a "hasSpouse" property -- you could scan the file, extract these facts, store them in a triple store, and then write SPARQL queries against that. You can also do joins to bring in additional facts from other places to construct richer "individual graphs".
Any SPARQL query can be broken up into relational operators that can be implemented by a Map/Reduce framework and there are tools like Pig and Hive that parallelize these operations on Hadoop. Unfortunately these tools don't implement the RDF data model so there is some "impedance mismatch" when you process RDF with them. Perhaps someday there will be better tools for this, but I just made an open source release of infovore, a tool that supports the kind of processing described above
Infovore is writen in Java and uses parallel processing to get serious speedup on 4 to 8 core servers and workstations. It doesn't have the maturity of Hadoop, but it has many optimizations that make it efficient for the databases on the DBpedia and Freebase scale.
+1 To what database_animal says, dotNetRDF is only designed to handle at best a couple of million triples in memory on a typical machine. Memory usage is ~1KB per triple but may be more/less depending on the data.
There is a Storage API specifically designed for working with external triple stores, for large data I would strongly recommend loading the data into your store of choice using it's native bulk load API rather than dotNetRDF (as otherwise you will likely have the same OOM problem). Once you have it in your store you can use the API to query it
Even if you choose a store that isn't explicitly supported by dotNetRDF if it has a SPARQL endpoint you can still query it, see SparqlConnector
answered 26 Nov '12, 12:43
Rob Vesse ♦