|
What are the best practices for processing RDF data using MapReduce (via Hadoop)? I am looking for people sharing their suggestions and lessons learned, such as:
|
|
From my experience within the Sindice project, you should definitely use n-triples or n-quad for many reasons:
Thanks for sharing your comments. I am leaving the answer open for a little while to see if there are other people who want to share additional suggestions. |
|
I've not done MapReduce specifically, but have done other parallel RDF work (on clusters and supercomputers), so not sure if this will apply to you directly (but hopefully it does). I can say yes, definitely use N-Triples or N-Quads. It takes all the complexity out of thinking about parsing and serializing, and you can split and concatenate the files without worrying about namespace declarations or splitting right in the middle of a complex structure (as you might find in RDF/XML or Turtle). |
|
Using N-Triples or N-Quads will work and there are a lot of benefits, as the original answer listed. Problem though is that the meaning of the data is not really at the granularity of a statement. For example, let's say that we want to state that ball #1 has a color with rgb value of 255,0,0:
How would you meaningfully distribute the statements in the above while staying with granularity at the statement level? I'd recommend using the concept of RDF Molecules. An RDF molecule is the minimal RDF graph you can create such that you do not lose information when distributing it (my summary, not a formal definition). Some links on RDF molecules: ftp://ksl.stanford.edu/pub/KSL_Reports/KSL-05-06.pdf http://www.w3c.rl.ac.uk/SWAD/papers/RDFMolecules_final.doc http://www.itee.uq.edu.au/~eresearch/presentations/Hunter_BioMANTA_UK_eScience.pdf http://jrdf.sourceforge.net/status.html http://morenews.blogspot.com/2008/07/yads-and-rdf-molecules.html Hi Bill, thanks for your answer. Do you have any suggestion on how to use RDF Molecules with MapReduce or how to create RDF Molecules with one or more MapReduce jobs? See also: http://www.semanticoverflow.com/questions/729/how-would-you-implement-the-symmetric-concise-bounded-description-scbd-using-ma Your very welcome. As for how to use RDF Molecules with Hadoop or other distributed store, you might want to check out these additional links: http://www.hpl.hp.com/techreports/2009/HPL-2009-346.pdf http://delivery.acm.org/10.1145/1780000/1779604/a5-ravindra.pdf?key1=1779604&key2=6400663721&coll=GUIDE&dl=GUIDE&CFID=89954265&CFTOKEN=86038772 http://rdf-proj.blogspot.com/2008/10/eventually-we-built-official-heart.html http://www.larkc.eu/ I haven't used the above, instead I've used a different approach to represent state changes in RDF Molecules and pass those around: a SPARUL command. Thanks again for the pointers. However, none of the links above provides with a MapReduce algorithm to construct RDF Molecules or similar RDF entities/aggregates. The paper titled "Towards scalable RDF graph analytics on MapReduce" group triples by subject, hot to deal with blank nodes is not mentioned. No you're right. I have never seen a COTS or OpenSource MapReduce system with RDF that takes molecules into account. It's something someone would need to implement. You could start with what I've used, treating each molecule as a SPARUL statement. That gives you how to send the data around. You'd still need to organize the molecules in some way, perhaps grouping them by subject or better yet by class of the subject (or perhaps predicate). Then you'd need to have a way of describing the jobs. It's an interesting problem and one I'd work on for Open Source if I could, but I don't have the time. |


