8
6

What are the best practices for processing RDF data using MapReduce (via Hadoop)?

I am looking for people sharing their suggestions and lessons learned, such as:

  • Use N-Triples and/or N-Quads, don't hurt yourself with RDF/XML or Turtle
  • If you really want to use RDF/XML and/or Turtle, process one file at the time, don't split them.
  • ...

asked 13 Apr '10, 14:45

castagna's gravatar image

castagna
1.4k1415
accept rate: 27%

edited 13 Apr '10, 14:55

Bastian%20Spanneberg's gravatar image

Bastian Span...
2.0k1613


From my experience within the Sindice project, you should definitely use n-triples or n-quad for many reasons:

  1. much easier to manipulate: you are dealing with a set of lines (triples or quads) which can be more efficiently distributed by hadoop since hadoop will distribute the data on a quad/statement level, and not on a document level. You are moving the data granularity to a statement instead of a document. Document can be big and therefore some hadoop job will require more resources/time to finish. By working with statements, the data is more equitably distributed over the nodes.
  2. you don't have the overhead of parsing xml or turtle for manipulating the data (parsing a triple or quad is simple and therefore very efficient)
  3. even if n-triple / n-quad format have size overhead compared to rdf/xml or turtle, you don't really care since hadoop/hdfs will compressed the data for you transparently (so the size overhead becomes very small)
  4. you can easily sort, group, extract the triples / quads using map and reduce functions. With just basic (few lines of code) map/reduce functions, you can create an hadoop job that will sort your data first by context, and subject and extract single entity description (the set of quads having the same context and subject). This is the basic mechanism behind the Sindice/SIREn indexing pipeline.
  5. more easy to debug (it is the only format that is easily readable by human imho).
permanent link

answered 15 Apr '10, 18:58

Renaud%20Delbru's gravatar image

Renaud Delbru
1212
accept rate: 0%

Thanks for sharing your comments. I am leaving the answer open for a little while to see if there are other people who want to share additional suggestions.

(15 Apr '10, 19:26) castagna castagna's gravatar image

Using N-Triples or N-Quads will work and there are a lot of benefits, as the original answer listed. Problem though is that the meaning of the data is not really at the granularity of a statement. For example, let's say that we want to state that ball #1 has a color with rgb value of 255,0,0:

@prefix rel: <urn:myont:rel:> .
@prefix class: <urn:myont:class:> .
@prefix ball: <urn:myont:class:Ball:> .

ball:1
      a class:Ball;
      rel:hasColor [
           a class:Color;
           rel:hasRedComponent "255"^^xsd:integer;                            
           rel:hasGreenComponent "0"^^xsd:integer;
           rel:hasBlueComponent "0"^^xsd:integer
           ].

How would you meaningfully distribute the statements in the above while staying with granularity at the statement level?

I'd recommend using the concept of RDF Molecules. An RDF molecule is the minimal RDF graph you can create such that you do not lose information when distributing it (my summary, not a formal definition).

Some links on RDF molecules:

ftp://ksl.stanford.edu/pub/KSL_Reports/KSL-05-06.pdf

http://www.w3c.rl.ac.uk/SWAD/papers/RDFMolecules_final.doc

http://www.itee.uq.edu.au/~eresearch/presentations/Hunter_BioMANTA_UK_eScience.pdf

http://jrdf.sourceforge.net/status.html

http://morenews.blogspot.com/2008/07/yads-and-rdf-molecules.html

permanent link

answered 11 May '10, 19:57

Bill%20Barnhill's gravatar image

Bill Barnhill
45814
accept rate: 16%

Hi Bill, thanks for your answer. Do you have any suggestion on how to use RDF Molecules with MapReduce or how to create RDF Molecules with one or more MapReduce jobs?

See also: http://www.semanticoverflow.com/questions/729/how-would-you-implement-the-symmetric-concise-bounded-description-scbd-using-ma

(12 May '10, 06:19) castagna castagna's gravatar image
1

Your very welcome.

As for how to use RDF Molecules with Hadoop or other distributed store, you might want to check out these additional links:

http://www.hpl.hp.com/techreports/2009/HPL-2009-346.pdf

http://delivery.acm.org/10.1145/1780000/1779604/a5-ravindra.pdf?key1=1779604&key2=6400663721&coll=GUIDE&dl=GUIDE&CFID=89954265&CFTOKEN=86038772

http://rdf-proj.blogspot.com/2008/10/eventually-we-built-official-heart.html

http://www.larkc.eu/

I haven't used the above, instead I've used a different approach to represent state changes in RDF Molecules and pass those around: a SPARUL command.

(12 May '10, 10:34) Bill Barnhill Bill%20Barnhill's gravatar image

Thanks again for the pointers. However, none of the links above provides with a MapReduce algorithm to construct RDF Molecules or similar RDF entities/aggregates. The paper titled "Towards scalable RDF graph analytics on MapReduce" group triples by subject, hot to deal with blank nodes is not mentioned.

(12 May '10, 20:16) castagna castagna's gravatar image

No you're right. I have never seen a COTS or OpenSource MapReduce system with RDF that takes molecules into account. It's something someone would need to implement. You could start with what I've used, treating each molecule as a SPARUL statement. That gives you how to send the data around. You'd still need to organize the molecules in some way, perhaps grouping them by subject or better yet by class of the subject (or perhaps predicate). Then you'd need to have a way of describing the jobs. It's an interesting problem and one I'd work on for Open Source if I could, but I don't have the time.

(13 May '10, 14:04) Bill Barnhill Bill%20Barnhill's gravatar image

I developed a Map/Reduce framework that runs on a single computer to do the data processing that creates :BaseKB from the Freebase quad dump. I get nearly linear speed up with as many as 8 CPU cores.

The system works in several modes. Mappers tend to be written as Java functions that operate on Jena Triple objects. For instance if I want to filter out triples that have a certain predicate, or if I want to rewrite identifiers, I just write Java to do it.

The usual design for a reducer is that it groups predicates by subject or object and loads them into a Jena model. The reducer then fills up another model with statements, primarily using CONSTRUCT SPARQL queries. The new model then gets flushed down the stream to go to the next processor. This is a comfortable style of programming, although SPIN might be even nicer.

This model can be supplemented in many ways. For instance, a rulebox can be inserted into the model or you could enhance the model by inserting triples from another source. Freebase, for instance, uses compound object types to create something like an "RDF Molecule" and if the CVTs were shoved into a key-value store in compressed form, you could get something with better efficiency and scalability than a conventional triple store. Reducers could reconstruct the "CVT Halo" around an subject and thus be able to work with an enhanced graph.

The above system is able to use SPARQL on single subjects or objects, but can't do general SPARQL queries.

That's one of the reasons I've been using Hadoop lately. Pig and Hive let you write computation pipelines by using relational operators and then automatically compile them to parallel Java code.

I've been using Pig extensively and the only complaint I have has to do with data structures. I'd really like to see something that uses RDF native data types. Right now I need to parse the RDF, convert it to standard data types, then remember what everything is supposed to me when I'm writing RDF.

Another fundamental issue is identifier compression. If I'm working with Freebase or DBpedia, I can just whack off the prefix of the URI to lower the memory/disk/network bandwidth of the data. It's even better if you can map the identifiers to integers. (Easy in either of those cases)

In general it would be nice to have some system that compresses URIs when they go into the system and decompresses them when they come out, just like most triple stores do. There's the option of doing this with a K/V store or by doing it with algorithms that involve sorting and joining. The second one might be a little less flexible (in particular it may be hard to look up identifiers that are specified in your code), but it ought to be scalable.

Working with Pig you'll probably use both s-p-o triples and also rows that are like SPARQL result sets.

If you don't want to use Pig, you can write your own mappers and reducers that take Triples as input. The exact same approach used in my old framework should work well.

permanent link

answered 17 Sep '12, 12:27

database_animal's gravatar image

database_animal ♦
8.4k1612
accept rate: 15%

edited 17 Sep '12, 12:28

I've not done MapReduce specifically, but have done other parallel RDF work (on clusters and supercomputers), so not sure if this will apply to you directly (but hopefully it does). I can say yes, definitely use N-Triples or N-Quads. It takes all the complexity out of thinking about parsing and serializing, and you can split and concatenate the files without worrying about namespace declarations or splitting right in the middle of a complex structure (as you might find in RDF/XML or Turtle).

permanent link

answered 13 Apr '10, 18:19

kasei's gravatar image

kasei
46818
accept rate: 28%

Your answer does not add information to what was already there in my question.

(13 Apr '10, 19:44) castagna castagna's gravatar image

From my experience behind developing the Anduin: Processing RDF on Hadoop in Scala:

  • N-quad format is indeed handy;
  • The real-word data sets like Billion Triple Challenge 2009 may contain a lot of large-size literals: so, one has to configure the data compression (e.g. LZO) in Hadoop to reduce the network overhead and wasting HDFS during execution map-reduce tasks;
  • Graph-like operations may be parallelized not so efficiently on real-world data due to the skewness of node degree distribution in the RDF graph.
permanent link

answered 05 Jul '13, 12:41

Nikita%20Zhiltsov's gravatar image

Nikita Zhiltsov
62728
accept rate: 16%

Here is another MapReduce approach for Linked Data: mrlin - MapReduce processing of Linked Data

permanent link

answered 03 Nov '12, 11:22

zazi's gravatar image

zazi
3.4k1213
accept rate: 13%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×869
×10
×5

question asked: 13 Apr '10, 14:45

question was seen: 7,666 times

last updated: 05 Jul '13, 12:41