Hi, I'm looking at a problem right now and scratching my head.

I have several N-Triples files sorted on subject and I'd like to merge them into one stream of triples, still sorted on subject.

Because RDF I/O in both Jena and Sesame is done on a push basis (the I/O system calls my code) it seems pretty hard to merge streams. It seems that I'd need to use multiple threads to convert push into pull, probably with a BlockingPriorityQueue in there somewhere. Once I start thinking of the corner cases, it might be a nightmare to get it working completely correctly.

Am I missing something and is there any easy way to do this? Or should I just stop worrying, do the merge to disk, and figure that disk space is cheap in 2012?

asked 19 Mar '12, 17:14

database_animal's gravatar image

database_animal ♦
8.4k1612
accept rate: 15%


A completely different approach is to simply concatenate the files and then run sort(1) on them. This assumes you don't care about comments and blank lines, and there is no leading whitespace before subjects but if the n-triples was produced by a program that's likely true.

Big advantage - you need to write a 2 line shell script to do it, not write-and-debug java code.

Bonus prize - strip out leading whitespace on subjects with sed(1).

permanent link

answered 19 Mar '12, 18:40

AndyS's gravatar image

AndyS ♦
13.5k37
accept rate: 33%

What I'm doing is similar to this, but I'm using Java for it because (1) the files are already sorted and (2) I already have something that uses a PriorityQueue to do the merging... I just need to make my abstractions a little leaky (get at the lines, not the triples) and it works nicely.

The PriorityQueue works nicely for this, because it's efficient even if you want to merge 1000 streams. (In this case though, N=3) I could probably get this to output the merge to a Stream of some sort and avoid the need to write it to a file, but the files are a little more testable.

(19 Mar '12, 20:05) database_animal ♦ database_animal's gravatar image

I did an adapter for RIO using Guava's AbstractService that turned their result listener interface into an iterator using a blocking queue; pretty straightforward little piece of code to work out. You could do the same w/ Jena and then a sorted iterator over the two would be very trivial. I had the AbstractService handle starting the parsing and collecting the results, the only trick I recall was using using an object to poison the queue to signify the end of parsing so you don't end up waiting on a take.

permanent link

answered 19 Mar '12, 18:11

mhgrove's gravatar image

mhgrove
3.3k17
accept rate: 29%

Interesting question. As both files are already sorted its not so bad.

Take two queues one for each file. Take the head of both files. Compare the heads. Write the head which should go first. Take a new head Compare the heads Write the head which should go first. etc... till no heads/triples remain.

In sesame you can have two rio readers both in a thread each pushing statements into a BlockingQueue per thread. You have a third thread which does the comparison and the writing. That third thread needs to pulls the heads of the queues.

Make sense?

permanent link

answered 19 Mar '12, 18:14

Jerven's gravatar image

Jerven ♦
4.6k610
accept rate: 34%

The last time I tried something in this system that involved interacting threads, performance went down by an order of magnitude. Perhaps this had to do with the particulars and maybe I'm being superstitious...

So long as the threads don't interact, this thing works like a dream on an 8 CPU machine...

(19 Mar '12, 19:59) database_animal ♦ database_animal's gravatar image

Jena does have a parser with a pull API for N-Triples and N-Quads [*]:

   public final class LangNTriples implements Iterator<Triple> { ... }
   // To use:
   LangNTriples parser = RiotReader.createParserNTriples(inputStream, null);


You may also want to look at org.openjena.atlas.data.SortedDataBag.SpillSortIterator which provides an implementation of a sort-merge iterator based on a PriorityQueue.

* For other RDF formats, you could try RiotTripleParsePuller, which was just added in SVN recently. This uses a thread and a BlockingQueue to create a pull API. Note this class is new and subject to change.

permanent link

answered 26 Mar '12, 21:12

sallen's gravatar image

sallen
2444
accept rate: 50%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×615
×152
×5

question asked: 19 Mar '12, 17:14

question was seen: 1,631 times

last updated: 26 Mar '12, 21:12