Hi, I'm looking at a problem right now and scratching my head.
I have several N-Triples files sorted on subject and I'd like to merge them into one stream of triples, still sorted on subject.
Because RDF I/O in both Jena and Sesame is done on a push basis (the I/O system calls my code) it seems pretty hard to merge streams. It seems that I'd need to use multiple threads to convert push into pull, probably with a BlockingPriorityQueue in there somewhere. Once I start thinking of the corner cases, it might be a nightmare to get it working completely correctly.
Am I missing something and is there any easy way to do this? Or should I just stop worrying, do the merge to disk, and figure that disk space is cheap in 2012?
asked 19 Mar '12, 17:14
A completely different approach is to simply concatenate the files and then run sort(1) on them. This assumes you don't care about comments and blank lines, and there is no leading whitespace before subjects but if the n-triples was produced by a program that's likely true.
Big advantage - you need to write a 2 line shell script to do it, not write-and-debug java code.
Bonus prize - strip out leading whitespace on subjects with sed(1).
answered 19 Mar '12, 18:40
I did an adapter for RIO using Guava's AbstractService that turned their result listener interface into an iterator using a blocking queue; pretty straightforward little piece of code to work out. You could do the same w/ Jena and then a sorted iterator over the two would be very trivial. I had the AbstractService handle starting the parsing and collecting the results, the only trick I recall was using using an object to poison the queue to signify the end of parsing so you don't end up waiting on a take.
answered 19 Mar '12, 18:11
Interesting question. As both files are already sorted its not so bad.
Take two queues one for each file. Take the head of both files. Compare the heads. Write the head which should go first. Take a new head Compare the heads Write the head which should go first. etc... till no heads/triples remain.
In sesame you can have two rio readers both in a thread each pushing statements into a BlockingQueue per thread. You have a third thread which does the comparison and the writing. That third thread needs to pulls the heads of the queues.
answered 19 Mar '12, 18:14
Jena does have a parser with a pull API for N-Triples and N-Quads [*]:
* For other RDF formats, you could try RiotTripleParsePuller, which was just added in SVN recently. This uses a thread and a BlockingQueue to create a pull API. Note this class is new and subject to change.
answered 26 Mar '12, 21:12