|
Hi, I'm looking at a problem right now and scratching my head. I have several N-Triples files sorted on subject and I'd like to merge them into one stream of triples, still sorted on subject. Because RDF I/O in both Jena and Sesame is done on a push basis (the I/O system calls my code) it seems pretty hard to merge streams. It seems that I'd need to use multiple threads to convert push into pull, probably with a BlockingPriorityQueue in there somewhere. Once I start thinking of the corner cases, it might be a nightmare to get it working completely correctly. Am I missing something and is there any easy way to do this? Or should I just stop worrying, do the merge to disk, and figure that disk space is cheap in 2012? |
|
I did an adapter for RIO using Guava's AbstractService that turned their result listener interface into an iterator using a blocking queue; pretty straightforward little piece of code to work out. You could do the same w/ Jena and then a sorted iterator over the two would be very trivial. I had the AbstractService handle starting the parsing and collecting the results, the only trick I recall was using using an object to poison the queue to signify the end of parsing so you don't end up waiting on a take. |
|
Jena does have a parser with a pull API for N-Triples and N-Quads [*]:
* For other RDF formats, you could try RiotTripleParsePuller, which was just added in SVN recently. This uses a thread and a BlockingQueue to create a pull API. Note this class is new and subject to change. |

