|
I am using Hadoop for SPARQL query processing of n-triples RDF. But the problem is to store these data,it needs to be splitted into smaller files on the basis of predicate split and predicate object split.I am not getting any tools to split n-triples RDF on the basis of predicate split(PS) and predicate object split(POS).Please help me out and suggest some solutions. |
|
Well, one way to do this is just to load the file into a triplestore and then perform SPARQL queries on it to retrieve the subsets you want. However, I'm guessing your data size is too large for this to be practical, and you want a streaming solution that does not involve a database. One streaming solution in Java is to use Sesame's Rio parser toolkit (disclosure: I'm a Sesame developer). You can create your own custom RDFHandler, which performs the split logic and passes statements on to separate RDFWriters. The writers can be created on the fly or in advance, it depends a bit on what exact split logic you have in mind. Something like this (rough sketch, you'll have to work out the details, such as how to initialize each file writer and how to decide which triple goes where):
For more info on how to work with Sesame's Rio parser, see this weblog article or consult the Rio API Javadoc. |
|
Jena RIOT example of filtering a stream of triples (this uses Apache Jena 2.10.0). http://jena.apache.org/documentation/io/index.html N-Triples is very regular - another approach is text processing: e.g. Linux csplit (or perl, or awk, or ...). |

