I am using Hadoop for SPARQL query processing of n-triples RDF. But the problem is to store these data,it needs to be splitted into smaller files on the basis of predicate split and predicate object split.I am not getting any tools to split n-triples RDF on the basis of predicate split(PS) and predicate object split(POS).Please help me out and suggest some solutions.

asked 23 Feb '13, 11:49

Arif's gravatar image

Arif
312
accept rate: 0%


Well, one way to do this is just to load the file into a triplestore and then perform SPARQL queries on it to retrieve the subsets you want. However, I'm guessing your data size is too large for this to be practical, and you want a streaming solution that does not involve a database.

One streaming solution in Java is to use Sesame's Rio parser toolkit (disclosure: I'm a Sesame developer). You can create your own custom RDFHandler, which performs the split logic and passes statements on to separate RDFWriters. The writers can be created on the fly or in advance, it depends a bit on what exact split logic you have in mind. Something like this (rough sketch, you'll have to work out the details, such as how to initialize each file writer and how to decide which triple goes where):

 // create a parser and a handler for the parsed data
 RDFParser parser = Rio.createParser(RDFFormat.NTRIPLES);
 RDFHandler splitter = new PredicateObjectSplitter();

 // link the parser to the splitter
 parser.setRDFHandler(splitter);

 // start the process
 parser.parse(inputStream);

 ...
 class PredicateObjectSplitter extends RDFHandlerBase {

       @Override
       public void handleStatement(Statement st) throws RDFHandlerException {
              URI predicate = st.getPredicate();
              Value object = st.getObject();

              // decide on the basis of predicate and object where the statement 
              // should go and then pass on.
              RDFWriter writer = getWriterForPOCombo(predicate, object);
              writer.handleStatement(st); 
       }
 }

For more info on how to work with Sesame's Rio parser, see this weblog article or consult the Rio API Javadoc.

permanent link

answered 23 Feb '13, 14:33

Jeen%20Broekstra's gravatar image

Jeen Broekstra ♦
11.6k412
accept rate: 38%

Jena RIOT example of filtering a stream of triples (this uses Apache Jena 2.10.0).

http://jena.apache.org/documentation/io/index.html

N-Triples is very regular - another approach is text processing: e.g. Linux csplit (or perl, or awk, or ...).

permanent link

answered 23 Feb '13, 15:06

AndyS's gravatar image

AndyS ♦
13.6k37
accept rate: 33%

edited 23 Feb '13, 15:12

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×3

question asked: 23 Feb '13, 11:49

question was seen: 599 times

last updated: 23 Feb '13, 15:12