5
2

Let's say I have a folder with several files containing RDF triples. Those files have heterogeneous serialization format (RDF/XML, N3, NTriple, ...) but there is no easy way to discriminate them (e.g. the file extension does not help).

I want to load them in a triple store via Java code (possibly using Sesame or Jena), so I need to 'guess' the serialization format of each file.

I'm currently using a workaround by catching parsing exception, but I was wondering if there is a better way to do this. Thanks!

asked 19 May '11, 09:27

iricelino's gravatar image

iricelino
855
accept rate: 0%

I've run into the same solution in my work - the exception catching is fast enough, but I agree that it seems there should be a more elegant way.

(19 May '11, 09:50) Ryan Kohl Ryan%20Kohl's gravatar image
2

Looking for format-specific patterns in the head of the source file (like the unix 'file' command does) could do, provided you are fine with subset formats like N-Triples to be reported as their largest superset (i.e. N3). If not, I'm afraid the whole source has to be scanned...

(19 May '11, 11:01) AB AB's gravatar image

The any23 library contains automatic media type detection for all major RDF syntaxes, bundled straight with parsers. It has been developed as part of the Sindice project and has been fine-tuned on literally hundreds of millions of real web documents. It's in Java and uses Sesame parsers.

permanent link

answered 20 May '11, 14:28

cygri's gravatar image

cygri ♦
9.0k412
accept rate: 34%

Sesame/Rio does have some support for automatically guessing the correct format through the class RDFFormat, which has a forFileName method that guesses the type based on the filename (extension), and a forMIMEType method which bases it on a supplied MIME-type identifier. But if you know neither, I guess that doesn't help you much.

The Aperture framework contains a MIMEType identification tool that does a magic number check on the contents of the file, this is what you would need, but unfortunately, I don't think it has support for RDF serializations (it's document-oriented, so does Word/PDF/Html/OpenOffice etc.).

permanent link

answered 19 May '11, 17:35

Jeen%20Broekstra's gravatar image

Jeen Broekstra ♦
11.5k412
accept rate: 37%

edited 19 May '11, 17:59

@Jeen Broekstra: yes, I saw the RDFFormat class, but as you said it doesn't fit my need. As per Aperture, I'll give a try to MagicMimeTypeIdentifier, but if you say it doesn't have support for RDF...

Thanks for your suggestions!

(20 May '11, 04:24) iricelino iricelino's gravatar image

If faster is the goal then look at comment by @AB. Trying formats that can be determined by the head content first and then moving on to those that require more of the stream would do the trick. But, how do we know that Jena or Sesame aren't already doing that? (maybe someone knows, or you can examine sources) If they are, then maybe you already have your fast mechanism (and easy :-)

I would suggest a two-stage approach: look at extensions first, then move on to parsing content by the faster of the aforementioned mechanisms.

permanent link

answered 19 May '11, 13:52

harschware's gravatar image

harschware ♦
7.7k1616
accept rate: 20%

edited 19 May '11, 13:54

1

Jena has a class com.hp.hpl.jena.util.FileUtils that has some methods called 'guessLang'. Per their documentation, it's all based the prefix (for jdbc uri's) and suffix (e.g. '.nt' for n-triples). As you say, a multi-staged approach from fastest to slowest makes a lot of sense.

(19 May '11, 14:42) Ryan Kohl Ryan%20Kohl's gravatar image

@harschware & @AB: This is exactly the approach I had in mind. I was wondering if a solution of this kind was already part of Sesame or Jena (I don't want to reinvent the wheel).

@Ryan Kohl: I saw that Jena method, but it checks the file extension... (see http://openjena.org/javadoc/com/hp/hpl/jena/util/FileUtils.html#guessLang%28java.lang.String%29)

Thanks for your suggestions!

(20 May '11, 04:17) iricelino iricelino's gravatar image

As an additional (fairly obvious) remark, a generalized implementation should wrap non-rewindable input streams in a buffer; in this case your guess is optimistic: if Turtle constructs turn up beyond the buffered head of a purported N-Triples stream you are out of luck...

(20 May '11, 06:34) AB AB's gravatar image

I think raptor allows you to check syntax for specific formats, so you might want to take a look at that. I've never used it, I don't know if you can integrate it with java, but it should be simple enough to create a script that outputs the names of the files and its corresponding syntax in some easy-to-read format for you to process in java. Hope this helps.

permanent link

answered 19 May '11, 09:42

janoma's gravatar image

janoma
14118
accept rate: 0%

A clean way to do this is by using Service Provider Interface (SPI) mechanism like in Java ImageIO API (see ImageReader). You have a registry of format readers that harvest the SPIs and provide convenience methods to find the adequate parser for a give file. The interface for the format reader implements a method checking whether the file is parsable or not by the SPI. The method needs to open the file and read the beginning file to see if the format is valid. I have implemented successfully this mechanism in our product knowledgeSmarts.

permanent link

answered 19 May '11, 14:28

fellahst's gravatar image

fellahst
3.3k29
accept rate: 11%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×866
×175
×30
×15

question asked: 19 May '11, 09:27

question was seen: 2,326 times

last updated: 20 May '11, 14:28