login about faq

This question needs a bit of introduction, bear with me please.

In RDF, resources are identified using URI references. The notion of URI reference was added in anticipation of the standardization of IRIs. The SPARQL spec builds on this and in fact adopts the IRI standard as part of its spec: RDF terms in SPARQL queries are identified using IRIs.

Unfortunately, there is an incompatibility: RDF URI references may contain "<", ">", '"' (double quote), space, "{", "}", "|", "", "^", and "`", but these are not allowed in IRIs (see SPARQL's IRI syntax). The upshot of this is that while it is perfectly legal to have an RDF triple of the following form:

<http://example.org/a b> a ex:Foo .

You can not directly query such a resource using SPARQL, e.g.:

SELECT * WHERE { <http://example.org/a b> ?P ?Y . }

is not a syntactically valid SPARQL query.

All of this is pretty well known of course. The reason I am introducing it is that I would like to ask some "best practice" type questions related to this issue:

  1. have you ever encountered this problem in practice, that is, have you ever had to work with a dataset that contained such non-compatible URIrefs?
  2. how do you deal with this incompatibility? Do you query around it? Does your triplestore/parser toolkit of choice offer you some kind of workaround for this problem? Or do you simply convert the offending data?

I'm not so much looking for theoretical solutions, I'm more interested in what has been done out there in practice, already.

This question was inspired by a recent discussion on the Sesame mailinglist by the way, just in case you thought it looked familiar :)

asked Apr 06 '11 at 23:22

Jeen%20Broekstra's gravatar image

Jeen Broekstra
4.8k311

It's gratifying that at least 6 people think it's a good question, strange though that no-one has an answer. Is this really a non-existent problem from a practical point of view?

(Apr 11 '11 at 02:09) Jeen Broekstra Jeen%20Broekstra's gravatar image

Some more specific references help. Finding contradictions in standards that refer to other standards is not trivial. The discussion you refer to can be found at http://sesame-general.435816.n3.nabble.com/Inconsistency-between-URIs-in-RDF-and-SPARQL-td2761175.html

(Apr 11 '11 at 03:25) Jakob Jakob's gravatar image

I thought I had been pretty thorough with my linking to the relevant parts of the spec(s), not sure what the mailinglist discussion adds to that, but nevertheless thanks for adding the link.

(Apr 11 '11 at 21:24) Jeen Broekstra Jeen%20Broekstra's gravatar image

I would mark this as a know issue and inconsistency between the specs. I guess, the new RDF WG will propose a solution for this issue (see http://www.w3.org/2011/rdf-wg/track/issues/8).

(Apr 12 '11 at 04:52) zazi zazi's gravatar image

There is one simple, practical and reasonable solution: just don't use URI references that contain nasty characters. Bad idea. No-go. If you happen to discover such URI references, you probably have more serious problems: they are a clear sign of ill-designed data and lack of interest in interoperability. You can be sure that the dataset is broken in other aspects as well.

There are similar instances of possible-but-not-recommended in many other areas. For example clone the following on a Linux command line: git clone git://github.com/nichtich/badnames.git It gives you a perfectly legal git repository with 57 files. Now try to look at the repository at github. You can get some files, such as https://github.com/nichtich/badnames/blob/master/badnames.pl but others not. You cannot even list the repository at https://github.com/nichtich/badnames because I put same nasty characters (control bytes etc.) in file names. Yes, this is allowed in git. No, this is not a good idea.

The practical solutions to nasty characters in any area are: either you replace them (for instance underscore instead of a space) or you escape them (for instance with percent-encoding) or you just disallow them (throw away any triple that contains a bad URI ref).

answered Apr 11 '11 at 03:58

Jakob's gravatar image

Jakob
1.5k10

3

I think this is in fact quite a reasonable way of looking at it, although saying that any dataset that contains such URI references is almost certainly broken in other respects is overstating it a bit, I think.

Also: I suspect that many such datasets are produced by people not quite informed about this issue: they simply create the dataset and validate it (using some RDF parser/validator). Since spaces are valid, they get no errors, and they therefore assume that they've done it right.

(Apr 11 '11 at 21:28) Jeen Broekstra Jeen%20Broekstra's gravatar image
3

I think "everything that does not produce an error will be considered as valid" is a general rule. I would also bet that the difference between URIs and URI references is little known. Probably most people think that "%20" and " " is the same.

(Apr 13 '11 at 09:07) Jakob Jakob's gravatar image
1

is this a good answer in a linked data world? good data sets often have mistakes in their execution... let's say somebody else has done the wrong thing -- how do i deal with that?

(Oct 23 '11 at 09:58) database_animal database_animal's gravatar image

@database_animal I'd say you inform the author of the mistake and ask them to fix it. How you deal with the data in the mean time is up to you, but I think the principle of "don't use it if it's broken" is solid enough - if perhaps a bit uncompromising. I'd still be happy to hear about other pragmatic ways of dealing with the issue :)

(Oct 24 '11 at 16:49) Jeen Broekstra Jeen%20Broekstra's gravatar image

a note: U+0020 SPACE is not a valid IRI char, it must be percent encoded.

for reference:

               ----------------------------------------
              |  U+0009 \t
              |  U+000A \n
              |  U+000B \v
% encoded --> |  U+000C \f
              |  U+000D \r
              |  U+0020 SPACE
              |  U+0085 NEL (NEXT LINE)
               ----------------------------------------
              |  U+00A0 NBSP (NO-BREAK SPACE)
              |  U+1680 OGHAM SPACE MARK
              |  U+180E MONGOLIAN VOWEL SEPARATOR
              |  U+2000 EN QUAD
              |  U+2001 EM QUAD
              |  U+2002 EN SPACE
   allowed -->|  U+2003 EM SPACE
              |  U+2004 THREE-PER-EM SPACE
              |  U+2005 FOUR-PER-EM SPACE
              |  U+2006 SIX-PER-EM SPACE
              |  U+2007 FIGURE SPACE
              |  U+2008 PUNCTUATION SPACE
              |  U+2009 THIN SPACE
              |  U+200A HAIR SPACE
              |  U+2028 LINE SEPARATOR
              |  U+2029 PARAGRAPH SEPARATOR
              |  U+202F NARROW NO-BREAK SPACE
              |  U+205F MEDIUM MATHEMATICAL SPACE
              |  U+3000 IDEOGRAPHIC SPACE
               ----------------------------------------

answered Apr 15 '11 at 11:57

Nathan's gravatar image

Nathan
4018

1

Percent-encoding is no property of IRIs, URIs, or their references, but only of mappings and transformations that are applied to these indentifiers. Space is neither a valid character in URI nor IRI, and %20 is a valid character sequence at most parts of an URI or IRI. The problem is that the RDF specification is not based on URI or IRI but on its own concept of an "URI reference" that allows spaces.

(Apr 22 '11 at 09:42) Jakob Jakob's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×593
×424
×45
×15
×5

Asked: Apr 06 '11 at 23:22

Seen: 992 times

Last updated: Oct 24 '11 at 16:49