|
This question needs a bit of introduction, bear with me please. In RDF, resources are identified using URI references. The notion of URI reference was added in anticipation of the standardization of IRIs. The SPARQL spec builds on this and in fact adopts the IRI standard as part of its spec: RDF terms in SPARQL queries are identified using IRIs. Unfortunately, there is an incompatibility: RDF URI references may contain "<", ">", '"' (double quote), space, "{", "}", "|", "", "^", and "`", but these are not allowed in IRIs (see SPARQL's IRI syntax). The upshot of this is that while it is perfectly legal to have an RDF triple of the following form:
You can not directly query such a resource using SPARQL, e.g.:
is not a syntactically valid SPARQL query. All of this is pretty well known of course. The reason I am introducing it is that I would like to ask some "best practice" type questions related to this issue:
I'm not so much looking for theoretical solutions, I'm more interested in what has been done out there in practice, already. This question was inspired by a recent discussion on the Sesame mailinglist by the way, just in case you thought it looked familiar :) |
|
There is one simple, practical and reasonable solution: just don't use URI references that contain nasty characters. Bad idea. No-go. If you happen to discover such URI references, you probably have more serious problems: they are a clear sign of ill-designed data and lack of interest in interoperability. You can be sure that the dataset is broken in other aspects as well. There are similar instances of possible-but-not-recommended in many other areas. For example clone the following on a Linux command line: The practical solutions to nasty characters in any area are: either you replace them (for instance underscore instead of a space) or you escape them (for instance with percent-encoding) or you just disallow them (throw away any triple that contains a bad URI ref). 3
I think this is in fact quite a reasonable way of looking at it, although saying that any dataset that contains such URI references is almost certainly broken in other respects is overstating it a bit, I think. Also: I suspect that many such datasets are produced by people not quite informed about this issue: they simply create the dataset and validate it (using some RDF parser/validator). Since spaces are valid, they get no errors, and they therefore assume that they've done it right. 3
I think "everything that does not produce an error will be considered as valid" is a general rule. I would also bet that the difference between URIs and URI references is little known. Probably most people think that "%20" and " " is the same. 1
is this a good answer in a linked data world? good data sets often have mistakes in their execution... let's say somebody else has done the wrong thing -- how do i deal with that? @database_animal I'd say you inform the author of the mistake and ask them to fix it. How you deal with the data in the mean time is up to you, but I think the principle of "don't use it if it's broken" is solid enough - if perhaps a bit uncompromising. I'd still be happy to hear about other pragmatic ways of dealing with the issue :) |



It's gratifying that at least 6 people think it's a good question, strange though that no-one has an answer. Is this really a non-existent problem from a practical point of view?
Some more specific references help. Finding contradictions in standards that refer to other standards is not trivial. The discussion you refer to can be found at http://sesame-general.435816.n3.nabble.com/Inconsistency-between-URIs-in-RDF-and-SPARQL-td2761175.html
I thought I had been pretty thorough with my linking to the relevant parts of the spec(s), not sure what the mailinglist discussion adds to that, but nevertheless thanks for adding the link.
I would mark this as a know issue and inconsistency between the specs. I guess, the new RDF WG will propose a solution for this issue (see http://www.w3.org/2011/rdf-wg/track/issues/8).