|
My head hurts from trying to find out how to encode URIs (e.g., from RDF/XML) to output as N-Triples. The issues include spaces, Unicode, URI refs in RDF 1, IRIs in RDF 1.1, and so forth. At the moment, I'm just escaping URIs per the standard escaping "map" for N-Triples given here (escaping tabs, quotes, newlines, unicode, etc.). However, I'm a little confused about the purpose of the URI reference encoding described here, which states that Unicode URIs should be converted to UTF-8 (?) and thereafter octets should be percent encoded. I don't know why this is needed (other than some strange URI ref. side-effect). Do I need to do something else other than standard N-Triples escaping? Should I percent encode octets first? Also, do I need to do something with spaces? EDIT: concrete example. If parsing RDF/XML from here:
into N-Triples. The raw IRI:
Should: Option 1 The IRI be directly escaped for N-Triples as:
...and not converted to a URI by escaping special characters. or Option 2 The IRI should be converted to a URI first by %-encoding and then escaped for N-Triples.
(I note that rapper chooses Option 1.) |
|
The N-triples doc describes escaping - that is putting the real character in the string but written in a funny way. The character must be legal in it' s raw form. "t" is a string which is of length one and really does have 0x09 (TAB) in it. The section in N-Triples is called "strings" - it's all the escapes and some do not apply to URI. Percent encoding is a little different. "%09" is a string of three characters %-0-9. It is not a a funny way to write a TAB because the other end will not reverse the process (it has the right to reverse "unreserved" characters if it is an end consumer of the URI). A raw TAB is not legal in a URI and the real URI as seen by software contains 3 characters %-0-9. You should %-encode characters into the URI. Then you do not need to apply string encoding to URIs. Ignore "RDF URI References" - they are a historical blip. The then RDF WG was following the IRI process and "RDF URI References" were supposed to be a placeholder for IRIs. e.g. At one time, IRIs were going to allows spaces. RDF-WG published the specs. The IRI draft then changed to not allows spaces. Oops. RDF 1.1 will use IRIs. RFC 3986 does not allow for URIs with real special characters in them. It requires the producer of the URI to %-encode them when the URI is created: Section 2.1: """ A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. """ 'represent' != 'escape' i.e. it is not really putting the raw character into the URI. Section 2.4 on when to encode and decode can be hard to understand in practice as it's a statement of principle but it is usually read as RDF systems do not fall under the "dereference" text but instead are intermediaries - they should not change the URI and expect URIs to be %-encoded. (and "no" this isn't always what you want!) (EDIT [Signified]: copied comment into answer. With respect to options presented in edit:) Option 2. If given an already-illegal URI, rapper attempt to make it safe-ish by using u but that is still putting illegal characters into the URI. But as an IRI: http://ru.dbpedia.org/resource/Эдж_(остров) is legal (as an IRI). See RFC 3987 for IRI->URI but the short story is %-encode. 1
I think the question is more intended as: how do you write down URI's that DO contain special characters (i.e. that haven't been created by %-encoding). Because, just as you say: an URI that contains a %-encoded character is NOT the same as the similar URI that uses the non-encoded character. Edited response as a comment isn't long enough. Thanks Andy. So in summary, you say to go for Option 1 in the edit of the question? 1
Option 2. If given an already-illegal URI, rapper attempt to make it safe-ish by using u but that is still putting illegal characters into the URI. But as an IRI: http://ru.dbpedia.org/resource/Эдж_(остров) is legal (as an IRI). See RFC 3987 for IRI->URI but the short story is %-encode. |


Related discussion found through a Google wrt. RDF 1.1.
http://lists.w3.org/Archives/Public/public-rdf-comments/2012Jul/0038.html
Also related: http://answers.semanticweb.com/questions/13076/uris-vs-iris-in-semantic-web-standards-and-tools