My head hurts from trying to find out how to encode URIs (e.g., from RDF/XML) to output as N-Triples. The issues include spaces, Unicode, URI refs in RDF 1, IRIs in RDF 1.1, and so forth.

At the moment, I'm just escaping URIs per the standard escaping "map" for N-Triples given here (escaping tabs, quotes, newlines, unicode, etc.).

However, I'm a little confused about the purpose of the URI reference encoding described here, which states that Unicode URIs should be converted to UTF-8 (?) and thereafter octets should be percent encoded. I don't know why this is needed (other than some strange URI ref. side-effect).

Do I need to do something else other than standard N-Triples escaping? Should I percent encode octets first? Also, do I need to do something with spaces?


EDIT: concrete example.

If parsing RDF/XML from here:

http://dbpedia.org/resource/Edge%C3%B8ya.xml

into N-Triples. The raw IRI:

http://ru.dbpedia.org/resource/Эдж_(остров)

Should:

Option 1

The IRI be directly escaped for N-Triples as:

http://ru.dbpedia.org/resource/\u042D\u0434\u0436_(\u043E\u0441\u0442\u0440\u043E\u0432)

...and not converted to a URI by escaping special characters.

or

Option 2

The IRI should be converted to a URI first by %-encoding and then escaped for N-Triples.

http://ru.dbpedia.org/resource/%D0%AD%D0%B4%D0%B6_(%D0%BE%D1%81%D1%82%D1%80%D0%BE%D0%B2)

(I note that rapper chooses Option 1.)

asked 28 Aug '12, 16:12

Signified's gravatar image

Signified ♦
23.5k1623
accept rate: 37%

edited 29 Aug '12, 11:54


The N-triples doc describes escaping - that is putting the real character in the string but written in a funny way. The character must be legal in it' s raw form.

"t" is a string which is of length one and really does have 0x09 (TAB) in it.

The section in N-Triples is called "strings" - it's all the escapes and some do not apply to URI.

Percent encoding is a little different.

"%09" is a string of three characters %-0-9. It is not a a funny way to write a TAB because the other end will not reverse the process (it has the right to reverse "unreserved" characters if it is an end consumer of the URI). A raw TAB is not legal in a URI and the real URI as seen by software contains 3 characters %-0-9.

You should %-encode characters into the URI. Then you do not need to apply string encoding to URIs.

Ignore "RDF URI References" - they are a historical blip. The then RDF WG was following the IRI process and "RDF URI References" were supposed to be a placeholder for IRIs. e.g. At one time, IRIs were going to allows spaces. RDF-WG published the specs. The IRI draft then changed to not allows spaces. Oops.

RDF 1.1 will use IRIs.

RFC 3986 does not allow for URIs with real special characters in them. It requires the producer of the URI to %-encode them when the URI is created:

Section 2.1: """ A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. """

'represent' != 'escape' i.e. it is not really putting the raw character into the URI.

Section 2.4 on when to encode and decode can be hard to understand in practice as it's a statement of principle but it is usually read as RDF systems do not fall under the "dereference" text but instead are intermediaries - they should not change the URI and expect URIs to be %-encoded.

(and "no" this isn't always what you want!)


(EDIT [Signified]: copied comment into answer. With respect to options presented in edit:)

Option 2.

If given an already-illegal URI, rapper attempt to make it safe-ish by using u but that is still putting illegal characters into the URI.

But as an IRI:

http://ru.dbpedia.org/resource/Эдж_(остров)

is legal (as an IRI). See RFC 3987 for IRI->URI but the short story is %-encode.

link

answered 29 Aug '12, 05:28

AndyS's gravatar image

AndyS ♦
13.0k37
accept rate: 32%

edited 31 Aug '12, 11:07

Signified's gravatar image

Signified ♦
23.5k1623

1

I think the question is more intended as: how do you write down URI's that DO contain special characters (i.e. that haven't been created by %-encoding). Because, just as you say: an URI that contains a %-encoded character is NOT the same as the similar URI that uses the non-encoded character.

(29 Aug '12, 05:48) Gerrit V Gerrit%20V's gravatar image

Edited response as a comment isn't long enough.

(29 Aug '12, 05:56) AndyS ♦ AndyS's gravatar image

Thanks Andy. So in summary, you say to go for Option 1 in the edit of the question?

(29 Aug '12, 11:54) Signified ♦ Signified's gravatar image
1

Option 2.

If given an already-illegal URI, rapper attempt to make it safe-ish by using u but that is still putting illegal characters into the URI.

But as an IRI:

http://ru.dbpedia.org/resource/Эдж_(остров)

is legal (as an IRI). See RFC 3987 for IRI->URI but the short story is %-encode.

(30 Aug '12, 17:28) AndyS ♦ AndyS's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×78
×11
×11
×5

Asked: 28 Aug '12, 16:12

Seen: 2,256 times

Last updated: 06 Sep '12, 12:30