If you look at the DBpedia dumps, you'll find URIs written like

http://dbpedia.org/resource/Gong_%28Band%29

but if you look at many other data sources, you'll get

http://dbpedia.org/resource/Gong_(Band)

most RDF tools don't seem to treat these as equivalent, so these won't bind. Sometimes this problem shows up in RDF, and sometimes it shows up in non-RDF systems such as the JSON AlchemyAPI and with Freebase.

So the question is... how to deal with this b.s. as accurately as possible and with as little work as possible?

asked 11 Nov '11, 10:32

database_animal's gravatar image

database_animal ♦
8.4k1612
accept rate: 15%

3

...btw, I was really hoping "%2B%40%23%24%25*%26%5E%25" would decode to something interesting.

(18 Nov '11, 11:59) Signified ♦ Signified's gravatar image

You should have picked an example such as http://dbpedia.org/resource/%C3%84 where you have to take care of UTF-8 and percent-encoding octets. And there is Unicode normalization NFC or NFD!

(19 Nov '11, 17:32) Jakob Jakob's gravatar image

While I think the best way eventually to handle this kind of thing is by moving the mountain (er, I mean, the DBpedia maintainers) to fix this, a pragmatic workaround this is to perform input filtering when processing their the DBPedia dump.

For example, using Sesame's Rio parser toolkit it's quite easy to fit in a custom ValueFactory that does a simple check/conversion when the createURI method is invoked. Steps:

  1. create a custom ValueFactory implementation that converts wrongly %-encoded characters back to their unencoded form
  2. create new Rio parser object and supply it with the custom ValueFactory
  3. use the parser to read your DBPedia data dump. The parser will report RDF statements with "cleaned up" URIs created by the custom ValueFactory.

This is all streaming so should be quick with minimal footprint, and if you want you can even immediately pump the parser output back into a Rio writer to create a new 'cleaned up' dump file.

permanent link

answered 13 Nov '11, 23:35

Jeen%20Broekstra's gravatar image

Jeen Broekstra ♦
11.6k412
accept rate: 38%

FWIW, the draft for RDF Concepts 1.1 states:

“Interoperability problems can be avoided by minting only IRIs that are normalized according to Section 5 of [IRI]. Non-normalized forms that should be avoided include: […] Percent-encoding of characters where it is not required by IRI syntax.”

DBpedia URIs are %-encoded because Wikipedia URIs used to be %-encoded. Wikipedia changed this at some point, but DBpedia hasn't followed that move.

permanent link

answered 12 Nov '11, 06:31

cygri's gravatar image

cygri ♦
9.0k412
accept rate: 34%

this is factually right, but it doesn't tell me what I can do to rectify this situation so I can get ahead with my project npow

(12 Nov '11, 17:06) database_animal ♦ database_animal's gravatar image

@database_animal: pester the DBPedia maintainers to fix their URIrefs! There should not be any %-encoding in their RDF. Seriously, I think this is the only way to get this corrected. I have myself previously sent a mail about this very problem to the DBPedia mailinglist, but the discussion fell a bit flat. If enough people point out the problem, however...

(12 Nov '11, 22:46) Jeen Broekstra ♦ Jeen%20Broekstra's gravatar image

Sometimes you have to move the mountain to Mohammad; if you've got a schedule, waiting for other people to change is risky.

Corruption from DBpedia will persist in external systems for years after they clean it up. It's been at least a year and a half since Freebase switched to mids and most vendors still have guids in their databases... which makes you wonder if their products are maintained at all.

(13 Nov '11, 13:20) database_animal ♦ database_animal's gravatar image

I think the http://dbpedia.org/resource/Gong_%28Band%29 form is craziness. I don't believe there is anything that says RDF must be URL unencoded before being loaded, so to even go that route is introducing errors into data. The temptation to go that route is probably that RDF/XML does not support URIs that have () in them ( see what are the most severe limitations of RDF/XML ).

I once got a dbpedia dataset from the nice folks who make OWLIM, that was meant to load into OWLIM. It stripped bad RDF found within just so it could load. dbpedia does not contain 100% correct RDF unfortunately. Some triple stores handle it with warnings, or errors. OWLIM is a bit strict and will bail out for the whole file if one triple is bad, thus the preprocessing required to get a dbpedia that loads ( see Loading DBpedia in a RDF database (e.g. OWLIM)).

How to proper deal with it? (your core question)... i'd like to know too.

permanent link

answered 11 Nov '11, 20:14

harschware's gravatar image

harschware ♦
7.7k1616
accept rate: 20%

there's the core of an answer there; Ontotext processes DBpedia to clean it up, and packages a "private label" version which is corrected.

Somebody who's not happy with data quality in DBpedia or some other source can process the dump. Perhaps such processed files can be distributed for free or be made available for commercial sale, or you can make them yourself.

(13 Nov '11, 08:12) database_animal ♦ database_animal's gravatar image

from what I understand Ontotext just throws out triples that aren't pure RDF. I like your idea of a service for cleaning up and providing data. I'm wondering if there aren't some practices in use today, like maybe make a cleaned up IRI and then create an owl:sameAs ref to it, or some such. Even to get from here to there you've got a data reconciliation problem (Google Refine? script some custom clean up routines?)

(13 Nov '11, 22:22) harschware ♦ harschware's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×276
×5
×3
×1

question asked: 11 Nov '11, 10:32

question was seen: 1,106 times

last updated: 19 Nov '11, 17:32