If you look at the DBpedia dumps, you'll find URIs written like
but if you look at many other data sources, you'll get
most RDF tools don't seem to treat these as equivalent, so these won't bind. Sometimes this problem shows up in RDF, and sometimes it shows up in non-RDF systems such as the JSON AlchemyAPI and with Freebase.
So the question is... how to deal with this b.s. as accurately as possible and with as little work as possible?
asked 11 Nov '11, 10:32
While I think the best way eventually to handle this kind of thing is by moving the mountain (er, I mean, the DBpedia maintainers) to fix this, a pragmatic workaround this is to perform input filtering when processing their the DBPedia dump.
For example, using Sesame's Rio parser toolkit it's quite easy to fit in a custom
This is all streaming so should be quick with minimal footprint, and if you want you can even immediately pump the parser output back into a Rio writer to create a new 'cleaned up' dump file.
answered 13 Nov '11, 23:35
Jeen Broekstra ♦
FWIW, the draft for RDF Concepts 1.1 states:
“Interoperability problems can be avoided by minting only IRIs that are normalized according to Section 5 of [IRI]. Non-normalized forms that should be avoided include: […] Percent-encoding of characters where it is not required by IRI syntax.”
DBpedia URIs are %-encoded because Wikipedia URIs used to be %-encoded. Wikipedia changed this at some point, but DBpedia hasn't followed that move.
answered 12 Nov '11, 06:31
I think the http://dbpedia.org/resource/Gong_%28Band%29 form is craziness. I don't believe there is anything that says RDF must be URL unencoded before being loaded, so to even go that route is introducing errors into data. The temptation to go that route is probably that RDF/XML does not support URIs that have () in them ( see what are the most severe limitations of RDF/XML ).
I once got a dbpedia dataset from the nice folks who make OWLIM, that was meant to load into OWLIM. It stripped bad RDF found within just so it could load. dbpedia does not contain 100% correct RDF unfortunately. Some triple stores handle it with warnings, or errors. OWLIM is a bit strict and will bail out for the whole file if one triple is bad, thus the preprocessing required to get a dbpedia that loads ( see Loading DBpedia in a RDF database (e.g. OWLIM)).
How to proper deal with it? (your core question)... i'd like to know too.
answered 11 Nov '11, 20:14