|
If you look at the DBpedia dumps, you'll find URIs written like http://dbpedia.org/resource/Gong_%28Band%29 but if you look at many other data sources, you'll get http://dbpedia.org/resource/Gong_(Band) most RDF tools don't seem to treat these as equivalent, so these won't bind. Sometimes this problem shows up in RDF, and sometimes it shows up in non-RDF systems such as the JSON AlchemyAPI and with Freebase. So the question is... how to deal with this b.s. as accurately as possible and with as little work as possible? |
|
While I think the best way eventually to handle this kind of thing is by moving the mountain (er, I mean, the DBpedia maintainers) to fix this, a pragmatic workaround this is to perform input filtering when processing their the DBPedia dump. For example, using Sesame's Rio parser toolkit it's quite easy to fit in a custom
This is all streaming so should be quick with minimal footprint, and if you want you can even immediately pump the parser output back into a Rio writer to create a new 'cleaned up' dump file. |
|
FWIW, the draft for RDF Concepts 1.1 states: “Interoperability problems can be avoided by minting only IRIs that are normalized according to Section 5 of [IRI]. Non-normalized forms that should be avoided include: […] Percent-encoding of characters where it is not required by IRI syntax.” DBpedia URIs are %-encoded because Wikipedia URIs used to be %-encoded. Wikipedia changed this at some point, but DBpedia hasn't followed that move. this is factually right, but it doesn't tell me what I can do to rectify this situation so I can get ahead with my project npow @database_animal: pester the DBPedia maintainers to fix their URIrefs! There should not be any %-encoding in their RDF. Seriously, I think this is the only way to get this corrected. I have myself previously sent a mail about this very problem to the DBPedia mailinglist, but the discussion fell a bit flat. If enough people point out the problem, however... Sometimes you have to move the mountain to Mohammad; if you've got a schedule, waiting for other people to change is risky. Corruption from DBpedia will persist in external systems for years after they clean it up. It's been at least a year and a half since Freebase switched to mids and most vendors still have guids in their databases... which makes you wonder if their products are maintained at all. |
|
I think the http://dbpedia.org/resource/Gong_%28Band%29 form is craziness. I don't believe there is anything that says RDF must be URL unencoded before being loaded, so to even go that route is introducing errors into data. The temptation to go that route is probably that RDF/XML does not support URIs that have () in them ( see what are the most severe limitations of RDF/XML ). I once got a dbpedia dataset from the nice folks who make OWLIM, that was meant to load into OWLIM. It stripped bad RDF found within just so it could load. dbpedia does not contain 100% correct RDF unfortunately. Some triple stores handle it with warnings, or errors. OWLIM is a bit strict and will bail out for the whole file if one triple is bad, thus the preprocessing required to get a dbpedia that loads ( see Loading DBpedia in a RDF database (e.g. OWLIM)). How to proper deal with it? (your core question)... i'd like to know too. there's the core of an answer there; Ontotext processes DBpedia to clean it up, and packages a "private label" version which is corrected. Somebody who's not happy with data quality in DBpedia or some other source can process the dump. Perhaps such processed files can be distributed for free or be made available for commercial sale, or you can make them yourself. from what I understand Ontotext just throws out triples that aren't pure RDF. I like your idea of a service for cleaning up and providing data. I'm wondering if there aren't some practices in use today, like maybe make a cleaned up IRI and then create an owl:sameAs ref to it, or some such. Even to get from here to there you've got a data reconciliation problem (Google Refine? script some custom clean up routines?) |


...btw, I was really hoping "%2B%40%23%24%25*%26%5E%25" would decode to something interesting.
You should have picked an example such as http://dbpedia.org/resource/%C3%84 where you have to take care of UTF-8 and percent-encoding octets. And there is Unicode normalization NFC or NFD!