I am fan of the work done in the DBpedia Spotlight, and I would be very interested on creating this service for other languages. But before to head on the adventure, I would like to know the challenges that this implies and/or recommendations, for example creating dictionaries, training models, etc. Thanks!

asked 06 Apr '11, 08:09

Ale's gravatar image

accept rate: 20%

Update: We have created a step-by-step guide.

Hi Ale, Pablo from DBpedia Spotlight here. Thanks for your interest! You will be pleased to know that there is active work in our group on internationalizing DBpedia Spotlight. We are working on Portuguese and Spanish as examples, and we hope that other languages will follow suit. I recommend interested parties to join our mailing list to keep in touch and avoid duplication of effort.

Tim's comments are in the right direction (thanks Tim!), but let's see if I can further detail the process:

  1. Get the latest DBpedia in your language. For English we used the following datasets: labels_en.nt.bz2, redirects_en.nt.bz2, disambiguations_en.nt.bz2, instance_types_en.nt.bz2, mappingbased_properties_en.nt.bz2. If they are not available, you will need to look into how to create them. For higher quality extraction, consider adding infobox mappings and DBpedia Ontology labels for your language here.
  2. Get a Wikipedia XML dump in your language.
  3. Get Lucene tokenizers, stemmers, stopwords, etc. in your language. This will require some language-specific knowledge, so I can't help much. But see for example hunspell
  4. Extract DBpedia resource occurrences, lexicalization dataset and extended dictionary
  5. Run DBpedia Spotlight occurrence indexing and spot dictionary creation

Steps 4 and 5 are self-documented in code. :) But we are working on creating a more high level description of the steps to save you some time.

Finally, if you plan to change the source code, please consider committing it back to our repository so that other people can also benefit from it - and you can be acknowledged as awesome contributor! :)

Update: We have created a step-by-step guide.

permanent link

answered 07 Apr '11, 03:31

Pablo%20Mendes's gravatar image

Pablo Mendes
accept rate: 50%

edited 18 Sep '12, 11:02


(+1) thanks for the authoritative answer

(07 Apr '11, 10:28) harschware ♦ harschware's gravatar image

I suspect you won't run into a need for creating dictionaries or training models as you say. The source code and instructions for installing a local copy are available. As well as the datasets that you will need as described on the main page. If it were me, I would download the source and datasets, open and familiarize myself with them. I would then take a look as to why the data does not include the English foreign language data (is that even true?). Once, and if, you see that it is missing you can match it up against dbpedia 3.6 dumps. It will probably be in the same format. If so, ingest the foreign language data into your server you are dedicating to this cause (make sure server has enough horse power to handle it). I suspect if the English foreign language data is missing you may also need to create the Disambiguation index as Lucene indexes or merge them with those the project page provides.

permanent link

answered 06 Apr '11, 12:35

harschware's gravatar image

harschware ♦
accept rate: 20%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:


question asked: 06 Apr '11, 08:09

question was seen: 4,519 times

last updated: 18 Sep '12, 11:02