I'm a bit confused of the whole semantic web RDF graph system, and even though I read a lot about it, I still wasn't able to answer some very technical questions I have. The whole idea is that there's data on the internet, and if we structure this data in a standard format, such as RDF, we can have computers query it and let us better search for things.

1) RDF is used to build vocabularies, but it's also used to assign attributes to actual data. So if I have some tabular data about students, say their "full name" and their "age", I can have an RDF that defines what "full name" and "age" mean. And then also have another RDF where this tabular data is mapped to those vocabulary terms? My question is, is there a difference in the RDF world between how we define vocabularies, and actual data?

2) The web is full of these RDF resources, and we can use SPARQL to query them. But again how does SPARQL know the difference between what are the vocabularies and what is the actual data?

3) I understand SPARQL can query multiple RDF resources. These RDF resources can exist on simple HTTP servers. So I imagine that a SPARQL endpoint, in order to execute your query, does an HTTP request to download the RDF(s). What if the RDF are really large? What if my students data contains millions of records? How can a single SPARQL endpoint scale to those dimensions? After all, data on the internet is quite large.

Thanks for any clarification!

asked 09 Jan '13, 08:55

Luca%20Matteis's gravatar image

Luca Matteis
1.5k39
accept rate: 11%

edited 09 Jan '13, 08:57


1) RDF is used to build vocabularies, but it's also used to assign attributes to actual data. So if I have some tabular data about students, say their "full name" and their "age", I can have an RDF that defines what "full name" and "age" mean. And then also have another RDF where this tabular data is mapped to those vocabulary terms? My question is, is there a difference in the RDF world between how we define vocabularies, and actual data?

There are many different ways to tackle that question and I'm not sure what would be the most useful for you.

Perhaps think of it in terms of natural language. You can make claims in natural language. To do this you need vocabulary: words. Words identify something, be it a specific thing, an idea, a relationship, what have you. The meaning (semantics) of words are (intensionally) defined in dictionaries, typically in the context of how their meaning is related to other words. You can make claims in natural language by arranging vocabulary into structured sentences.

Unfortunately, machines find natural language too difficult since sentence structure is too complex and vocabulary is too rife with ambiguity to easily process. Hence we need (something) like RDF that machines can read.

In RDF, the vocabulary is just the set of terms used. Instead of words which are prone to ambiguity ("orange" ... fruit or colour?), RDF primarily uses URIs as words (e.g., http://dbpedia.org/resource/Orange_(fruit)). To make claims machine-readable, RDF only allows simple sentences in the form of triples. Now URIs by themselves don't have any real meaning to a computer (much like the word "orange" wouldn't mean much to someone who didn't speak English unless they had a dictionary/translation).

Now here's the tricky part ... how would you describe an orange to someone (in this case a machine) who has never encountered or has no hope of ever encountering an orange in the same way we do? It's like a large mandarin? It's like a large, sweet, orange-coloured lime? ... in fact, you can't explain in a way that we would say that the person/machine understands. But we don't really care that the machine doesn't understand ... if we tell the machine that limes and oranges and lemons are all citrus fruits and all citrus fruits are fruits and contain vitamin C, even though the machine doesn't understand, it can start to emulate a form of understanding that's useful for applications. You could start to ask the machine questions like show me recipes in the database okay for a citrus allergy, and the machine could automatically rule out anything with oranges or limes or lemons or ingredients containing those.

Now RDF as a data model, and well-defined languages like RDFS and OWL (which allow for direct definitions of vocabulary in a generic intensional sense like a dictionary), allow humans to define things in this manner that the machine can leverage to seem "smart" and do more with the data automatically.

Here then, the vocabulary is just the terms used, and the data are the claims made in the form of RDF triples.

All RDF data uses some vocabulary. In a sense, any RDF triple can be seen as "defining" the vocabulary it uses (in a generic extensional sense ... you probably didn't learn what the word "dog" in your native language means from a dictionary).

I understand SPARQL can query multiple RDF resources. These RDF resources can exist on simple HTTP servers. So I imagine that a SPARQL endpoint, in order to execute your query, does an HTTP request to download the RDF(s). What if the RDF are really large? What if my students data contains millions of records? How can a single SPARQL endpoint scale to those dimensions? After all, data on the internet is quite large.

Though some SPARQL techniques fetch RDF live over HTTP, it's more common for the (bulk of) data to be centralised in a local database prior to querying. This allows greater efficiency than running queries live over the Web.

In terms of scalability, people like SPARQL because it's expressive and allows you to write complex queries. However, the issue with having a complex query language is that SPARQL evaluation is PSPACE Complete: the academic way of saying, in the worst case, a SPARQL query might not terminate before the heat death of the universe. Not all SPARQL queries can be run at large scale. This is a simple fact of life. That is not to say that SPARQL is somehow rendered useless ... far from. There are still very many useful queries that can be executed using a SPARQL engine over scales in excess of tens of billions of triples (or, in fact, over any arbitrary scale assuming sufficient hardware).

link

answered 09 Jan '13, 20:34

Signified's gravatar image

Signified ♦
23.5k1623
accept rate: 37%

edited 09 Jan '13, 20:37

Very cool. Thanks for this explanation. So say I need to have a system to allow users to make "natural language" queries against RDFs, and not make them study the complexity of SPARQL's syntax. Is there an interface or something that is capable of generating SPARQL code, based on certain basic filtering options of a minimalist form field?

(10 Jan '13, 04:56) Luca Matteis Luca%20Matteis's gravatar image

There are I guess. One option is to look into Question Answering, which takes natural language questions and tries to convert them into (e.g.) SPARQL queries. Some links here:

http://answers.semanticweb.com/questions/20130/generate-triple-from-natural-language-question?page=1&focusedAnswerId=20253#20253

This is of course a difficult task for machines.

There are a number of works looking at making the creation of SPARQL queries "easier" for users through the use of form fields, etc.

If you're interested in more detail, I'd recommend making a new question along these lines.

(10 Jan '13, 13:49) Signified ♦ Signified's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×1,237
×851
×159

Asked: 09 Jan '13, 08:55

Seen: 810 times

Last updated: 10 Jan '13, 13:49