I'm a bit confused of the whole semantic web RDF graph system, and even though I read a lot about it, I still wasn't able to answer some very technical questions I have. The whole idea is that there's data on the internet, and if we structure this data in a standard format, such as RDF, we can have computers query it and let us better search for things.
1) RDF is used to build vocabularies, but it's also used to assign attributes to actual data. So if I have some tabular data about students, say their "full name" and their "age", I can have an RDF that defines what "full name" and "age" mean. And then also have another RDF where this tabular data is mapped to those vocabulary terms? My question is, is there a difference in the RDF world between how we define vocabularies, and actual data?
2) The web is full of these RDF resources, and we can use SPARQL to query them. But again how does SPARQL know the difference between what are the vocabularies and what is the actual data?
3) I understand SPARQL can query multiple RDF resources. These RDF resources can exist on simple HTTP servers. So I imagine that a SPARQL endpoint, in order to execute your query, does an HTTP request to download the RDF(s). What if the RDF are really large? What if my students data contains millions of records? How can a single SPARQL endpoint scale to those dimensions? After all, data on the internet is quite large.
Thanks for any clarification!
There are many different ways to tackle that question and I'm not sure what would be the most useful for you.
Perhaps think of it in terms of natural language. You can make claims in natural language. To do this you need vocabulary: words. Words identify something, be it a specific thing, an idea, a relationship, what have you. The meaning (semantics) of words are (intensionally) defined in dictionaries, typically in the context of how their meaning is related to other words. You can make claims in natural language by arranging vocabulary into structured sentences.
Unfortunately, machines find natural language too difficult since sentence structure is too complex and vocabulary is too rife with ambiguity to easily process. Hence we need (something) like RDF that machines can read.
In RDF, the vocabulary is just the set of terms used. Instead of words which are prone to ambiguity ("orange" ... fruit or colour?), RDF primarily uses URIs as words (e.g., http://dbpedia.org/resource/Orange_(fruit)). To make claims machine-readable, RDF only allows simple sentences in the form of triples. Now URIs by themselves don't have any real meaning to a computer (much like the word "orange" wouldn't mean much to someone who didn't speak English unless they had a dictionary/translation).
Now here's the tricky part ... how would you describe an orange to someone (in this case a machine) who has never encountered or has no hope of ever encountering an orange in the same way we do? It's like a large mandarin? It's like a large, sweet, orange-coloured lime? ... in fact, you can't explain in a way that we would say that the person/machine understands. But we don't really care that the machine doesn't understand ... if we tell the machine that
Now RDF as a data model, and well-defined languages like RDFS and OWL (which allow for direct definitions of vocabulary in a generic intensional sense like a dictionary), allow humans to define things in this manner that the machine can leverage to seem "smart" and do more with the data automatically.
Here then, the vocabulary is just the terms used, and the data are the claims made in the form of RDF triples.
All RDF data uses some vocabulary. In a sense, any RDF triple can be seen as "defining" the vocabulary it uses (in a generic extensional sense ... you probably didn't learn what the word "dog" in your native language means from a dictionary).
Though some SPARQL techniques fetch RDF live over HTTP, it's more common for the (bulk of) data to be centralised in a local database prior to querying. This allows greater efficiency than running queries live over the Web.
In terms of scalability, people like SPARQL because it's expressive and allows you to write complex queries. However, the issue with having a complex query language is that SPARQL evaluation is PSPACE Complete: the academic way of saying, in the worst case, a SPARQL query might not terminate before the heat death of the universe. Not all SPARQL queries can be run at large scale. This is a simple fact of life. That is not to say that SPARQL is somehow rendered useless ... far from. There are still very many useful queries that can be executed using a SPARQL engine over scales in excess of tens of billions of triples (or, in fact, over any arbitrary scale assuming sufficient hardware).