What is skolemization? And, what is an example of skolemizing something?

asked 08 Apr '11, 17:35

harschware's gravatar image

harschware ♦
accept rate: 19%

edited 08 Apr '11, 18:06

Signified's gravatar image

Signified ♦

...a not so interesting intro to blank-nodes and Skolemisation

Skolemisation (in the simplest case) refers to replacing existential variables with unique constants. With regards RDF, this particularly applies to blank-nodes.

Under the RDF semantics, blank-nodes are existential variables... there exists something, but that thing isn't specifically named. Thus, given:

_:signified rdf:type sioc:UserAccount .    (1)

We know that there exists something which is a member of sioc:UserAccount. In fact, according to simple entailment, the above triple also entails the following:

_:signified rdf:type _:someclass .         (2)
_:blah rdf:type _:blah2 .                  (3)

...the first triple reads as "there exists something which is a member of some class" and the second triple reads precisely the same—under RDF semantics, the above two triples represent the exact same information (RDF graphs which don't contain such "redundancy" are called lean graphs).

Skolemisation can be used to state that the existential variable _:signified actually denotes something in particular (i.e., that it's a constant). Essentially, you're replacing the blank-nodes with a unique constant not appearing elsewhere. Although this Skolemisation falls outside of the standard RDF semantics, many implementations interpret blank-nodes as "Skolem constants".. Sometimes, people wrap angle-brackets around the blank-nodes <_:signified> to indicate that they're Skolems, or replace them with some IRI like <genid:signified>... others just leave the blank-nodes in their natural syntax and interpret them directly as constants.

Under such interpretations of blank-nodes, triples (2,3) would not be entailed from (1) since _:someclass may refer to something concrete other than sioc:UserAccount since _:someclass is now a (Skolem) constant. Much like a URI, it now refers to something particular.

...a not so brief history of blank-nodes and Skolemisation

...here taking a little poetic licence.

In the beginning, there was RDF, and it was decided that URIs should name things. However, if you (the RDF publisher) were feeling lazy (or if you just needed a little syntactic sugar for lists, etc.), you could omit the URI in the RDF/XML markup and a blank-node would be born...

...and so it soon became time to bestow upon RDF(S) a semantics... a meaning. And so, it was asked, what do these blank-nodes mean? The brightest logicians of the day conferred and decided that blank-nodes should be interpreted as existentially-quantified variables... these nodes exist, but they don't have names. You can look, but you can't touch.

....this also captured the notion of "locality" for blank-nodes such that they could be scoped to the graphs containing them: when merging two RDF graphs, you could keep the blank node labels (labels ...not identifiers or names) unique such that blank-nodes with the same labels in different graphs were different blank-nodes (called an "RDF merge"). This meant that I could use blank-nodes to locally encode, say, a "well-formed" RDF list—with a single value for rdf:first, and a single sub-list value for rdf:rest, and the list being terminated with rdf:nil—and no matter what external graphs would do in the Open World, that local RDF list would always be well-formed and "closed" (assuming the data graphs were ultimately RDF merged together).

...along with this existential/local interpretation of blank-nodes came simple entailment (as above).

...and on top of that was built RDF and RDFS entailment.

...then along came OWL, which was on a slightly different (but closely related) tangent, and folks needed a way of writing OWL down as RDF. In fact, the locality of blank-nodes made them useful for encoding RDF lists (used for OWL union/intersection/enumeration classes), and other n-ary OWL constructs in RDF (for restriction classes like the owl:someValuesFrom construct below)... if the RDF representation was "well-formed" in the local document, it would be "well-formed" in the Open World... for example, an external graph couldn't just stick a new member into your local union-class if you used a blank node.

...and again, the brightest logicians of the day decided that OWL (RDF-based semantics) would also interpret blank-nodes as existential variables, giving rise to the notion of "anonymous individuals", an obvious example of which would be Beatrice's mother. Let's say that we know (1) Beatrice is human (:Beatrice rdf:type :Human), and that (2) all humans have mothers, who are themselves human (:Human owl:equivalentClass [ owl:someValuesFrom :Human , owl:onProperty :mother ]). Given this information, an OWL reasoner can figure out Beatrice has a human mother; even though it can't name that mother, it can still represent the entailment (if needs be) using an existential blank-node: :Beatrice :mother _:anon . _:anon rdf:type :Human ..

...in fact, you could model something like SPARQL ASK queries (in the days before SPARQL) using blank-nodes as variables. Just ask the (simple|RDF|RDFS|OWL RDF-Based) reasoner if the triple with the blank-node is entailed—you could ask "is there a human with a human mother" by asking a reasoner if the RDF claim _:exvar1 rdf:type :Human ; :mother _:exvar2 . _:exvar2 rdf:type :Human . was true.

...but there was some practical problems associated with making reasoners this smart, and allowing them to ponder upon existential matters. In practice, simple entailment became not-so-simple-NP-complete entailment... figuring out if one bunch of data with a lot of blank nodes was entailed by another bunch of data with a lot of blank nodes could get tricky. To get an idea of the problem, imagine a directed graph where each node is unlabelled (a blank-node), and these nodes are connected by directed links... now imagine an arbitrary second graph of the same type. Now, try figuring out if you can "overlay" the second graph somewhere into the first graph, preserving the link structure of both graphs. Roughly speaking, you're looking at the graph homomorphism problem (a.k.a. ~simple entailment). If you can overlay the second graph on top of the first graph (you don't need to cover the first graph), then you have entailment... the second graph is true if the first graph is true: you can "map" the blank-nodes in the second graph such that it becomes a sub-graph of the first graph. [It should be noted that there are fairly efficient algorithms for handling most cases of this problem, but still, it sucks to implement in the general case.]

...furthermore, you may already have spotted that Beatrice's mother is human, and so she must have a mother who is human, who herself must have... ad infinitum. People started dumbing down their reasoners, curtailing or forgetting about existential variables... the logician's hack of choice to overcome this problem was Skolemisation: for example, one could replace existential variables with unique constants and only do reasoning over named individuals (now including the Skolemised blank-nodes). More and more reasoners began to forget about Beatrice's mother and her extended, existential, human maternal lineage (esp. those reasoners that wanted to write entailments down) ...Michael's answer covers this in more depth and with more accuracy.

...all the while, publishers on the Web—mostly FOAF hackers in those days—were struggling with assigning URIs to things that weren't web-pages or email addresses. That's another sub-story (minting etiquette/centralised naming/dereferenceability/http-range-14 [hash vs. slash]), but the upshot was avoidance of URIs and a proliferation of blank-nodes on the Web, mostly representing people and users of blogging platforms. It's safe to say that these publishers couldn't give an existential crap about existentially quantified variables... they were just lazy/confused about minting URIs, and in RDF/XML, you can just omit the rdf:about/rdf:ID attribute and let someone else worry about the blank-node problem.

...and worry other people were forced to. If I parsed an RDF/XML FOAF profile with one parser, and you parsed the same document with another parser, we would need to do a simple-entailment check just to figure out if we were on the same page or not. Also, given the nature of RDF/XML where blank nodes are often assigned a label based on their ordering in a document, if one were unfortunate enough to have to try and monitor the changes of a particular RDF/XML document over time, sticking an extra blank-node into the start of the document would make it look like the whole document changed.

...then Linked Data entered the picture, and URIs became Cool, and you had a choice of 303s or hash URIs, and people started using some URIs some of the time. And blank-nodes were largely villified, with some minor exceptions: they weren't dereferenceable, and they certainly weren't cool. But still, blank-nodes persisted on the Web, be they due to "laziness", "legitimate purposes", "syntactic sugar", or a relic of older data... blank nodes were not going away.

...and then SPARQL came along, and it too sort-of-echoed the existential-variable-blank-node-business, where blank-nodes in queries were interpreted as "non-distinguished variables" (~existential variables). Things were also complicated by the scoping of blank-nodes in named-graphs, queries and results. Now, if I asked a query like "give me all friends of Beatrice", and the result set included a blank node _:bnode14, I couldn't find out more information about this particular friend since _:bnode14 is not a constant, but a variable, and since the _:bnode14 in the query is scoped differently from the one in the named graph representing Beatrice's FOAF file, and since—in any case—the SPARQL engine is allowed to make up any blank-node label as it sees fit.

...and so SPARQL implementors quickly created their own engine-specific solutions to create "hooks" for querying blank nodes, all revolving around Skolemising blank-nodes. Engines began rewriting blank nodes to <genid:bnode14> or <_:bnode14> or just interpreting _:bnode14 as a constant. This allowed for querying blank-nodes in a manner analogous for URIs. Different vendor-specific means of uniquely Skolemising and scoping blank-nodes (in the data, results and queries) began to emerge. Data exchange/federated-querying/etc. involving blank-nodes became a quagmire.

...a not so pretty prognosis for blank-nodes and Skolemisation

Blank-nodes have (arguably) long been an open wound festering in the Semantic Web standards—how severe the issue is is a matter of opinion.

Blank nodes are poorly understood, poorly used, difficult to handle—they have added a lot of complexity to the nuts-and-bolts of the standards built upon them, and the implementations consuming them. What started as a simple "shortcut" for publishers has gotten a little out of hand. Blank-nodes have their uses, but people are beginning to question whether they are worth the effort and what alternatives might exist. People are discussing amputating blank nodes from RDF, but this is not so straightforward.

First off, the standard interpretation of blank-nodes as existential variables is deeply ingrained in the standards: this interpretation was originally used as the only meaningful way the logicians of the day saw to formalise the locality of blank nodes (and to allow for denoting the existence of something you couldn't quite name). Blank nodes were then adopted/supported/mistreated by later standards (sometimes with significant cost) to maintain some form of backwards-compatibility or to meet user needs. Revising these standards now would come at a great cost (esp. wrt. implementations).

With respect to publishing, blank nodes are also used as convenient syntactic shortcuts in Turtle for containers, and for representing OWL in RDF. Blank-nodes are also useful for representing transient information as RDF, or for describing various resources which should not be externally (de)referenceable. Also, the original use-case for blank-nodes still exists: blank-nodes allow publishers to postpone/forego minting URIs for their data.

Given the complexity of the issue, proposed next-steps for dealing with blank-nodes are widely divergent:

  • leave blank-nodes as they are and continue to interpret them as existential variables;
  • leave blank-nodes as they are, but interpret them directly as Skolem constants;
  • leave blank-nodes as they are, but propose Skolemisation schemes for converting blank-nodes to (globally unique) URIs;
  • deprecate blank-nodes and propose Skolemisation schemes for converting legacy blank-nodes to (globally unique) URIs;
  • remove blank-nodes and get publishers to directly use URIs.

...there are various strong opinions out there, but no clear-cut answers.


answered 08 Apr '11, 17:57

Signified's gravatar image

Signified ♦
accept rate: 38%

edited 03 Jan '13, 12:57


And what do you get out of that? What are the pros or the reasons to do that? And what other (negative? potentially harmful?) consequences are there? What can you do with that that you couldn't before and what can't you do anymore when you go that way?

I mean, there seems to be a lot of discussion around that topic but most of it whent straight over my head but it seems like something that people constantly argue about.

(09 Apr '11, 16:05) Simon Reinhardt Simon%20Reinhardt's gravatar image

I assume you're referring to the discussion on the semweb mailing list that I've been wisely avoiding until now? ;) http://lists.w3.org/Archives/Public/semantic-web/2011Mar/

I tried to give a list of pros and cons, but that's a bit too simplistic... above is the best I could do to give a reasonably impartial picture of what they're talking about (or at least, the part of the discussion that my non-logician brain could parse).

(09 Apr '11, 23:13) Signified ♦ Signified's gravatar image

I think there is a general misunderstanding about the purpose(s) of bnodes. They do not only act as anonymous individuals, but they also provide a notion of /local nodes/ within graphs. To me, the locality aspect is the more fundamental one and also the more relevant one. What the RDF semantics does by treating bnodes as existential variables is simply a way of appropriately capturing these two aspects of bnodes. I don't see any way to do it differently, without giving up on at least one of the two aspects - in particular locality. Try playing with local Skolem constants and see what happens!

(10 Apr '11, 05:13) Michael Schn... ♦ Michael%20Schneider's gravatar image

Another thing: When logicians build their reasoners using Skolemization, then it's not because they don't know how to do reasoning with existential variables but, instead, they use Skolemization as a technique to properly and efficiently handle existential variables. It's not a matter of replacing existential semantics by Skolems, it's rather a matter of how to use Skolems right. :-)

(10 Apr '11, 06:08) Michael Schn... ♦ Michael%20Schneider's gravatar image

@Signified: Great answer, thank you very much for extending it!

@Michael: Could you expand on the notion of local nodes a bit? :-)

(10 Apr '11, 07:04) Simon Reinhardt Simon%20Reinhardt's gravatar image

@Simon: Compare it with programming languages, where one uses local entities when they have no relevance outside the given scope but could lead to uncontrollable name clashes there. Now consider an OWL class expression, e.g. "UnionOf(SomeValuesFrom(:p, :C), AllValuesFrom(:q, :D))", for which the corresponding RDF triples are "glued" by tons of nodes. It'll be ok to give names to these glue nodes, as they need to be distinguished somehow - but only /locally within the class expression/! They should be "unreachable" outside the ontology and, consequently, are represented by local(!) blank nodes.

(10 Apr '11, 08:16) Michael Schn... ♦ Michael%20Schneider's gravatar image

Still on "locality": To avoid the wrong impression that my comment has only to do with OWL: It is clear that with a data model consisting only of triples, then one will generally have to represent more complex assertions by gluing together collections of triples using special shared nodes. Sometimes, such shared nodes may represent "real" things (e.g. in the case of FOAF Person descriptions), but in many situations they will just be "glue nodes" without specific meaning. Keeping the relevance of such glue nodes local to the graph in which they are used will generally be a good idea.

(10 Apr '11, 08:39) Michael Schn... ♦ Michael%20Schneider's gravatar image

And just for the case that someone asks why one does not simply avoid giving semantics to "pure glue nodes"... That's because one would somehow need be able to distinguish them from blank nodes standing for "real things". The RDF semantics provides a generic and coherent semantics to /arbitrary/ RDF graphs, without having special usage of graphs in mind, and not just restricted to special kinds of RDF graphs. In fact, there /are/ special-usage-and-special-kind-of-graph semantics around, which are different from the RDF semantics: take for example the semantics underlying OWL DL!

(10 Apr '11, 08:45) Michael Schn... ♦ Michael%20Schneider's gravatar image

Howdy. :)

@Michael, I originally had a lot more discussion about locality of blank-nodes in before, but I had to edit the answer down and I removed them. I've editted back in a bit more discussion on that...

As for local Skolems, the problem is assuring the global uniqueness of naming? This is not such a problem if you have named graphs... for "unnamed graphs", yep...

Now soliciting comments on what to remove from the answer... :)

(10 Apr '11, 11:56) Signified ♦ Signified's gravatar image
showing 5 of 9 show 4 more comments

...an "interesting-enough" intro to blank-nodes and Skolemisation

... for those like me with an internet age attention span. cliff-notes version of @Signified's awesome but lengthy answer:

Essentially, you're replacing the blank-nodes with a unique constant not appearing elsewhere.

_:signified rdf:type sioc:UserAccount .


<_:signified> rdf:type sioc:UserAccount .

answered 10 Apr '11, 22:15

harschware's gravatar image

harschware ♦
accept rate: 19%

BTW, I blame @Simon and the upvoters of his comment for the length of my answer (which was originally fairly sane). :P

(10 Apr '11, 22:36) Signified ♦ Signified's gravatar image

he he, couldn't resist :-P

(10 Apr '11, 22:50) harschware ♦ harschware's gravatar image

Gosh! That's not fair! ;-)

(11 Apr '11, 10:00) Michael Schn... ♦ Michael%20Schneider's gravatar image

Tough. :-P

(12 Apr '11, 13:43) Simon Reinhardt Simon%20Reinhardt's gravatar image

Well at least we got some karma points from it. :P

I'm still clinging to a literalist interpretation of "karma points" such that they will be redeemable in the after life when negotiating my next reincarnation.

FWIW, I will not be requesting reincarnation as a "semantic web researcher". ;)

(12 Apr '11, 14:09) Signified ♦ Signified's gravatar image

In addition to Signified's long answer on what Skolem's are and how they entered the RDF world, here is a likewise long answer on the technical properties of Skolems. I will say under which conditions using Skolems is harmless, when it is mostly harmless, and which usage pitfalls exist. I will also talk about the more general case of Skolem functions in contrast to Skolem constants.

Equisatisfiability, or: When Skolems are the Right Choice

Dealing with existential variables in reasoners can become a challenge, but when the task is to check satisfiability (semantic consistency) of a logical formula (e.g. of an OWL ontology), then Skolemization is the right way to proceed, as Skolemization retains "equisatisfiability". This means that if the original formula is satisfiable (unsatisfiable), then the Skolemized formula is also satisfiable (unsatisfiable). That's weaker than logical equivalence and, in fact, a Skolemized formula is generally not equivalent to the original formula, but equisatisfiability is fully sufficient for satisfiability checking, and Skolemization significantly eases this task, as there are no existential variables anymore afterwards.

So far for satisfiability checking, but what about entailment checking? Well, provided a sufficiently expressive logic language, one can translate every entailment checking task into a corresponding satisfiability checking task, where the entailment holds if and only if the translated formula is un-satisfiable. So, in these cases of expressive languages, Skolemization can be used for entailment checking as well, although only indirectly after successful transformation of the original entailment.

How does the transformation look like? If you start from an entailment query "A |= B", where "A" and "B" are formulas of the language being considered, then the translated formula is "A and not B" ("A&~B"): A implies B if and only if "A&~B" is wrong under all possible interpretations, hence unsatisfiable (or semantically inconsistent). And if there are existential variables in the target(!) formula "A&~B", then it is safe to replace them by Skolems, since this will not change it's state of (un)satisfiability.

Under which conditions is such a translation possible? As one can see from the translation above, one needs conjunction ("&") and negation ("~"). Negation is the relevant language feature here, as it has to apply to all possible formulas of the language - or at least to all possible formulas that one is interested in to see in the conclusion part of an entailment query. For standard first-order logic, this is clearly the case. For some description logics it also is. But unfortunately not for RDF, as there is no negation of graphs and triples. (Well, it could become possible with a proper embedding of RDF into a more expressive language, but I'll better don't touch this topic here. :-))

Skolemization Lemma, or: the Mostly Harmless Use of Skolemization

So, we now know that heavy-weight logic-based reasoning can use Skolemization to properly and efficiently deal with existential variables. But the translation into a satisfiability checking task that is possible for these languages does not work for RDF entailment. Can Skolemization still be used without tears? Yes, pretty well, when Skolemization is restricted to the premise of an entailment check. In this case, while some semantic change is happening, this change can well be characterized formally and can be considered "mostly harmless". This is the content of the "Skolemization lemma".

This lemma may look a bit complicated to non-logicians but what it essentially says is that you do not learn much news from an RDF graph after being Skolemized. Of course, you get all the statements of the graph including Skolems as entailments as well. Also, Skolems, being constants, are formally more specific and therefore more "expressive" than existential variables. But as the Skolems have to be "fresh constants", i.e. they are neither used in the graph nor in the background theory, these entailments do not provide much additional information.

Pitfall 1: Skolemization versus Culture

Still, one should be careful with the Skolemization lemma in practice, as it may produce results beyond intuition. For example, imagine the following RDF graph:

G := { _:x owl:sameAs _:y . }

Under RDF semantics (plus equality, as provided by OWL), this just says that there exist two things that are the same, which is certainly true! However, the following happens to be a valid Skolemization of G:

sk(G) := { 

Big news apparently! But there's nothing wrong here: neither TimBL's nor Danbri's FOAF URI is part of G or the RDF background theory and, hence, they are proper Skolem constants.

One has to keep in mind: Skolemization is a purely logical/technical method, which doesn't care about what people have in mind when they see certain names being used as Skolem constants. One could, of course, restrict the set of names being applicable as Skolem constants.

Pitfall 2: non-fresh constants

While the above Skolemization by TimBL's and DanBri's URIs was technically ok (although slightly questionable from a cultural point of view :-)), what one really has to take care of is to not reuse names: a Skolem constant has to be "fresh"!

The freshness is not a vague concept but, rather, it is clearly specified in terms of the vocabulary being used for creating formulas in the language, including the background theory, if such exists. In a formal context, one always has to provide this vocabulary, and then it is easy to find out what names may (technically) act as a Skolem constant: every name not occurring in this vocabulary.

In RDF, the vocabulary is the set of terms used to build the graph, the vocabulary terms of RDF(S), such as "rdf:type", and all possible literals.

But why at all this "freshness" requirement? Because the Skolem constant must be a legitimate replacement for the original existential variable. There must not be any constraints on the existential variable other than those constraints that already exist on it. If you would reuse some URI that is used in the graph to replace some existential variable, there would at least be one RDF triple using this URI in the original RDF graph, and this triple would act as an additional constraint on the Skolem constant that has not originally existed for the existential variable.

Here is an example:

G := {
    :timbl owl:differentFrom :danbri .
    _:x owl:sameAs :danbri .

sk(G) := {
    :timbl owl:differentFrom :danbri .
    :timbl owl:sameAs :danbri . # *invalid* Skolemization 

Here, I have incorrectly re-used the URI ":timbl", which is already used elsewhere in the graph G, to replace the blank node "_:x". Apart from the obvious strangeness of doing so, one can even see that this leads to a technical problem: while G was satisfiable in OWL (Full), sk(G) is unsatisfiable. As we have already seen that (proper) Skolemization retains satisfiability, this example cannot represent proper Skolemization!

Pitfall 3: Skolems in the Conclusion

While the Skolemization lemma essentially gives go to applying Skolemization to the premise of an entailment query, it does not talk about the conclusion part of the entailment, and this for good reason, as Skolemization of the conclusion will in most cases destroy an entailment. If there is an existential variable in the conclusion, then Skolemization will replace it by some fresh constant that does not occur elsewhere, in particular not in the premise ontology. This Skolem constant can then be interpreted pretty arbitrarily, and this semantic freedom will generally be deadly for an entailment.

For example, the following is a simple entailment, having an existentially interpreted blank node in its conclusion:

{ :danbri foaf:knows :timbl } 
|= { _:x foaf:knows :timbl }

This is a valid (simple) entailment, since the conclusion just says that there is someone (or something) who knows :timbl, and we already know from the premise that :danbri is an existing example of someone knowing :timbl. Now, after Skolemizing the conclusion, we receive:

{ :danbri foaf:knows :timbl } 
|= { sk:42 foaf:knows :timbl }

Here, "sk:42" is the chosen fresh Skolem constant being used to replace the blank node "_:x". It is easy to see that the conclusion does not logically follow from the premise, as "sk:42" can be freely interpreted, even by someone who does not know :timbl (e.g. my mother :-)).

Skolem Functions

For completeness, I wanted to note that for languages more expressive than RDF, Skolemization often leads to Skolem functions instead of Skolem constants, where function terms of the form "f(?x,?y,?z)" are used to replace existential variables in formulas. Actually, Skolem constants can be seen as null-ary Skolem function terms.

Skolem function terms are being used when formulas can have both existentially and universally quantified variables. For example, the statement "every human has a mother" can be formulated using explicit quantifiers as follows:

    Human(?x) => hasMother(?x,?y)

Now, if we naively Skolemize the existential variable "?y" by a Skolem constant "sk:y", we receive

// invalid Skolemization via a Skolem constant
    Human(?x) => hasMother(?x, sk:y)

This essentially reads that there is something denoted by sk:y, such that for every human ?x, sk:y is his/her mother. This is very different from the original assertion: originally, there was for every person some dedicated mother. Now, there is suddenly a single mother for everyone!

The problem is that the Skolem constant is independent of the still existing (universal) variable "?x". The remedy is to formally introduce such a dependency by building a function "sk:y(?x)":

// correct Skolemization via a Skolem function
    Human(?x) => hasMother(?x, sk:y(?x))

Now, for any instantiation ?x of a Human, the "hasMother" predicate relates ?x to the mother of ?x, who is denoted by the function term "sk:y(?x)".


answered 10 Apr '11, 11:59

Michael%20Schneider's gravatar image

Michael Schn... ♦
accept rate: 34%

edited 10 Apr '11, 12:04

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 08 Apr '11, 17:35

Seen: 8,290 times

Last updated: 03 Jan '13, 12:57