In a new project we are going to design a domain ontology and a triple store, and some advices would be welcome.
The data manipulated will be totally created/updated and controlled in the project, with some links to external data set as geoNames for instance. The data will certainly not be published as linked data.
Given that the RDF data in this project is kind of a closed world, there are some nice features of RDFS or OWL that are certainly useful in many contexts, but maybe not useful or even making things worst in the present case.
The first question that arises here is about properties Range/Domain, and Inverse properties. Given the fact that the T-Box and A-Box is "under control", and so are the SPARQL queries, and given the other important fact that the triple store will handle a lot of triples and user queries (and thus performance is a key point of the solution), I would tend to say that we should avoid properties range/domain and inverse properties that would only lead to an increasing number of non-useful triples. Am I wrong, and even in this context, are there good reason to use those features ?
The second main question is the classical idea that we should use existing ontologies as much as possible. In our case, Dublin Core could be used to express a part of the model only (so using only 2 or 3 metadat from Dublin Core as Title, Creator, etc.), and we will thus have to create a domain ontology anyway. In this context, I would say 'why use Dublin Core' instead of just creating exactly what we need, according to the reasoner we will use and the rules we will carefully select ?
And if some day the data are published, it seems never too late to generated a new 'linked data friendly' sets, no ?
[Edition after Jerven's remark] Just want to point out about "correctness first", which of course is a key: the idea of not using domain/range for properties comes from the fact that objects and subjects are all well identified, and so the rdf:type can be an assertion instead of giving more work to the reasoner that could be used for other more important and complexe tasks. But again, my question is not a general question (where domain/range of course do make sense), but a question specific to our project.
About reusing existing ontologies: I am totally convinced that reusing existing ontologies is the key to the semantic web, but that it is a very hard requirement to achieve. Sofar I find it a bit of a mess to find out the corresponding ontology, mixing ontologies together, seeing than strange behavior of reasoning, etc. But the point here is a little different: if modeling a domain for which there is no existing ontology, why create an ontology and also use only 2 or 3 dcterms or foaf terms (the inverse would of course make more sense: there is an ontology which covers 90% of the domain, and we just extend it)
[Edition to answer Richar's remark - why would a reasoner be needed (text too long for a comment)] We are still in the early steps of a project where one of the goal is to see if semantic technologies are of any help (compared to RDB and NoSQL). There is no huge logic in the data that would require a reasoner. But so far we think that the main use of a reasoner would be to facilitate the queries through data which have a long properties "chaining" path: two resources (res1 and res2) are linked together by following properties from res1 to many other resources and finally to res2 - a precise exemple: i am in a shop, I want to buy a shirt, I scan a code and get the origin of the material of my shirt, I get certificates (bio, GMO Free, etc.) and can see the place of production of the raw material for instance. Queries are possible but ugly and maybe slow when it will come to millions of triples and hundreds of users. For queries that have to answer lots of users queries and be efficient, this information needed for the answer will thus be computed at data load time. Then, instead of using transitive properties that would create loads of unnecessary triples all the way from res1 to res2, custom rules will be used to create only the needed triples. For queries which are more rarely issued, following properties paths will not be a problem (the information is there - and fast answer time is not an issue). Does it make sense ?
Thanks a lot for any advice Fabian
Think about correctness first, performance later. It is easy to increase performance when needed (buy bigger machine). It is much harder to go back and correct wrong data (spend lots of man hours fixing data).
Secondly on the reuse of existing ontologies.
answered 05 May '11, 11:41