I noticed that from the thread on What's missing from RDF databases? that one thing people were quite keen on was full-text search in SPARQL.
This is because it mitigates a classic alleged performance problem in SPARQL when people write queries like so:
It should be self evident that writing a query like this is a bad idea because you get a huge swathe of potential matches and then have to apply a regular expression on every one. Full text search extensions to SPARQL let you write queries which achieve the same thing with huge performance advantages, for example the LARQ style syntax:
However one problem with this is that is isn't standardised and every vendor seems to have different syntaxes for this, I'm aware of at least 3 different implementations none of which are interoperable:
So I wondered several things:
I'm particularly interested in point 3, maybe I'm being unimaginative but I can't think of any use cases beyond the glaringly obvious text search uses i.e. using it to find RDF graphs/triples containing certain text. If you have some other interesting/useful use case please enlighten me!
We use the IndexingSail from uSeekM as documented here: https://dev.opensahara.com/projects/useekm/wiki/IndexingSail, and here http://opensahara.com/blog/geospatial-search-rdf-data. It provides a layer on top of triplestores that use the Sesame API, thus can be used with a large number of triplestores (Sesame Native, Bigdata, 4Store, Owlim, Mulgara, ...). The reason for using the uSeekM is that we don't have vendor lock-in. The same search syntax can be used for all these stores. The open source version of uSeekM uses a Postgres database with GIN inverted indexes for building the text-index. SPARQL and SerQL queries are rewritten by uSeekM and federated between the text-index and the triplestore.
The syntax for search:text filters is described here: https://dev.opensahara.com/projects/useekm/wiki/IndexingSail#Full-Text-Search
Disclaimer: I work for TalkingTrends, and uSeekM is developed by us for this purpose. It also does geospatial indexing of geometries for efficient GeoSPARQL querying.
I haven't used it personally, but Ontotext's BigOWLIM also lists support for integrated full-text search as one of its features.
Then there is the Nepomuk LuceneSail, which I have used in the past. Example query: Search for any resource an RDFS-label value that contains the string "person":
The main use case for any such solution typically has to do with document management. For example, we have built an RDF-based system for experimental researchers to document their research, where lots of semi-structured data is added (samples x, y, z, experiments a, b and c, and so on), but where users can also add (Office, PDF, ...) documents (for example as input literature, or as a publication resulting from a particular bit of research).
We have used a full-text search solution to facilitate doing keyword searches that consider both the structured RDF (tags, entity types, all other sorts of annotations) as well as the full-text of the documents. A non-indexed triplestore will be dead-slow when evaluating string searches on literals that basically contain the entire text of a multipage word file, so we use an indexer.
answered 15 Apr '11, 22:15
Jeen Broekstra ♦
Allegrograph also has the ability to do free text searches. In SPARQL you use the fti:match predicate.
I worked developing a project where user assets were tracked by a resource id, which had label, comments, and other properties. Text searching on any literal was a way to provide a loose matching mechanism for user assets, that allowed them to quickly narrow results. Since it searched all literals, the semantic meaning of the text became unimportant, but as a filter mechanism it allowed one to quickly narrow results by its target text.
Also, FWIW, the use cases that always drives me to using free text search is:
Anzo supports it in a way similar to, but not interoperable with, the systems you mention. There's some incomplete documentation of it at http://www.openanzo.org/projects/openanzo/wiki/SPARQLExtensions (search for "textmatch").
We use it for the "obvious" reason -- it gives users an easy way to start narrowing down large sets of data to the records they're interested in.
answered 15 Apr '11, 09:27
So far, I think the other answers miss the point that text that is relevant to a node can be dispersed to other nodes. For instance, suppose you're looking for a company and you do a search for "Larry Page". Well,
Back in 2006 I was involved in something called the Global Performing Arts Database. GloPAD was based on PostgreSQL, but the data model was similar to an RDF system in that we had a very detailed ontology that distributed text into many nodes. I built a system that queried our data dictionary and could construct a graph of traversal paths across tables and rows. The system was capable of static analysis of this graph as well as dynamic traversal in real time.
A key thing we developed was traversal rules that would determine which links would yield relevant text and which wouldn't. The most important rule in the box was that the system would not traverse links that went to an object of the same type as the starting object. In the example above, for instance, there might be a
but the system would not follow it because
In GloPAD we did the traversal whenever a new record was added or was changed, creating a set of synthetic documents that were indexed in a conventional full-text system. This had the advantage of predictable query time. Performance was fine and our customers were happy with the search results.
It would also be possible to find matching node with matching text and traverse the graph at query time to determine what other nodes are related and how to rank them. Here it would be easy to tweak the ranking function to depend on relationship types and relationship distance, but I'd be concerned that queries could be slow if you hit a highly connected part of the graph.
answered 11 Dec '11, 08:11
The main use case of "full text search" in SPARQL is to delude customers by giving them a buzzword. With "full text search" managers get a checkbox to judge a SPARQL-based product or service without having to understand and think about which tools to use for solving which problems. Instead you should distinguish between:
People must likely think about the third but they only get the first.
answered 11 Dec '11, 05:13
The Virtuoso Faceted Browser Service is a powerful demonstration of its Full Text Indexing in action for querying over large scale Linked Data.
Try the following steps for a live demonstration of its use: