|
I noticed that from the thread on What's missing from RDF databases? that one thing people were quite keen on was full-text search in SPARQL. This is because it mitigates a classic alleged performance problem in SPARQL when people write queries like so:
It should be self evident that writing a query like this is a bad idea because you get a huge swathe of potential matches and then have to apply a regular expression on every one. Full text search extensions to SPARQL let you write queries which achieve the same thing with huge performance advantages, for example the LARQ style syntax:
However one problem with this is that is isn't standardised and every vendor seems to have different syntaxes for this, I'm aware of at least 3 different implementations none of which are interoperable:
So I wondered several things:
I'm particularly interested in point 3, maybe I'm being unimaginative but I can't think of any use cases beyond the glaringly obvious text search uses i.e. using it to find RDF graphs/triples containing certain text. If you have some other interesting/useful use case please enlighten me! |
|
We use the IndexingSail from uSeekM as documented here: https://dev.opensahara.com/projects/useekm/wiki/IndexingSail, and here http://opensahara.com/blog/geospatial-search-rdf-data. It provides a layer on top of triplestores that use the Sesame API, thus can be used with a large number of triplestores (Sesame Native, Bigdata, 4Store, Owlim, Mulgara, ...). The reason for using the uSeekM is that we don't have vendor lock-in. The same search syntax can be used for all these stores. The open source version of uSeekM uses a Postgres database with GIN inverted indexes for building the text-index. SPARQL and SerQL queries are rewritten by uSeekM and federated between the text-index and the triplestore. Example query:
The syntax for search:text filters is described here: https://dev.opensahara.com/projects/useekm/wiki/IndexingSail#Full-Text-Search Disclaimer: I work for TalkingTrends, and uSeekM is developed by us for this purpose. It also does geospatial indexing of geometries for efficient GeoSPARQL querying. |
|
I haven't used it personally, but Ontotext's BigOWLIM also lists support for integrated full-text search as one of its features. Then there is the Nepomuk LuceneSail, which I have used in the past. Example query: Search for any resource an RDFS-label value that contains the string "person":
The main use case for any such solution typically has to do with document management. For example, we have built an RDF-based system for experimental researchers to document their research, where lots of semi-structured data is added (samples x, y, z, experiments a, b and c, and so on), but where users can also add (Office, PDF, ...) documents (for example as input literature, or as a publication resulting from a particular bit of research). We have used a full-text search solution to facilitate doing keyword searches that consider both the structured RDF (tags, entity types, all other sorts of annotations) as well as the full-text of the documents. A non-indexed triplestore will be dead-slow when evaluating string searches on literals that basically contain the entire text of a multipage word file, so we use an indexer. |
|
Allegrograph also has the ability to do free text searches. In SPARQL you use the fti:match predicate. Overview: http://www.franz.com/agraph/support/documentation/v4/text-index.html Example SPARQL: http://www.franz.com/agraph/support/learning/SPARQL-fti-match.lhtml I worked developing a project where user assets were tracked by a resource id, which had label, comments, and other properties. Text searching on any literal was a way to provide a loose matching mechanism for user assets, that allowed them to quickly narrow results. Since it searched all literals, the semantic meaning of the text became unimportant, but as a filter mechanism it allowed one to quickly narrow results by its target text. Also, FWIW, the use cases that always drives me to using free text search is:
|
|
Anzo supports it in a way similar to, but not interoperable with, the systems you mention. There's some incomplete documentation of it at http://www.openanzo.org/projects/openanzo/wiki/SPARQLExtensions (search for "textmatch"). We use it for the "obvious" reason -- it gives users an easy way to start narrowing down large sets of data to the records they're interested in. |
|
So far, I think the other answers miss the point that text that is relevant to a node can be dispersed to other nodes. For instance, suppose you're looking for a company and you do a search for "Larry Page". Well, Back in 2006 I was involved in something called the Global Performing Arts Database. GloPAD was based on PostgreSQL, but the data model was similar to an RDF system in that we had a very detailed ontology that distributed text into many nodes. I built a system that queried our data dictionary and could construct a graph of traversal paths across tables and rows. The system was capable of static analysis of this graph as well as dynamic traversal in real time. A key thing we developed was traversal rules that would determine which links would yield relevant text and which wouldn't. The most important rule in the box was that the system would not traverse links that went to an object of the same type as the starting object. In the example above, for instance, there might be a
but the system would not follow it because In GloPAD we did the traversal whenever a new record was added or was changed, creating a set of synthetic documents that were indexed in a conventional full-text system. This had the advantage of predictable query time. Performance was fine and our customers were happy with the search results. It would also be possible to find matching node with matching text and traverse the graph at query time to determine what other nodes are related and how to rank them. Here it would be easy to tweak the ranking function to depend on relationship types and relationship distance, but I'd be concerned that queries could be slow if you hit a highly connected part of the graph. 1
I don't think that's the task of FTS itself, SPARQL itself can do that. It's well within the domain of SPARQL to combine provided FTS filters and relate those to other resources (nodes). 1
RDF search engines like Falcons and (to a lesser degree) Sindice use a similar approach where large RDF graphs are broken down into “virtual” documents. Each document contains a certain graph neighbourhood around one given resource. Each document is indexed under terms according to the “text“ that ends up in it. Unlike in your approach, the rules for breaking down large graphs are not domain-specific in those systems, but more on structural issues (blank nodes, literals vs IRIs etc). But then again the original question was about SPARQL, not about other ways of implementing RDF search. |
|
The main use case of "full text search" in SPARQL is to delude customers by giving them a buzzword. With "full text search" managers get a checkbox to judge a SPARQL-based product or service without having to understand and think about which tools to use for solving which problems. Instead you should distinguish between:
People must likely think about the third but they only get the first. 3
Practically all the FTS implementations mentioned in answers here use vector based models, and some even use Lucene under the hood. So I am not really understanding what you are trying to say here. Also don't understand what you mean by "literal pattern matching which is not supported by SPARQL. " Just because capture groups aren't supported doesn't mean you can't do some fairly sophisticated regex patterns on literals. ( Although, I too wish for capture groups ). |
|
The Virtuoso Faceted Browser Service is a powerful demonstration of its Full Text Indexing in action for querying over large scale Linked Data. Try the following steps for a live demonstration of its use:
Erm, why was this downvoted so heavily? If you downvote, please leave a comment. 1
I think primarily because it looks rather like a blatant product plug |


Related reading? "Benchmarking Fulltext Search Performance of RDF Stores", Enrico Minack, Wolf Siberski, Wolfgang Nejdl, ESWC 2009. http://data.semanticweb.org/conference/eswc/2009/paper/225/html