14
7

I noticed that from the thread on What's missing from RDF databases? that one thing people were quite keen on was full-text search in SPARQL.

This is because it mitigates a classic alleged performance problem in SPARQL when people write queries like so:

SELECT ?s
WHERE
{
  ?s ?p ?o 
  FILTER(REGEX(?o, "substring"))
}

It should be self evident that writing a query like this is a bad idea because you get a huge swathe of potential matches and then have to apply a regular expression on every one. Full text search extensions to SPARQL let you write queries which achieve the same thing with huge performance advantages, for example the LARQ style syntax:

PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT ?s
{
    ?lit pf:textMatch 'substring' .
    ?s ?p ?lit
}

However one problem with this is that is isn't standardised and every vendor seems to have different syntaxes for this, I'm aware of at least 3 different implementations none of which are interoperable:

So I wondered several things:

  1. Are you using full text search with SPARQL or would you use it if it were available in your RDF database/SPARQL engine?
  2. Whose implementation do you use? Please add links to documentation and an example if it's not one I've listed or someone else has already suggested
  3. What's your use case for it?

I'm particularly interested in point 3, maybe I'm being unimaginative but I can't think of any use cases beyond the glaringly obvious text search uses i.e. using it to find RDF graphs/triples containing certain text. If you have some other interesting/useful use case please enlighten me!

asked 15 Apr '11, 05:16

Rob%20Vesse's gravatar image

Rob Vesse ♦
13.9k1715
accept rate: 29%

edited 31 Jan '13, 05:59

1

Related reading? "Benchmarking Fulltext Search Performance of RDF Stores", Enrico Minack, Wolf Siberski, Wolfgang Nejdl, ESWC 2009. http://data.semanticweb.org/conference/eswc/2009/paper/225/html

(15 Apr '11, 21:08) Signified ♦ Signified's gravatar image

We use the IndexingSail from uSeekM as documented here: https://dev.opensahara.com/projects/useekm/wiki/IndexingSail, and here http://opensahara.com/blog/geospatial-search-rdf-data. It provides a layer on top of triplestores that use the Sesame API, thus can be used with a large number of triplestores (Sesame Native, Bigdata, 4Store, Owlim, Mulgara, ...). The reason for using the uSeekM is that we don't have vendor lock-in. The same search syntax can be used for all these stores. The open source version of uSeekM uses a Postgres database with GIN inverted indexes for building the text-index. SPARQL and SerQL queries are rewritten by uSeekM and federated between the text-index and the triplestore.

Example query:

PREFIX search: <http://rdf.opensahara.com/search#>
SELECT ?record
WHERE {
  ?record a <http://purl.org/ontology/mo/Record>
  ?record ?predicate ?match .
  FILTER(search:text(?match, "Florence & Machine"))
} ORDER BY ?record

The syntax for search:text filters is described here: https://dev.opensahara.com/projects/useekm/wiki/IndexingSail#Full-Text-Search

Disclaimer: I work for TalkingTrends, and uSeekM is developed by us for this purpose. It also does geospatial indexing of geometries for efficient GeoSPARQL querying.

permanent link

answered 15 Apr '11, 06:22

Gerrit%20V's gravatar image

Gerrit V
1.7k18
accept rate: 28%

edited 16 Jun '12, 08:41

I haven't used it personally, but Ontotext's BigOWLIM also lists support for integrated full-text search as one of its features.

Then there is the Nepomuk LuceneSail, which I have used in the past. Example query: Search for any resource an RDFS-label value that contains the string "person":

PREFIX search:   <http://www.openrdf.org/contrib/lucenesail#>
SELECT ?x ?score ?snippet WHERE {
?x search:matches ?match.
?match search:query "person";
  search:property rdfs:label;
  search:score ?score;
  search:snippet ?snippet. 
}

The main use case for any such solution typically has to do with document management. For example, we have built an RDF-based system for experimental researchers to document their research, where lots of semi-structured data is added (samples x, y, z, experiments a, b and c, and so on), but where users can also add (Office, PDF, ...) documents (for example as input literature, or as a publication resulting from a particular bit of research).

We have used a full-text search solution to facilitate doing keyword searches that consider both the structured RDF (tags, entity types, all other sorts of annotations) as well as the full-text of the documents. A non-indexed triplestore will be dead-slow when evaluating string searches on literals that basically contain the entire text of a multipage word file, so we use an indexer.

permanent link

answered 15 Apr '11, 22:15

Jeen%20Broekstra's gravatar image

Jeen Broekstra ♦
11.6k412
accept rate: 37%

Allegrograph also has the ability to do free text searches. In SPARQL you use the fti:match predicate.

Overview: http://www.franz.com/agraph/support/documentation/v4/text-index.html

Example SPARQL: http://www.franz.com/agraph/support/learning/SPARQL-fti-match.lhtml

I worked developing a project where user assets were tracked by a resource id, which had label, comments, and other properties. Text searching on any literal was a way to provide a loose matching mechanism for user assets, that allowed them to quickly narrow results. Since it searched all literals, the semantic meaning of the text became unimportant, but as a filter mechanism it allowed one to quickly narrow results by its target text.

Also, FWIW, the use cases that always drives me to using free text search is:

  • Because a regex doesn't do the trick. Either regex on a literal would involve scanning too many literals and bring the query to a crawl, or the search expression isn't something I can do in a regex easily, e.g. search 5 keywords order independent, or search for near neighbors, etc.
  • Or, if the search is something that you present to a user and inject into your SPARQL. Far more users are more comfortable with Google like search expressions than with a regex pattern, so the choice to use free text is driven by your expectation of user need.
permanent link

answered 15 Apr '11, 10:19

harschware's gravatar image

harschware ♦
7.7k1616
accept rate: 20%

edited 15 Apr '11, 12:10

Anzo supports it in a way similar to, but not interoperable with, the systems you mention. There's some incomplete documentation of it at http://www.openanzo.org/projects/openanzo/wiki/SPARQLExtensions (search for "textmatch").

We use it for the "obvious" reason -- it gives users an easy way to start narrowing down large sets of data to the records they're interested in.

permanent link

answered 15 Apr '11, 09:27

lee's gravatar image

lee
3.2k39
accept rate: 37%

So far, I think the other answers miss the point that text that is relevant to a node can be dispersed to other nodes. For instance, suppose you're looking for a company and you do a search for "Larry Page". Well, :Larry_Page may exist as a :Person, but you want :Google to show up. Somehow the system needs to follow the link from :Larry_Page to :Google.

Back in 2006 I was involved in something called the Global Performing Arts Database. GloPAD was based on PostgreSQL, but the data model was similar to an RDF system in that we had a very detailed ontology that distributed text into many nodes. I built a system that queried our data dictionary and could construct a graph of traversal paths across tables and rows. The system was capable of static analysis of this graph as well as dynamic traversal in real time.

A key thing we developed was traversal rules that would determine which links would yield relevant text and which wouldn't. The most important rule in the box was that the system would not traverse links that went to an object of the same type as the starting object. In the example above, for instance, there might be a

:Microsoft :isCompetitorof :Google .

but the system would not follow it because :Microsoft is also a :Company. In GloPAD this was easy because the types were mutually exclusive.

In GloPAD we did the traversal whenever a new record was added or was changed, creating a set of synthetic documents that were indexed in a conventional full-text system. This had the advantage of predictable query time. Performance was fine and our customers were happy with the search results.

It would also be possible to find matching node with matching text and traverse the graph at query time to determine what other nodes are related and how to rank them. Here it would be easy to tweak the ranking function to depend on relationship types and relationship distance, but I'd be concerned that queries could be slow if you hit a highly connected part of the graph.

permanent link

answered 11 Dec '11, 08:11

database_animal's gravatar image

database_animal ♦
8.4k1612
accept rate: 15%

1

I don't think that's the task of FTS itself, SPARQL itself can do that. It's well within the domain of SPARQL to combine provided FTS filters and relate those to other resources (nodes).

(11 Dec '11, 08:25) Gerrit V Gerrit%20V's gravatar image
1

RDF search engines like Falcons and (to a lesser degree) Sindice use a similar approach where large RDF graphs are broken down into “virtual” documents. Each document contains a certain graph neighbourhood around one given resource. Each document is indexed under terms according to the “text“ that ends up in it. Unlike in your approach, the rules for breaking down large graphs are not domain-specific in those systems, but more on structural issues (blank nodes, literals vs IRIs etc). But then again the original question was about SPARQL, not about other ways of implementing RDF search.

(12 Dec '11, 05:34) cygri ♦ cygri's gravatar image

The main use case of "full text search" in SPARQL is to delude customers by giving them a buzzword. With "full text search" managers get a checkbox to judge a SPARQL-based product or service without having to understand and think about which tools to use for solving which problems. Instead you should distinguish between:

  • string searching as simple detection of substrings in literals
  • literal pattern matching which is not supported by SPARQL.
  • text retrieval based on ranking algorithms (basically vector space model), such as implemented in Lucene

People must likely think about the third but they only get the first.

permanent link

answered 11 Dec '11, 05:13

Jakob's gravatar image

Jakob
1.9k211
accept rate: 10%

3

Practically all the FTS implementations mentioned in answers here use vector based models, and some even use Lucene under the hood. So I am not really understanding what you are trying to say here.

(11 Dec '11, 08:22) Gerrit V Gerrit%20V's gravatar image

Also don't understand what you mean by "literal pattern matching which is not supported by SPARQL. " Just because capture groups aren't supported doesn't mean you can't do some fairly sophisticated regex patterns on literals. ( Although, I too wish for capture groups ).

(12 Dec '11, 00:20) harschware ♦ harschware's gravatar image
-2

The Virtuoso Faceted Browser Service is a powerful demonstration of its Full Text Indexing in action for querying over large scale Linked Data.

Try the following steps for a live demonstration of its use:

  1. Goto: http://lod.openlinksw.com -- remember, you have 21 Billion+ triples (I state this number with what follows in mind)
  2. Enter pattern: School
  3. When you get the first results page, click on the "Type" link in the Navigation section
  4. You'll see: http://lod.openlinksw.com/fct/facet.vsp?cmd=load&fsq_id=36
  5. Few Retry Button Clicks later (depending on state of working set cache) you see: education.data.gov.uk in the list
  6. Then you'll have: http://lod.openlinksw.com/fct/facet.vsp?cmd=load&fsq_id=37
  7. Optionally, click on "Places" link from the Navigation section and you get: http://lod.openlinksw.com/fct/facet.vsp?cmd=load&fsq_id=38
  8. Click on "Entity1" link at page top and you get a list
  9. Use "Save" for permalink and sharing
  10. etc..
permanent link

answered 15 Apr '11, 22:13

hwilliams's gravatar image

hwilliams
1.2k22
accept rate: 18%

edited 15 Apr '11, 22:15

Erm, why was this downvoted so heavily? If you downvote, please leave a comment.

(16 Jun '12, 09:33) Signified ♦ Signified's gravatar image
1

I think primarily because it looks rather like a blatant product plug

(18 Jun '12, 12:37) Rob Vesse ♦ Rob%20Vesse's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×1,301
×17
×13
×9
×5

question asked: 15 Apr '11, 05:16

question was seen: 10,936 times

last updated: 31 Jan '13, 05:59