I want to query a database of Gene signatures, where each signature basically consists of a list of gene symbols (Think "Ache", "Bcan", "Gjb1" ...).

I then want to use a list of such gene symbols (say for example "Ache", "Bcan", "Gjb1") as a query list, and get signatures that has a given minimum number of matches against my query list.

Can I / how can I most easily, implement this in SPARQL?

The case of minimum one match, it is easy, that would be (for the mentioned query list of "Ache", "Bcan", "Gjb1"), like so:

SELECT ?signature WHERE { 
    ?signature [some gene-symbol property] ?symbol .
    FILTER (?symbol = STR('Ache') || ?symbol = STR('Bcan') || ?symbol = STR('Gjbl'))
}

... but how to go about for an arbitrary "minimum matches" count, without getting a too convoluted query?

asked 03 Jan '13, 15:55

Samuel%20Lampa's gravatar image

Samuel Lampa
698311
accept rate: 33%


You can do this using the aggregate support in SPARQL 1.1:

SELECT ?signature WHERE { 
  ?signature [some gene-symbol property] ?symbol .
  FILTER (?symbol IN ("Ache", "Bcan", "Gjbl"))
}
GROUP BY ?signature
HAVING (COUNT(?symbol) > N)

(untested; substitute in your value of N as you see fit)

Lee

permanent link

answered 03 Jan '13, 16:11

lee's gravatar image

lee
3.2k39
accept rate: 37%

Ah, looks nice! Unfortunately my triplestore (4store) install does not seem to have SPARQL 1.1 support yet ... :/

(03 Jan '13, 16:22) Samuel Lampa Samuel%20Lampa's gravatar image
1

I can't think of any realistic way to do it if limited to SPARQL 1.0 constructs, unfortunately. Switch triple stores? :-)

(03 Jan '13, 16:24) lee lee's gravatar image

Well, I'm trying now, with 4 query genes, and the limit (N)=2, by manually (could later be programmed) creating the queries with pairs of match statements, glued together with SPARQL UNIONS. 4s seems to make quite a good job at least for these: 0.048 s, with 7 MTs in the store. Gotta see how it goes with larger numbers though :)

(05 Jan '13, 09:45) Samuel Lampa Samuel%20Lampa's gravatar image

Produced all the permutations in python. Succeeded to query a 7 gene list with min matches set to 4 (takes some 4 seconds, some python processing). A bigger min_matches would cause "argument too long" (I'm executing the SPARQL query on the command line, with 4s-query), so I guess I better give up this idea :)

(05 Jan '13, 11:23) Samuel Lampa Samuel%20Lampa's gravatar image

Ok, had to turn to a combination of SPARQL and Python, to generate the query.

Went for something like this:

def format_subclause(predicate, gene):
    return " ?s tm:%s ?o . FILTER ( ?o = STR('%s') ) " % (predicate, gene)

query = "PREFIX tm: <%s> " % prefix
query += "SELECT ?s ?o WHERE { { "
query += "} UNION {".join([format_subclause(predicate, g) for g in query_genes])
query += "} }"

This will result in all items that match any of the query genes, so in order to filter out only the ones with matches greater than "min_matches", I wrote some more python code, like this:

facts = {} # New python "array"
output = query_4store(query)
for item in output:
    parts = re.split("\s+", item.strip())
    key = parts[0]
    value = parts[1].strip('"')
    # Add only uniqe values to list
    if facts.has_key(key):
        if not value in facts[key]:
            facts[key].append(value)
    else:
        # Create new list
        facts[key] = [value]

# DEBUG CODE START
print "OUTPUT:"
print "-"*80
for fact in facts:
    gene_count = len(facts[fact])
    if gene_count >= min_matches:
        print fact,facts[fact],gene_count
# DEBUG CODE END

(The "<" and ">" in the code examples are actually "<" and ">")

Running the script with "min_matches" set to 51 (the highest number of genes in any item) takes around 2.5 seconds (The triplestore contains around 7 million triples). Seems to be good enough for my use case.

Wanted to document this here although it is not a pure SPARQL solution, in case it can help someone else.

Also, feel free to share your suggestions on how to write the pyhotn code in a more clean way!

permanent link

answered 05 Jan '13, 20:23

Samuel%20Lampa's gravatar image

Samuel Lampa
698311
accept rate: 33%

You mean you had to give up doing everything with SPARQL, but implement the filters using programming, right ? How would that be with a relational database ? I ask this because it sometimes happens to me that SPARQL looks nice, I create a query that maybe would not be possible with a RDBMs, but then it takes so long to run that, as you did, I finally have to either write code OR do materialization in the triples to answer the query in an acceptable' time. I know this is "vague", but here it seems your example is clear.

(07 Jan '13, 02:42) Fabian Cretton Fabian%20Cretton's gravatar image

+Fabian Cretton: That's correct! Note though that this is for SPARQL 1.1, which my striple store implements. The first answer in this thread is giving a much shorter way to do it in newer versions of SPARQL.

(09 Jan '13, 14:02) Samuel Lampa Samuel%20Lampa's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×1,319

question asked: 03 Jan '13, 15:55

question was seen: 1,429 times

last updated: 09 Jan '13, 14:02