I want to query a database of Gene signatures, where each signature basically consists of a list of gene symbols (Think "Ache", "Bcan", "Gjb1" ...).
I then want to use a list of such gene symbols (say for example "Ache", "Bcan", "Gjb1") as a query list, and get signatures that has a given minimum number of matches against my query list.
Can I / how can I most easily, implement this in SPARQL?
The case of minimum one match, it is easy, that would be (for the mentioned query list of "Ache", "Bcan", "Gjb1"), like so:
... but how to go about for an arbitrary "minimum matches" count, without getting a too convoluted query?
asked 03 Jan, 15:55
You can do this using the aggregate support in SPARQL 1.1:
(untested; substitute in your value of N as you see fit)
answered 03 Jan, 16:11
Ok, had to turn to a combination of SPARQL and Python, to generate the query.
Went for something like this:
This will result in all items that match any of the query genes, so in order to filter out only the ones with matches greater than "min_matches", I wrote some more python code, like this:
(The "<" and ">" in the code examples are actually "<" and ">")
Running the script with "min_matches" set to 51 (the highest number of genes in any item) takes around 2.5 seconds (The triplestore contains around 7 million triples). Seems to be good enough for my use case.
Wanted to document this here although it is not a pure SPARQL solution, in case it can help someone else.
Also, feel free to share your suggestions on how to write the pyhotn code in a more clean way!
answered 05 Jan, 20:23