Ok, had to turn to a combination of SPARQL and Python, to generate the query.
Went for something like this:
def format_subclause(predicate, gene):
return " ?s tm:%s ?o . FILTER ( ?o = STR('%s') ) " % (predicate, gene)
query = "PREFIX tm: <%s> " % prefix
query += "SELECT ?s ?o WHERE { { "
query += "} UNION {".join([format_subclause(predicate, g) for g in query_genes])
query += "} }"
This will result in all items that match any of the query genes, so in order to filter out only the ones with matches greater than "min_matches", I wrote some more python code, like this:
facts = {} # New python "array"
output = query_4store(query)
for item in output:
parts = re.split("\s+", item.strip())
key = parts[0]
value = parts[1].strip('"')
# Add only uniqe values to list
if facts.has_key(key):
if not value in facts[key]:
facts[key].append(value)
else:
# Create new list
facts[key] = [value]
# DEBUG CODE START
print "OUTPUT:"
print "-"*80
for fact in facts:
gene_count = len(facts[fact])
if gene_count >= min_matches:
print fact,facts[fact],gene_count
# DEBUG CODE END
(The "<" and ">" in the code examples are actually "<" and ">")
Running the script with "min_matches" set to 51 (the highest number of genes in any item) takes around 2.5 seconds (The triplestore contains around 7 million triples). Seems to be good enough for my use case.
Wanted to document this here although it is not a pure SPARQL solution, in case it can help someone else.
Also, feel free to share your suggestions on how to write the pyhotn code in a more clean way!
answered
05 Jan, 20:23
Samuel Lampa
593●3●10
accept rate:
27%