|
How can I express with SPARQL the following query: "Given you have 'patternSpectrum', holding a set of peaks, with specified numerical values of their heights, do SELECT all spectrums which hold a set of peaks, where each of their peaks corresponds to exactly one peak in patternSpectrum, within a numerical limit of 0.3 units" So, some kind of search for spectra with near-matches of all it's peaks' heights with the peaks of a given "pattern spectrum" or "search spectrum". I was able to express this search in Prolog, as documented in http://saml.rilspace.com/content/backtracking-key-difference-between-sparql-and-prolog (the third code snippet), but so far NOT in SPARQL :| One of the problems seems to be that arithmetic operations can only be done in FILTER constructs, and not the SELECT statements. But also, since it is a comparison of a set of peaks, which has to match one-to-one to another set, it seems like the backtracking feature of Prolog becomes indispensable ... but maybe I'm missing something? I'm fearing that it possibly has to involve some OWL definitions (though I had some difficulties finding out how to do that as well :|), and of course I'd hope for a solution solely in SPARQL since this application really is a search problem and not a classification one. (Though, hints about how to do this in OWL are highly welcome too!) EDIT: An extract from the sample data I'm using, describing a spectrum, having peaks, which each have a numerical-value shift:
What I want to retrieve is a list of spectra which satisfies the criteria (that there is one corresponding peak shift value for each of a list of numerical values which the user provides). |
|
With the help of advice from Brandon Ibach on the [pellet-users] mailing list, we found a query that works (but unfortunatly has serious scaling problems, as explained below). First, the sample data I'm using now:
(Yes, one single spectra as for now. That is due to the scaling problems.) The file contain a spectrum with peaks (nmr:hasPeak) with one shift each (nmr:hasShift), with the following values (for convenience if anyone wants to try): [17.6, 18.3, 22.6, 26.5, 31.7, 33.5, 33.5, 41.8, 42.0, 42.2, 78.34, 140.99, 158.3, 193.4, 203.0, 0] Trying to search for just a few of these peaks (so as to not run into performance problem) against the 1 spectrum file, results in this SPARQL query, which works fine, and finishes in 3 sec:
The reason I didn't realize this simple solution was that I was unaware of the procesing model of SPARQL, which "involves generating every possible combination of bindings of values in the data to variables in the query, where the bindings match the query pattern" ... and "Each set of bindings then being evaluated against each FILTER expression and those sets that don't evaluate to true are discarded." according to a clarifying explanation by Brandon on the mailing list. So far so good. Unfortunately though, this query scales exponentially with the number of variables / shift values tested for, as documented in this blog post. (Using six similar shift values in the search, takes some 488 seconds, on a dataset with 1 spectrum!) This is of course problematic as one might want to search for spectra with up to as many as 50 peaks in the worst case, and of course with datasets bigger than 1 spectrum, so any hints about alternative strategies with more linear scaling behavior are still highly welcome. Well, this query doesn't fully do what the title of this question was asking for, since the values of the "imagined" pattern spectrum are here explicitly stated in the query, but this is at least on the way to the solution, and could be useful as is if for example generating the SPARQL query programmatically (Which is easily done in Bioclipse's JS scripting environment). |



Could you provide a sample dataset and expected output? I'm having a hard time completely grasping the problem involved.
Thanks for your reply, and sorry for late answer (got busy with exams). Updated it with sample data extract.