6
4

If I wanted to do semantic web application development in [some obscure language] and nothing is currently available, where would I go to find out how they work and how to build a production-quality triple store in my obscure language? What API standards are out there that I ought to adhere to?

Specifically, How do I go about indexing triples efficiently in terms of space and search time? Are B-Tree derivative data structures what I should look at, or is something else better? What optimisations are known for compaction and for optimising, say, data retrieval to support reasoners and SPARQL queries?

asked 03 Nov '09, 21:57

Andrew's gravatar image

Andrew ♦♦
1.5k1413
accept rate: 26%


Thomas Neumann and Gerhard Weikum have put a lot of thought into your question relative to development of their RDF-3X engine. The following document includes performance testing against multiple data sets, info on index optimizations, triples compression, etc.:

http://www.vldb.org/pvldb/1/1453927.pdf

permanent link
This answer is marked "community wiki".

answered 23 Nov '09, 13:03

Steve%20Houser's gravatar image

Steve Houser
11
accept rate: 0%

Some stores index everything, some assume that the predicate is always present in a triple pattern. With that assumption, two indexes are needed, POS and PSO, not 3 (e.g. SPO, POS, OSP). The saving is greater for quads.

Once indexes are created, if they are not used, they just sit on disk out of the way. They don't take cache space so don't affect query execution speed.

permanent link

answered 05 Nov '09, 08:36

Andy%20Seaborne's gravatar image

Andy Seaborne
25112
accept rate: 50%

1

Hi Andy, Is there, yet, a state-of-the-art design that is most often used? Perhaps there's a definitive paper or review paper that would give me guidelines on data structures, algorithms, and trade-offs to consider?

(10 Nov '09, 21:51) Andrew ♦♦ Andrew's gravatar image

I recently wrote an article about the way we implemented a triple store at our company, Procurios. I go into the low-level details of implementing it in a relational database:

Semantic web marvels in a relational database - part I: Case Study

permanent link

answered 14 Nov '09, 20:30

Patrick%20van%20Bergen's gravatar image

Patrick van ...
412
accept rate: 0%

1

Hi Patrick, Thanks for the links. Although I am specifically interested in the design of a dedicated triple store, I'm also very interested in how your approach compares to such a solution in terms of space and time complexity for a standard query benchmark. Do you have such comparison figures yet? Also - just out of interest - what was your implementation language? Have you tried to use this store with an inference engine? how did it perform for that sort of use?

(15 Nov '09, 21:48) Andrew ♦♦ Andrew's gravatar image

Hi Andrew. PHP is our implementation language. Since we don't use a semantic web query language its hard to compare performance with other stores. The article is just to give you some ideas. Set up several test situations and create your own performance stats, is what I would advice.

(17 Nov '09, 21:13) Patrick van ... Patrick%20van%20Bergen's gravatar image

Most triple stores and frameworks are open source under liberal licenses (e.g. virtuoso, sesame, 4store, jena, openanzo, redland, semweb.net, mulgara). I think the best place to learn about building a triple store is by looking at those and spending time understanding how they work and what design decisions were made.

permanent link

answered 04 Nov '09, 15:35

Ian%20Davis's gravatar image

Ian Davis
2.3k1414
accept rate: 13%

1

Hi Ian, I'm motivated not only by a desire to get questions answered on this_site but to also get a discussion going about the relative merits of different approaches. For example, as Josh mentioned, there seems no easy way to get around the need for at least 3 indexes. That's potentially a big bloat that might be off-putting for some. How can you minimize such inflation? Are there accepted encoding schemes for URLs in a triple store? I have also read elsewhere about partitioning schemes that can speed up searches...

(04 Nov '09, 21:51) Andrew ♦♦ Andrew's gravatar image

A search on Google brought up the paper "Design and implementation of an RDF Triple Store".

If you're more like the source code reader type of guy, you mabe want to take a look at the sources of TDB, which is the native storage engine in Jena, or also at the sources of Sesame.

Update: Another place where you maybe could learn something about the topic is the BigData project. They write a lot about technical details in their blog, and it's open source, too, so you can take a look at the sources, too.

permanent link

answered 04 Nov '09, 19:06

Bastian%20Spanneberg's gravatar image

Bastian Span...
2.0k1613
accept rate: 21%

edited 18 Nov '09, 12:10

I read the paper. Not very impressive. It does score points in the (some obscure language) stakes though. );^}>

(18 Nov '09, 10:48) Andrew ♦♦ Andrew's gravatar image

You might have a look at a recent paper dipLODocus

Here's the introduction:

Despite many recent efforts, the lack of efficient infrastructures to manage RDF data is often cited as one of the key problems hindering the development of the Semantic Web. Last year at ISWC, for instance, the two industrial keynote speakers (from the New York Times and Facebook) pointed out that the lack of an open-source, efficient and scalable alternative to MySql for RDF data was the number one problem of the Semantic Web.
permanent link

answered 07 Nov '12, 04:35

Fabian%20Cretton's gravatar image

Fabian Cretton
1.1k17
accept rate: 2%

edited 07 Nov '12, 14:11

Rob%20Vesse's gravatar image

Rob Vesse ♦
13.9k1715

+1 Interesting paper, def worth a read

(07 Nov '12, 16:52) Rob Vesse ♦ Rob%20Vesse's gravatar image

http://www.openlinksw.com/weblog/oerling I've found it hard to follow at times, but you get the idea that Orri has thought a lot about it.

There have been some academic(-ish) papers here and there. I recently read a good one about a distributed triple store. I think it was about 4store(.org) but I can't remember where I found it. Anyone else know?

Otherwise, you probably have to ping the people that have built them for ideas. For instance, in the SemWeb.NET [1] triple store that I built, I found a simple MySQL structure [2] worked well enough to scale to 1B triples, though it was very space-hungry with many indexes.

[1] http://razor.occams.info/code/semweb/ [2] http://razor.occams.info/code/repo/?/semweb/src/SQLStore.cs

permanent link

answered 04 Nov '09, 13:48

Joshua%20Tauberer's gravatar image

Joshua Tauberer
1112
accept rate: 0%

Intellidimension uses a rather intuitive solution (maybe others, too) and they said it has a major impact on performance. They maintain two triple tables:

  • Table 1 stores SPO as they are in string format in three (or four in case of quads) colums
  • Table 2 stores all the same triples (also in 3 or 4 columns) but SPO are represened not as a strings but each by their MD5 hash value as a GUID.

The big idea is that the second table not only has truely fixed width (content of TEXT types cells are stored outside of the table, which has a rather bad effect), but also very narrow. They claim that the width of a SQL tables strongly influence performance. Before executing a query, they calculate the MD5 value of the strings in the query and execute the query with the hashed values on Table 2. This way, they get the rows complying to the query in a performant way, and retrieve the real values from Table 1.

permanent link

answered 23 Mar '10, 12:48

ROWLEX%20Admin's gravatar image

ROWLEX Admin
12115
accept rate: 0%

Get 'Programming the Semantic Web' by Toby Segaran et. al.

http://shop.oreilly.com/product/9780596153823.do

The authors build a simple triplestore using Python and explain how it works. Then they use RDFLib and SQLite before moving on to Sesame, Jena etc.

permanent link

answered 06 Nov '12, 13:14

laserblue's gravatar image

laserblue
303
accept rate: 0%

I really wonder if once you created an application over your triplestore, and once you now know the queries you're doing—if it is not possible to "trim" from the indexes the "things" that you won't ever use for your queries. That would actually remove the unnecessary bloat, think of this like a "database packing" operation.

Probably trimming is not doable, what about re-indexing the data while skipping what won't be used, and the no-skip list being generated from a collection of your SPARQL queries.

permanent link

answered 04 Nov '09, 23:24

Laurian%20Gridinoc's gravatar image

Laurian Grid...
1286
accept rate: 0%

Thanks, Laurian. For this question I'm more interested in how to construct a general purpose triple store, so I am also more interested in generally applicable tuning techniques as well...

(10 Nov '09, 21:48) Andrew ♦♦ Andrew's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×206
×69
×29
×18
×4

question asked: 03 Nov '09, 21:57

question was seen: 12,017 times

last updated: 07 Nov '12, 16:52