I'm creating a system that I want to be able to scale up to trillions of triples, and I have serious doubts about how to achieve this using existing triple stores. What should I avoid to ensure I never cripple myself in ways that are not apparent until the data has grown? I would prefer to bias my designs (now) in favour of fast query time (i.e. to better serve responsive web pages).

Is SPIN regarded as the fastest form of rules implementation? Should I partition my graphs one way or another? Should I be creating several SERVICEs to allow parallel querying, or will the existing implementations (of highly scalable triple stores) take care of all that? Should I stick to a specific OWL subset to ensure good inference and query performance? What have your experiences been in refactoring existing large ontologies and datasets to rectify performance issues?


The question initially mentioned AllegroGraph on EC2, but really, it doesn't matter what platform I use if I can get acceptable performance at the scale I aspire to.

Further discussion on this topic can also be found here.

asked 27 May '12, 21:53

Andrew's gravatar image

Andrew ♦♦
accept rate: 30%

edited 01 Aug '12, 19:59

There are 2 dimensions to performance, capacity and scalability which are both interelated.

  • Firstly you want to ensure a level of capacity and scalability with respects to the amount of data you can store and query.

  • Secondly you want to ensure a level of performance, capacity and scalability with respects to the number of users who can concurrently query that data.

Fortunately, the features of Semantic Web support both and irrespective of the native features of allegrograph, the diagram below illustrates this:

alt text

The basic premise behind this solution is the Shard and Federate design pattern. Here you have a load balanced federation cluster that federates and delegates incoming SPARQL queries to the other shards of the complete dataset. Its concern is only to federate an provide the final solutions to the client. As an additional feature you might want the federation cluster to know what shards (SPARQL endpoints) hold what datasets (graphs), thereby reducing the need for the HTTP client to know what shards (SPARQL endpoints) they need to query for their datasets. This is especially important if the number of datasets and sparql endpoints in going to expand and collapse quite frequently.

Individual shards are load balanced with the option to persist data on a storage area network (a bit old school I know).

As a paradigm for application:

  • Load balancers in the diagram = EC2 load balancer
  • SAN in diagram = S3 bucket
  • Federation cluster = Federation cloud formation
  • Shard x cluster = Shard cloud formation

Alternatively, look at using Chef to manage cluster/clouds. My general approach to configuration management is to avoid a system that relies on configuration (a well thought out convention works better).

Finally you also touch upon updating data and providing reasoning (the 2 again are slightly interelated). Depending on whether you use a SAN (shared S3 bucket or not) will determine how this happens. If you do use a shared S3 bucket, then you might follow the Milton and Keynes paradigm (master slave, updates go to the slave, slave becomes master). If you aren't then you might want to follow a star topology or chain topology, with nodes dropping out of the cluster and updating and then re-entering the cluster. If you insist on supporting inference, then this becomes a whole lot more complicated if you deside to forward chain (infact you might need to refactor the architecture quite considerably), but shouldn't be affected by backward chaining. Additionally, take a look at the native features of allegrograph and see how they align.

permanent link

answered 28 May '12, 13:53

William%20Greenly's gravatar image

William Greenly
accept rate: 13%

edited 28 May '12, 15:51

Thanks for the detailed answer.

Sadly, I do need inference - it's the whole point of my project - how should inference change my design of the architecture?

@database_animal also has a very good point - until we start generating revenue, we can't have arbitrary scalability. Are there formalisms I should avoid (at least initially) so that I can reason more efficiently?

(28 May '12, 19:04) Andrew ♦♦ Andrew's gravatar image

Ok - so the issue with reasoning using the above architecture is that sharding makes it difficult to accomplish this as often you want to reason over the complete dataset and not individual shards. You might be able to work around this by query rewriting or by ensuring that any shard is complete WRT reasoning.

(29 May '12, 06:59) William Greenly William%20Greenly's gravatar image

I'm going to add a 3rd dimension to what @William says.

Before you build a system out you need some sense of what it is going to cost per unit value it delivers.

To take a simple example, Facebook has an ARPU (average revenue per user) of about $5 a year. Salesforce.com has an ARPU that's closer to $1000 per year. Whatever else Facebook does, it can't be spending more than $5 a user in hardware costs. This impacts the scaling of the business because the cost increases as you add users, unlike, say, software development, which can be amortized against a growing user base.

When you fetishize 'scaling' at the expense of efficiency, you're just building a system that will make you go broke faster on a bigger scale. Rather than just burning through your own bank account, you can burn through the bank accounts of some venture capitalists too.

I'm speccing out a system which needs very fast U.I. responsiveness and I can say the production system is not going to use a triple store for certain queries, if it uses one at all, because I need to know these come back in less than a second.

It's important to know what specific scale you have in mind and compare that to what others are doing. Sindice, using a hybrid system, handles 50 billion heterogenous triples. The people at Franz have gotten 1 trillion triples into Allegrograph but that took a machine with 240 cores, 1.2 TB of RAM, two weeks of load time, and certainly close attention from people who understand the product very well.

permanent link

answered 28 May '12, 16:11

database_animal's gravatar image

database_animal ♦
accept rate: 15%

edited 28 May '12, 19:03

I understand that issue - I've spent the last few years fetishizing performance in order that I need not scale out so far. In this case, my business model depends on a rich and suitably abstract 'upper' ontology giving me the scope to make complex inferences about users.

Larger sites in the market I'm targeting have hundreds of millions of users (no, it's not social networking!) And I expect to gather 1000s of facts about each user. I may be able to depend on simpler and more efficient entailment regimes, but if there are architectural solutions that allow richer models...

(28 May '12, 18:56) Andrew ♦♦ Andrew's gravatar image

A big question is: how often do you look at more than one user at a time? If the answer is "not very", then scalability in terms of # of users isn't a problem.

It's very possible you could crunch user-user data to produce an optimized knowledge base that lets you apply learnings from other users to an individual.

(29 May '12, 14:36) database_animal ♦ database_animal's gravatar image

[1/2] I will be comparing users against each other, and running potentially complex aggregate functions across the whole population. Ideally, I'd like to provide a multifaceted view of each user - providing an assessment of them in context - which [I suppose] demands that I perform my inferences on demand.

(30 Jul '12, 23:57) Andrew ♦♦ Andrew's gravatar image

[2/2] Otherwise I could pull out the raw data, shove it into some almighty Hadoop cluster and materialize the inferences that it can draw for me. Which would also be a much better way to keep costs down, since I can decommission a cluster after the run. Alas, I don't think that stale, canned, data would be a very compelling proposition in this case.

(30 Jul '12, 23:57) Andrew ♦♦ Andrew's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:


question asked: 27 May '12, 21:53

question was seen: 2,342 times

last updated: 01 Aug '12, 19:59