I'm creating a system that I want to be able to scale up to trillions of triples, and I have serious doubts about how to achieve this using existing triple stores. What should I avoid to ensure I never cripple myself in ways that are not apparent until the data has grown? I would prefer to bias my designs (now) in favour of fast query time (i.e. to better serve responsive web pages).
Is SPIN regarded as the fastest form of rules implementation? Should I partition my graphs one way or another? Should I be creating several SERVICEs to allow parallel querying, or will the existing implementations (of highly scalable triple stores) take care of all that? Should I stick to a specific OWL subset to ensure good inference and query performance? What have your experiences been in refactoring existing large ontologies and datasets to rectify performance issues?
The question initially mentioned AllegroGraph on EC2, but really, it doesn't matter what platform I use if I can get acceptable performance at the scale I aspire to.
Further discussion on this topic can also be found here.
There are 2 dimensions to performance, capacity and scalability which are both interelated.
Fortunately, the features of Semantic Web support both and irrespective of the native features of allegrograph, the diagram below illustrates this:
The basic premise behind this solution is the Shard and Federate design pattern. Here you have a load balanced federation cluster that federates and delegates incoming SPARQL queries to the other shards of the complete dataset. Its concern is only to federate an provide the final solutions to the client. As an additional feature you might want the federation cluster to know what shards (SPARQL endpoints) hold what datasets (graphs), thereby reducing the need for the HTTP client to know what shards (SPARQL endpoints) they need to query for their datasets. This is especially important if the number of datasets and sparql endpoints in going to expand and collapse quite frequently.
Individual shards are load balanced with the option to persist data on a storage area network (a bit old school I know).
As a paradigm for application:
Alternatively, look at using Chef to manage cluster/clouds. My general approach to configuration management is to avoid a system that relies on configuration (a well thought out convention works better).
Finally you also touch upon updating data and providing reasoning (the 2 again are slightly interelated). Depending on whether you use a SAN (shared S3 bucket or not) will determine how this happens. If you do use a shared S3 bucket, then you might follow the Milton and Keynes paradigm (master slave, updates go to the slave, slave becomes master). If you aren't then you might want to follow a star topology or chain topology, with nodes dropping out of the cluster and updating and then re-entering the cluster. If you insist on supporting inference, then this becomes a whole lot more complicated if you deside to forward chain (infact you might need to refactor the architecture quite considerably), but shouldn't be affected by backward chaining. Additionally, take a look at the native features of allegrograph and see how they align.
I'm going to add a 3rd dimension to what @William says.
Before you build a system out you need some sense of what it is going to cost per unit value it delivers.
To take a simple example, Facebook has an ARPU (average revenue per user) of about $5 a year. Salesforce.com has an ARPU that's closer to $1000 per year. Whatever else Facebook does, it can't be spending more than $5 a user in hardware costs. This impacts the scaling of the business because the cost increases as you add users, unlike, say, software development, which can be amortized against a growing user base.
When you fetishize 'scaling' at the expense of efficiency, you're just building a system that will make you go broke faster on a bigger scale. Rather than just burning through your own bank account, you can burn through the bank accounts of some venture capitalists too.
I'm speccing out a system which needs very fast U.I. responsiveness and I can say the production system is not going to use a triple store for certain queries, if it uses one at all, because I need to know these come back in less than a second.
It's important to know what specific scale you have in mind and compare that to what others are doing. Sindice, using a hybrid system, handles 50 billion heterogenous triples. The people at Franz have gotten 1 trillion triples into Allegrograph but that took a machine with 240 cores, 1.2 TB of RAM, two weeks of load time, and certainly close attention from people who understand the product very well.