There are a lot of expensive ways to scale your database – all of which are highly touted by the big three database vendors because, well, they want to sell you all types of really expensive stuff. Despite what an “engagement consultant” might tell you though, most of the high-traffic websites on the web (google, digg, facebook) rely on far cheaper and better strategies: the core of which is called sharding.
What’s really astounding is that sharding is database agnostic – yet only the MySQL crowd seem to really be leveraging it. The sales staff at Microsoft, IBM and Oracle are doing a good job selling us expensive solutions.
Sharding is the separation of your data across multiple servers. How you separate your data is up to you, but generally it’s done on some fundamental identifier. For example, if we were building a hosted bug tracking site, our data model would likely look something like:
Every Client is pretty much isolated from all other Clients. So if we put all of Client 1’s data on Server 1 and Client 2’s data on Server 2, our system will run just fine. This scales out horizontally infinitely well (there’s little to no overhead). Our first 500 clients can all go on our first server, at which point we can introduce a second database server and place our next 500 clients. Servers need only be added when actually needed, and there’s no need for management servers, load balancers or anything else – just straight database connections.
One of the disadvantages of sharding is that it does impact your code. You need to figure out which database to connect to. For our simple scenario above, this isn’t too difficult:
This is a simplified example, but should be pretty easy to expand on. Another approach is to use a modulus to figure out which connection string to use, something like:
using (SqlConnection connection = GetConnection(clientId))
private static SqlConnection GetConnection(int clientId)
if (clientId <= 500)
connectionString = _connectionStrings;
connectionString = _connectionStrings;
return new SqlConnection(connectionString);
return new SqlConnection(connectingString[clientId % _connectingString.Length]);
This brings up another problem with sharding (a big one) – repartitioning your data. If we pick the above modulus algorithm with 2 servers and 2 clients then:
Client 2 will be associated to ConnectionString (2 % 2 == 0)
Client 1 will be associated to ConnectionString (1 % 2 == 1)
If we now add a bunch of clients along with a 3rd server, then our code expects to find Client 2 on a different server (2 % 3 == 2). Essentially what this means is that you’ll need a repartitioning strategy – whether that’s an advanced connection manager configuration approach, or bulk copy scripts. The good news is that all of this should be deep inside your data layer and completely hidden from your calling code. There are many ways to handle this, pick whatever seems simplest.
The last hurdle to overcome is actually sharding your data. Our bug hosted example was pretty straightforward, but even it has limitations. When a client creates a new account they are asked to submit their subdomain of choice. We need to check whether that subdomain is available or not – which isn’t trivial since our data is spread all around. Similarly, when a user logs in, we don’t yet know which client they belong to, therefore we can’t figure out which database server to hit for authentication. In such cases, rather than sharding data on a key, you shard on purpose. Essentially, this means you have a database dedicated to your Users table, as well as a ClientHost table which does nothing more than provide a single place to look up whether a host is available or not. Again, this is something that your data access layer must be aware of.
Despite these issues, sharding is my preferred database scaling choice by far. All the issues can be fixed with a bit of code deep within your data layer. The performance advantage AND cost advantage make it a no-brainer. The only reason to consider clustering is for high availability scenarios, or in cases where your bottleneck is data that cannot be easily split. Also, keep in mind that sharding typically plays nice with replication or clustering, so these aren’t necessarily exclusive strategies.