Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

Scale Cheaply – Sharding

There are a lot of expensive ways to scale your database – all of which are highly touted by the big three database vendors because, well, they want to sell you all types of really expensive stuff. Despite what an “engagement consultant” might tell you though, most of the high-traffic websites on the web (google, digg, facebook) rely on far cheaper and better strategies: the core of which is called sharding.


What’s really astounding is that sharding is database agnostic – yet only the MySQL crowd seem to really be leveraging it. The sales staff at Microsoft, IBM and Oracle are doing a good job selling us expensive solutions.


Sharding is the separation of your data across multiple servers. How you separate your data is up to you, but generally it’s done on some fundamental identifier. For example, if we were building a hosted bug tracking site, our data model would likely look something like:



Every Client is pretty much isolated from all other Clients. So if we put all of Client 1’s data on Server 1 and Client 2’s data on Server 2, our system will run just fine. This scales out horizontally infinitely well (there’s little to no overhead). Our first 500 clients can all go on our first server, at which point we can introduce a second database server and place our next 500 clients. Servers need only be added when actually needed, and there’s no need for management servers, load balancers or anything else – just straight database connections.


One of the disadvantages of sharding is that it does impact your code. You need to figure out which database to connect to. For our simple scenario above, this isn’t too difficult:


using (SqlConnection connection = GetConnection(clientId))
{

}
private static SqlConnection GetConnection(int clientId)
{
string connectionString;
if (clientId <= 500)
{
connectionString = _connectionStrings[0];
}
else
{
connectionString = _connectionStrings[1];
}
return new SqlConnection(connectionString);
}
This is a simplified example, but should be pretty easy to expand on. Another approach is to use a modulus to figure out which connection string to use, something like:

return new SqlConnection(connectingString[clientId % _connectingString.Length]);

This brings up another problem with sharding (a big one) – repartitioning your data. If we pick the above modulus algorithm with 2 servers and 2 clients then:
     Client 2 will be associated to ConnectionString[0] (2 % 2 == 0)
     Client 1 will be associated to ConnectionString[1] (1 % 2 == 1)


If we now add a bunch of clients along with a 3rd server, then our code expects to find Client 2 on a different server (2 % 3 == 2). Essentially what this means is that you’ll need a repartitioning strategy – whether that’s an advanced connection manager configuration approach, or bulk copy scripts. The good news is that all of this should be deep inside your data layer and completely hidden from your calling code. There are many ways to handle this, pick whatever seems simplest.


The last hurdle to overcome is actually sharding your data. Our bug hosted example was pretty straightforward, but even it has limitations. When a client creates a new account they are asked to submit their subdomain of choice. We need to check whether that subdomain is available or not – which isn’t trivial since our data is spread all around. Similarly, when a user logs in, we don’t yet know which client they belong to, therefore we can’t figure out which database server to hit for authentication. In such cases, rather than sharding data on a key, you shard on purpose. Essentially, this means you have a database dedicated to your Users table, as well as a ClientHost table which does nothing more than provide a single place to look up whether a host is available or not. Again, this is something that your data access layer must be aware of.


Despite these issues, sharding is my preferred database scaling choice by far. All the issues can be fixed with a bit of code deep within your data layer. The performance advantage AND cost advantage make it a no-brainer. The only reason to consider clustering is for high availability scenarios, or in cases where your bottleneck is data that cannot be easily split. Also, keep in mind that sharding typically plays nice with replication or clustering, so these aren’t necessarily exclusive strategies.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

22 Responses to Scale Cheaply – Sharding

  1. Colin Jack says:

    Great write up, was wondering it you’d tried the sharing support in NHibernate and whether you’d recommend any general sharding guides/best practices.

  2. LockeVN says:

    IMO, I think this technique good at small and rather not compicated database schema. It will be better in some case, and it should be used with other cluster, replication technique …. Use it like a tweak, not like a methodology for your app, and, consider your app complicated issue.

    Some problems you said it not huge, not big, is really BIG. I think it’s because of your point of view, a developer.

    — some common normal SQL queries will not work. Can’t join, search, make report ….. Eg: in a blog, you shard the user, you need to get post of your friend, your friend of friend ….

    — Rolling database, update changes, maintain, backup, … cost a lot of time, and you must do by yourself (by your script). And, the important issue is: “the DBA must know what you do, how data is, where it located,what their relations …” instead of “which DB to maintain, maintain it”

    — I don’t think our boss, our company allow us to change the code each time their app grow. They will prefer change, or reconfig the DB, the load balancer, … not the application. With you, the process of: change code, recompile (or interprete), re-test, re-deploy … is simple, but IMO, I think it is a big pain in software process (I’m not the boss lol)

    — repartitioning is really a big pain. Not easy as you said. How can we assume that our repartitioning script run without error (because of different config of several DB server, because of ACID, because of quality of our data). And, we couldn’t undo if we failed.

    — This technique move the “select data from” processing load from (3rd) DB cluster layer to YOUR application layer, from DB cluster machine to applcation machine.
    — There are some other issues I can’t put here, but you can google “federated database” . This technique is similar to Microsoft idea put in SQL Server (from MSSQL2000 and prior, I don’t know about 2005 and 2008 version)

    Sorry for straight word, and my poor English 😀

  3. excellent article. real eye opener. I am right now in a position where I for security reasons need to partition a database. So there can actually be other reasons for sharding a database other than pure performance.

  4. simon says:

    写的非常好~~

  5. Well, I think MySQL people think about sharding more because MySQL is so damn slow on large databases, even if you read, memorize and use all possible performance optimizations.

    This is just my 2 cents. While I am considering sharding for my future requirements, I am not ever returning to MySQL (DB2 Express-C may be a good free choice).

  6. I would strongly suggest that you look at our database. We have a version, ScimoreDB Distributed, which does all the “sharding” for you. You code against a single db. When you want to distribute over multiple machines, just specify partitioning key, and ScimoreDB will distributed the data for you.
    And all this, while your normal SQL queries still work. Not a single line in your app have to be changed.
    Read more on:
    http://www.scimore.com/products/distributed.aspx
    Give it a try.

    Another thing. Its free.

    /thomas (hilbert atthething scimore.com)

  7. Scott says:

    I think the reason the MS SQL community hasn’t completely embraced it but the MYSQL community has is the price difference.

    If you shard MYSQL, you are out the hardware and maintenance costs. If you shard MSSQL, you are out hardware, maintenance AND really expensive and confusing licensing costs.

    Ayende really opened my eyes at ALT.NET Seattle to the sharding possibilities of NHibernate.

  8. karl says:

    @D:
    I always did want to try a shard of SQL Server Express databases :)

  9. D says:

    SQL Server licenses are too expensive for such partitioning

  10. karl says:

    @foobar:
    Never said it was new, but no doubt it will be to some people. Thanks though. Darn, was hoping I’d win a turing award for this.

    @Dave:
    I talked about that a bit in the blog post. If it’s a one off query, just go through all of them. For reporting, we use to run nightly scripts that would re-merge all the data onto a single db. Of course, your millage with that will vary based on how critical/intense your reports are. For frequent queries, sharding might not be right. Maybe it isn’t right for that particular set of data, or maybe it isn’t right for all of your data.

  11. Karthik says:

    Well, the approach is not “the silver bullet”, works well in certain situations. We have been using the approach since quite some time, and it works for us since we dont have to share data between clients.

  12. Gunt says:

    Dave, obviously the author is sharding himself over that question at the moment, maybe he will get back to you soon.

  13. Da ve says:

    Sharding is great, but how do you search across, say 5 sharded databases where you have 5 Client tables ?

  14. Rick says:

    Thanks, Karl, this was a very nice article.

    I would offer a disadvantage– if you ever need to do analysis on your data (or decide to do some ELT type logic), joins aren’t going to work across db instances.

    Otherwise, I like the idea of sharding, too.

    Rick

  15. foobar says:

    This is hardly new or innovative. I have textbooks from the early 1990s describing this technique!

  16. karl says:

    Good point Bryan, shouldn’t have missed this one…changes need to be deployed to all servers, which again isn’t a huge deal (you can script something to take care of it), it’s yet another extra step.

  17. Rolling out DB changes is an issue.

  18. karl says:

    Joshua: I’ve dealt with clusters with both SQL Server and MySQL, and if your data supports it, I’m of the opinion that this solution scales better regardless of cost. I also think it takes less developer time. I know I made it sound like there’s a lot of accommodation in-code that need to happen, but clusters don’t set themselves up or maintain themselves either. Sure it might be more transparent from your code’s point of view, but that’s just because someone’s spending time making it that way for you.

    I think I know what you’re getting at. The current trend is that hardware is cheap and developers aren’t. But “just throwing more hardware at the problem” is a little too high level for my taste. You have to have some type of architecture in place for it (be that shards, clusters, replication or a mix). Anyways, in my experience, software costs tend to dwarf both hardware and developer costs.

  19. Darren Kopp says:

    Our company uses this approach as well. We have a shared code base that every customers website uses. To track what db / server they are on we just have a virtual directory in which we store a file that has the information. works out well for us.

    We have the information stored in a global database as to which server / db the user is on as well, because we have our web site deployments scripted, but we use the file (which is created via the script as well) so that if for whatever reason the global database isn’t available, none of our clients go down. the world of world of N + 1 redundancy is what influenced that decision.

  20. Is this assuming that developer time (through the life of the product) and increased complexity of code is cheaper than hardware and db licenses?
    Or is this stating that this solution is better (more scalable), regardless of cost?

  21. SQLDenis says:

    If you implement distributed partitioned views you would not have to make any code changes. However I don’t think that this is available with all RDBMS vendors

  22. Jorge Diaz Tambley says:

    Good post, up to now what I’ve seen are only hardware upgrades. This technique is cheaper.

    Thanks.