CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Karl Seguin

.NET From Ottawa, Ontario

June 2008 - Posts

  • Scale Cheaply - Sharding

    There are a lot of expensive ways to scale your database – all of which are highly touted by the big three database vendors because, well, they want to sell you all types of really expensive stuff. Despite what an “engagement consultant” might tell you though, most of the high-traffic websites on the web (google, digg, facebook) rely on far cheaper and better strategies: the core of which is called sharding.

    What’s really astounding is that sharding is database agnostic – yet only the MySQL crowd seem to really be leveraging it. The sales staff at Microsoft, IBM and Oracle are doing a good job selling us expensive solutions.

    Sharding is the separation of your data across multiple servers. How you separate your data is up to you, but generally it’s done on some fundamental identifier. For example, if we were building a hosted bug tracking site, our data model would likely look something like:

    Every Client is pretty much isolated from all other Clients. So if we put all of Client 1’s data on Server 1 and Client 2’s data on Server 2, our system will run just fine. This scales out horizontally infinitely well (there’s little to no overhead). Our first 500 clients can all go on our first server, at which point we can introduce a second database server and place our next 500 clients. Servers need only be added when actually needed, and there’s no need for management servers, load balancers or anything else – just straight database connections.

    One of the disadvantages of sharding is that it does impact your code. You need to figure out which database to connect to. For our simple scenario above, this isn’t too difficult:

     
    using (SqlConnection connection = GetConnection(clientId))
    {
     ...
    }
    private static SqlConnection GetConnection(int clientId)
    {
       string connectionString;
       if (clientId <= 500)
       {
          connectionString = _connectionStrings[0];
       }
       else
       {
          connectionString = _connectionStrings[1];
       }
       return new SqlConnection(connectionString);
    }
    
    This is a simplified example, but should be pretty easy to expand on. Another approach is to use a modulus to figure out which connection string to use, something like:
     
    return new SqlConnection(connectingString[clientId % _connectingString.Length]);
    

    This brings up another problem with sharding (a big one) – repartitioning your data. If we pick the above modulus algorithm with 2 servers and 2 clients then:
         Client 2 will be associated to ConnectionString[0] (2 % 2 == 0)
         Client 1 will be associated to ConnectionString[1] (1 % 2 == 1)

    If we now add a bunch of clients along with a 3rd server, then our code expects to find Client 2 on a different server (2 % 3 == 2). Essentially what this means is that you’ll need a repartitioning strategy – whether that’s an advanced connection manager configuration approach, or bulk copy scripts. The good news is that all of this should be deep inside your data layer and completely hidden from your calling code. There are many ways to handle this, pick whatever seems simplest.

    The last hurdle to overcome is actually sharding your data. Our bug hosted example was pretty straightforward, but even it has limitations. When a client creates a new account they are asked to submit their subdomain of choice. We need to check whether that subdomain is available or not – which isn’t trivial since our data is spread all around. Similarly, when a user logs in, we don’t yet know which client they belong to, therefore we can’t figure out which database server to hit for authentication. In such cases, rather than sharding data on a key, you shard on purpose. Essentially, this means you have a database dedicated to your Users table, as well as a ClientHost table which does nothing more than provide a single place to look up whether a host is available or not. Again, this is something that your data access layer must be aware of.

    Despite these issues, sharding is my preferred database scaling choice by far. All the issues can be fixed with a bit of code deep within your data layer. The performance advantage AND cost advantage make it a no-brainer. The only reason to consider clustering is for high availability scenarios, or in cases where your bottleneck is data that cannot be easily split. Also, keep in mind that sharding typically plays nice with replication or clustering, so these aren't necessarily exclusive strategies.

  • Foundations of Programming Ebook

    I'm excited to finally release the official, and completely free, Foundations of Programming EBook. This essentially contains all 9 Foundation parts including a conclusion and some typical book fluff (table of content, acknowledgement and so on). A number of spelling errors were corrected, along with some small technical changes and clarifications - largely based on feedback, so thanks for everyone who provided it! Otherwise it's exactly the same as what's been posted here over the past several months.

    Download it from http://codebetter.com/files/folders/codebetter_downloads/entry179694.aspx

    Download the Learning Application from: http://codebetter.com/blogs/karlseguin/archive/2008/07/18/foundations-of-programming-learning-application.aspx

     Foundations Of Programming 

    If the above link fails, you can also get it from http://www.openmymind.net/FoundationsOfProgramming.pdf

    Posted Jun 24 2008, 09:53 PM by karl with 84 comment(s)
    Filed under:
  • Foundations of Programming - pt 9 - Proxy This and Proxy That

    Few keywords are as simple yet amazingly powerful as virtual in C# (overridable in VB.NET). When you mark a method as virtual you allow an inheriting class to override the behavior. Without this functionality inheritance and polymorphism wouldn't be of much use. A simple example, slightly modified from Programming Ruby (ISBN: 978-0-9745140-5-5), which has a KaraokeSong overrides a Song's to_s (ToString) function looks like:
    class Song
       def to_s
          return sprintf("Song: %s, %s (%d)", @name, @artist, @duration)
       end
    end
    
    class KaraokeSong < Song
       def to_s
          return super + " - " @lyrics
       end
    end
    

    The above code shows how the KaraokeSong is able to build on top of the behavior of its base class. Specialization isn't just about data, it's also about behavior!

    Even if your ruby is a little rusty, you might have picked up that the base to_s method isn't marked as virtual. That's because many languages, including Java, make methods virtual by default. This represents a fundamental differing of opinion between the Java language designers and the C#/VB.NET language designers. In C# methods are final by default and developers must explicitly allow overriding (via the virtual keyword). In Java, methods are virtual by default and developers must explicitly disallow overriding (via the final keyword).

    Typically virtual methods are discussed with respect to inheritance of domain models. That is, a KaraokeSong which inherits from a Song, or a Dog which inherits from a Pet. That's a very important concept, but it's already well documented and well understood. Therefore, we'll examine virtual methods for a more technical purpose: proxies.

    Proxy Domain Pattern

    A proxy is something acting as something else. In legal terms, a proxy is someone given authority to vote or act on behalf of someone else. Such a proxy has the same rights and behaves pretty much like the person being proxied. In the hardware world, a proxy server sits between you and a server you're accessing. The proxy server transparently behaves just like the actual server, but with additional functionality - be it caching, logging or filtering. In software, the proxy design pattern is a class that behaves like another class. For example, if we were building a task tracking system, we might decide to use a proxy to transparently apply authorization on top of a task object:

    public class Task
    {  
       public static Task FindById(int id)
       {
          return TaskRepository.Create().FindById(id);
       }   
    
       public virtual void Delete()
       {
          TaskRepository.Create().Delete(this);
       }
    }
    public class TaskProxy : Task
    {
       public override void Delete()
       {
          if (User.Current.CanDeleteTask())
          {
             base.Delete();
          }
          else
          {
             throw new PermissionException(...);
          }
       }
    }
    

    Thanks to polymorphism, FindById can return either a Task or a TaskProxy. The calling client doesn't have to know which was returned - it doesn't even have to know that a TaskProxy exists. It just programs against the Task's public API.

    Since a proxy is just a subclass that implements additional behavior, you might be wondering if a Dog is a proxy to a Pet. Proxies tend to implement more technical system functions (logging, caching, authorization, remoting, etc) in a transparent way. In other words, you wouldn't declare a variable as TaskProxy - but you'd likely declare a Dog variable. Because of this, a proxy wouldn't add members (since you aren't programming against its API), whereas a Dog might add a Bark method.

    Interception

    The reason we're exploring a more technical side of inheritance is because two of the tools we've looked at so far, RhinoMocks and NHibernate, make extensive use of proxies - even though you might not have noticed. RhinoMocks uses proxies to support its core record/playback functionality. NHibernate relies on proxies for its optional lazy-loading capabilities. We'll only look at NHibernate, since it's easier to understand what's going on behind the covers, but the same high level pattern applies to RhinoMocks.

    (A side note about NHibernate. It's considered a frictionless or transparent O/R mapper because it doesn't require you to modify your domain classes in order to work. However, if you want to enable lazy loading, all members must be virtual. This is still considered frictionless/transparent since you aren't adding NHibernate specific elements to your classes - such as inheriting from an NHibernate base class or sprinkling NHibernate attributes everywhere.)

    Using NHibernate there are two distinct opportunities to leverage lazy loading. The first, and most obvious, is when loading child collections. For example, you may not want to load all of a Model's Upgrades until they are actually needed. Here's what your mapping file might look like:

    <class name="Model" table="Models">
       <id name="Id" column="Id" type="int">
          <generator class="native" />
       </id>
       ...
       <bag name="Upgrades" table="Upgrades" lazy="true" >
          <key column="ModelId" />
          <one-to-many class="Upgrade" />
       </bag>      
    </class>
    

    By setting the lazy attribute to true on our bag element, we are telling NHibernate to lazily load the Upgrades collection. NHibernate can easily do this since the it returns it uses its own collection types (which all implement standard interfaces, such as IList, so to you, it's transparent).

    The second, and far more interesting, usage of lazy loading is for individual domain objects. The general idea is that sometimes you'll want whole objects to be lazily initialized. Why? Well, say that a sale has just been made. Sales are associated with both a sales person and a car model:

    Sale sale = new Sale();
    sale.SalesPerson = session.Get<SalesPerson>(1);
    sale.Model = session.Get<Model>(2);
    sale.Price = 25000;
    session.Save(sale);
    

    Unfortunately, we've had to go to the database twice to load the appropriate SalesPerson and Model - even though we aren't really using them. The truth is all we need is their ID (since that's what gets inserted into our database), which we already have.

    By creating a proxy, NHibernate lets us fully lazy-load an object for just this type of circumstance. The first thing to do is change our mapping and enable lazy loading of both Models and SalesPeoples:

    <class name="Model" table="Models" lazy="true" proxy="Model">...</class>
    
    <class name="SalesPerson" table="SalesPeople" 
          lazy="true" proxy="SalesPerson ">...</class>
    

    The proxy attribute tells NHibernate what type should be proxied. This will either be the actual class you are mapping to, or an interface implemented by the class. Since we are using the actual class as our proxy interface, we need to make sure all members are virtual - if we miss any, NHibernate will throw a helpful exception with a list of non-virtual methods. Now we're good to go:

    Sale sale = new Sale();
    sale.SalesPerson = session.Load<SalesPerson>(1);
    sale.Model = session.Load<Model>(2);
    sale.Price = 25000;
    session.Save(sale);
    

    Notice that we're using Load instead of Get. The difference between the two is that if you're retrieving a class that supports lazy loading, Load will get the proxy, while Get will get the actual object. With this code in place we're no longer hitting the database just to load IDs. Instead, calling Session.Load<Model>(2) returns a proxy - dynamically generated by NHibernate. The proxy will have an id of 2, since we supplied it the value, and all other properties will be uninitialized. Any call to another member of our proxy, such as sale.Model.Name will be transparently intercepted and the object will be just-in-time loaded from the database.

    Just a note, NHibernate's lazy-load behavior can be hard to spot when debugging code in Visual Studio. That's because VS.NET's watch/local/tooltip actually inspects the object, causing the load to happen right away. The best way to examine what's going on is to add a couple breakpoints around your code and check out the database activity either through NHibernate's log, or SQL profiler.

    Hopefully you can imagine how proxies are used by RhinoMocks for recording, replaying and verifying interactions. When you create a partial you're really creating a proxy to your actual object. This proxy intercepts all calls, and depending on which state you are, does its own thing. Of course, for this to work, you must either mock an interface, or a virtual members of a class.

    In This Chapter

    In chapter 6 we briefly covered NHibernate's lazy loading capabilities. In this chapter we expanded on that discussion by looking more deeply at the actual implementation. The use of proxies is common enough that you'll not only frequently run into them, but will also likely have good reason to implement some yourself. I still find myself impressed at the rich functionality provided by RhinoMock and NHibernate thanks to the proxy design pattern. Of course, everything hinges on you allowing them to override or insert their behavior over your classes. Hopefully this chapter will also make you think about which of your methods should and which shouldn't be virtual. I strongly recommend that you take a look at the following articles/posts to better understand the virtual by default vs final by default points of view:

    Posted Jun 18 2008, 08:32 AM by karl with 5 comment(s)
    Filed under:
  • Resharper 4

    It's hard to imagine that almost a year has gone by since my jab at Resharper's 3.0 lack of support for .NET 3.5. Yesterday I finally got around to installing thew newly released Resharper 4 and I'm more then blown away by some of the new features. Not only does it fully support the new syntax (lambdas, linq, anonymous types and so on), but it offers some nice new features.

    The first thing I noticed was that the "Reformat" feature - which i use a lot -  has been renamed to "Cleanup Code" and not only does more, but also supports profiles - so different code cleanup profiles can do different things. One thing i haven't figured out yet is how to edit the 2 default profiles

    The next thing that surprised me was that Resharper suggested I use object initialization. So given:

    task t = new Task();
    t.Name = "Test";

    and hitting alt-enter, resulted in:

    Task t = new Task {Name = "Test"};

    Similarly, Resharper suggests  using implicit type variable. David already blogged about this - and like him, I also disabled this suggestion. However, if you're with JP on this, you'll certainly appreciate the helpful tip.

    One feature I'm on the fence about is their JetBrains.Annotation assembly. With it, you can decorate your members usings JetBrain-specific attributes to provide even better integration. For example, given a method that behaves like string.Format, I can add a StringFormatMethod attribute:

    [StringFormatMethod("key")]
    public void Put(string key, params object[] args) { ... }

    This then allows Resharper to provide additional information, so if I do:

    Put("testing {0}, {1}, {2}", 1, 2);

    Resharper will tell me that {2} doesn't have a matching argument. It's a neat feature, but there's something strange about adding a JetBrain's 'dll to my project.


    Generally, I think Resharper's a must-have. If you have an older version and aren't working on 3.5 code, then save your money. However, if you're doing even a little bit of 3.5 programming, then this thing is totally worth it. I have three complains/concerns.

    First, I wish more of the windows docked. For example, I wish "Recent Edits" was dockable. While we're on the topic of recent edit, Resharper should look at what e-TextEditor does and provide THAT amazing functionality.

    Secondly, each version of Resharper gets progressivley more complex. There are more shortcuts (like the new ctrl-shift-enter) and more configuration. The barrier to entry is starting to get a little high. Although the couple hours you might spend configuring it are quickly made up.

    Finally, price. I can't help but feel that, despite the amazing value, upgrading from 3.0 to 4.0 should be less than $100. Maybe I feel that way 'cuz 3.0 was a bit of let-down for me (I know it wasn't for everyone, especially VB.NET developers), and also because I think everyone should use it.

    What are you waiting for? Get your free 30 day trial now.

    P.S - I downloaded that sucker at 8meg/sec from their Rackspace server - that's insane (rackspace is in Texas, I'm all the way north in Ottawa). And while I still love Rackspace, I'm a far bigger fan of SoftLayer. Same thing but  $600/month cheaper, $0.20/gb instead of like $2.00, and amazingly useful iSCSI.

     

More Posts

Our Sponsors

Free Tech Publications