NoSQL For The Rest Of Us

No one would blame you for strictly associating NoSQL with performance. Most of the back and forth about NoSQL – an umbrella term given for non-relational storage mechanisms – has squarely put the focus on performance, sites with massive traffic, and server farms. It’s an interesting conversation, but one that risks alienating NoSQL from the majority of developers.

The Problem

Does NoSQL provide us simple developers with any tangible benefit? As a matter of fact, it can – one as significant for us as performance is for facebook. First though you need to understand that all of those tools you’ve been using to access your data, such as DataSets, Linq2Sql, Hibernate, NHibernate, EntityFramework, ActiveRecord, SubSonic SQLAlchemy, are meant to help you deal with the well-known object-relational impedance mismatch. The short description is that data stored in code (typically using an Object Oriented approach) and data stored in a relational DB requires coercion to move to-and-fro. The amount of coercion will greatly vary from system to system as will the visibibility (or leakiness) of said coercion (the tool you use has a significant impact on this as well).

A lot of developers don’t feel that object-relational impedance mismatch is really a problem or even exists. That’s only because, as the only solution, you’ve been dealing with it for so long – possibly your entire programming career (like me), that you don’t think about it. You’ve been trained and desensitized to the problem, accepting it as part of programming the same way you’ve accepted if statements.

A Solution?

By changing how data is stored, the object-relational mismatch no longer applies to NoSQL solutions. Of course, just because the object-relation mismatch is gone, doesn’t mean something else hasn’t taken its place. There’s 4 primary storage techniques used by NoSQL solutions. I’m going to look at the two I’m most familiar with: Document (via MongoDB) and ColumnFamily (via Cassandra).

Now I don’t have any NoSQL systems in production. My experience with MongoDB has been as a main contributor to the C# MongoDB driver (Norm) – largely focusing on the underlying communication protocol. I’ve also been writing a sample application as a demo for the driver and am prototyping something here at work (with plans to go into production). My experience with Cassandra is even more limited – having spent this past weekend looking at writing a C# driver for it (Apollo).

What I’ve noticed with MongoDB specifically (and probably document-oriented database in general) is that your data layer practically vanishes. This makes sense given the support for arrays and nested documents, as well as the ability to serialize and deserialize from a non-ambiguous and simple-type protocol like JSON (or BSON). This is huge productivity win for developers – shorter development time, less code and therefore less bugs.

On the flip side, my initial reaction to ColumnFamily storage (and I’d assume Key-Value engines) approach is that its even further away from OO than a relational model – thus the mismatch is even greater. You end up dealing with individual values (or arrays of individual values) and the language of Cassandra bleeds deeply into your application (much like the language of RDBMS bleed in your code when you use DataSets). Again, not a huge surprise sine Cassandra *is* heavily tuned for performance.

The Drivers

Ultimately, the drivers and tools you use to communicate with the storage engine are going to have a significant impact. For example, before Norm, the main way to communicate with MongoDB from C# was essentially through the use of glorified dictionaries. Before NHibernate we were using DataReaders or DataSets. However, the greater the difference between OO and storage model, the greater the complexity and leaks. Also, NoSQL drivers are young and have tons of room to grow, whereas the last hot thing to happen on the RDBMS front was Rail’s ActiveRecord.

Conclusion

Keep in mind that I am biased. Not only is my knowledge of Cassandra limited, but I’ve also had a hand in shaping the MongoDB drivers – obviously MongoDB fits well with my vision of what data access should look like. Maybe as my involvement in the ColumnFamily approach grows, so too will my opinion of the technology. However, it seems pretty clear to me at an implementation level that document-oriented databases (as well as object oriented database, like db40, I’d assume) are relatively close to the OO model used in code, and as such provide the greatest value to programmers.

At this very moment though, risks of new technology aside and the inconvenience of having to learn and grow, I’d say that there is compelling reason for developers to move away from relational storage engines (at least to prototype and play with). Reduced complexity within the application layer, not performance and scalability, is NoSQLs greatest strength.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

12 Responses to NoSQL For The Rest Of Us

  1. karl says:

    Our client isn’t usable right now, but it is available at:
    http://github.com/karlseguin/apollo

    Our plan is to rewrite all of the autogenerated code and focus on a single protocol and transport. Once that’s done, we plan on trying to make it more familiar for .NET developers.

    You’ll note that our codebase includes all of the Thrift/Apache stuff. Right now the project is a mess with our stuff side-by-side with it, but rest assured that insert and get are working on completely custom-written code.

  2. Venu says:

    Interesting article, Karl.

    Quick question(s)

    1) How does your client differ from HectorSharp
    http://github.com/mattvv/hectorsharp

    2) Is your client publicly available?

    I ask as my understanding is that a lot of the code is generated from Thrift (looking at the HectorSharp codebase). Whatever is not generated out there, looks very hacky. Thoughts?

  3. karl says:

    Daniel:
    Mongo is a full Js shell, so you can do something like:

    db.users.find().forEach(function(user) {
    var company = db.companies.findOne({“_id”: user["companyId"]})
    user["important"] = company["important"]
    db.users.save(user);
    });
    db.companies.update({}, {$unset: {important: 1}}, false, true)

    (there may be more efficient ways to do this)

  4. Come across lot of buzz around nosql in past couple days. Even Kent has hard time renaming this umbrella word – NoSQL.

    NoSQL > No to SQL
    NOSQL > Not Only SQL (makes more sense – found it today)

    Anyways, nice post.

  5. Daniel K. says:

    Ok there is no schema but data.
    Say i have the entities ‘Persons’ n:1 ‘Company’(1) and the field ‘Company.Important’. Now i decide i want to move the ‘Important’ indicator to Person for whatever reason.

    In sql i would create the new column in Person, copy the value from the related Company and kill the original column.

    If i would do this in code i guess i would need the ‘old class’ and the ‘new class’ and somehow convert the data from first to the latter one if there is no language to manipulate/move fields directly.

  6. karl says:

    @Daniel
    A lot of NoSQL (all?) are schema-free, so there’s no need to create and rename columns, or to change any schema. And you can always run updates to change all value blog.Status = ‘old’ to blog.Status =1 (and yes, it’ll just be stored as an int instead of a string without any schema).

    If you add a piece of information, but retrieve an old record that doesn’t have it, then that property will be the C# default (null, 0, false..you could use a nullable for value types). If you’ve removed information, then it just won’t get loaded.

    So in theory, you just change your domain model and go. However, we need to do more in the drivers to make this a reality. I wouldn’t know what we need to do more until someone ran into the problem though.

  7. Daniel K. says:

    I’ve got a general question on the topic but slightly off topic :) I got no experience with “noSql” but i’d like to know how you update a database without sql.

    Given a product with a versioned database that changes over time and scripts for each version step. The script is executed if the program finds the database to be not up-to-date. Then some sql executes, creates new columns, renames stuff, does some schema changes, enters default data etc…

    How would i do that with noSql? Does every db have it’s own “not sql but something else” language for that or do i have to hope that old data still works with the new model (which would be equal to “i can’t change the data model ever”)

  8. Alex Popescu says:

    Very interesting feedback based on your experiments with these projects. The one points I kind of disagree is:

    “What I’ve noticed with MongoDB specifically (and probably document-oriented database in general) is that your data layer practically vanishes. This makes sense given the support for arrays and nested documents, as well as the ability to serialize and deserialize from a non-ambiguous and simple-type protocol like JSON (or BSON). This is huge productivity win for developers – shorter development time, less code and therefore less bugs.”

    Data modeling and some implied impedance mismatch coming from the need to separate objects and data are still very important aspects of the NoSQL space and there are many pitfalls that we will need to learn to avoid. I’ve posted more about this subject http://nosql.mypopescu.com/post/457102094/look-ma-ive-just-got-an-n-1-with-nosql-flavor.

    bests,

  9. Rick says:

    I would have read your post, but I found the devilicous tag on the right obscured the text to an obnoxious degree.

    Looks like an interesting article though.

  10. Andrew says:

    A good read.

    I mentioned over on Ayende’s blog today that I think too many people are getting tied up in the scalability questions with NoSQL rather than looking at how much eaiser something like db40 or MongoDB can make development.

  11. Kyle Banker says:

    Great article. This point needs to be made more often.

  12. Ben says:

    I have a few random thoughts…. and I don’t have a blog so I’m just going to clog yours up ;)

    First, NoSQL is a terrible, but catchy, bucket. What happens when a subset of SQL that doesn’t support, say, joins, comes out for MongoDB. On the one hand, it’s a great way to break the mental paradigm, but it’s a bit of disservice to both sides.

    Further, within the NoSQL space, as you’ve pointed out, there are many players with many different motivations. It’ll be interesting to see what software Darwinism does to many of them over time. (I won’t even show my ignorance by speculating on which ones will make it, but I have an idea)

    Next, the main mental shift for most of us is from something that’s, general, a 3rd normal form data model in an OLTP system to something that’s not (serialized object graphs or whatnot). In reality, there are lots of data models in place today that look nothing like what 80% of us use every day. If people start looking at the idea rather than the implementation, I think there’d be less flak from the ‘old guard’

    Last, I can think of a few decent-size systems I’ve been a part of where a NoSQL solution would have been a perfect complementary piece. I think many people will come to view, say, a MongoDB in the same light.

    In one scenario, we had a website that read from a data mart. The data mart was populated nightly from a data warehouse. We never bothered, but denormalizing the data significantly would have greatly improved our performance. In the same vein, replacing that nightly feed with a job that updated a nosql store would have given us as good, or better read performance, without the overhead of a scarce resource (database cluster, dba support, etc) and we could have horizontally scaled easily. In that case, we’d have, essentially, a perfect read-only durable caching solution.

    Another scenario that comes to mind was an ecommerce site I was one of the architects on (let’s just say it wasn’t Amazon but consistently in the top 8). It was backed by a big active-active read / master-write 32-cpu (on each box) Oracle 9i cluster. Still, to maintain performance, we judiciously used caching on the app servers with some elegant approaches (created by a very smart man) to how we handled cache invalidation in various scenarios (the most common being pricing errors followed by limited quantity items). Again, if we treated our Oracle cluster as the master for products and shopping cart (orders and inventory where handled by another, equally large, system) and replicated out, we’d have, arguably, an easier to scale, faster approach.

    anyway, just a couple thoughts