BSON Serialization

BSON is a binary-encoded serialization of JSON-like documents, which essentially means its an efficient way of transfering information. Part of my work on the MongoDB NoRM drivers, discussed in more details by Rob Conery, is to write an efficient and maintainable BSON serializer and deserializer. The goal of the serializer is that you give it a .NET object and you get a byte array out of it which represents valid BSON. The deserializer does the opposite – give it a byte array and out pops your object. Of course, there are limits to what they can do – they are mostly meant to be used against POCO/domain entities.

Grammar
The first thing to understand when building serializers is how to read grammar. In programming languages, grammar is a way to express the valid keywords and values a parser might run into. Both the JSON and BSON grammars are great to learn, given how simplistic yet powerful they are. The JSON grammar, available on the homepage of json.org gives a nice representation of what valid JSON should look like. The BSON grammar, available at bsonspec.org under the specification button, follows a more traditional dialect. Essentially, you have symbols on the left and expressions on the right. The expressions can, and often will be, made up of additional symbols and or actual values. Eventually though, you’ll end up with a symbol which is only made up of values – which means you can stop going down the rabbit hole. Its also very common for a child symbol to reference a parent symbol – but eventually something breaks this cycle.

An Example
So, say we wanted to serialize the following json:

{"valid": true}

Everything in BSON starts with a document. From the bson specification, we can see that a document is made up of a 32bit integer (representing the total size of the document, including the integer itself), another symbol called an e_list, and finally a termination character. As a start, we’d have something like:

Now, an e_list itself is made up of a symbol called an element followed by another e_list or an blank string. An element is made up of a single byte type (with \x08 representing a boolean), a symbol called e_name and a byte value for true or false. So now we have:

The only thing missing now is our e_name (which represents the word “valid” in the original JSON). An e_name is really just a cstring which is our value UTF8 encoded into an array of bytes with a trailing byte of \x00:

Our final byte array looks something like:

Serializing a single bool value might be the simplest of cases, but once you understand that, you’re well on your way to being able to serialize anything. Sure, serializing an array might be a bit trickier, since each element within the array is its own document – but the challenge is mostly implementation versus conceptual.

What’s the Length?
It may surprise you at first, but the most difficult part to implement is actually determining the length of a document (or various other symbols which have a length). The problem is that we don’t know the length until after we’ve serialized it. Some implementations will essentially serialize the object graph twice, first to calculate lengths, then to write out the array. In NoRM we do things more efficiently. We keep a linked list of documents, and a pointer to the current document. A document is a very simple object – it keeps track of where it started, who its parent is (null in the case of the root document) and how much data was written. When a new document is needed – say when we start serialization, or when the grammar dictates that we need a new document (arrays or nested objects), we mark where we are and write out a placeholder length. Then, when the document ends, we seek to our placeholder, and write out the length. The relevant code looks like:

  private void NewDocument()
  {
      var old = _current;
      //we start the Written at 4 because of the length itself
      _current = new Document { Parent = old, Start = (int)_writer.BaseStream.Position, Written  = 4};      
      _writer.Write(0); //length placeholder
  }
  private void EndDocument(bool includeEeo)
  {
      var old = _current;
      if (includeEeo)
      {
          Written(1);
          _writer.Write((byte)0);
      }            
      _writer.Seek(_current.Start, SeekOrigin.Begin);
      _writer.Write(_current.Written); //override the document length placeholder
      _writer.Seek(0, SeekOrigin.End); //back to the end
      _current = _current.Parent;
      if (_current != null)
      {
          Written(old.Written);
      }

  }
  private void Written(int length)
  {
      _current.Written += length;
  }

EndDocument is pretty interesting. Since the length of a nested document contributes to the length of the parent document, we need to make sure to update the parent (now current) document with the length of the nested one.

Conclusion
Everything else is pretty straightforward in terms of serialization – largely reliant on reflection and reflection helpers. We use Jon Skeets reflection to delegate approach to make things even faster (something I truthfully don’t fully understand). Currently our implementation has some coupling to other NoRM components. Hopefully one day the BSON stuff will be stand-alone. If you can’t wait, you can either use another library like JSON.NET (which more mature anyways), or spend a few minutes (it shouldn’t take more than that) pulling out our serializer/deserializer.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

2 Responses to BSON Serialization

  1. Mike Dirolf says:

    Good writeup Karl. If anybody reading this has anything to contribute to the bsonspec site please do so! The site is open and source is here:

    http://github.com/mdirolf/bsonspec.org

  2. Cool stuff, Karl.

    -Charles