Fast Serialization

lileBook We use ALOT of serialization in the current system I work with. Serializing/deserializing 100,000,000 objects in a day is pretty common. For a long time we knew that the binary formatter was fat and slow but never rationalized writing something custom as we were always fast enough. Unfortunately our data throughput has raised 400% in the last year (when you start with gigs and gigs of messages this is a huge gain) and our little three or four year old dual xeon 2.2 has turned into the little engine that could during peaks lately so we finally bit the big one and threw something together quickly.

A toast … to the little server that could!

This solution is for a fairly niche condition and is heavily optimized so please read the explanations below to see if it will be good for your scenario before using it.

 

This is the first of a series of posts dealing with this … Let’s start with introducing a new interface to our system

    public interface ICustomBinarySerializable
    {
        void WriteDataTo(BinaryWriter _Writer);
        void SetDataFrom(BinaryReader _Reader);
    }

You would then implement this interface in your object like this, only write out exactly what you need and write it in the simplest way possible.

    class TestObject : ICustomBinarySerializable
    {
        public int Integer;
        public TestObject(){}

        public TestObject(int _Integer)
        {
            Integer = _Integer;
        }

        public virtual void WriteDataTo(BinaryWriter _Writer)
        {
            _Writer.Write((int) Integer);
        }

        public virtual void SetDataFrom(BinaryReader _Reader)
        {
            Integer = _Reader.ReadInt32();
        }
    }

Then I wrote a custom formatter that operates on objects that are ICustomBinaryObjectSerializable. You may note that for the index that represents the type I use an integer. This is probably more appropriate to be a short than an integer and we could save a few bytes here.

    public class CustomBinaryFormatter : IFormatter
    {
        private SerializationBinder m_Binder;
        private StreamingContext m_StreamingContext;
        private ISurrogateSelector m_SurrogateSelector;
        private readonly MemoryStream m_WriteStream;
        private readonly MemoryStream m_ReadStream;
        private readonly BinaryWriter m_Writer;
        private readonly BinaryReader m_Reader;
        private readonly Dictionary<type, int> m_ByType = new Dictionary<type int >();
        private readonly Dictionary m_ById = new Dictionary();
        private readonly byte[] m_LengthBuffer = new byte[4];
        private readonly byte[] m_CopyBuffer;

        public CustomBinaryFormatter()
        {
            m_CopyBuffer = new byte[20000];
            m_WriteStream = new MemoryStream(10000);
            m_ReadStream = new MemoryStream(10000);
            m_Writer = new BinaryWriter(m_WriteStream);
            m_Reader = new BinaryReader(m_ReadStream);
        }

        public void Register(int _TypeId) where T:ICustomBinarySerializable
        {
            m_ById.Add(_TypeId, typeof(T));
            m_ByType.Add(typeof (T), _TypeId);
        }

        public object Deserialize(Stream serializationStream)
        {
            if(serializationStream.Read(m_LengthBuffer, 0, 4) != 4)
                throw new SerializationException("Could not read length from the stream.");
            IntToBytes length = new IntToBytes(m_LengthBuffer[0], m_LengthBuffer[1], m_LengthBuffer[2], m_LengthBuffer[3]);
            //TODO make this support partial reads from stream
            if(serializationStream.Read(m_CopyBuffer, 0, length.i32) != length.i32) 
                throw new SerializationException("Could not read " + length + " bytes from the stream.");
            m_ReadStream.Seek(0L, SeekOrigin.Begin);
            m_ReadStream.Write(m_CopyBuffer, 0, length.i32);
            m_ReadStream.Seek(0L, SeekOrigin.Begin);
            int typeid = m_Reader.ReadInt32();
            Type t;
            if(!m_ById.TryGetValue(typeid, out t))
                throw new SerializationException("TypeId " + typeid + " is not a registerred type id");
            object obj = FormatterServices.GetUninitializedObject(t);
            ICustomBinarySerializable deserialize = (ICustomBinarySerializable) obj;
            deserialize.SetDataFrom(m_Reader);
            if(m_ReadStream.Position != length.i32) 
                throw new SerializationException("object of type " + t + " did not read its entire buffer during deserialization. This is most likely an inbalance between the writes and the reads of the object.");
            return deserialize;
        }

        public void Serialize(Stream serializationStream, object graph)
        {
            int key;
            if (!m_ByType.TryGetValue(graph.GetType(), out key))
                throw new SerializationException(graph.GetType() + " has not been registered with the serializer");
            ICustomBinarySerializable c = (ICustomBinarySerializable) graph; //this will always work due to generic constraint on the Register
            m_WriteStream.Seek(0L, SeekOrigin.Begin);
            m_Writer.Write((int) key);
            c.WriteDataTo(m_Writer);
            IntToBytes length = new IntToBytes((int) m_WriteStream.Position);
            serializationStream.WriteByte(length.b0);
            serializationStream.WriteByte(length.b1);
            serializationStream.WriteByte(length.b2);
            serializationStream.WriteByte(length.b3);
            serializationStream.Write(m_WriteStream.GetBuffer(), 0, (int) m_WriteStream.Position);
        }

        public ISurrogateSelector SurrogateSelector
        {
            get { return m_SurrogateSelector; }
            set { m_SurrogateSelector = value; }
        }

        public SerializationBinder Binder
        {
            get { return m_Binder; }
            set { m_Binder = value; }
        }

        public StreamingContext Context
        {
            get { return m_StreamingContext; }
            set { m_StreamingContext = value; }
        }
    }

So that it is clear how this works … when you instantiate a custom formatter, you associate types back to integer ids. Example from my tests:

formatter.Register<TestObject>(1);

This says when you get a type id of 1 it should be a TestObject and vice versa when you write a TestObject give it a type id of 1.

 

When writing an object the format is

<4 bytes length><4 bytes type id><object data>

 

When we read the data we first read the 4 bytes of length (n), then read n bytes off the stream. We then copy that into our local buffer (see notes below). We then seek to the beginning of the buffer and tell the object to read its state using the binary reader we provide to it.

 

 

Performance

Before we look at all of the bad an evil things this is doing let’s try some basic performance tests. To run tests I used the following simple object (what this library was designed to be really fast with). I grabbed this object off someone’s blog who was also playing with serialization and added the interface but can’t seem to find the link of which one it was to give credit for saving me a good minutes worth of typing :).

 

[Serializable]
public class Customer : ICustomBinarySerializable {
     private String _lastname;
     private String _firstname;
     private String _address;
     private int _age;
     private int _code;

    public Customer()
    {
        
    }
    public Customer(String lastName, String firstName, String address, int age, int code)
    {
        _lastname = lastName;
        _firstname = firstName;
        _address = address;
        _age = age;
        _code = code;
    }

    public String LastName {
           get {return _lastname;}
           set {_lastname = value;}
     }
     public String FirstName
     {
           get {return _firstname;}
           set {_firstname = value;}
     }
     public String Address
     {
           get {return _address;}
           set {_address = value;}
     }

     public int Age
     {
           get {return _age;}
           set {_age = value;}
     }

     public int Code
     {
           get {return _code;}
           set {_code = value;}
     }

    public void WriteDataTo(BinaryWriter _Writer)
    {
        _Writer.Write((string)_lastname);
        _Writer.Write((string)_firstname);
        _Writer.Write((string)_address);
        _Writer.Write((Int32)_age);
        _Writer.Write((Int32)_code);
    }

    public void SetDataFrom(BinaryReader _Reader)
    {
        _lastname = _Reader.ReadString();
        _firstname = _Reader.ReadString();
        _address = _Reader.ReadString();
        _age = _Reader.ReadInt32();
        _code = _Reader.ReadInt32();
    }
}

 

Speed

To test the speed of the serializer I chose to serialize / deserialize one of these objects 10,000,000 times to/from a MemoryStream.

Test Time (lower is better)
Serialize (Binary) 01:48.54
Serialize (Custom) 00:06.73
Deserialize (Binary) 2:01.29
Deserialize (Custom) 0:08.55

So on serializing the custom one is a whopping 1612% faster and on deserializing it is 1418% faster. That’s not too bad as both are more than an order of magnitude.

 

Size

The other area I really wanted to optimize as it is common for us to have 40+ gb transaction files for a day (disk IO is expensive) is the size of each message. Because we are not writing the same kind of schema information that the binary formatter does we can also be quite a bit smaller than its output. For the object given the binaryformatter results in 232 bytes of output while the custom formatter results in 41. This message has quite a few string which add into the amount of serialized data (on our messages (about 40) we average about a 1/10 ratio between the two). Even so its still a 500% gain in storage space required. Don’t let this fool you though there are some  ….

 

Problems

There are a number of problems with this type of strategy. It is imperative that you know about the tradeoffs involved with this code before using it. This was written for a niche situation and it may really hurt you if you aren’t careful!

 

Versioning

There is no versioning information provided by default in the data. One could easily provide this in their custom serialization implementation but the formatter does not provide it by default for you.

 

Endianess

One of the interesting things here is dealing with the length. I have done this using a quite unsafe (but faster) solution.

    [StructLayout(LayoutKind.Explicit)]
    public struct IntToBytes
    {
        public IntToBytes(Int32 _value) { b0 = b1 = b2 = b3 = 0; i32 = _value; }
        public IntToBytes(byte _b0, byte _b1, byte _b2, byte _b3) {
            i32 = 0;
            b0 = _b0;
            b1 = _b1;
            b2 = _b2;
            b3 = _b3;
        }
        [FieldOffset(0)]
        public Int32 i32;
        [FieldOffset(0)]
        public byte b0;
        [FieldOffset(1)]
        public byte b1;
        [FieldOffset(2)]
        public byte b2;
        [FieldOffset(3)]
        public byte b3;
    }

This has endian problems if you use it on multiple machines that have different endianess like say mono on a ppc vs clr on x86. One could easily get around this by just using BitConverter instead (or doing some binary arithmetic if you miss having real reasons for doing so :)). For us however most of these objects are being serialized between processes on the same machine so its not an issue for us.

 

Copying of Data

Another problem (read: decision) has to do with how the formatter deals with the stream itself internally. It copies data off the stream into an internal memory buffer, it does this so it can reuse the same binaryreader/writer every time. This makes it non-reentrant and forces the copy but in testing with many very small messages the copying of the data turned out to be faster than creating a new Reader/Writer to the original stream on every iteration. This may turn out different for you, I will leave it as an exercise for the reader to change this (I promise it won’t take more than 5 minutes)

 

Typing

Its a lot of typing in your objects (we can work around this with some IL generation) but that’s a whole other post isn’t it.

 

Anyways I hope people enjoy this and can find a niche place of their own to use such a strategy.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

28 Responses to Fast Serialization

  1. Pharmc422 says:

    Very nice site!

  2. Pharme832 says:

    Very nice site!

  3. Marc Gravell says:

    For info, you might want to compare/contrast to protobuf-net; this works a lot more simply re POCO – it is largely attribute based (meaning: less code to get wrong), and has the advantages of being a portable (i.e. agreed) wire format, with extensibility etc baked in. It is of course optimized etc, and works on all framework versions.
    http://code.google.com/p/protobuf-net/

  4. unruledboy says:

    I agree, but it’s easier to implement within the serializer :)

  5. Greg says:

    unruledboy:

    Why is it a responsibility of the formatter to handle the indexing? I think this would be better served with an abstraction outside of the formatter (and it could as such be used with any formatter)

  6. unruledboy says:

    I added the index feature, anyone who like to have a look, check out the following link:

    http://files.cnblogs.com/unruledboy/Serializer.zip

  7. Ah, cool. Works in Silverlight and you can even do this:

    using System.Windows;
    using System.Windows.Controls;
    using System.Windows.Documents;
    using System.Windows.Ink;
    using System.Windows.Input;
    using System.Windows.Media;
    using System.Windows.Media.Animation;
    using System.Windows.Shapes;
    using CustomBinarySerializer;

    namespace SerializerTest
    {
    [Serializable]
    public class CustomerList: ICustomBinarySerializable
    {
    public List Customers;

    public CustomerList()
    {
    Customers = new List();
    }

    public void WriteDataTo(BinaryWriter _Writer)
    {
    foreach(Customer c in this.Customers )
    {
    _Writer.Write(c.LastName);
    _Writer.Write(c.FirstName);
    _Writer.Write(c.Address);
    _Writer.Write(c.Age);
    _Writer.Write(c.Code);
    }
    }

    public void SetDataFrom(BinaryReader _Reader)
    {
    while (_Reader.BaseStream.Position < _Reader.BaseStream.Length )
    {
    Customer c = new Customer();

    c.LastName = _Reader.ReadString();
    c.FirstName = _Reader.ReadString();
    c.Address = _Reader.ReadString();
    c.Age = _Reader.ReadInt32();
    c.Code = _Reader.ReadInt32();
    this.Customers.Add(c);
    }
    }

    }
    }

  8. Greg says:

    unruledboy:

    You would want to do your own indexing in such a scenario (i.e. with a file). It is pretty easily done outside of the formatter.

    Cheers,

    Greg

  9. Greg says:

    change of TODO:

               int read = 1;

               int total = 0;

               m_ReadStream.Seek(0L, SeekOrigin.Begin);

               while (total < length.i32 && read > 0)

               {

                    int left = length.i32 – total;
                    int toRead = left > m_CopyBuffer.Length ? m_CopyBuffer.Length : left;
                    int read = serializationStream.Read(m_CopyBuffer, 0, toRead);

                   total += read;

                   m_ReadStream.Write(m_CopyBuffer, 0, read);

               }

  10. unruledboy says:

    @Greg

    what I mean is: if I serialize a lot of biz objects, lets say 100,000, then if we specifically need the number 89,002 object, we have to go through all the objects until we reach it, right?

  11. unruledboy says:

    silverlight version, but extremely slow, only a dozon times per second, strange, eh?

    using System;
    using System.IO;
    using System.Runtime.Serialization;
    using System.Collections;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;

    namespace Serializer
    {
    public class CustomBinaryFormatter
    {
    private readonly MemoryStream m_WriteStream;
    private readonly MemoryStream m_ReadStream;
    private readonly BinaryWriter m_Writer;
    private readonly BinaryReader m_Reader;
    private readonly Dictionary m_ByType = new Dictionary();
    private readonly Dictionary m_ById = new Dictionary();
    private readonly byte[] m_LengthBuffer = new byte[4];
    private readonly byte[] m_CopyBuffer;

    public CustomBinaryFormatter()
    {
    m_CopyBuffer = new byte[20000];
    m_WriteStream = new MemoryStream(10000);
    m_ReadStream = new MemoryStream(10000);
    m_Writer = new BinaryWriter(m_WriteStream);
    m_Reader = new BinaryReader(m_ReadStream);
    }

    public void Register(int _TypeId) where T:ICustomBinarySerializable
    {
    m_ById.Add(_TypeId, typeof(T));
    m_ByType.Add(typeof (T), _TypeId);
    }

    public object Deserialize(Stream serializationStream)
    {
    if(serializationStream.Read(m_LengthBuffer, 0, 4) != 4)
    throw new SerializationException(“Could not read length from the stream.”);
    IntToBytes length = new IntToBytes(m_LengthBuffer[0], m_LengthBuffer[1], m_LengthBuffer[2], m_LengthBuffer[3]);
    //TODO make this support partial reads from stream
    if(serializationStream.Read(m_CopyBuffer, 0, length.i32) != length.i32)
    throw new SerializationException(“Could not read ” + length + ” bytes from the stream.”);
    m_ReadStream.Seek(0L, SeekOrigin.Begin);
    m_ReadStream.Write(m_CopyBuffer, 0, length.i32);
    m_ReadStream.Seek(0L, SeekOrigin.Begin);
    int typeid = m_Reader.ReadInt32();
    Type t;
    if(!m_ById.TryGetValue(typeid, out t))
    throw new SerializationException(“TypeId ” + typeid + ” is not a registerred type id”);
    //object obj = FormatterServices.GetUninitializedObject(t);
    object obj = Activator.CreateInstance(t);
    ICustomBinarySerializable deserialize = (ICustomBinarySerializable) obj;
    deserialize.SetDataFrom(m_Reader);
    if(m_ReadStream.Position != length.i32)
    throw new SerializationException(“object of type ” + t + ” did not read its entire buffer during deserialization. This is most likely an inbalance between the writes and the reads of the object.”);
    return deserialize;
    }

    public void Serialize(Stream serializationStream, object graph)
    {
    int key;
    if (!m_ByType.TryGetValue(graph.GetType(), out key))
    throw new SerializationException(graph.GetType() + ” has not been registered with the serializer”);
    ICustomBinarySerializable c = (ICustomBinarySerializable) graph; //this will always work due to generic constraint on the Register
    m_WriteStream.Seek(0L, SeekOrigin.Begin);
    m_Writer.Write((int) key);
    c.WriteDataTo(m_Writer);
    IntToBytes length = new IntToBytes((int) m_WriteStream.Position);
    serializationStream.WriteByte(length.b0);
    serializationStream.WriteByte(length.b1);
    serializationStream.WriteByte(length.b2);
    serializationStream.WriteByte(length.b3);
    serializationStream.Write(m_WriteStream.GetBuffer(), 0, (int) m_WriteStream.Position);
    }
    }
    }

  12. Greg says:

    @unruledboy I dont understand what you asked.

    @gael
    1) under problems section “Versioning”. This was for a very specific scenario where its not much of an issue

    2) was gonna use reflection.emit instead as an example (I alluded to it at the end) but hmm postsharp could also be interesting.

    Greg

  13. Two remarks.

    - A problem with that is versioning: you should have exactly the same object definition on both sides. If not, you need a protocol negotiation phase, which will result in serializer/deserializer methods to be generated using System.Reflection.Emit.

    - If you dont want to support protocol negotiation (which I would understand), why not automatically generating the WriteDataTo and ReadDataFrom? PostSharp could help, although you would have to use PostSharp Core directly!

    Good article!

    -gael

  14. unruledboy says:

    since it does not have an index, it has to iterate the whole file for a speficy item, right?

  15. stephenpatten says:

    public class CustomBinaryFormatter : IFormatter
    {
    private SerializationBinder m_Binder;
    private StreamingContext m_StreamingContext;
    private ISurrogateSelector m_SurrogateSelector;
    private readonly MemoryStream m_WriteStream;
    private readonly MemoryStream m_ReadStream;
    private readonly BinaryWriter m_Writer;
    private readonly BinaryReader m_Reader;
    private readonly Dictionary m_ByType = new Dictionary();
    private readonly Dictionary m_ById = new Dictionary();
    private readonly byte[] m_LengthBuffer = new byte[4];
    private readonly byte[] m_CopyBuffer;

    public CustomBinaryFormatter()
    {
    m_CopyBuffer = new byte[20000];
    m_WriteStream = new MemoryStream(10000);
    m_ReadStream = new MemoryStream(10000);
    m_Writer = new BinaryWriter(m_WriteStream);
    m_Reader = new BinaryReader(m_ReadStream);
    }

    public void Register(int _TypeId) where T : ICustomBinarySerializable
    {
    m_ById.Add(_TypeId, typeof(T));
    m_ByType.Add(typeof (T), _TypeId);
    }

    public object Deserialize(Stream serializationStream)
    {
    if(serializationStream.Read(m_LengthBuffer, 0, 4) != 4)
    throw new SerializationException(“Could not read length from the stream.”);
    IntToBytes length = new IntToBytes(m_LengthBuffer[0], m_LengthBuffer[1], m_LengthBuffer[2], m_LengthBuffer[3]);
    //TODO make this support partial reads from stream
    if(serializationStream.Read(m_CopyBuffer, 0, length.i32) != length.i32)
    throw new SerializationException(“Could not read ” + length + ” bytes from the stream.”);
    m_ReadStream.Seek(0L, SeekOrigin.Begin);
    m_ReadStream.Write(m_CopyBuffer, 0, length.i32);
    m_ReadStream.Seek(0L, SeekOrigin.Begin);
    int typeid = m_Reader.ReadInt32();
    Type t;
    if(!m_ById.TryGetValue(typeid, out t))
    throw new SerializationException(“TypeId ” + typeid + ” is not a registerred type id”);
    object obj = FormatterServices.GetUninitializedObject(t);
    ICustomBinarySerializable deserialize = (ICustomBinarySerializable) obj;
    deserialize.SetDataFrom(m_Reader);
    if(m_ReadStream.Position != length.i32)
    throw new SerializationException(“object of type ” + t + ” did not read its entire buffer during deserialization. This is most likely an inbalance between the writes and the reads of the object.”);
    return deserialize;
    }

    public void Serialize(Stream serializationStream, object graph)
    {
    int key;
    if (!m_ByType.TryGetValue(graph.GetType(), out key))
    throw new SerializationException(graph.GetType() + ” has not been registered with the serializer”);
    ICustomBinarySerializable c = (ICustomBinarySerializable) graph; //this will always work due to generic constraint on the Register
    m_WriteStream.Seek(0L, SeekOrigin.Begin);
    m_Writer.Write((int) key);
    c.WriteDataTo(m_Writer);
    IntToBytes length = new IntToBytes((int) m_WriteStream.Position);
    serializationStream.WriteByte(length.b0);
    serializationStream.WriteByte(length.b1);
    serializationStream.WriteByte(length.b2);
    serializationStream.WriteByte(length.b3);
    serializationStream.Write(m_WriteStream.GetBuffer(), 0, (int) m_WriteStream.Position);
    }

    public ISurrogateSelector SurrogateSelector
    {
    get { return m_SurrogateSelector; }
    set { m_SurrogateSelector = value; }
    }

    public SerializationBinder Binder
    {
    get { return m_Binder; }
    set { m_Binder = value; }
    }

    public StreamingContext Context
    {
    get { return m_StreamingContext; }
    set { m_StreamingContext = value; }
    }
    }

  16. Josh Heyse says:

    Greg,

    Check this out. The XBox Live team uses reflection to accomplish a similar approach. They Dynamicly Generate delegates to do the serailization.

    http://msdn.microsoft.com/en-us/magazine/cc163960.aspx

  17. stephenpatten says:

    Michael DiBernardo,

    I beleive the Jon Skeet has already tackled that task, or is quite close to calling his port to C# baked.

    Stephen

  18. Stephen Patten says:

    Greg,

    Veiwing the “html sourc” and copy to the clipboard yeild the same results. I know the problem is between the keyboard and chair, so would you be kind enough to provide the source to me either?

    stephenpatten……remove…..@…….remove…..gmail.com

  19. Google’s recent release of protocol buffers might be useful in this context as well… If they haven’t already been ported to C#, I wouldn’t expect it to be long until they are.

    http://code.google.com/p/protobuf/

  20. Ramon Smits says:

    Nice short article that proofs custom binary serialization is worth the effort.

  21. FransBouma says:

    For people who need a more generic approach which uses the same underlying idea:
    http://www.codeproject.com/KB/dotnet/FastSerializer.asp
    http://www.codeproject.com/KB/dotnet/OptimizingSerialization2.aspx

    It definitely is worth the effort to implement custom binary serialization if you use it a lot. In line of business applications, where graphs of entities are passed back and forth between service and client, it can often be night and day between normal serialization using the binary formatter and a fast serialization approach.

    Often another nice benefit of custom serialization is that the resulting block of data is much smaller than the block of data produced by the normal binary formatter.

    Also, if you don’t want to go very deep with a custom serializer, do at least one thing: string caching. Strings are making things phat. Using string caching (by using a simple table which you place at the start of the block returned and you then simply place an index in the output stream) you can make the block of data very compact without a lot of effort, which already reduces the overall time a message is sent, because the block of data is smaller.

  22. Greg says:

    its the generics, it tends to not like them when I copy/paste … if you view source you can see them. Sorry about that.

  23. Stephen Patten says:

    Greg,

    Was the sample code supposed to compile?

  24. Greg says:

    Anthony that is one use.

    We also do alot of interprocess message exchange through either a transaction file or shared memory.

    Cheers,

    Greg

  25. Nice one. Out of curiosity – is this being used to serialize/deserialize objects sent over the wire using your fast TCP server?

  26. Stephen Patten says:

    Thank you, Greg. Will post my own results alter tomorrow.

  27. Nice, Greg. Glad you got around to it!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>