Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

Postgresql – Day 2

My migration from SQL Server to PostgreSQL for an upcoming project continues. On day 1 I managed to get the application up and running quickly, thanks to NHibernate. Ayende was kind enough to post some advice to minimize the already-few problems I ran into (I’ve since implemented the ISQLExceptionConverter and am really happy about it).

Today I had a bigger challenge – making the OLTP to OLAP transformation script work. I knew this would be more difficult because I’m using straight up SQL (instead of HQL), as well as a couple instances of SQLBulkCopy. So far, most of the transformation is done in sql – there’s very little data being pulled down and processed by in code. Luckily, I have a huge and thorough integration test, which populates the database with a couple days worth of data, executes the transformation script in steps (as it would be in real life), and then validates the results. Essentially my test figures out the results in C# (by looping over the entities and aggregating the results every which way) and compares it to what the transformation scripts does.

Aside from a couple syntactical differences, such as having used [ ] to escape the count column, 1 column ambiguity that PostgreSQL couldn’t resolve (weird), and having used insert {table} instead of insert into {table}, I only ran into 2 major problem.

The first was my usage of SqlBulkCopy which is tied to SQL Server’s bcp utility. I was only using it to bulk-load the data for my integration test, but it did make a huge difference, and I wanted to keep the same performance if possible. It turns out that PostgreSQL supports a COPY from STDIN command, which the .NET Npgsql library exposes as an NpgsqlCopyIn class. Although it didn’t perform quite as well (it still rocked), I much prefer the implementation. You essentially write a CSV to a stream – much nicer than having to transform my objects to datatables. Here’s the reusable code that I used:

public static void BulkCopy<T>(IEnumerable<T> items, Func<T, object[]> callback, string tableName, string columns)
{
    var sql = string.Format("copy {0} {1} from stdin with delimiter '|'", tableName, columns >> string.Empty);
    using (var connection = new NpgsqlConnection(_connectionString))
    {
        connection.Open();                
        using (var command = new NpgsqlCommand(sql, connection))                
        {                    
            var copy = new NpgsqlCopyIn(command, connection);                    
            try
            {
                copy.Start();                        
                foreach (var item in items)
                {
                    var data = SerializeData(callback(item));
                    var raw = Encoding.UTF8.GetBytes(string.Concat(data, "\n"));
                    copy.CopyStream.Write(raw, 0, raw.Length);
                }
            }
            catch (Exception)
            {
                copy.Cancel("Undo copy");
                throw;
            }
            finally
            {                        
                if (copy.CopyStream != null)
                {                            
                    copy.CopyStream.Close();
                }
                copy.End();
            }
        }
    }
}
private static string SerializeData(object[] data)
{
    var sb = new StringBuilder();
    foreach(var d in data)
    {
        if (d == null)
        {
            sb.Append("\\N");
        }
        else if (d is DateTime)
        {
            sb.Append(((DateTime) d).ToString("yyyy-MM-dd HH:mm:ss"));
        }
        else if (d is Enum)
        {
            sb.Append(((Enum) d).ToString("d"));
        }
        else
        {
            sb.Append(d.ToString());
        }
        sb.Append("|");
    }
    return sb.Remove(sb.Length - 1, 1).ToString();
}

You use it like so:

var users = GetAllUsers();
Func<User, object[]> callback = u => new object[] { p.Id, p.Email, p.UserName, p.Password };
BulkCopy(users, callback, "Users", null);

If you want to specify the columns, say because the Id is an auto incrementing field, simply do:

var users = GetAllUsers();
Func<User, object[]> callback = u => new object[] {  p.Email, p.UserName, p.Password };
BulkCopy(users, callback, "Users", "(Email, UserName, Password)");

The code essentially turns each entity in your collection into a serialized string, which gets written to the stream.

The other problem I had was that one of my most complex queries was really slow. The query had joins, joining a subselect in the where clause. SQL Server didn’t have a problem with it, PostgreSQL did. I wasn’t too surprised because past experience with MySQL had taught me that such queries could be problematic. I saw a few statements hinting that PostgreSQL 8.4 (I’m running 8.3) might fix it. I spent some time trying to speed it up, had some success, and then simply decided to go about it a different way. While I think its very likely that I wasn’t doing it properly, there’s no denying that SQL Server handled this better.

I did time the execution of my integration test, using the same code. This was run on identically pathetic machines – an Atom 330 with 4gigs of ram and a 5400RMP hard drive. Postgresql 8.3 on Ubuntu server 9.04 and SQL Server 2008 on Windows Server 2008 R2. Without using any bulk copy, Postgresql was slightly faster. With their respective bulk copy, SQL Server was slightly faster. I’m not giving numbers because the results were really close, seconds apart when everything took minutes. Also, I do think the test was useful; however, the code was single threaded which means it didn’t provide any insight on how the databases behave when under a more typical kind of load. My goal was really just to make sure I wasn’t making a huge mistake, and it doesn’t look, or feel, like I am.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

2 Responses to Postgresql – Day 2

  1. Tom says:

    I think its supposed to be “columns ?? string.Empty”

    Since I was porting from SqlBulkCopy, I already had the DataTable on hand, so I stuck with it. Here’s a version for DataTable:

    public static void BulkCopy(DataTable items, string tableName)
    {
    StringBuilder sb = new StringBuilder();
    sb.Append(“(“);
    bool first = true;
    foreach (DataColumn column in items.Columns)
    {
    if (first)
    first = false;
    else
    sb.Append(“,”);
    sb.Append(column.ColumnName);
    }
    sb.Append(“)”);

    var sql = string.Format(“copy {0} {1} from stdin with delimiter ‘|'”, tableName, sb.ToString());
    string conn = Procedure.Instance.AnalysisDatabases[Procedure.Instance.DefaultAnalysisDatabase].getConnectionString();
    using (var connection = new NpgsqlConnection(conn))
    {
    connection.Open();
    using (var command = new NpgsqlCommand(sql, connection))
    {
    var copy = new NpgsqlCopyIn(command, connection);
    try
    {
    copy.Start();
    foreach (DataRow item in items.Rows)
    {
    var data = SerializeData(item.ItemArray);
    var raw = Encoding.UTF8.GetBytes(string.Concat(data, “\n”));
    copy.CopyStream.Write(raw, 0, raw.Length);
    }
    }
    catch (Exception)
    {
    copy.Cancel(“Undo copy”);
    throw;
    }
    finally
    {
    if (copy.CopyStream != null)
    {
    copy.CopyStream.Close();
    }
    copy.End();
    }
    }
    }
    }

    I also had to change serialize data to say:

    if (d == null || d is System.DBNull)

    Thanks a ton for this code. Saved me a lot of time.

  2. Rob Rowe says:

    Thanks for showing us this migration process through your blog. We are currently contemplating a similar move and the more I read the more apt I am to take the leap also. Keep it coming!