Azure Table Storage Performance Considerations

This is my first post on CodeBetter, so I’ll briefly present myself. My name is Dario Solera, I’m co-founder and CEO at Threeplicate. Luckily, I spend most of my time writing code. We develop two web applications: Amanuens, a cloud-based software localization platform, and ScrewTurn Wiki. Both are based on ASP.NET.

I started working with Windows Azure in late 2009, when it wasn’t even RTM. It took a good deal of effort to become good at understanding Azure storage usage paradigms and how to get the best performance.

Besides SQL Azure, there are 3 types of data storage in Azure:

  1. Table Storage, allowing you to store table-like entities (it’s mostly a NoSQL, non-relational store)
  2. Blob Storage, allowing you store binary blobs of data (plain files)
  3. Queue Storage, allowing you to queue messages for later processing.

It’s likely that to build a complex application you’ll have to use all types of storage services. The Blob Storage is nothing more than a file system accessible via a REST API, and queues are a familiar concept. The most peculiar type of storage is Table.

The point is, if you want a data store that can be searched with queries, Table Storage is the natural choice. In this case, all NoSQL concepts apply:

  • no relations between tables (or entities sets)
  • entities are simply a set of key-value pairs
  • tables are schema-less (i.e. each entity can have a different schema, even in the same table)
  • there is limited support for keys and indexes within tables.

Ignoring all the other limitations of Azure’s Table Storage, the most important is that each entity must have a unique couple of Partition and Row Keys. There are a number of good practices to follow when designing entities to get the best performance – but all of them are fine-tuning compared to the very #1 rule: You should never, ever query tables on properties other than Partition and Row Key. The idea is that when using Partition and Row Keys, the storage will use its (binary and distributed) index to find results really fast, while when using other entity properties the storage will have to scan everything sequentially, significantly reducing performance. So, querying on the Partition Key is good, querying on both Partition and Row Key is good too, querying only on Row Key is not good (the storage will fall back to sequential scans).

Amanuens is based on Azure, and makes use of Blobs, Queues and most importantly Tables. Basically, all the data that would normally go into SQL, in Amanuens is stored in the Table Storage.

The main problem is that there are times when you simply cannot avoid querying on entity properties other than PK and RK. What it normally happens is that you design an entity so that query performance is best in its most frequent querying scenario, but there might also be other scenarios too. Unless you keep different pre-computed views of the same data, each one optimized for a specific scenario, you must resort to querying on entity properties.

First of all, such queries are still executed “server-side”, so you don’t really have to read all entities and filter them in your code, and this is a huge performance improvement already. That said, in Amanuens we found that some non-key queries were surprisingly fast, so I wanted to dig deeper into the matter and get some kind of rule of thumb on when it’s safe to query on entity properties and when it’s not.

I put together some test code, which you can find on Bitbucket. The code does nothing more than create a given number of entities and then run some typical queries against them. The entities are what could be a simple e-commerce store item:

  PartitionKey // Vendor ID (GUID as string),
  RowKey // Product Code (GUID as string),
  VendorId // = PartitionKey
  ProductCode // = RowKey
  Name // Item name,
  Description // Item description,
  AvailableStock // Number of items in stock (int),
  Price // double, as decimal is not supported in Azure Table

The purpose of this test is to verify the performance difference between the same query run against Partition and Row Keys, and run against VendorId and ProductCode, that contain the same data found in the keys.

When creating entities, Vendor ID is picked from a pre-defined list of 20 GUIDs, round robin style. Product Codes are GUIDs created on-the-fly.

There are 4 test queries:

  1. search for all entities whose PartitionKey (VendorId) is “dummy”
  2. search for all entities whose PartitionKey (VendorId) is the first GUID in the pre-defined list
  3. search for all entities whose RowKey (ProductCode) is “dummy”
  4. search for all entities whose PartitionKey (VendorId) is the first GUID in the pre-defined list, and whose RowKey (ProductCode) is “dummy”.

I did not test the case with given PartionKey and RowKey because it’s the very perfect query scenario, that would certainly give maximum performance.

I ran the tests in the West Europe location, with https disabled for storage connections (for best performance) and in a Medium instance (the smallest instance size with “High” I/O performance).

The code runs each query 50 times, and takes the average execution time, and then sums it to a total. Basically, each test run produces two results:

  • queries_on_keys_result = avg_query1_time + avg_query2_time + avg_query3_time + avg_query4_time
  • queries_on_properties_result = avg_query1_time + avg_query2_time + avg_query3_time + avg_query4_time.

Except for 100k entities, I repeated each test 10 times (in 2 series, each on a new table). Basically, each query has been executed 500 times.

I compiled the results in a spreadsheet, so feel free (or even encouraged) to have a look at it and challenge my results. You know, 4, 6, 8, 10 eyes are better than 2.

Azure Performance

As you can see, performance is almost linear. Most importantly, the performance difference between queries executed on PK/RK and those executed on entity properties, seems to reach a top value around +180%.

Azure Performance

There is an obvious problem with the tests executed for 100k entities, as it doesn’t seem right that the performance difference decreases. I think it’s safe to assume that after 10k entities, the performance difference does not increase anymore and stays constant at around +170%.

It’s interesting however how for 100 entities or less, queries on entity properties are as fast or even faster than those executed on PK/RK. I guess this is because the storage, with so few entities, is not able to execute queries in parallel on multiple nodes.

So to answer my initial question, it is clear that with few entities, there is no performance difference at all. With entities between 100 and 1000, the difference is still somewhat limited (+40%), then after 1000 entities it spikes up to +180%.

That said, it all depends on what you have to do. If you’re serving an interactive user request, then your first priority is to keep execution time as low as possible, so you should limit your non-key queries. If you are executing a background task, then I think that to keep your application logic simple, it is just fair to run non-optimized queries that take much time to execute – because after all, “much time” might not be too much.

About Dario Solera

Mix and match as you like: software engineer, cloud believer, (former) entrepreneur, F1 addict. My less-technical posts are at You can also find me on Twitter and Google+.
This entry was posted in Software Development, Windows Azure. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • applejuice

    We are looking now at using azure storage and we were considering the performance implications of full table scans.
    This info has been very helpful! thanks

  • zzzxtreme

    i’m trying to think of real world scenarios where azure table storage is needed, and has improved overall performance for a large websites. could you name some?

  • Dario Solera

    Yes of course I followed continuation tokens – I have a helper class here that does exactly that:

    The first GUID is the first GUID created.

  • Kelly Leahy

    It seems you may have hit a threshold between 50000 and 100000 that caused your partitions to be split to multiple nodes. In that case, I suppose it’s possible that the query was then able to execute with more parallelism.

    Another question, however – were you following continuations when executing your queries? If not, that could also explain the drop off in time, if you were executing on more than one partition, and you didn’t actually check that you got the result back you expected, you might have been given back a continuation and a zero-length result, and TS expected you to go to the next partition with your next query and you didn’t. If you didn’t, then you were essentially only querying part of your table.

    By “first guid in the predefined list” do you mean first in creation order or first in alphabetical order? If you mean first in alphabetical order, that could also skew your results (if you’re ignoring continuations), as you may always hit that guid in your queries so you think you got back the full result (but actually you were given a continuation that you didn’t follow).