This bug was fixed in the Windows Azure SDK v. 1.5 or v. 1.6. So go download the newest version!

Edit: 6th july

This is a bug and it has been filed withint the team thats developing the Windows Azure Client Library. More information here.

Edit: 1st july - new version of this blog post:

I think my old version (see below) was not totally clear on how to reproduce what I was trying to say. So I have made a new code snip that will show that the ContentMD5 blob property is not populated by the Storage Client library event if the MD5 value is in the REST result. Below is the code that show it. To try the code, just copy/past it to a console application and change the account, storageKey and containerName to match your settings.

I should also add, that I'm not doing this to show of or talk down on Microsofts work with Windows Azure. I think all the Azure stuff is awesome!! It is just that, if this is a bug, and if it is fixed, it will make working with MD5 hash values on blobs a lot easier and faster. If I'm wrong I will take it all back :)

So I also did some more testing and I found out that the MD5 is calculated by the and set blob stroge when you upload the blob. Because when I did a simple test of creating a new container and uploading a simple 'Hello World' txt file and tried doing a ListBlobs through the REST API, I got the XML back below. Note that I did not populate the ContentMD5 when I uploaded the file! If I try doing a ListBlobs method call on a CloudBlobContainer instance the ContentMD5 is NOT populated (Read more about this in my other blog post). So my best guess is that this is a error in the API of some sort.

I would be very nice if this was fixed, so we did not have to compuste the MD5 hash by hand and adding it to the blobs metadata. I would make code like I did for the local folder to/from blob storage synchronization much simpler (Read it here).

It turns out that this is a bug in the Windows Azure Client Library. Read more here.

Original version:

For some strange reason the ContentMD5 blob property is only populated if you call FetchAttributes on the blob. This means that you can not obtain the value of the ContentMD5 blob property by calling ListBlobs, with any parameter settings. If you have to call FetchAttributes to obtain the value it is not feasible to use on a large set of blobs because of the request time for calling FetchAttributes is going to take a long time. This means that it can not be used in a optimal way when doing something like fast folder synchronization I wrote about in this blog post.

The ContentMD5 blob property is never set by the blob storage it self. And if you try to put any value in it that is not a correct 128 MD5 hash of the file, you will get an exception. So it has limited usage.

I have made some code to illustrate the missing population of the ContentMD5 blob property when using the listBlobs method.

Here is the output of running the code:

GetBlobReference without FetchAttributes: Not populated
GetBlobReference with FetchAttributes: zhFORQHS9OLc6j4XtUbzOQ==
ListBlobs with BlobListingDetails.None: Not populated
ListBlobs with BlobListingDetails.Metadata: Not populated
ListBlobs with BlobListingDetails.All: Not populated

Introduction

In this blob post I will describe how to do a synchronize of a local file system folder against a Windows Azure blob container/folder. There are many ways to do this, some faster than others. My way of doing this is especially fast if few files has been added/updated/deleted. If many is added/updated/deleted it is still fast, but uploading/downloadig files to/from the blob will be the main time factor. The algorithm I’m going to describe was developed by me when I was implementing a non-live-editing Window Azure deployment model for Composite C1. You can read more about the setup here. I will do a more technical blog post about this non-live-editing setup later.

Breakdown of the problem

The algorithm should only do one way synchronization. Meaning that it either updates the local file system folder to match whats stored on the blob container. Or updates the blob container to match whats stored in the local folder. So I will split it up into two. One for synchronizing to the blob storage and one from the blob storage.

Because the blob storage is located on another computer we can’t compare the time stamp of the blobs against timestamps on local files. The reason for this is that the two computers (blob storage and our local) clocks will never be 100% in sync. What we can do, is use file hashes like MD5. The only problem with file hashes is that they are expensive to calculate, so we have to this as little as possible. We can accomplish this by saving the MD5 hash in the blobs metadata and cache the hash for the local file in memory. Even if we convert the hash value to a base 64 string, holding the hash in memory for 10.000 files will cost less than 0.3 mega bytes. So this scales fairly okay.

When working with the Windows Azure blob storage we have to take care not to do lots request. Especially we should take care not to do a request for every file/blob we process. Each request is likely to take more than 50ms and if we have 10.000 files to process, this will cost more than 8 minutes! So we should never use GetBlobReference/FetchAttributes to see if the blob exists and/or get its MD5 hash. But this is no problem because we can use the ListBlobs method with the right options.

Semi Pseudo Algorithms

Lets start with some semi pseudo code. I have left out some methods and properties but they should be self explaining enouth, to get the overall understanding of the algorithms. I did this so it would be easier to read and understand. Further down I’ll show the full C# code for these algorithms.

You might wonder why I store the MD5 hash value in the blobs meta data and not in the ContentMD5 property of the blob. The reason is that ContentMD5 is only populated with a value if FetchAttribute is called on the blob, which will make the algorithm perform really bad. Ill do a blog post on the odd behavior of the ContentMD5 blob property in a later blog post. Edit: Read it here.

The rest of the code

In this section I will go through the missing methods and properties from the semi pseudo algorithms above. Most of them are pretty simple and self explaining, but a few of them are more complex and needs more attention.

LastSyncTime and Container

These are just get/set properties. Container should be initialized with the the blob container that you wish to synchronize to/from. LastSyncTime is initialized with DateTime.MinValue. LocalFolder points the to local directory to synchronize to/from.

There are two versions of this method. This one is used when synchronizing to the blob. It uses the LastWriteTime of the file and the last time we did a sync to skip calculating the file hash of files that have not been changed. This saves a lot of time, so its worth the complexity.

This is the other version of the GetFileHashFromCache method. This one is used when synchronizing from the blob. The UpdateFileHash is used for updating the file hash cache when a new hash is obtained from a blob.

This method returns all files in the start folder given by the LocalFolder property. It ToLowers all file paths. This is done because blobs names are case sensitive, so when we compare paths returned from the method we want to compare on all lower cased paths. When comparing paths we also use the GetLocalPath method which translates a blob path to a local path and also ToLowers the result.

Both properties (some of them) and the metadata collection of a blob can be used to store meta data for a given blob. But there are a small differences between them. When working with the blob storage, the number of HTTP REST request plays a significant role when it comes to performance. The number of request becomes very important if the blob storage contains a lot of small files. There are at least three properties found in the CloudBlob.Properties property that can be used freely. These are ContentType, ContentEncoding and ContentLanguage. These can hold very large strings! I have tried testing with a string containing 100.000 characters and it worked. They could possible hold a lot more, but hey, 100.000 is a lot! So all three of them can be used to hold meta data.So, what are the difference of using these properties and using the metadata collection? The difference lies in when they get populated. This is best illustrated by the following code:

The difference is when using ListBlobs on a container or blob directory and the values of the BlobRequestOptions object. It might not seem to be a big difference, but imagine that there are 10.000 blobs all with a meta data string value with a length of 100 characters. That sums to 1.000.000 extra data to send when listing the blobs. So if the meta data is not used every time you do a ListBlobs call, you might consider moving it to the Metadata collection. I will investigate more into the performance of these methods of storing meta data for a blob in a later blog post.