Data storage in Windows Azure

Hello again, it´s time for another deep dive into Windows Azure. Today’s menu features Windows Azure storage. We will look at the possibilities of storing data in the whole platform, what is different from the standard Windows server and how to leverage new storages which Windows Azure offers.

Let´s start with answering a question: „Where can I store my data in Windows Azure? “ There are a lot of alternatives to choose from. The first is SQL Azure. As I have certainly mentioned in one of my previous posts, SQL Azure is a cloud database very similar to Microsoft SQL server. One of my future blog posts will be focused entirely on SQL Azure, so let´s move on to next topic.
The second option is to use one of the storages of Windows Azure AppFabric. These storages are not all-purpose; they are specific to certain tasks. The Service bus feature offers queues and topics. There you can store data used for communication between systems connected via the Service bus. The second type of storage is Windows Azure AppFabric cache. As its name suggests, this storage is meant for storing cache data. It is the cloud version of the Windows Server AppFabric product. You can use this storage for two purposes:

Cache storage – with Kentico CMS 6 you are able to override the CacheHelper class, which is used for storing and accessing cached data, so that you can write a provider for the Windows Azure AppFabric cache. You may be wondering why we didn´t write it ourselves. The reason is that our caching functionality relies upon cache dependencies but the AppFabric API doesn’t work with CacheDependency object . So we decided rather to focus on other features than write a provider which right now cannot deliver the whole functionality.

Session state storage – I have talked about that in my previous post - you can save session state into Windows Azure AppFabric. It´s very easy to set up, you just need to add a few sections to the web.config and add references to your project.

The last option is to store data somewhere inside Windows Azure. First of all, you can store data in the local storage of an instance (Windows Azure virtual machine). The system has the NTFS format, therefore you can use the standard System.IO namespace to save files there. If you want to use it, you have to register the usage of local storage into the service definition file:

Now you can store whatever you want under this path. Local storage has a few significant disadvantages:

Data stored in it isn´t persistent, thus in case your role is shut down, you lose all your data saved there.

Data is stored only in one instance but in Windows Azure your application is typically running in multiple machine environment. If you store data in it, you need to take care about its synchronization.

For these reasons I recommend using this storage for temporary data or as a cache. Kentico CMS uses it for these two purposes.
On the other hand, there are three new types of storages in Windows Azure. These are more general-purpose and they are all built as scalable storages.

Windows Azure blob storage

Blob is an acronym for Binary Large Object, which means any file with any structure can be saved into this storage. There are two types of objects, blobs and containers. Blobs represent files while every blob is placed in a container. A container is very similar to a folder in a standard file system. One big difference is that the blob storage has flat structure.

As you can see on the schema above, it has three levels – root level, container level and blob level. A container cannot contain another container. However, blob name consists of two parts – a prefix and a name. Prefixes can be used to emulate a tree structure by storing path to the files in them. Levels of the tree structure are defined by delimiters – the default delimiter is “/” and I personally recommend to leave it like that. The Blob storage API also offers virtual directory mechanism which uses blob prefixes so you can for example list blobs in certain virtual directory (with certain blob prefix). For better understanding, here is an example of this feature. Let´s say that we have these two blobs:

/path1/blob1.txt
/path1/path2/blob2.txt

The paths are blob prefixes and blob1.txt and blob2.txt are blob names. If you list the /path1/path2/ virtual directory you get only the blob2.txt file. If you list /path1/, the result is blob1.txt - there will be no /path2/blob2.txt in the result listing (but you can change this by a special parameter). Virtual directories are defined by the prefix; there is no way to create or delete a virtual directory and every virtual directory contains at least one blob.
Both blobs and containers can have several attributes specified. You can also define your custom attributes. The most important container attributes are access permissions. You have three choices:

Full access – in this case, anyone can list and access blobs in the container.

Blob only access – anyone can read blob content but cannot list the contents of the container.

No public access – access to blobs and containers is restricted to authorized requests.

This property is not available for blobs but you can use shared access key mechanism in order to set up access rights on blob level.
The most used blob properties are last write time (which is in UTC, since all Windows Azure datacenters use this time) and E-tag. This property is an MD5 checksum of the current version of the blob. If you want to cache files on single instances (this is what Kentico CMS does), this property is very helpful.
One thing I personally miss the most is a property/method which returns whether a blob exists or not. There are two possible workarounds – listing the contents of a container and comparing them with the blob name or a faster way – using FetchAttributes. The code can look like this:

This code tries to do the FetchAttributes operation. If the blob does not exist, an exception is thrown. This is the best practice to find out whether a blob exists or not.

Windows Azure drive

There are two options how to use the blob storage – use API directly (like in the example above), or use the Windows Azure drive. Let´s talk about the second choice first. This feature was designed to help you move your applications easier. You just mount a part of the blob storage as a disk in your virtual Azure machine and store your files there. This disk behaves the same way as any other NTFS disk and, of course, you can access it with the System.IO namespace. Internally, data is stored in a VHD file (this file is represented as a blob in the storage), so you can download this file a mount it locally. The guys from Microsoft modified a standard driver for disk access to work with blob storage.
If you want to create and mount a Windows Azure drive, you just call this code:

In this example I used the storage client library. First I to connected to the Windows Azure storage. Then I initialized a local storage for caching (you must also add local storage registration to the service definition file). After that I created a container and a blob on the Azure drive. The next step was to create the Azure drive. At the end I mounted the created drive and I stored the drive letter into a variable. From this moment you can use the Azure drive the same way as any other NTFS disk.
This sounds really great but Windows Azure has one big disadvantage. You can mount a disk for read/write access only to one instance. Since applications on Windows Azure run typically on multiple instances, this is not what you want. There are options how to solve it, for example:

One instance has write access while the other instances can use snapshots of a disk in read only mode. This solution is suitable only if instances can have up-to-date data with little delays because it takes some time to create new snapshots. Also, you need to implement a mechanism to send data from “read only instances” to “write access instance”. And what if an instance with write access will experience some failure? It takes 3 to 5 minutes to start a new instance (depends on size of application) so no data will be written in this interval.

Every instance will have its own Windows Azure drive. But data will be cloned multiple times (depends on number of your instances). You will also need to synchronize file changes between instances. Another challenge is to ensure the right behavior when a new role starts. You definitely need to clone an Azure drive, the question is from where. The instance you choose can be slightly unsynchronized at the moment and the clone process can take some time as well. In this time your new instance won’t be able to synchronize data or even read them.

In my opinion you can use Windows Azure drive as a temporary solution when moving your application to Windows Azure but nothing else. I personally don´t know any Windows Azure project that can scale out on Windows Azure and uses Windows Azure drive for storing the files at the same time.

Blob storage API

The basic storage API is REST based. Microsoft chose this because of independence on any language. You can send two types of requests – unauthorized and authorized. The first group is successful only if the given blob/container has set up access for public users and you can only do read operations, you cannot modify any data without authorization. For example, a call which requests a list of containers looks like this:

I´m pretty sure that most of you don´t want to program directly against the REST API because you have to create and parse HTTP requests in your code, which is not comfortable. Fortunately there is a client library for some languages, including C#. The C# client library is a wrapper around the REST API, it is written and supported by Microsoft and it provides you with objects for working with the storage. The same listing using the client library looks like this:

As you can see, the result is a collection of objects. Very nice, isn´t it?

Other storages

Windows Azure offers two other storage types – queues and tables. Queue storage is basically a set of queues where you can send messages. It contains two types of objects – queue and message.

The ideal behavior is FIFO (first in first out) style but the order of delivery is not guaranteed (as the shown in the diagram). A single message can have 8KB at most. You may ask “What is the purpose of this storage?” You can use it for communication between two different roles. The most basic scenario which is usually demonstrated is having a web front-end as a web role where users upload images and a worker role which creates thumbnails from them. Every time a user uploads an image, a message with the image information is sent from the web role to the worker role.
The last type of storage is table storage. In this type data is organized in tables and rows like you’re used to from SQL servers. But there are significant differences. First of all, table storage in Windows Azure is not relational, which means constraints like foreign keys cannot be specified. Second of all , tables have no schema, every row can contain different types and number of columns. The table storage is similar to hash tables – you have a unique key which accesses a certain row (value).
A key consists of two parts – a partition key and a row key. A partition key can help you with performance of your table storage. Data from one partition is physically stored in one place. Partitions are replicated internally on different parts of the datacenter and partitions which are used more frequently are replicated to a larger number of locations and access to them is faster. Because of this fact it´s not recommended using the same partition ID for your entire table. Rather than that you should divide your table into partitions by usage frequency.

Congratulations to those of you who are reading this sentence! You reached the end of my post. This was the first part of my talk about storages, in my next blog post I´m going to describe which storages are used by Kentico CMS and why we decided to use them.

I'm a fan of cloud computing (primarily Windows Azure) and I really like to dig into web application security. My blog is focused on everything related to Kentico, .NET Framework, Cloud platforms and web application security.

Comments

Veer Bahudar
commented on Mar 16, 2012

Really!! this is a good article which gives an insight view of storage types in windows azure, which helps a lot for beginners as me as well as developer. You made it very simple and understandable. Thanks for sharing with us. Check out the following links too, it also having nice stuff....