About Erwyn

The Windows Azure Platform has a rich diagnostics infrastructure to enable logging and performance monitoring in the cloud.

When you run a cloud app locally in the DevFabric you can view diagnostics trace output in the console window that is part of the DevFabric UI. For example, say I have the line

System.Diagnostics.Trace.TraceInformation("Old Trace called at {0}.", DateTime.Now);

That line gives a result like this in the DevFabric UI:

If you can’t see the DevFabric UI, you can enable it here after starting your cloud app from Visual Studio:

Using the DevFabric UI is the most basic form of viewing Trace output. Looking at each individual console window doesn’t really scale well if you have many instances and, furthermore, the console window of an instance is not available in the cloud. For this, there is a special TraceListener derived class: Microsoft.WindowsAzure.Diagnostics.DiagnosticMonitorTraceListener. It sends the trace output to a Windows Azure diagnostics monitor process that is able to store the messages in Windows Azure Storage. Here is a look at a trace message using the Azure Storage Explorer:

However, note that a similar message “TraceSource called at …” didn’t show up in the DevFabric UI. You might wonder what is going on. And you might wonder why I want to use a TraceSource instead of Trace? Because this MSDN article states:

One of the new features in the .NET Framework version 2.0 is an enhanced tracing system. The basic premise is unchanged: tracing messages are sent through switches to listeners, which report the data to an associated output medium. A primary difference for version 2.0 is that traces can be initiated through instances of the TraceSource class. TraceSource is intended to function as an enhanced tracing system and can be used in place of the static methods of the older Trace and Debug tracing classes. The familiar Trace and Debug classes still exist, but the recommended practice is to use the TraceSource class for tracing.

The reason you don’t see the message for the TraceSource in the DevFabric UI is that the DevFabric magically adds a special TraceListener for the “old fashioned” Trace class, but not for your TraceSource instance. I put together a cloud app solution (Visual Studio 2010) that shows this through a simple web role. This web role has the configuration you see above in its web.config file. If you run this simple web role in the DevFabric you’ll see:

Note that Trace has a Microsoft.ServiceHosting.Tools.DevelopmentFabric.Runtime.DevelopmentFabricTraceListener instance registered, while the TraceSource hasn’t. To remedy this, I’ve created a small class that adds a DevFabricTraceListener to a TraceSource if it is registered for Trace:

PowerPivot: a managed self-service analysis solution that empowers end users to access, analyze and share data across the enterprise in an IT managed environment using Excel 2010 and SharePoint Sever 2010.

Master Data Services: helps IT organizations centrally manage critical data assets companywide and across diverse systems, and enables more people to securely manage master data directly, and ensure the integrity of information over time.

Report Builder 3.0: report authoring component with support for geospatial visualization. This new release provides capabilities to further increase end user productivity with enhanced wizards, more powerful visualizations, and intuitive authoring.

StreamInsight: a low latency complex event processing platform to help IT monitor, analyze and act on the data in motion to make more informed business decisions in near real-time.

The Microsoft Enterprise Library has always been one of the most popular things to come out of the patterns & practices team. Yesterday p&p reached a major milestone by releasing version 5.0 of EntLib.

The improvements are too numerous to sum up here, but let me mention one: this release has full .NET 3.5 SP1 and .NET 4 compatibility and works great from both Visual Studio 2008 SP1 and Visual Studio 2010 RTM.

Long before I joined the company, Brad Abrams was one of the first people for me that put a human face on Microsoft.

I felt sad when I just read his blog post that Brad is leaving Microsoft. But at the same time I feel happy for all that he has accomplished for the company. Check his blog post for all that he has been involved in. Not on the list, but not less important in my mind, is the book Framework Design Guidelines that he co-authored.

As you will probably know, Visual Studio 2010 and .NET Framework 4 will RTM on April 12, 2010 and will be available for download on MSDN Subscriptions Downloads the same day.

The Windows Azure team is committed to making .NET Framework 4 available in Windows Azure within 90 days of the RTM date.

A lesser known fact is that the latest available Windows Azure build already has a .NET 4 version installed, namely the RC bits. Although this version cannot be used to run applications on (because .NET 4 is not yet exposed in the Windows Azure dev tools), you can use this build to test if the presence of .NET 4 has impact on existing .NET 3.5 apps running on Windows Azure.

[Updated 2010-04-25: The lightweight view for MSDN has been fixed, so the content of this post is no longer relavant][Updated 2010-04-07: Updated one screenshot, because the lightweight view has changed]

MSDN has recently switched the default online view mode to “lightweight view”. Although this dramatically improves loading times for the online docs, this currently breaks parts of the Windows Azure Documentation. For example, when you view http://msdn.microsoft.com/en-us/library/ee393295(v=MSDN.10).aspx in lightweight mode, the namespaces are not hyperlinks and the menu is broken:

To fix this, you have to click on the “Preferences” link in the upper right corner of the page:

Choose “Classic” on the page that appears:

Click OK, and you’ll get back to a view that actually works:

We are working on getting the “lightweight view” fixed with this part of the Azure documentation.

Last week we published a new major version of the Visual Studio Performance Testing Quick Reference Guide. The effort of creating this document was lead by Geoff Gray. Geoff is a senior consultant in the Microsoft Services Labs in the US that specializes in performance testing using Visual Studio. I was part of the Rangers team that contributed articles to this guide.

From time to time I do performance testing engagements. Last week for example I worked for a customer and load tested their new Internet facing website before it goes live later this month. Performance testing uncovers issues in the software and the infrastructure that might otherwise go unnoticed until the public hits the site. And that is a bad time to discover performance issues.

Creating and executing real-world performance testing can be a tricky job and this quick reference guide gives lots of tricks & tips from consultants from Microsoft Services and the Visual Studio product team on how to tackle challenges you might encounter.

Visual Studio has a rich extensibility model for performance testing and that’s why it is interesting for development minded people like me. You can create plug-ins in C# to do advanced parameterization of web tests and load tests (example on MSDN). The QRG has lots of examples of when and how this is useful.

Load testing is part of Visual Studio 2008 Team Suite or Test Edition and of Visual Studio 2010 Ultimate. With VS2010, if you need to simulate more than 1000 users, you can purchase “virtual user packs”. This will enable you to simulate higher loads using a test rig consisting of one test controller and several test agents. Previously with VS2008 you had to license each agent. You can’t be sure in advance how many users you can simulate with one agent because it depends on many factors like hardware and complexity of the application and the test itself.

Many organizations forego the effort of performance testing and take a “wait and see” stance. This reactive approach can lead to costly repairs and bad publicity. My advice is to take a proactive stance and be confident about the performance of your application before it goes live. In many cases there are quick wins to improve the performance by enabling better caching (blob caching for SharePoint). Even simple things like enabling HTTP compression are still often forgotten.

My first blog post with the title UPC sucks is the #1 hit on Bing and the #5 hit on Google. Unfortunately it is time to add another such post.

UPC is a cable company that has a monopoly on cable tv in large parts of the Netherlands. I have two UPC mediaboxes for watching digital tv and using the UPC On Demand service. When they work, the picture quality is great. If they don’t work, they cause you major headaches and an endless stream of calls to the UPC helpdesk. You have to dial a paid phone number in order to speak to UPC and the waiting time is between 5 and 10 minutes in my experience.

Since february 2009, I have two mediaboxes, one is a Philips HD-DVR and the other is a Thompson SD without DVR. On at least three occasions I have had “access denied” problems on all digital tv channels. Resolving these problems took multiple calls to the helpdesk in each case. The usual chain of recommended trouble shooting is:

Recheck the cabling, unless you insist you can see the “access denied” error message on the screen so there is a working connection between your mediabox and the tv.

Take the power cable out for 30 seconds and then reinsert it. The reboot takes at least a couple of minutes. Which you pay for of course, because you don’t want to hang up the phone and get back to the end of the queue when you need to call again because the problem is not resolved.

Reset the box to factory settings. You loose all customizations (like favorite channels). This takes about five minutes.

Press some button on the box while reinserting the power cable (variant of step 2) to force a software update. This takes over five minutes before the box is usable again.

I am sure these steps resolve 90% of the problems for most customers, because the mediabox is an unstable piece of crap. I try these steps before calling the helpdesk to save some money on the call. But for the remaining 10% of the cases it doesn’t work. And actually if the helpdesk had more intelligent scripts or more intelligent people, they could tell in advance what types of problems are not resolved by these steps. The “access denied” case is such a class of problem. Some helpdesk employees realize this, if you insist you know the problem is on their end. Others just want to send a mechanic to check your cabling for which you have to stay at home for at least half a day.

What sometimes happens is that the mediabox for unknown reasons looses it authorization to view all channels. The resolution is that the UPC server should push the authorization again to the box. Some employees claim they can’t do this and others that this takes up to 24 hours to take effect, but that is rubbish. If you happen to get a decently skilled employee on the phone, they should be able to solve the problem instantly while you wait.

In May 2009 the On Demand service stopped working on both of my mediaboxes. After selecting a video and pressing OK, error code CU103 or VD103 appeared with the message that I should call the helpdesk. For most cases this symptom can be resolved by rebooting the box (step 2 above). After being online for a couple of days, the box has the tendency to loose its IP-address and is not smart enough to reacquire it. You can check this by pressing the red button on channel 999. This performs a connectivity check. In my case the result was “congratulations! you have a connection with the UPC network” and the next screen showed all green statuses with IP-addresses and all. After calling the helpdesk they insisted that it must be a coax cabling problem on my end because steps 2 to 4 didn’t fix the problem. I didn’t believe this, because I have good cabling from Hirschmann. Nevertheless, as I was out of options, I agreed to stay at home for a mechanic. The guy (from Centric hired by UPC) was no On Demand expert and repeated steps 2 to 4. He measured the signal strength, which was fine as I expected. He did have one extra trick up his sleeve:

After the reset to factory settings enter the wrong region code so you connect with a server in another region. Try to start On Demand. Switch back to original region. Retry.

This didn’t work. The mediaboxes otherwise seemed to work just fine and the chance of two in one home breaking down at the same day is pretty slim, so they were not replaced. The mechanic left without being able to fix the problem. He concluded that the problem had to be at the UPC server end. I didn’t hear back from UPC for a couple of days and called them again. A new case would be opened and I heard something about “frequency fine-tuning on the server end”. A couple of days later On Demand magically started working, but I never heard if and what was fixed by UPC.

Fast forward to August 7th, 2009. On Demand stopped working again with the same symptoms as in May. So I called the helpdesk. They always apologize for the inconvenience and claim they will solve your problem. Their script and training includes friendliness. But it turns out that for complex On Demand problems they can only send e-mail messages to a special UPC unit that I dub the “black hole”. There is only one-way communication from the helpdesk to this unit possible. They promise that the unit will call you back, but they never do. Apparently the workflow is that the customer has to call the help desk again if the issue is not resolved after a couple of days and his/her patience runs out after not hearing anything from UPC. The help desk goes through exactly the same cycle again and just opens a new case for the “black hole” unit. Some employees have the nerve to claim that the unit will look into the issue the same evening. Others say they cannot state any reasonable time for resolution or feedback: “could be longer than a week”. There is only one case where the helpdesk called me back (not the black hole unit). The only thing the employee could tell me was that the e-mail had been sent and he would call again after the weekend to give another status update. That call never came.

Randomly in this endless sequence of calls to the helpdesk an employee will claim again that they need to send a mechanic. After telling them the whole history again, I could oftentimes convince them that the problem is very likely not on my end and I don’t want to stay at home for a day for a mechanic who can’t fix the problem. My mediaboxes have a proper return signal. UPC can measure this remotely. UPC Interactive works, so the TCP/IP connection is obviously not the problem.

Tonight I got into this whole discussion with the help desk again. I was ready to give in, so I asked “When can you send a mechanic?” The first available option was August 26th, so more than a week from now. And that for a problem that started on August 7th. And no compensation what so ever if the mechanic isn’t able to fix the problem, because it isn’t at my end of the cable. UPC has a cash back policy if they can’t fix a problem in 24 hours. But On Demand is not included in this policy, because they claim it is a “free” service. Of course it isn’t free. You can’t get it unless you pay money for a digital television pack and mediabox. On Demand is a large part of their marketing and an important motivator for me to pay extra for digital television.

Tomorrow I am going to call UPC again to file a formal complaint. This blog post serves as a public statement that I can refer to, to add some extra weight to my complaint.

If you managed to read this far, thanks for bearing with me.

If you consider becoming a UPC customer, be prepared to buy extra aspirin.

Yesterday I worked on a new version of my FlickrMetadataSynchr tool and published the 1.3.0.0 version on CodePlex. I wasn’t really planning on creating a new version, but I was annoyed by the old version in a new usage scenario. When you have an itch you have to scratch it! And it is always good to catch up on some programming if recent assignments at work don’t include any coding. So what caused this itch?

About two weeks ago I got back from my holiday in China with about 1,500 pictures on my 16 GB memory card. I always make a first selection immediately after taking a picture, so initially there were lots more. After selection and stitching panoramic photos, I managed to get this down to about 1,200 pictures. Still a lot of pictures. But storage is cheap, tagging makes search easy, so why throw away any more? One of the perks of a Pro account on Flickr is that I have unlimited storage, so I uploaded 1,1173 pictures (5.54 GB). This took over 12 hours because Flickr has limited uploading bandwidth.

Adding metadata doesn’t stop at tagging pictures. You can add a title, description and geolocation to a picture. Sometimes this is easier to do on your local pictures, and sometimes I prefer to do it on Flickr. The FlickrMetadataSynchr tool that I wrote is a solution to keeping this metadata in sync. You should always try to stay in control of your data, so I keep backups of my e-mail stored in the “cloud” and I store all metadata in the original picture files on my hard drive. Of course I backup those files too. Even offsite by storing an external hard drive outside my house.

Back to the problem. Syncing the metadata for 1,1173 pictures took an annoyingly long time. The Flickr API has some batch operations, but for my tool I have to fetch metadata and update metadata for pictures one-by-one. So each fetch and each update uses one HTTP call. Each operation is not unreasonably show, but when adding latency to the mix it adds up to slow performance if you do it sequentially.

Imperative programming languages like C# promote a sequential way of doing things. It is really hard to exploit multiple processor cores by splitting up work so that it can run in parallel. You run into things like data concurrency for shared memory, coordinating results and exceptions, making operations cancellable, etc. Even with a single processor core, my app would benefit from exploiting parallelism because the processor spends most of its time waiting on the result of the HTTP call. This time can be utilized by creating additional calls or processing results of other calls. Microsoft has realized that this is hard work for a programmer and great new additions are coming in .NET Framework 4.0 and Visual Studio 2010. Things like the Task Parallel Library and making debugging parallel applications easier.

However, these improvements are still in the beta stage and not usable yet for production software like my tool. I am not the only user of my application and “xcopy deployability” remains a very important goal to me. For example, the tool does not use .NET 3.5 features and only depends on .NET 3.0, This is because Windows Vista comes with .NET 3.0 out of the box and .NET 3.5 requires an additional hefty install. I might make the transition to .NET 3.5 SP1 soon, because it is now pushed out to all users of .NET 2.0 and higher through Windows Update.

So I added parallelism the old-fashioned way, by manually spinning up threads, locking shared data structures appropriately, propagate exception information through callbacks, making asynchronous processes cancellable, waiting on all worker threads to finish using WaitHandles, etc. I don’t use the standard .NET threadpool for queing work because it is tuned for CPU bound operations. I want to have fine grained control over the number of HTTP connections that I open to Flickr. A reasonable number is a maximum of 10 concurrent connections. This gives me almost 10 ten times the original speed for the Flickr fetch and update steps in the sync process. Going any higher puts me at risk of being seen as launching a denial-of-service attack against the Flickr web services.

If you want to take a look at my source code, you can find it at CodePlex. The app was already nicely factored, so I didn’t have to rearchitect it to add parallelism. The sync process was already done on a background thread (albeit sequentially) in a helper class, because you should never block the UI thread in WinForms or WPF applications. The app already contained quite a bit of thread synchronization stuff. The new machinery is contained in the abstract generic class AsyncFlickerWorker<TIn, Tout> class. Its signature is

/// <summary>
/// Abstract class that implements the machinery to asynchronously process metadata on Flickr. This can either be fetching metadata
/// or updating metadata.
/// </summary>
/// <typeparam name="TIn">The type of metadata that is processed.</typeparam>
/// <typeparam name="TOut">The type of metadata that is the result of the processing.</typeparam>
internal abstract class AsyncFlickrWorker<TIn, TOut>

It has the following public method

/// <summary>
/// Starts the async process. This method should not be called when the asychronous process is already in progress.
/// </summary>
/// <param name="metadataList">The list with <typeparamref name="TIn"/> instances of metadata that should
/// be processed on Flickr.</param>
/// <param name="resultCallback">A callback that receives the result. Is not allowed to be null.</param>
/// <typeparam name="TIn">The type of metadata that is processed.</typeparam>
/// <typeparam name="TOut">The type of metadata that is the result of the processing.</typeparam>
/// <returns>Returns a <see cref="WaitHandle"/> that can be used for synchronization purposes. It will be signaled when
/// the async process is done.</returns>
public WaitHandle BeginWork(IList<TIn> metadataList, EventHandler<AsyncFlickrWorkerEventArgs<TOut>> resultCallback)

These async workers are used by the FlickrHelper class (BTW: this class has grown a bit too big, so it is a likely candidate for future refactoring). Its method that calls async workers is generic and has this signature:

This method contains an anonymous delegate that acts as the result callback for the async workers. Generics and anonymous delegates make multithreaded life bearable in C# 2.0. Anonymous delegates allow you to use local variables and fields of the containing method and class in the callback method and thus easily access and change those to store the result of the worker thread. Of course, make sure you lock access to shared data appropriately because multiple threads might callback simultaneously to report their results.

And somewhere in 2010 when .NET 4.0 is released, I could potentially remove all this manual threading stuff and just exploit Parallel.For 😉