I have been getting my hands dirty recently working on a little solution which is functionally pretty useless, but demonstrates basic usage of some very cool Azure features, including Azure Service Bus Relays and Azure Service Bus Queues.

Essentially it’s a combination of several applications (some of which are cloud hosted in Azure while others are on-prem components) working together, calling each other in a never-ending cycle, as depicted below.

Main components of the solution:

Console Application

The Console Application is running on-prem (my laptop at home) and is self-hosting a WCF service that that can be invoked via Azure Service Bus. It has a single Web Service method which, when invoked, will use the QueueClient class to add a new message to the Azure Service Bus Queue, which is hosted in Azure.

Azure Service Bus Queue

The Azure Service Bus Queue is a component that lives inside Azure. Items are being added to this queue from the abovementioned Console Application. The Queue has to be created explicitly, either through the Azure portal, or through code. In our case we are doing it programmatically.

Worker Role (Cloud Service)

The Worker Role will respond to the OnMessage() event as soon as a new message is detected in the Queue. It responds by invoking the Web Service method that is hosted by the on-prem Console Application.

Azure Service Bus Relay

The Azure Service Bus Relay is a component that lives in Azure. It doesn’t need to be created explicitly, as it’s created and destroyed automatically by Azure in the background when it’s (no longer) needed.

The most beautiful part of this solution, I find, is the fact that we have an Azure hosted component that can invoke a Web Service that is hosted on-prem, sitting behind a firewall, etc. There is not a huge amount of code involved in getting this setup, as most of the complexity is handled by the WCF binding that is used by WCF service and client.

Think about how powerful this can be in scenarios where a cloud hosted solution needs to access data or otherwise interact with applications that are hosted on-prem, behind a firewall. The key thing here is that the on-prem Web Service actually initiates the connection to the Azure Service Bus Relay and keeps it open. Any incoming Web Service calls will reuse this existing connection.

We recently had a very strange problem with one of our custom Event Receivers in our production SharePoint environment. The Event Receiver implemented the Asynchronous ItemUpdated event. It was used to generate and set a value on one of our fields on the document being updated. The code in the Event Receiver appeared to work most of the times, but would fail at seemingly random occasions and would leave the updated document without the generated field set.

We were struggling to isolate the combination of factors that made it fail. The weirdest thing was that there were no errors to be found in the ULS logs or the Event Log. We added lots of logging and try/catch blocks, but for some reason when the Event Receiver failed it would never enter the catch block, so there was no exception to log.

One key point that helped us with the troubleshooting was that we had noticed that the Event Receiver ALWAYS worked when the document was being updated through the SharePoint web UI. We also had a PowerShell script which was used for bulk updating of documents. This script was scheduled to run at regular intervals using Windows Task Scheduler. It appeared that the issue only occurred when the Updated event was triggered via this scheduled PowerShell script, but even then it still seemed intermittent as it would often work just fine.

We were unable to reproduce the issue at all when calling the ps1 file directly from the PowerShell console. So what was different when the script was run from the Task Scheduler vs directly from the PowerShell console? Well, the Task Scheduler actually calls a BATCH script which in turn invokes the PowerShell script which fires up a new PowerShell process. This process dies when it finishes execution of the ps1 file!

Remember, our Event Receiver is an Asynchronous one, so it would not block the execution of the PowerShell script. The Event Receiver is actually executed on a thread inside the PowerShell process since the ps1 script triggered the Updated event. So, when the PowerShell.exe process dies, it does not seem to wait for any background threads to complete, which in our case causes our Event Receiver to suffer from a sudden death. I was a bit surprised to see this to be honest!

Anyway, I guess one of the reasons why in our case the problem seemed to be appearing randomly is that only the last document in a batch would be affected, which sometimes meant 1 in a couple thousand documents. Only recently users had started feeding the script with “batches” consisting of just one document, which is what highlighted the problem to us and lead to this investigation. We were wondering what had changed recently (we had not touched this part of the code for a while!), since it was all working fine before (we thought), but in reality the bug had always been there but it had never occurred to us!

So everyone please beware when invoking PowerShell scripts from BATCH scripts when you have Asynchronous Event Receivers in your SharePoint environment!

What did we do to work around our problem? We just put a little Sleep at the end of our PowerShell script… 🙂

With the introduction of the new SharePoint 2013 App Model, many SharePoint Developers will have to get their hands dirty writing JavaScript. One operation that is going to be very common is to read query string values from JavaScript. C# Developers are used to things like HttpContext.Current.Request.QueryString[“param1”], but unfortunately it is slightly more complicated to get the same result in JavaScript.

Searching the web for an answer on how to do this in JavaScript yields many different results. I would say that the best suggestions are listed on this popular question on StackOverflow. The most upvoted answer to this StackOverflow question has been improved by James Padolsey which has lead to the following function:

Recently I have been playing a bit with the new SharePoint 2013 App model. I was using the new Napa tools to develop a little App that could be be hosted in an App Part, also known as ClientWebPart, which is essentially an iframe which loads your App.

My App consisted of an ASPX page with a bunch of hyperlinks with target=”_blank” on them. I noticed that when the ASPX page was loaded by the ClientWebPart in its iframe, the hyperlinks wouldn’t work at all in Chrome (Firefox and IE were fine!). I started inspecting the DOM and noticed a “sandbox” attribute on the iframe element.

The sandbox attribute rendered on the ClientWebPart iframe contains the following four options:

I would say that this is a bug in the ClientWebPart, as Chrome appears to be the only browser affected. Let’s hope this will be fixed soon! In the meantime, consider using target=”_top” as an alternative, as this does work.

Imagine you need to develop a piece of functionality to extend SharePoint, but the available APIs do not directly allow you to do this for any of the following reasons:

You are not allowed to deploy any server side code

The server side code you can deploy has limited access to the server object model (e.g. Sandbox Solutions in SharePoint 2010).

The API you need to access is private or internal

Consider the following scenario.

For our SharePoint Online site, we want to implement a Web Part that allows us to save the current site as a WSP in the solution gallery. As it’s SharePoint Online, we can’t deploy Farm Solutions, so we will have to deploy it as a Sandbox Solution. Unfortunately, we have limited access to the Object Model on the server, and there is nothing available in the Client Object Model which we can use to save the current site as a template.

Now, if there is an existing administration page that does what we want to do (in our case /_layouts/savetmpl.aspx does what we want to do), then technically all we need to do is a submit an HTTP POST request to that page with the right HTTP headers and form parameters and the server will happily process the request, as it has no way of telling whether the request was triggered by a user submitting the form, or by something else.

Welcome to the world of stateless protocols.

So we need to find out what the request should look like so that we can use jQuery to build it and issue it as an AJAX request from our own code.

Let’s open up Fiddler. With Fiddler open, in the browser we press the button that submits the form on /_layouts/savetmpl.aspx,. After that, we can inspect the form parameters of the HTTP POST request that the browser then makes. We are not really interested in the HTTP headers as the browser will take care of passing any headers related to the current browser session (including the authorization header) to the server when we issue an AJAX request.

We have three main categories of form parameters.

The first category is made up of parameters that directly map to user input fields.

ctl00$PlaceHolderMain$ctl00$ctl00$TxtSaveAsTemplateName

ctl00$PlaceHolderMain$ctl01$ctl00$TxtSaveAsTemplateTitle

ctl00$PlaceHolderMain$ctl01$ctl01$TxtSaveAsTemplateDescription

ctl00$PlaceHolderMain$ctl03$CbSaveData

The second category consists of a set of hidden fields which are used by ASP.NET & SharePoint to do its postback magic, including validation of the form post.

__REQUESTDIGEST

__VIEWSTATE

__EVENTVALIDATION

The third category is “the rest”. This is stuff we are not really interested in, but we still need to send it to the server.

MSOWebPartPage_PostbackSource

MSOTlPn_SelectedWpId

MSOTlPn_View

MSOTlPn_ShowSettings

MSOGallery_SelectedLibrary

MSOGallery_FilterString

MSOTlPn_Button

MSOSPWebPartManager_DisplayModeName

MSOSPWebPartManager_ExitingDesignMode

MSOWebPartPage_Shared

MSOLayout_LayoutChanges

MSOLayout_InDesignMode

MSOSPWebPartManager_OldDisplayModeName

MSOSPWebPartManager_StartWebPartEditingName

MSOSPWebPartManager_EndWebPartEditing

_maintainWorkspaceScrollPosition

__spText1

__spText2

It’s easy to determine what we want to submit as values for the first category. We either capture these values in our custom UI, or we have some logic in our code that determines these values for us.

The second category is a bit more difficult. Essentially, the server is expecting us to post back these values, which were provided by the server and rendered on the page as hidden input fields at the time of requesting the page which has the form on it. This means we need to make an initial GET request using jQuery so we can extract the values from the form, before we can submit them in our post.

The third category is easy, as we can copy the values from the request we captured with fiddler.

The script

Following the instruction above, we can write a bit of javascript like this to allow us to submit forms on “remote” pages.

When the page you are posting to is modified, for example by a SharePoint update, then it’s possible that your script breaks due to changes in form parameter names. This makes this technique a bit fragile. For this reason I recommend that you only consider using this technique once you have confirmed that it’s not possible to achieve what you want by using public APIs.

This article describes how the out-of-the-box XsltListViewWebPart, XLV in short, can cause performance problems, sometimes up to the point where pages being served to users are hitting the ASP.NET execution timeout (110 seconds by default).

The XLV is a Web Part that was introduced in SharePoint 2010 as the replacement for the ListViewWebPart. The most important feature of the XLV is the ability to use a custom XSLT style sheet to customise the rendering of the list data surfaced through the XLV. On the project I am currently working we are making extensive use of this technique. What turns out is that this has been the main cause of the – seemingly random – performance problems we have been experiencing on our production environment for the last couple of months. On one day, it completely got out of hand and for the majority of the working day, our 3000 users were getting request timeouts and were simply unable to load most the pages in our portal. As a result, with support from a few colleagues, I have spent about 2 weeks trying to get to the bottom of this problem, which involved extensive analysis of the ULS logs, IIS logs, Content Database, Performance Counters and a lot of Reflected SharePoint code. By the end, I had come up with a set of steps to reproduce the problem on a vanilla SharePoint environment. I had gotten a better understanding of the internal workings of the XLV; enough to explain why it’s doing what it does. Keep on reading if you are interested in finding out more. Warning, this article contains code.

So – what are the ingredients?

a SharePoint 2010 Publishing Site

One or more XsltListViewWebParts with the XslLink property set to point to an XSLT file

Multiple Web Front End servers

What are the symptoms?

A variable increase in response times for pages with XLVs using XslLink; sometimes hardly noticeable, sometimes so bad that the request will time out.

Why? What? How?

In short, the problem is related to this thing called XSLT compilation which happens on the Web Front Ends at runtime. There seems to be a bug in the XLV that’s causing it to discard the compiled XSLT from its cache when it should be keeping it.

Compilation of XSLT is quite an expensive operation, which is why the XLV is meant to be caching the result so that the XSLT doesn’t have to be recompiled for every subsequent request. The time it takes to compile an XSLT style sheet seems to be dependent on its contents. Compilation of style sheets with more content seem to take longer than compiling small ones. On our environment, we’ve got fairly large XSLT stylesheets and it’s not uncommon for a single compilation to take 2 seconds.

Why is it a problem if the Web Frond End servers are busy compiling XSLT style sheets all the time? Have a look at the following reflected code (obtained using ILSpy, an open-source alternative to .NET Reflector),

The developers amongst you will understand that the call to AppDomainHelper.DoSerialize() is wrapped in a lock, meaning that XSLT compilation is something that can only be done by one thread at a time.

This funnel becomes a massive bottleneck when compilation cannot keep up with the compilation requests coming in, as they will form a growing queue of threads blocked at Monitor.Enter(). As the queue of blocked threads keeps growing, the amount of time that each individual thread has to wait goes up. Imagine for a moment what this will do to page response times.

Not to mention that the compilation of XSLT is actually being performed in a separate AppDomain, whose lifetime is managed by SharePoint. For some reason, if this AppDomain has served a total of 100 compilation requests, SharePoint decides to unload the AppDomain and create a new one. This is also an expensive operation and gives the queue of blocked threads the opportunity to grow even more.

To back up this theory, have a look at the following SCOM chart, which presents the “Total appdomains unloaded” performance counter for each of our Web Front End servers over a period of 24 hours. The counter gets reset after an Application Pool recycle, hence the drops after midnight.

It shows that it’s quite likely that our Web Front End servers are busy compiling XSLT all the time, causing and increase in response times for all requests to pages containing an XLV with XslLink set.

Let’s go back to our scenario where we experienced those severe performance issues for a whole day. The queue of threads waiting for the lock to be released had grown so much that we had threads waiting long enough for ASP.NET to decide it was time for it throw its virtual towel in the ring for that request. When the ASP.NET execution timeout is hit, ASP.NET it will throw a ThreadAbortException on the thread that is handling the request.

The picture below is showing a filtered set of entries from a ULS log file which contains 12 minutes worth of logs at Verbose level from a single Web Front End, captured during this period when we were experiencing our severe performance problems. The definition of the filter that is applied is Where [Message] Contains “Monitor.Enter”.

476 timeouts in 12 minutes, that’s about 40 per minute.

The nice thing about these entries in the ULS logs is that we can tell which method the thread was executing when the ThreadAbortException was thrown by ASP.NET because they contain a Stack Trace, and as you can see in the screenshot above, these threads were all “executing” Monitor.Enter(), i.e. they were blocked, waiting for the lock to be released.

Why are the XLVs compiling the XSLT all the time? Isn’t it supposed to cache the result and reuse it? You’d think so. Let’s have a good look at the Stack Trace that we’ve got that leads to the call to Monitor.Enter() as I am pretty sure that somewhere in the call stack a decision was made that it should not use a cached version of the compiled XSLT.

Error while executing web part: System.Threading.ThreadAbortException: Thread was being aborted.
at System.Threading.Monitor.Enter(Object obj)
at Microsoft.SharePoint.WebPartPages.BaseXsltListWebPart.GenerateCustomizedXsl(BaseXsltInfo baseXsltInfo)
at Microsoft.SharePoint.WebPartPages.BaseXsltListWebPart.get_CustomizedXsl()
at Microsoft.SharePoint.WebPartPages.BaseXsltListWebPart.LoadXslCompiledTransform(WSSXmlUrlResolver someXmlResolver)
at Microsoft.SharePoint.WebPartPages.DataFormWebPart.GetXslCompiledTransform()
at Microsoft.SharePoint.WebPartPages.DataFormWebPart.PrepareAndPerformTransform(Boolean bDeferExecuteTransform)

Note that the XsltListViewWebPart derives from BaseXsltListWebPart which in turn derives from DataFormWebPart.

DataFormWebPart.GetXslCompiledTransform() contains logic which decides whether BaseXsltListWebPart.LoadXslCompiledTransform() should be called or not. It only calls this method if it was unable to retrieve the compiled XSLT from its cache. Note that the DataFormWebPart uses HttpContext.Current.Cache as its “first level” of caching for the compiled XSLT (I will explain later why I call it “first level”).

So by the looks of it, the XLV was consistently unable to retrieve any compiled XSLT from the HttpContext.Current.Cache? Did it fail to add it to the cache the last time it compiled the XSLT, or has it perhaps been removed from the cache since it’s been added?

Let ‘s have another look at the same ULS file we looked at earlier. We’ve got the following suspicious entries. The definition of the filter that is applied is Where [Message] Contains “InvalidateAll”.

Not a very descriptive message you’d say, but descriptive enough to suspect it might be clearing some kind of cache for whatever reason.

ILSpy agrees.

What is the MemoryPartCache class? And what is it removing from the cache exactly?

The Microsoft.SharePoint.WebPartPages.WebPart class exposes the protected methods PartCacheRead() and PartCacheWrite() which internally use an object that derives from the abstract Microsoft.SharePoint.WebPartPages.PartCache class. Depending on what’s configured in the web.config, either an instance of MemoryPartCache of DatabasePartCache will be used. With the default web.config for SharePoint, an instance of MemoryPartCache will be used.

Let’s drill down into the constructor of MemoryPartCache.

Two things are interesting here. First of all, the call to BuildDependsKeys() as it looks like inside it is building the value for this.dependStringKey, which is the key for the object that is being removed in the InvalidateAll() method which we looked at earlier. Let’s have a look at that first, before we have a look at what the EnsureDependsKeyValid() method does.

Assuming the Web Part for which this MemoryPartCache object is being created has a StorageKey (Web Part ID), it will build a string with two “dynamic” elements; the Storage Key and the id of the SPSite. This implies there is a one-to-one relationship between an instance of Microsoft.SharePoint.WebPartPages.WebPart and such a key. Let’s remember this and find out where else this key is being used.

Before we have a look at EnsureDependsKeyValid(), let’s look at the Add() method. The MemoryPartCache.Add() is called when PartCacheWrite() is called on Microsoft.SharePoint.WebPartPages.WebPart. Whatever object needs to be added to the cache will be passed into MemoryPartCache.Add() as a method parameter named “data”. On line 59, at the very bottom of the bit of code below, you can see that the “data” object is being added to the cache. But hold on, it is being added with a CacheDependency! The CacheDependency that is being passed in is based on a string array, which means that it establishes a dependency on other object(s) in the cache, with the array items representing the key(s) of the “master” item(s). In this case we are only dependent on a single “master” item. This “master” item itself is added to the cache on line 30, if it didn’t already exist (which is tested on line 24). The interesting bit here is that the “master” item’s key is in fact this.dependentStringKey, which means that whenever MemoryPartCache.InvalidateAll() is called for a Web Part, its “master” item will be removed from the cache, along with any other “child” items that have been added through MemoryPartCache.Add().

We had already proven that InvalidateAll() is getting called quite a bit, so it’s possible that other cached objects are being removed from the cache as a result of that.

Let’s have a look at MemoryPartCache.EnsureDependsKeyValid(), which is also being called from within the MemoryPartCache constructor.

We are getting closer. So in certain circumstances, InvalidateAll() will be called, which essentially does what pretty much what it says it does; remove the master item from the cache along with any other objects in the cache related to this particular Web Part. What exactly is it testing in the if statement on line 6? It is comparing the result of the call to this.GetDepencyKeyHash() to the value of the cached “master” item, which, at the time of adding, also held the return value of a call to this.GetDepencyKeyHash(), as can be seen in line 30 in MemoryPartCache.Add(), above.

Let’s have a look at what GetDepencyKeyHash() does.

Aha! It looks at the Web Part version.

In a nutshell this means that whenever a web part version changes, any cache stored against other (previous) versions of that web part, will be cleared.

The version of a Web Part is not exposed through a public API, so if you want to see it you need to either write code which uses reflection to read the internal Version property of Microsoft.SharePoint.WebPartPages.WebPart, or you can have a quick peek in the Content Database. Note that you should never touch the Content Database on a production environment, which is why you should back up the database and restore it elsewhere before you start running queries against it.

As I was exploring the Content Database, I spotted the tp_Version column on the AllWebParts table. When I selected all rows from the AllWebParts table, and ordered it by tp_Version in descending order, I found that some of our web parts had versions greater than 100.000, which is pretty suspicious I must say!

Let’s summarise what information we’ve got so far.

Web Part versions are incrementing A LOT for some unknown reason

Objects stored in Web Part Cache are being cleared every time a Web Part version changes

XLVs seem to be unable to retrieve compiled XSLT from the cache, with recompilation of the XSLT as a result, causing a variable increase in response times

Even though the compiled XSLT itself is not stored using Web Part Cache, another object of type CloneableHashtable is. This object is added and retrieved from the Web Part Cache to facilitate the retrieval of the compiled XSLT by taking part in process of building the cache key required to retrieve the compiled XSLT. This means that when InvalidateAll() is called and the Web Parts’s “master” object will be removed from the cache, so will this CloneableHashtable as it’s dependent on the Web Part’s “master” object. As a result, the XLV cannot retrieve the CloneableHashtable from the cache (as it does not exist anymore), and consequently it cannot build the cache key that is required to retrieve the compiled XSLT from the cache. This logic is defined in Microsoft.SharePoint.WebPartPages.DataFormWebPart.GetXslCompiledTransform().

Why are the Web Part versions being incremented automagically?

Remember I mentioned “first level” of caching earlier? Well, there is a second level of caching of the compiled XSLT, in the content database, in the tp_Cache column in the same AllWebParts table. How I understand how this is SUPPOSED to work, is that if you have multiple Web Front End servers, and the first server to serve a page with a new XLV with XslLink on it does the compilation, it will first add the compilation result to its own cache as described above, and then it will also save it to the database (line 32, below), so that all the other Web Front End servers in the farm can reuse it.

Pretty clever.

Too bad it doesn’t work.

The call to CacheWriteInternal on line 32 results in a call to a Stored Procedure called proc_UpdateWebPartCache. However, the value passed in to the VarBinary @Cache parameter of the Stored Procedure does not contain any data, as can be observed in the Developer Dashboard.

Also, If you run a trace with SQL Server Profiler whilst CacheWriteInternal() is being called you can see the call to proc_UpdateWebPartCache, and you can see that a NULL value was passed as the value for the @Cache parameter.

To confirm the database really didn’t receive the compilation result, we can run a select query on the AllWebParts table. It shows only NULL values in the tp_Cache column. Essentially, it’s not saving the compiled XSLT to the database properly.

What it did do, just before it called CacheWriteInternal(), was that it made a call to SaveChanges() on line 31. And guess what that does. It calls a Stored Procedure called proc_UpdateWebPart, and one of the many things this Stored Procedure does is incrementing the Web Part version!

What would have happened if the compiled XSLT would have been saved to the database successfully?

Lets have a look at the definition of PartCacheUpdate() which is being overridden in the Microsoft.SharePoint.WebPartPages.BaseXsltListWebPart class, the base class for the XLV.

Now the name of this method is a bit confusing, but what it really is, is a mechanism to allow to the WebPartManager to “inject” the value of the tp_Cache column from the AllWebParts table into the Web Part that owns the cache at the time of instantiating the Web Part as a Control on the page, somewhere fairly early on in the page lifecycle. So IF the tp_Cache column would have a held a value other than null, we would have captured it in the this._customizedXsl field, line 7.

Now let’s go back to a bit of code we looked at just a moment ago.

Notice the if statement on line 9, where it tests if this._customizedXsl is null. Assuming we had received it from the database via the WebPartManager, there would have been no call to this.GenerateCustomizedXsl() and we would have successfully reused the compilation result from another Web Front End server.

There is however a caveat. Even if we did receive the compiled XSLT from the database, there is a “second chance” to recompile when the if statement on line 18 is satisfied. The BaseXsltHashKey property on the XLV holds a string containing a list of filenames of .xsl files that live in the TEMPLATE\LAYOUTS\XSL folder and their last modified times. The if statement on line 18 tests if the value stored in the Web Part is the same as the one it just calculated by checking the files on the file system of Web Front End serving the current request. If they are different, SharePoint thinks you just deployed a new .xsl file and it’s time to recompile.

What this means is, if, for some reason, the list of .xsl files contained in the TEMPLATE\LAYOUTS\XSL and their last modified times are not exactly the same across all Web Front End servers, you will have a problem. Because, SharePoint will keep detecting a difference between the value stored in the Web Part property and the calculated values, as the Web Part property keeps getting updated with the calculated value from the Web Front End server that last detected the difference, and the never ending ping pong game has started.

We have seen scenarios where standard WSP deployment of .xsl files into the TEMPLATE\LAYOUTS\XSL folder sometimes caused files to be deployed / updated with timestamps that we not exactly the same across all Web Front End servers in the farm, i.e. a difference of a single second. This is enough SharePoint to start playing the never ending ping pong game. I must say that I have not seen this kind of odd behaviour with WSP deployments before, so I am not sure if this is something environmental or something that affects everyone.

Overview

For a quick overview of the overall flow which is described in this article please refer to the following a diagram.

The main question

Why does the call to CacheWriteInternal() result in a Stored Procedure call with a NULL value for the @Cache parameter? I would love to know.

What’s next?

I am planning to write a follow up blog post soon on this topic where I will describe a work around that has solved the problem for us. Some of our worst hit pages went from an average of 8 second response times to less than a second.

Also, I have been in touch with Microsoft, and they have been able to reproduce the issue on their end. I am waiting to hear back from them.

I followed the MSDN article Creating a Custom WCF Service in SharePoint Foundation to deploy my own custom WCF service into the ISAPI folder of SharePoint 2010. I was under the impression that browsing to the url where my .svc file was located would return some kind of WSDL. I was also expecting that I could use this url to add a Service Reference in Visual Studio 2010. Instead, I was getting a HTTP 400 Bad Request responses in both scenarios, making me think there was something wrong with the implementation of my service.