Tag Archives: perf

Does a referenced assembly get loaded if no types in the assembly are “not used”?The term used is is very subjective. For a developer it would mean that you probably never created an instance or called a method on it. But this does not cover the whole story. You can instead consider what are the reasons for an assembly load occurring. Suzanne’s blog on Assembly loading Failures would give you a good understanding of failures if that is what you are interested in. This post focuses on how to identify what exactly is causing an assembly to load.We in the WCF team are very cautious on introducing assembly dependencies and how how our code paths can cause assembly loads since this impacts the reference set of your process. Images that get loaded during a WCF call can become the cause of slow start up since every assembly is a potential disk look up and larger the number the higher the impact to startup. As a guidance for quick app startup is that you can eliminate a lot of the unnecessary assemblies from being loaded to speed up application startup if you refactor types properly. Continue reading →

In certain load tests you want to make sure a bunch of threads reach a particular state before they proceed with the rest of the work. You cannot make sure that all threads execute a point simultaneously since the CPU scheduling would determine this. However you can move these threads to Ready. A ready-thread is a thread can be scheduled for execution on a particular core – http://msdn.microsoft.com/en-us/library/dd627187%28VS.85%29.aspx

“The WaitForMultipleObjects function determines whether the wait criteria have been met. If the criteria have not been met, the calling thread enters the wait state until the conditions of the wait criteria have been met or the time-out interval elapses.”

Here is a small example of how to start multiple threads and then let them proceed after all of them have reached a particular point in execution.

With xperf being more and more adopted and with rich stackwalking capabilities, its only natural to use it for finding out bottlenecks and cause for switch out.

Findout the ready thread information and what causes the threads to switch out and the associated stack that woke up when a thread switches back in is one way to determine what was the offending stack that causes other threads to switch out. This helps us identify potential hot locks or just really expensive locks or issues due to false data sharing.

You can run the following command to capture stack traces with ready thread information.

WCF enables throttling execution of operations but not their completions. This becomes and issue when a large number of outstanding operations complete almost simultaneously causing the callback on the client to be overwhelmed with completions. Generally we don’t expect the client to issue of infinite number of pending operations but if you do end up with very high CPU usage and all suspect all your operations are stuck in the callback method which takes a lock then you need to throttle the callbacks yourself.You could try Setting the minThreads but this affects the whole app domain. The issue is due to the large number of callbacks that come in concurrently. The sample attached throttles the callbacks to have only one thread execute completions while there are 20 threads starting the operations and all completing almost simultaneously. The idea is to wrap the AsyncResult of your operation and complete only the required number of results in parallel and this would throttle the service operation Ends automatically.Sample Source: AsyncEndThrottling

Problem Statement

Some broker implementations require creating a copy the message forwarding it over to the backend. The broker also might slightly modify things like addressing headers etc. on the message for proper message routing within the DMZ. The problem is that we see a very high CPU cost in creating this copy message and this also results in lower throughput. Note: Streamed transfer mode is not in scope for this article.

Simulation

For all performance issues we need to measure and profile and to investigate this issue we initially try to simulate the pattern of the broker by just copying over the message and then forwarding it to a backend dummy service. We then take profiles of this to understand how much the actual cost of copying is.

An 8% cost for copying seems to be acceptable considering the value of making the copy and able to do other things if required. But then again this was not what is being observed. In the profiles from the actual broker we notice about 40% cost for creating a copy. This means that almost half the time is spent in creating a message copy. So effectively your throughput would almost drop to half when the broker is configured to create a copy of the message. This is excluding costs like logging etc.Evidently our simulation is not accurate so we need to isolate this further. We take in more functionality from the broker so that we hit this expensive path. One of the key observations was that the message is copied just before it is being forwarded. This also means that there are a bunch of manipulations that was done on the message and in our simulation we didn’t perform any manipulation. So to get this closer we need to probably change some things on the message.To keep it simple we did something like removing some header and adding another header to the message since most brokers modify headers before forwarding it over.

Now that we have identified the root cause we also need to identify the solution so that the broker can achieve the functionality without taking up so much CPU.

Solution

The solution is actually quite simple “Modify your message after you create the buffered copy”. I wantedgive the solution before the analysis since most of you would probably not be interested in the analysis but if you are then the rest would be interesting.

BufferedMessage headers haven’t been captured (this happens for e.g. when the user inserts a header in the first location)

If headers have been modified then CreateBufferedMessage takes an alternative path using the DefaultMessageBuffer. The reason is that a fully copy of the message has to be created if any buffered header has been modified. An internal property called headers.ContainsOnlyBufferedMessageHeadersis used to distinguish if the faster BufferedMessageBuffer can be used to create the buffered copy or not. If there are any modified headers then this means we need to assure that the message is fully marshaled over and the buffer itself cannot be copied(e.g. the user can add a reference type to the header) and so we fall back to a path that would fully reparse the message and create a fully deserialized copy of the modified message.The main point here is a copy should always be a deep copy and any kind of modification should not result in a message with shallow copied message parts. When you copy and create a message from the original then your message objects get its own copy of headers that it can play around with without affecting the original incoming message. Message copy by itself is a fast operation as you can see from the above profile and copying a modified message can be very CPU intensive when using buffered transfer mode.

For greater flexibility our router can be something like a pass through router. If we are just calling a backend service then we can use a generic contract to receive and forward messages to the back end service as shown below.

Here we create a copy of the message to consume locally on the broker incase we want to validate some parts of the message or log etc. Ideally the fastest would be to just directly forward it over but application sometimes require all incoming messages to be logged or validated at the entry point of the DMZ.

Pros

Loosely Coupling

Potentially can avoid a lot of serialization and deserialization cost. (Encoders need to match for this)

Changes in the backend usually do not require changes in the Broker provided the broker uses generic client contracts.

Cons

Synchronous pattern has its overhead and not scalable.

The router has to be of very high capacity to support high loads even though backend machines may block on IO.

Best Practices

Match your encoders – Encoder mismatches can cause a heavy serialization and deserialization on the router. This is because any change between the backend and the frontend encoders on the binding would require re-encoding of the message and hence a full read and write at the encoder layer.

Avoid having to create message copy. If the backend can create the copy/validate you save the routers CPU for more messages per second.

Make sure you use fast message copy – Buffered messages have a very optimized message copy path that WCF would take provided you haven’t changed parts in the message. I will talk about how to hit this fast path next.