Hazim Shafi's Bloghttps://blogs.msdn.microsoft.com/hshafi
Concurrency Visualizer: Parallel Performance Tools for WindowsThu, 17 Jun 2010 11:07:00 +0000en-UShourly1Case Study: Parallelism and Memory Usage, VS2010 Tools to the Rescue!https://blogs.msdn.microsoft.com/hshafi/2010/06/17/case-study-parallelism-and-memory-usage-vs2010-tools-to-the-rescue/
https://blogs.msdn.microsoft.com/hshafi/2010/06/17/case-study-parallelism-and-memory-usage-vs2010-tools-to-the-rescue/#respondThu, 17 Jun 2010 11:07:00 +0000https://blogs.msdn.microsoft.com/hshafi/2010/06/17/case-study-parallelism-and-memory-usage-vs2010-tools-to-the-rescue/I was approached recently by a request to help a Microsoft customer improve his application’s performance.He had a managed application that exhibited a fair amount of data level parallelism in a long-running for_each loop.When that loop was parallelized using the Parallel.ForEach() construct in TPL (Task Parallel Library) on a quad core system, our customer was not happy with the resulting performance improvement.He expected linear speedup since there wasn’t much sharing and synchronization in the implementation.

Actually, what I just said was true after my friend Stephen Toub helped the customer resolve a race condition on a data structure.The solution that was used was considerably better than using a lock.It involved using a concurrent data structure also provided by TPL (a ConcurrentBag).Our customer was using the Concurrency Visualizer, but needed our help to figure out how to use it to identify the root cause of his scalability problem.So, we took a look at his trace together and within a couple of minutes, it was clear that we were dealing with a classic workstation GC overhead pattern.

As you see in the above screenshot, when the garbage collector kicks in, all threads but one are blocked, as shown in the highlighted area above.This pattern repeats itself multiple times in the parallel loop, indicating that garbage collection occurs often.If you click on one of the red synchronization regions for the blocked threads, you will see a callstack (be sure to add the Microsoft Symbol Server to your symbols path under Tools->Options->Debugger->Symbols to get good symbols in the runtime) containing the following at the tail end (top of the callstack view):

clr.dll!CLREvent::Wait

clr.dll!WKS::gc_heap::wait_for_gc_done

clr.dll!WKS::gc_heap::try_allocate_more_space

clr.dll!WKS::gc_heap::allocate_more_space

clr.dll!WKS::GCHeap::Alloc

clr.dll!Alloc

This indicates that the thread is waiting on GC to complete.The only thread executing during the time will be running GC code (click on a green segment for a sample callstack).One option to improve the performance of GC is to use the Server GC; an implementation that creates a heap per core and does not block all threads if a GC event is triggered.This change resulted in a 40%+ improvement in performance exhibited by significantly less blocking during execution (see screenshot below showing that most of the time, many threads are executing on this quad core hyperthreaded system).But, we still needed to get to the underlying root of excessive garbage collections. In case you’re interested in how to switch to the server GC implementation in Visual Studio, here are the steps:

1.Create an application config file.This can be done by right clicking on the project file and selecting Add->New Item->General->Application Configuration File.

2.Edit the config file to add the runtime option shown below.Beware, this is case-sensitive!

<?xmlversion=“1.0“encoding=“utf-8“ ?>

<configuration>

<runtime>

<gcServerenabled=“true“/>

</runtime>

</configuration>

Before explaining how we solved that mystery, let’s consider an interesting reality about parallelism.If you start out with a loop that creates temporary objects, even though that’s bad if it can be avoided, the rate at which your program will create those objects is limited by the speed of that single thread.Now imagine when you increase the number of threads running that code at a given point in time.You’ve just increased the rate at which objects are created and discarded!So, excessive object creation issues get exacerbated and GC’s happen more frequently, making even more important to do a good job with memory usage.The more threads/cores thrown at the problem, the higher the rate of memory churn.

There’s a nice profiling tool in VS2010 dedicated to analyzing CLR memory usage.It’s the “.NET Memory Allocation (Sampling)” option that you see in the Performance Wizard Window.This tool samples object allocations in your application and reports many useful statistics.The most important one in our investigation was a list of most frequently allocated objects.Between that list and looking at the source code, we were able to identify an object type that we noticed was being frequently created inside the loop.It turned out that the object was a pretty simple structure containing two doubles.So, we decided that it’s worth allocating two local doubles on the stack instead of creating temporary objects.And, voila!A very happy customer sent an email back a few hours later, reporting that his test scenario that was running at 1.1 seconds was now taking .25 seconds on a quad core (hyperthreaded) system.See the screenshot of a profile of this run below.

I thought that I’d share this with you not only because it’s really cool to see a customer using our technologies successfully, but because there are some important lessons here:

5.The Concurrency Visualizer was very effective at exposing the problem and helping us address it.

I hope you found this helpful.

-Hazim

]]>https://blogs.msdn.microsoft.com/hshafi/2010/06/17/case-study-parallelism-and-memory-usage-vs2010-tools-to-the-rescue/feed/0Come Join Us at the International Supercomputing Conference (ISC ’10) in Hamburghttps://blogs.msdn.microsoft.com/hshafi/2010/05/30/come-join-us-at-the-international-supercomputing-conference-isc-10-in-hamburg/
https://blogs.msdn.microsoft.com/hshafi/2010/05/30/come-join-us-at-the-international-supercomputing-conference-isc-10-in-hamburg/#respondSun, 30 May 2010 03:24:33 +0000https://blogs.msdn.microsoft.com/hshafi/2010/05/30/come-join-us-at-the-international-supercomputing-conference-isc-10-in-hamburg/Hi,

I hope that some of you can join us at the International Supercomputing Conference (ISC) at Hamburg this week. Keith Yedlin and I will be giving a tutorial today about the Parallel Computing Platform team’s technologies in Visual Studio 2010 (link). As part of the tutorial, I will be giving a 75 minute talk and demo about the parallel debugging and profiling capabilities in the product, including the Concurrency Visualizer of course. We will have laptops for you to try out the product hands-on as well as short lab exercises. I hope to see some of you there! We will also have a booth on the show floor where you can stop by and get some hands-on demos and chat about your needs and experiences.

The April issue of MSDN magazine is out, and we were happy to see an article entitled “Better Coding with Visual Studio 2010” by Doug Turnure that also briefly covers the Concurrency Visualizer. I encourage you all to read it here because it contains other information about parallel programming and other cool features in VS2010. The April issue in general brings a lot of great information about features that have been developed by the Parallel Computing Platform team at Microsoft. It is always gratifying when other teams and individuals get excited about your technologies.

I’d like you all to be aware of an issue that can affect the quality or functionality of the Concurrency Visualizer. Our tool relies heavily on gathering timestamps in order to correlate events across cores and threads. When running on a virtualized processor, depending on many variables, the fidelity of this information can be affected when running on a Hyper-V system. The Concurrency Visualizer can even fail when it detects inconsistent timing information. We are working on addressing this in future versions, but for now, we recommend that you do your performance analysis running on native operating systems. Some purists in the performance engineering community might even claim that doing performance analysis on a hyper-V is fundamentally wrong, just like doing performance analysis with a lot of interference on a system from external applications or sources. I’m not such a purist on the former because VM environments will continue to increase in popularity and we need a good solution; however, for now you should be warned against it.

We apologize for any inconvenience that this may cause you. If you have any experiences that you’d like to share in this regard, please don’t hesitate to do so here.

I’ve received many requests for a more in-depth article on the features of the Concurrency Visualizer in VS2010. Well, I’m happy to report that I’ve come through with an article entitled “Performance Tuning with the Concurrency Visualizer in Visual Studio 2010” that appeared in the current (March 2010) issue of MSDN Magazine. The article and a screenshot of the tool even made it to the front cover. I highly recommend reading that article if you want to make the most out of this parallel performance analysis tool. The article may be viewed at http://msdn.microsoft.com/en-us/magazine/ee336027.aspx.

I’ve recently posted a short article on how you can use the Concurrency Visualizer to understand the performance of MPI (Message Passing Interface) applications. You can find it at our team’s blog. Also, this week I’m giving a talk at Gamefest 2010 entitled “Visualization Tools for Multicore Performance Analysis”. Stop by and chat if you’re attending.

Those of you who are used to doing performance analysis can appreciate the value of reducing interference between your application and other applications and services running on the system under study. So far, I’ve been using the Visual Studio IDE to show you how you can collect and analyze a profile. Since Visual Studio itself can be a resource intensive application, it is sometimes desirable to collect a profile without the IDE’s assistance. Further, it is sometimes desirable to collect a profile on a system that does not have Visual Studio installed. For these purposes, the Concurrency Visualizer comes with support in the Visual Studio profiler command-line tools. The command-line tools allow both launch and attach profile collection. Here’s how you can accomplish this:

Lanunch Scenario:

1. Open a Visual Studio Command Prompt window as an Admininstrator (remember, ETW-based collection requires high privileges). The tool is usually found at Start->All Programs->Microsoft Visual Studio 2010->Visual Studio Tools->Visual Studio Command Prompt (2010). The console will have the appropriate paths set up for the tools that we’ll be using.

2. Start profiling and launch the application of interest using the following command (I usually do this from the directory containing the program of interest):

3. Now the application should be launched and you can perform your test. When finished, if you terminated the application, you can just run the following command to complete the profile collection:

vsperfcmd /shutdown

4. When the above command completes, you will find the profile file “profilefilename.vsp” in the current directory. All you need to do now is to open this .vsp file in Visual Studio (Ultimate or Premium) using the File->Open->File menu option. Just so that you know, there are two other files containing profile data: profilefilename.app.ctl and profilefilename.krn.ctl

Attach Scenario:

1. Find the process id (PID) of the application that you’re interest in. You can use Task Manager to do that by enabling the PID column in Processes tab using the View->Select Columns option.

3. When you’re done profiling the usage pattern that you’re interested in, run the following commands:

vsperfcmd /detach

vsperfcmd /shutdown

4. When the above command completes, you will find the profile file “profilefilename.vsp” in the current directory. All you need to do now is to open this .vsp file in Visual Studio (Ultimate or Premium) using the File->Open->File menu option.

If you’d like to collect on a system that does not have Visual Studio installed, you will need to install the Standalone Profiler tools. There’s a directory on the Visual Studio DVD containing the installer for these tools. You will need to run this as an admin because it installs a driver. In addition, the command-line tools require .NET 4.

That’s all you need to collect profiles without the overhead of Visual Studio. Now go give these a try!

In my PDC 2008 presentation, I showed how the Concurrency Visualizer in Visual Studio 2010 allows users the option of instrumenting their code in order to link the visualizations with application constructs or phases of execution. The Concurrency Visualizer does not require any instrumentation to function, but for some complex application scenarios, it is often difficult to identify the regions of execution that are of interest to us. This is because a common performance investigation is usually focused on a certain “problem” that manifests itself during a portion of an application’s execution.

For VS2010 Beta 2, we have released a simple API that can be used for this purpose. This API is called the Scenario and is available for download for free from http://code.msdn.microsoft.com/Scenario. There are both native and managed implementations of this API, depending on the application that you are dealing with. The Scenario API includes many features that may be of interest to the user, so I urge you to read the documentation to learn about it. For our purposes, the Scenario API encapsulates the work necessary to generate ETW events that are consumed by the Concurrency Visualizer. In order to use it, you need to instantiate a Scenario object, and then mark the phases that are important to you by invoking the Begin and End APIs. When you do so, the Concurrency Visualizer will mark these regions with vertical markers in the CPU Utilization view, or rectangular regions in the Threads and Cores views. Each Scenario has an associated string describing it and these strings are shown in tooltips in the views. In the CPU Utilization view, the strings are shown when you hover on the vertical markers. In the other views, they show up when you hover on the horizontal connectors of the Scenario rectangles. If you’re interested in analyzing work that happens in one of these regions, you can just zoom in on it and then switch among the various view and examine reports or interact with the UI to get your work done. You can also use the measurement tool in the Threads view toolbar to measure the time it takes to execute the various phases/scenarios. Here’s a simple example that you can use to try out this functionality in VS2010 Beta 2 after downloading the appropriate bits from the above website.

// Mark end of work phase myScenario->End(0, TEXT(“Work Phase”)); exit(0);}

When you profile this app, you’ll notice rectangular regions such as the ones depicted below that correspond to each Scenario Begin/End pair in the application. If you hover the mouse on the horizontal bars, you’ll get a tooltip containing the text string that you associated with the Scenario phases. You can now zoom in on a phase that interests you, make timing measurements with the measurement tool etc. The Cores view has a similar UI as shown, but the CPU Utilization view shows vertical bars for the begin and end markers instead of rectangular regions. Unfortunately, in Beta 2 the bars use a shade of grey that’s hard to see. You can also hover on those vertical markers to get the Scenario text.

In my previous post, I mentioned the “Demystify” feature of our tool that isn’t quite working in the VS2010 Beta 2 release (Premium and Ultimate versions). Our team has now placed a web-based preview of this feature on our Team Blog. Demystify is a great way of learning about our tool’s features and it will be in the final release. Give it a go and use it as a valuable resource while you’re trying out our tool. Please keep an eye on both blogs for more information about our tool.

I’m very excited about the release of Visual Studio 2010 Beta 2 that is going to be available to MSDN subscribers today and to the general public on 10/21. This release includes significant improvements in many areas that I’m sure you’ll love. But, as the Architect of the Concurrency Visualizer tool in the VS2010 profiler, I’m extremely thrilled to share with you the huge improvements in the user interface and usability of our tool. Our team has done an outstanding job in listening to feedback and making innovative enhancements that I’m sure will please you, our customers. Here is a brief overview of some of the improvements that we’ve made:

Before we start, I’ll remind you again that the tool that I’m describing here is the “visualize the behavior of a multithreaded application” option under the concurrency option in the Performance Wizard accessible through the Analyze Menu. I’ve described how the tool can be run in a previous post. Here’s a screenshot of the performance wizard with the proper selection to use our tool:

Ok, now let’s start going over the changes. First, we’ve slightly changed the names of the views for our tool. We now have “CPU Utilization”, “Threads”, and “Cores” views. These views can be accessed either through the profiler toolbar’s Current View pull-down menu, through bitmap buttons at the top of the summary page, or through links in our views as you’ll see in the top left of the next screenshot.

You’ll notice that the user interface has gone through some refinement since Beta 1 (see my earlier posts for a comparison) Let’s go over the features quickly:

1. We’ve added an active legend in the lower left. The active legend has multiple features. First, for every thread state category, you can click on the legend entry to get a callstack based report in the Profile Report tab summarizing where blocking events occured in your application. For the execution category, you get a sample profile that tells you what work your application performed. As usual, all of the reports are filtered by the time range that you’re viewing and the threads that are enabled in the view. You can change this by zooming in or out and by disabling threads in the view to focus your attention on certain areas. The legend also provides a summary of where time was spent as percentages shown next to the categories.

2. When you select an area in a thread’s state, the “Current Stack” tab shows where your thread’s execution stopped for blocking categories, or the nearest execution sample callstack within +/- 1ms of where you clicked for green segments.

3. When you select a blocking category, we also try to draw a link (dark line shown in the screenshot) to the thread that resulted in unblocking your thread whenever we are able to make that determination. In addition, the Unblocking Stack tab shows you what the unblocking thread was doing by displaying its callstack when it unblocked your thread. This is a great mechanism to understand thread-to-thread dependencies.

4. We’ve also improvement the File Operations summary report that is accessible from the active legend by also listing file operations performed by the System process. Some of those accesses are actually triggered on behalf of your application, so we list them but clearly mark them as System accesses. Some of those accesses may not be related to your application.

5. The Per Thread Summary report is the same bar graph breakdown of where each thread’s time was spent that used to show up by default in Beta 1, but can now be accessed from the active legend. This report is a guide that helps you understand improvements/regressions from one run to another and serves as a guide to help focus your attention on the threads and types of delay that are most important in your run. This is valuable for filtering threads/time and prioritizing your tuning effort.

6. The profile reports now have two additional features. By default, we now filter out the callstacks that contribute < 2% of blocking time (or samples for execution reports) to minimize noise. You can change the noise reduction percentage yourself. We also allow you to remove stack frames that are outside your application from the profile reports. This can be valuable in certain cases, but it is left off by default because blocking events usually do not occur in your code, so filtering that stuff out may not help you figure out what’s going on.

7. We added significant help content to the tool. You’ll notice the Hints tab that was added and it includes instructions about features of the view as well as links to two important help items. One is a link to our Demystify feature, which is a graphical way to get contextual help. This is also accessible through the button on the top right hand side of the view. Unfortunately, the link isn’t working in Beta 2, but we are working on hosting an equivalent web-based feature on the web to assist you and get feedback before the release is finalized. I’ll communicate this information in a subsequent post. The other link is to a repository of graphical signatures for common performance problems. This can be an awesome way of building a community of users and leveraging the experiences of other users and our team to help you identify potential problems.

8. The UI has been improved to preserve details when you zoom out by allowing multiple colors to reside within a thread execution region when the same pixel in the view corresponds to multiple thread states. This was the mechanism that we chose to always report the truth and give the users a hint that they need to zoom in to get more accurate information.

The next screenshot shows you a significantly overhauled “Cores” view:

The Cores view has the same functionality; namely, understanding how your application threads were scheduled on the logical cores in your systems. The view leverages a new compression scheme to avoid loss of data when the view is zoomed out. It has a legend that was missing in Beta 1. It also has clearer statistics for each thread: the total number of context switches, the number of context switches resulting in core migration, and the percentage of total context switches resulting in migration. This can be very valuable when tuning to reduce context switches or cache/NUMA memory latency effects. In addition, the visualization can easily illustrate thread serialization on cores that may result from inappropriate use of thread affinity.

This is just a short list of the improvements that we’ve made. I will be returning soon with another post about new Beta 2 features, so please visit again and don’t be shy to give me your feedback and ask any questions that you may have.