Intel® VTune™ Amplifier Release Notes and New Features

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

This page provides the current Release Notes for Intel® VTune™ Amplifier (Intel® VTune™ Amplifier XE for versions 2017 and older). The notes are categorized by major version, from newest to oldest, with individual releases listed within each version section.

Click a release to expand it into a summary of new features and changes in that version since the last release. The expanded summary also contains download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.

You can copy a link to a specific release's section by clicking the chain icon next to its name.

Improved insight into parallelism inefficiencies for applications using Intel Threading Building Blocks (Intel TBB) with extended classification of high Overhead and Spin time.

Automated installation of the VTune Amplifier collectors on a remote Linux target system. This feature is helpful if you profile a target on a shared resource without VTune Amplifier installed or on an embedded platform where targets may be reset frequently.

Details:

HPC Performance Characterization Analysis improvements

The HPC Performance Characterization Analysis has received several improvements.

Increased detail and structure for the vector efficiency metrics based on FLOP counters in the FPU Utilization section help diagnose the reason for low utilization connected with poor vector code generation. Relevant metrics include:

Vector Capacity Usage

FP Instruction Mix

FP Arithmetic Instructions per Memory Read or Write

SP FLOPs per Cycle (may indicate memory bandwidth bound code)

For MPI applications, the MPI Imbalance metric shows CPU time spent by ranks spinning in waits on communication operations, normalized by the number of ranks on the profiling node. The metric issue detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI Busy Wait time by ranks is not significant, then the rank with the minimal time most likely lies on the critical path of application execution. In this case, review the CPU utilization metrics by this rank.

The Top Loops/Functions with FPU Usage by CPU Time table shows the top functions that contain floating point operations, sorted by CPU time. The FPU Utilization column provides issue descriptions based on whether a loop/function is bandwidth bound, whether it is vectorized or scalar, and what instruction set it's using.

For Intel Xeon Phi processors (codenamed Knights Landing), the following FPU metrics are available instead of FLOP counters:

DRAM Bandwidth Bound metric

A new metric is available in the Memory Usage viewpoint for the Memory Access and HPC Performance Characterization analyses which indicates whether your system spent much time heavily utilizing the DRAM bandwidth. The calculation of this metric relies on accurate maximum system DRAM bandwidth measurement, and depends on the number of sockets on your system.

GPU Hotspots Summary improvements

The GPU Hotspots viewpoint's Summary tab has been extended to display more information. The GPU Usage section can be used to identify whether the GPU was properly utilized. The Packet Queue Depth Histogram can be used to estimate the GPU software queue depth per GPU engine during the target run. Ideally, your goal is an effective GPU engine utilization with evenly loaded queues and minimal duration for the zero queue depth.

For a high-level view of the DMA packet execution during the target run, review the Packet Duration Histogram. Select a required packet type from the drop-down menu and identify how effectively these packets were executed on the GPU. Having high packet count values for the minimal duration is optimal.

KVM Guest OS Profiling

If you are a system developer and interested in the performance analysis of a guest Linux* system, use Intel VTune Amplifier for performance analysis of this guest Linux* OS via Kernel-based Virtual Machine (KVM) from the host system. Depending on your analysis target, you may choose either of the following usage models for KVM guest OS profiling:

Locks & Waits analysis for Python

Locks and Waits analysis can now be used to tune threaded performance of mixed Python* and native code. View Sync Objects in the grid, see Python frames in the Call Stack, an define which sync objects are the Global Interpreter Lock (GIL), either by wait count or by callstack. Drill down to Python source to explore thread synchronization issues at code level. For more information on how to configure the analysis, see the Python* Code Analysis product help article.

Support for the Average Latency metric in the Memory Access analysis based on the driverless collection

Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions

Command line summary report for the HPC Performance Characterization analysis extended to show metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.

Summary view of the General Exploration analysis extended to explicitly display measure for the hardware metrics: Clockticks vs. Piepline Slots

PREVIEW: New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris™ Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.

Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view

Overview:

Disk Input and Output analysis (Preview) that monitors utilization of the disk subsystem, CPU and PCIe buses, helps identify long latency of I/O requests and imbalance between I/O and compute operations.

Details:

Intel® Xeon Phi™ Processor Support

Decide how to use MCDRAM (the high bandwidth memory) effectively using Memory Access Analysis, analyze the scalability of MPI and OpenMP* with HPC Performance Characterization Analysis, and explore the microarchitecture efficiency with General Exploration Analysis.

See the analysis usage example in the Analyzing an OpenMP and MPI Application web-based tutorial, which provides a hands-on exercise to identify memory utilization inefficiencies and load imbalance for a sample hybrid application.

Memory Access Analysis

The Memory Access Analysis has been improved. In addition to support for the Intel Xeon Phi processors, it now supports custom memory allocators, and includes automatic detection of maximum system DRAM bandwidth characteristics and scaling bandwidth data from that maximum. This allows users to easily see how they actually utilize the available DRAM bandwidth, rather than just raw GB/S values. The QPI bandwidth has been split to Total, Outgoing, and Incoming, instead of just the total. The workflow has been optimized for identifying the top memory objects with high bandwidth utilization per domain. Finally, no special drivers are required on Linux*; this analysis type can now use standard Linux* perf to collect data, eliminating the need for root to install other drivers.

Disk I/O Analysis (Preview)

The Disk Input and Output analysis for HDD, SATA, or NVMe SSD monitors utilization of the disk subsystem, CPU, and PCle buses, and helps to identify long latency of I/O requests and imbalance between I/O and compute operations.

GPU analysis improvements

The GPU Analysis Summary provides a set of metrics to estimate the GPU utilization per engine, identify stalled or idle execution units, and explore the most typical problems with low occupancy or frequent sampler accesses. Navigate from the Hottest GPU computing tasks summary to the details provided in the graphics tab.

Usability Improvements

Remote usage and Command Line usage have been improved. Use the Arbitrary target GUI configuration to generate a command line for performance analysis on a system that is not accessible from the current host.

MPI analysis has been extended with the event-based sampling collection supported for multiple ranks per node with an arbitrary MPI launcher and natural syntax. Use the MPI launcher option in the arbitrary targets configuration to automatically generate a command line for MPI analysis from the GUI.

An option for enabling and disabling the OpenMP regions analysis has been added to selected analysis configurations.

Support has been added for the Attach To Process target type with event-based sampling for low-privilege Java* daemons on Linux*.

The event selection mechanism for custom hardware event based sampling has been extended with filtering options.

The grid views and identification of performance issues have had UI improvements made.

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

HPC Characterization analysis (preview) that monitors utilization of the CPU, memory, and FPU for a compute-intensive or throughput application and helps identify floating point operation and memory optimization opportunities.

Metric-based navigation between call stack types replacing the former Data of Interest selection

Updated filter bar options, including the selection of a filtering metric used to calculate the contribution of the selected program unit (module, thread, and so on)

New option to measure the maximum local bandwidth and use this data to scale the DRAM Bandwidth overtime view and calculate the bandwidth histogram thresholds

Support for the Fedora* 23, Ubuntu* 15.10

Support for the Microsoft Windows* 10 November update

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Tune OpenMP Scalability Faster

Using the enhanced OpenMP* analysis you can effectively identify common performance bottlenecks of your parallel implementation, such as:

Execution of serial portions (outside of any parallel region): When the master thread is executing a serial region and when the worker threads are in the OpenMP runtime library waiting for the next parallel region.

Load imbalance: When a thread finishes its part of workload in a parallel region, it waits at a barrier for the other threads to finish.

Not enough parallel work: The number of loop iterations is less than the number of working threads so several threads from the team are waiting at the barrier not doing useful work at all.

Synchronization on locks: When synchronization objects are used inside a parallel region, threads can wait on a lock release, contending with other threads for a shared resource.

VTune Amplifier, together with the Intel OpenMP runtime library from Intel Composer XE 2015 Update 3 or higher, helps you understand how an application utilizes available CPUs and identifies causes of CPU under-utilization.

Easier MPI+OpenMP Multi-Rank Analysis

Identify ranks with low MPI communication spin time to have the highest impact when tuning OpenMP performance.

For MPI analysis results including more than one process with OpenMP regions, the Summary window shows a section with top processes laying on a critical path of MPI execution with Serial Time and OpenMP Potential Gain aggregated per process:

For a detailed description and result interpretation instructions please refer to the MPI Analysis product help topic.

Microsoft Windows* 10 Support

Built-in Hyper-V can be enabled by default on some systems with Windows 10 installed. Hyper-V doesn’t allow PMU EBS analysis for other tools. If you need to perform HW analysis - follow troubleshoot instructions in the VTune Amplifier product help to disable Hyper-V.

OpenMP Enhancements

CPU time-based classification of Spin and Overhead time in OpenMP runtime does not reveal the elapsed-time impact of a parallel region inefficiency because it depends on the number of working threads. VTune Amplifier’s new per-OpenMP region metrics that are based on CPU time are now normalized by the number of threads in the region and represented as an expansion of the “potential gain” metric.

Besides reporting the potential gain metric in absolute elapsed time, the VTune Amplifier display breaks down the impact of various issues by percentage of total application elapsed time.

Precise trace-based imbalance calculation that is especially useful for profiling of small region instances

Imbalance of working threads on barriers is a major performance issue that prevents efficient CPU utilization by OpenMP applications. VTune Amplifier’s sampling method may miss certain situations of imbalance. For example

region instances that are smaller than the sampling interval,

the number of parallel region instances is insufficient to get statistically correct results, or

threads enter a passive wait on a barrier and don’t consume CPU time on a busy wait (e.g. for KMP_BLOCKTIME=0) .

To avoid these situations, the Intel OpenMP runtime library from Parallel Studio XE 2016 Beta reports to VTune Amplifier the precise imbalance time. This additional information from the OpenMP runtime does not add overhead since the reporting is done on a per-barrier basis. The precise imbalance metrics are displayed when the OpenMP Potential Gain metric is expanded.

When an OpenMP region contains multiple constructs with barriers (e.g., loops with implicit barriers, a ‘single’ construct, or a user barrier), it is useful to distribute inefficiency metrics by barrier-to-barrier segments. Below is an example a region based on four barrier-to-barrier segments.

The Intel OpenMP runtime from Intel Composer XE 2015 Update 3 (or higher) instruments barriers for VTune Amplifier to enhance its inefficiency metrics. The barrier type is added to the segment name – loop, single, reduction, etc. The runtime also emits additional information for parallel loops with implicit barriers, such as loop scheduling and chunk size, that is useful in understanding imbalance or the nature of the scheduling overhead. Use the /Barrier-to-Barrier Segment grouping to view the statistical distribution by barrier-to-barrier segments.

Please note that the same lexical loop constructs with different schedule types or chunk sizes will be displayed separately in different rows. For example, if one instance had a chunk size of 1000 and another had a chunk size of 1563, there would be two entries for the construct with the same name but different sizes in the OpenMP Loop Chunk column.

Barrier-to-Barrier Segments are also available on the timeline.

Intel® MPI and OpenMP Multi-rank Analysis on a Compute Node

For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along with MPI communication between ranks. VTune Amplifier recognizes samples in Intel MPI communication busy wait functions and shows metrics based on that information. For multi-rank OpenMP results, VTune Amplifier’s Summary view is enriched with a table of Top MPI ranks with OpenMP metrics sorted by MPI Communication Spin time from low to high values. The lower the Communication time the more the rank was executing (vs. spinning) and the more impact OpenMP tuning will have on the application elapsed time.

Process names are hyperlinked to the Bottom-up view with ‘/Process /OpenMP Region/ …’ groupings to get details of the OpenMP metrics aggregated per-process, with the ability to expand the results by Regions and Barrier-to-Barrier Segments.

where specifies the rank range to be included in the VTune Amplifier analysis. Separate ranks with a comma or use the “-” symbol for a set of contiguous ranks. Use the ‘all’ value to configure profiling on all the ranks. Exclusive launch mode helps prevent running more than one collection per node, which is a limitation of EBS profiling.

Starting with Intel MPI version 5.0.3, the ‘node-wide’ clause can be used instead of ‘exclusive’ to make collection on all ranks of the nodes on which the resides, or for all nodes in the case of ‘all’ ranks. In this case, VTune Amplifier will create a result directory per node with host name suffix for the result directory name. This is particularly convenient for EBS collection, where there are limitations on simultaneous profiling by multiple VTune Amplifier command lines.

With Intel Trace Analyzer and Collector 9.0.2 and later, you can generate VTune Amplifier hotspot analysis command lines for ranks selected in the ITAC graphical user interface: from “Event Timeline” Chart, ‘Function Profile/ Load Balance’ grid view or copy the generated command line for the most CPU bound process from ITAC summary page.

User Interface Enhancements

General Exploration analysis with confidence indication

Some of the metrics in VTune Amplifier views may now be marked as unreliable by greying out the values in the following views: Summary, Bottom-up, and Source. This can happen when the amount of collected event samples is too low to reliably calculate the metric.

Currently it is used for EBS metrics on General Exploration analysis but it may be extended to more metrics in the future if the feedback is favorable.

Timeline “Super Tiny” bird’s-eye view

Timeline analysis of core utilization on modern server and many-core co-processor cards with a large number of ranks/threads is particularly useful with a bird’s-eye view to be able to recognize application phases and behavioral patterns for further data zooming and filtering. VTune Amplifier’s “Super Tiny” view shows all application threads at once using a pixel color intensity to reflect Efficient, Spin & Overhead and MPI Communication Time metrics. Timeline hierarchical grouping for “Super Tiny” shows leaves only grouped according to the grouping hierarchy:

Access the new view in the timeline context menu:

New Filtering Mode for Command Line Reports

To display only particular columns providing metrics/event data, use the -column option and specify a full name of the required column(s) or its substring.

Examples:

Show grouping and data columns only for event columns with the *INST_RETIRED.* string in the title:

$ amplxe-cl -R hw-events -r r000ah --column=INST_RETIRED.

Show grouping and data columns only for columns with the Idle and Spin strings in the title:

Due to PMU limitations, Advanced Hotspots cannot be collected inside an Intel® TSX transaction. Thus, the new “TSX Hotspots” analysis type has been added to help identify performance-critical program units inside transactions.

Total, Read and WriteBandwidth timeline areas merged into single area making it easier to see all bandwidth activity

Grouping by package for the CPU Time timeline area

GPU Architecture Diagram

On Windows* systems with Intel HD Graphics you may find it easier to analyze your OpenCL application by exploring the GPU hardware metrics per GPU architecture blocks.

To do this, choose the Computing Task grouping level in the Graphics window, select an OpenCL kernel of interest and click the Architecture Diagram tab in the Timeline pane. VTune Amplifier updates the architecture diagram for your platform with performance data per GPU hardware metrics for the time range the selected kernel was executed.

GPU analysis on Linux

GPU analysis on Linux* targets is now available in VTune Amplifier XE, including:

To perform analysis of Intel® Media SDK tasks execution over time, make sure to configure your Linux kernel according to the “Intel® Media SDK Program Analysis Configuration” topic in the VTune Amplifier help.

The Global/local accesses hardware event set for GPU analysis has been renamed Compute basic (with global/local memory accesses) to better represent the collected data. See the description in the "GPU Metrics" topic of the product help for detailed metrics.

With enhanced OpenMP* region analysis, identify common performance bottlenecks, such as load imbalance, granularity issues or synchronization issues. See serial and parallel times for your application and potential tuning gains for parallel regions. For more details refer to the “OpenMP* Analysis” topic in the product help.

Easier data collection on Intel® Xeon Phi™ coprocessors

Collecting data on Intel® Xeon Phi™ coprocessors is easier than ever with improved analysis workflow via the new target system configuration options. Call stack collection is also now supported for Intel Xeon Phi coprocessors. ITT API collection (including OpenMP* analysis) now works out of the box on the Intel Xeon Phi coprocessor w/o necessity to set any environment variables for both native and offload applications. For more details, refer to the “Intel Xeon Phi Coprocessor Analysis Workflow” topic in the product help.

Easier to use General Exploration and Bandwidth Analysis

Stop worrying about which microarchitecture you’re profiling and use the new General Exploration and Bandwidth analysis types, enabling you to use the same command line on any supported system! For more details, please refer to the “About Performance Analysis with VTune Amplifier” topic in the product help.

The hardware event-based sampling analysis tree has been re-structured to introduce cross-CPU basic configurations and separate advanced CPU-specific analysis configurations. General Exploration and Bandwidth analysis types are shared between all supported CPUs. All tuning opportunities are covered by the General Exploration analysis type for newer processor families, e.g., Ivy Bridge and beyond. Review the Tuning Guides to take full advantage of the General Exploration analysis type. CPU specific analysis types, when available, are expanded automatically according to the detected processor type for older processor families (see note below).

NOTE: The Ivy Bridge family of processors no longer has separate advanced analysis types, only General Exploration and Bandwidth. The Sandy Bridge advanced analysis types that used to be available for Ivy Bridge did not work on Ivy Bridge processors because of hardware incompatibilities and the metrics of interest are now included in the General Exploration analysis type. Also, the Haswell processor family does not have separate advanced analysis types. Again, use the General Exploration metrics and the Haswell tuning guide.

Custom Groupings

Many new ways to group and order the performance data, including custom groupings in grid views and new groupings in the timeline pane.
To see how to create a custom grouping please refer to the "Grouping Data" and "Dialog Box: Custom Grouping" topics in the product help.

Use the Timeline grouping menu to group the data by program units. A grouping level depends on the analysis type. For more details, please refer to “Managing Timeline View” in the product help.

Enhanced navigation in the clickable Summary pane

Hyperlinks open the Bottom-up view sorted by the selected metric or directly to the selected function or OpenMP region.

Easier remote collection

Use the graphical interface running on a Windows* or Linux* host system to collect data on a remote Linux* system via SSH. Configure remote collection via the “remote Linux (SSH)” Target system configuration option in the Project Properties dialog:

NOTE:

ssh/scp or plink/pscp tools must be available in the PATH

When collecting data remotely, VTune Amplifier XE looks for the compatible collector on the remote system in the default install location: /opt/intel/vtune_amplifier_xe_. It also temporary stores performance results on the target system in the /tmp directory. If you installed the VTune Amplifier XE to a different location on target and need to specify another temporary directory, use the appropriate configuration options in the Project Properties/Target tab in GUI, or collection knobs -target-install-dir and -target-tmp-dir in the command line.

If your target application requires custom working directory or user-defined environment variables you can specify them via a launching script and use the script as an application to launch.

For more details please refer to the "Collecting Data Remotely from the VTune Amplifier GUI" topic in the product help.

Analyze Linux* or Windows* profiling data on your OS X* host

Use a Mac* computer as your main system? Now you can host the VTune Amplifier GUI on Mac computers running OS X to view remotely collected results, including the ability to configure and launch remote collection to supported Linux systems.

Once you have registered your Windows or Linux product, an OS X viewer is available for download without additional cost (see below). It will use your existing Windows or Linux license. Note: performance profiling on Mac computers is not available.

The VTune Amplifier XE viewer for OS X is available as a separate download in the Intel Software Development Products Registration Center, e.g.:

After clicking on the "Version 2015" in the right column, you will see the following. Click on the .dmg file to download it, or use the download manager.

After downloading the vtune_amplifier_xe_2015.dmg file, follow these steps to install the software:

Install instructions

Open up permissions to "/Users/Shared/Library/Application Support" to allow the installation of the license file.

Start the 'Finder' application on your OS X* system.

Find the file 'vtune_amplifier_xe_2015.dmg'

Open/Click on the .dmg file to mount the disk-image.

In new opened window, double click on the 'vtune_amplifier_xe_2015.mpkg' item to start installation.

All GUI applications use the 'Applications' folder as their destination. As a result of a successful installation, 'VTune Amplifier XE 2015' should be created in 'Applications' folder.

You may start VTune Amplifier XE 2015 by double-clicking on it in the 'Applications' folder.

Un-install instructions

Ensure that the 'VTune Amplifier XE 2015' application is closed.

Open the 'Finder' application

Drag the 'VTune Amplifier XE 2015' application in directory 'Applications' (or other) and drop it in the 'Trash' on the desktop.

Reduce overhead by limiting stack depth

Reduce collection overhead for custom event-based sampling analysis types using the new option to limit call stack depth (in system pages). Use the '-stack-depth' collector knob in the command line and the corresponding GUI control "Stack size" in the Custom Analysis dialog for the hardware-based sampling.

Import externally collected data

Increase analysis by importing externally collected data into existing results. VTune Amplifier provides the ability to correlate interval or discrete data, provided by an external collector, with the regular data collected by the profiler. To learn more, refer to the “Adding External Data to the Intel® VTune™ Amplifier” topic in the product help.

You can extend standard VTune Amplifier performance analysis and launch a custom data collector directly from the VTune Amplifier. Your custom collector can be an application you analyze with the VTune Amplifier or a collector that can be launched with the VTune Amplifier. Learn more about configuring and launching a custom collector from GUI and command line from “Using a Custom Collector” help topic.

VTune Amplifier can process and integrate performance statistics collected externally with a custom collector or with your target application in parallel with the native VTune Amplifier analysis. To achieve this, provide the collected custom data as a csv file with a predefined structure and save this file to the VTune Amplifier result directory.
VTune Amplifier can load and process the following data types:

Interval data with start time and end time

Samples with a set of counters

To make the VTune Amplifier interpret the custom statistics from the csv file, make sure the file format meets the requirements specified in “Creating a CSV File with External Data” help topic.

Use the TSX Exploration analysis for tuning applications that use Intel® Transactional Synchronization Extensions (Intel® TSX). The analysis relies on performance counter-based profiling to understand transactional execution behavior and the causes of transactional aborts. For more information on Intel® TSX, see Web resources about Intel® Transactional Synchronization Extensions.

NOTE: the analysis is supported only for Intel processors with the Intel® TSX feature enabled. Due to recent published errata, systems may have this feature disabled, by default.

The tuning process consists of 2 steps:

Measuring transactional success
The first step is to measure the transactional success in an application.
Select 'TSX Exploration' analysis type and choose ‘1. Transactional success’ from the ‘Analysis Step’ combo box, as shown below:
Three metrics are collected:

Clockticks – total number of unhalted cycles collected

Transactional Cycles – number of cycles spent during transactions. If it is near zero then the application is either not using lock-based synchronization or not using a synchronization library enabled for lock elision through the Intel TSX instructions.

Abort Cycles - number of cycles spent during transactions which were eventually aborted. If it is small relative to Transactional Cycles, then the transactional success rate is high and additional tuning is not required. If it is almost the same as Transactional Cycles (but not very small), then most transactional regions are aborting and lock elision is not going to be beneficial. The next step would be to identify the causes for transactional aborts and reduce them, which leads us to the next step.

Sampling transactional aborts
Select the 'TSX Exploration' analysis type and choose ‘2. Aborts’ option from the ‘Analysis Step’ combo box, as shown below:
As a result of this analysis, you’ll see where the transaction aborts are happening and for what reason. Possible reasons include:

Instruction - Some instructions, such as CPUID and IO instructions, may cause a transactional execution to abort in the implementation.

Data Conflict - A conflicting data access occurs if another logical processor either reads a location that is part of the transactional region's write-set or writes a location that is a part of either the read- or write-set of the transactional region. Since Intel TSX detects data conflicts at the granularity of a cache line, unrelated data locations placed in the same cache line will be detected as conflicts.

Capacity - Transactional aborts may also occur due to limited transactional resources. For example, the amount of data accessed in the region may exceed an implementation-specific capacity.

OpenCL™ Software Technology Kernel Analysis

If your application uses OpenCL software technology and is doing substantial computational work on the GPU, capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics by enabling the 'Trace OpenCL kernels on Processor Graphics' option during analysis configuration. To view information about all OpenCL kernels running on the GPU, in the Graphics tab of the analysis results switch the grouping to 'Computing Task Purpose / Computing Task (GPU) / Instance'. VTune Amplifier identifies the following computing task purposes:

Compute (kernels)

Transfer (OpenCL routines responsible for transferring data from the host to a GPU)

Synchronization (for example, clEnqueueBarrierWithWaitList)

The corresponding columns show the overall time a kernel ran on the GPU and the average time for a single invocation (corresponding to one call of clEnqueueNDRangeKernel), working group sizes, as well as averaged GPU hardware metrics collected for a kernel. The cell is highlighted (pink) when there is a potential tuning opportunity. Hover over the cell to read the issue description.

To view details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and analyze the time spent in the queue, zoom in and explore the Computing Queue data in the Timeline pane. You can click a kernel task to highlight the whole queue to the execution displayed at the top layer:

Auto-driver rebuild

Did you update your Linux kernel and now the sampling driver won’t load? No worries! With the new auto-rebuild feature, the sampling driver detects the kernel update and automatically attempts to rebuild and load the driver.

Starting with this release, if the boot scripts have been installed so that the sampling drivers are automatically loaded during boot time, the boot scripts will check for a change in the kernel and automatically rebuild the driver, at boot time. If successfully rebuilt, new drivers will be loaded so that samples can be collected with the updated kernel. Make sure to update the kernel sources when updating the running kernel for this feature to work.

Driver-less Event-Based Sampling collection

Can’t install the Intel event-based sampling driver on Linux because IT won’t let you have root access? Advanced analysis is available even if you can’t install the Intel event-based sampling driver.

Driver-less event-based sampling is supported for the Advanced Hotspots, General Exploration and Custom analysis types on Linux* operating systems based on kernel 2.6.32 or higher, which exports CPU PMU programming details over /sys/bus/event_source/devices/cpu/format file system. This driver-less sampling collection mode is based on the Linux perf* functionality. VTune Amplifier automatically enables the driver-less collection if the Intel event-based sampling driver cannot be installed during product installation.

NOTE: The Intel event-based sampling driver provides additional features not available in perf, such as:

Stacks

Uncore events

Multiple precise events

New events for the latest processors, even on older OSes

NMI Watchdog timer automatically disabled during EBS data collection

The Non Maskable Interrupt (NMI) watchdog timer causes incorrect results in the PMU event-based sampling (EBS) analysis.
Before, VTune Amplifier XE refused to perform EBS collection if the nmi_watchdog is ON, and a user had to disable it manually.
Now the nmi_watchdog timer is disabled automatically for EBS collection period. No more hassles turning it on and off. Profiling just works!

Perf data visualization

Are you collecting event-based sampling data with the Linux ‘perf’ tool? Visualize it now in the VTune Amplifier GUI for enhanced analysis!

Improved OpenMP* region analysis

Common problems of OpenMP* overhead in an OpenMP program is serial time and load imbalance. OpenMP is a fork-join parallel model, which means that an OpenMP program starts with a single master thread executing serial code. Parallel regions cause the master thread to fork into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for synchronization. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code. In such a program, the time is spent waiting in the OpenMP runtime in two cases:

Serial time: When the master thread is executing a serial region, the slave threads in the OpenMP runtime are waiting for the next parallel region.

Load imbalance: When a thread finishes a parallel region, it waits in a barrier for the other threads to finish.

Intel® VTune™ Amplifier together with Intel Composer XE 2013 Update 2 or later helps you understand where an OpenMP program is serial and where it is imbalanced. It also provides a mechanism to correlate the time spent in the OpenMP runtime with the source code of the program. The OpenMP runtime library in the Intel Composer XE contains markers that can be used by the VTune Amplifier to break out the time in OpenMP by parallel region and serial code. The following paragraphs highlight the enhancements.

Summary pane: Use the OpenMPRegion Duration histogram to analyze instances of each OpenMP region, explore the time distribution per instance and identify Fast/Good/Slow region instances and focus on analysis of performance outlier instances in Grid/Timeline views. Initial distribution of region instances by Fast/Good/Slow categories is done as a ratio of 20/40/20 between min and max region time values.

Bottom-up pane: Select the OpenMP Region grouping level and analyze CPU, Spin and Overhead time spent in OpenMP regions. High Spin time values signal a parallel region imbalance. As a potential solution, you may set dynamic scheduling to reduce the imbalance. High Overhead time values can result from too fine-grain parallel work with a high scheduling cost. In this case consider increasing the parallel work executed by a working thread, for example, defining the region for an outer loop.

Top-down Tree pane: Explore the logical program flow of OpenMP regions. Call stacks of worker threads are properly joined with the corresponding fork point (OMP parallel for or OMP parallel directives) in the master thread so you can see full control flow graph for a hotspot in worker threads.

Timeline pane: Explore markers on the Timeline ruler area corresponding to OpenMP region instance duration. Hover over a marker to see the details on the region instance executed at this particular moment of time or click the marker to select the region on the timeline and filter data by region time.

With Intel® Core™ processors based on the Intel microarchitecture code name Haswell, use the special VTune Amplifier analysis type TSX Exploration for tuning applications that use Intel® Transactional Synchronization Extensions (Intel® TSX). The analysis relies on performance counter-based profiling to understand transactional execution behavior and the causes of transactional aborts. For more information on Intel TSX, see Web Resources about Intel® Transactional Synchronization Extensions.

NOTE : You need to perform analysis on Haswell processors w/o the "K" designator, e.g., Intel® Core™ i7-4770K does not support Intel TSX.

The tuning process consists of 2 steps:

Measuring transactional success
The first step is to measure the transactional success in an application. Select TSX Exploration analysis type and choose 1. Transactional success from the Analysis Step combo box, as shown below:
Note that three metrics are collected:

Clockticks – total number of unhalted cycles collected

Transactional Cycles – number of cycles spent during transactions. If it is near zero then the application is either not using Intel TSX-based synchronization or not using a synchronization library enabled for lock elision through the Intel TSX instructions.

Abort Cycles - number of cycles spent during transactions which were eventually aborted. If it is small relative to Transactional Cycles, then the transactional success rate is high and additional tuning is not required. If it is almost the same as Transactional Cycles (but not very small), then most transactional regions are aborting and lock elision is not going to be beneficial. The next step would be to identify the causes for transactional aborts and reduce them – see next step.

Sampling transactional aborts
Select the TSX Exploration analysis type and choose 2. Aborts option from the Analysis Step combo box, as shown below:
As a result of this analysis, you’ll see where the transaction aborts are happening and for what reason. Possible reasons include:

Instruction - Some instructions, such as CPUID and IO instructions, may cause a transactional execution to abort.

Data Conflict - A conflicting data access occurs if another logical processor either reads a location that is part of the transactional region's write-set or writes a location that is a part of either the read- or write-set of the transactional region. Since Intel TSX detects data conflicts at the granularity of a cache line, unrelated data locations placed in the same cache line will be detected as conflicts.

Capacity - Transactional aborts may also occur due to limited transactional resources. For example, the amount of data accessed in the region may exceed an implementation-specific capacity.

"Multiplexing reliability" metric for General Exploration

A new multiplexing (MUX) reliability metric is now available for the General Exploration analysis type. Use this metric to know whether the data for your collection was statistically valid. Values close to 90% (i.e., 0.900) are desirable. Please see the documentation for more information on multiplexing events.

An example of when multiplexing events can reduce precision is a short collection duration, so that a statistically relevant number of events is not counted during the collection period.

In this case, either check the "Allow multiple runs" option in the Project Properties or increase the collection time. e.g., increase the workload of your application so that it runs longer.

Extended Summary window with hyperlinks for Top Hotspots and performance metrics navigating to the Bottom-up grid view

The Summary Pane has been enriched with hyperlinks for Top Hotspots, performance metrics and General Exploration issues, which navigate a user to the Bottom-up grid view with the respective function item selected or column with the metric sorted.

Default settings for the Call Stack Mode drop-down menu on the filter bar have been changed to "User functions + 1".

When using VTune Amplifier with the default Call Stack Mode "Only user functions", some customers are often surprised that they do not see some library code in the results, while they are sure that there are MKL, IPP or some other library usage. These are usually considered as “system” by VTune Amplifier. This happens since in this mode we attribute all system code back to user code caller side. Attribution of everything to user functions created some confusion.

The User functions + 1 mode filters all system functions except those directly called from user functions, so a user can see which top function is hot and who is calling that.

NOTE: The changes will only be visible for newly created VTune Amplifier projects or if you never changed the Call Stack mode in your existing project, otherwise the Call Stack mode will be inherited from the project properties.

Updated product toolbar

Updated the product toolbar providing quick access to the product documentation with the new Help button and to the Import dialog box (standalone only) with the Import Result button.

Added remote system configuration options

The Target tab of the Project Properties has been enhanced to specify a path to the VTune Amplifier installed on the remote machine and a path to a remote temporary directory used for storing performance results.

When collecting data remotely, the VTune Amplifier XE looks for the collectors on the target system in the default install location: /opt/intel/vtune_amplifier_xe_2013. It also temporary stores performance results on the target system in the /tmp directory.

If you installed the VTune Amplifier XE to a different location on target and need to specify another temporary directory, use the appropriate configuration options in the Project Properties:Target tab in GUI or command line collection knobs -target-install-dir and -target-tmp-dir:

You no longer need to disable the NMI Watchdog Timer on Linux* to use the VTune Amplifier hardware-based sampling support! Now, VTune Amplifier will automatically turn it off during collection. One more thing that you don't have to ask your admin to do!

Previous releases of the VTune Amplifier XE refused to perform hardware-based collection if the the Non Maskable Interrupt (NMI) watchdog timer was enabled because it would cause incorrect results, so the user had to manually disable it.

Effective with VTune Amplifier XE 2013 Update 17 release, the timer is automatically disabled during the hardware-based collection period, only. It is automatically re-enabled after collection completes. A message to that effect is displayed in the collection log window.

You may use the VTune Amplifier XE graphical interface running on a Windows* or Linux* host system to collect data on a remote Linux* system via SSH. To configure remote collection:

Go to the Project Properties dialog Target tab

Select the remote Linux (SSH) from the Target system drop-down menu

In the SSH details field, enter the username and hostname for your remote Linux system in username@hostname format

Select your profiling target from the Target type drop-down menu. You may select any type of profiling target: application, process, or system analysis

Configure other Project properties if required and click OK to save your settings and close the Project Properties dialog box

Start a New Analysis

NOTE:

ssh/scp or plink/pscp tools must be available in the PATH

When collecting data remotely, VTune Amplifier XE looks for the compatible collector on the remote system in the default install location: /opt/intel/vtune_amplifier_xe_. It also temporary stores performance results on the target system in the /tmp directory. If you installed the VTune Amplifier XE on the remote system to a different location and need to specify another temporary directory, you need to set the following environment variable on the host before starting amplxe-gui:AMPLXE_TARGET_PRODUCT_DIR=
AMPLXE_TARGET_TMP_DIR=

If your target application requires custom working directory or user-defined environment variables you can specify them via a launching script and use the script as an application to launch.

Update 16 simplifies setting up ITT API collection for native analysis on the Intel® Xeon Phi™ coprocessor. If you chose the default installation options, with the libittnotify library installed to the coprocessor (/usr/lib64/libittnotify.so exists on your card), set the KMP_FOR_TPROFILE=1 environment variable for the application to launch via ssh command, or via your launch script, to the card:

To import a csv file with the externally collected data into an existing VTune Amplifier result use Import from CSV option in the Analysis Target tab, or Analysis Type tabs in GUI or -import option in the command line interface. Importing a csv file does not affect symbol resolution in the existing result. For more details please refer to the “About Adding External Data to the Intel® VTune™ Amplifier” topic in the product help.

Support for importing a csv file that does not specify a hostname for the target system

You can import a csv file that does not specify a hostname for the target system but contains time stamps represented in the UTC format. In this case, the VTune Amplifier displays global data (not attributed to specific threads/processes) only. For more details please refer to the “Creating a CSV File with External Data” topic in the product help.

Search functionality for the grid views added to the toolbar

Find button is available on the grid toolbar, which invokes the search dialog in the same way as Ctr-F. See “Searching for Data” for more details.

Hardware event-based analysis types now support collection data limit

Hardware event-based analysis types now support the collection data limit to prevent collecting large amounts of data, which may slow down data processing. For more details, please refer to “Limiting Data Collection Size”

The hardware event-based sampling analysis tree was re-structured to introduce cross-CPU basic configurations and separate advanced CPU-specific analysis configurations. General Exploration and Bandwidth analysis types are shared between all supported CPUs. All tuning opportunities are covered by the General Exploration analysis type for newer processor families, e.g., IVB and beyond. Users should review the Tuning Guides to take full advantage of the General Exploration analysis type. CPU specific analysis types, when available, are expanded automatically according to the detected system type for older processor families (see note below). For more details, please refer to “About Performance Analysis with VTune Amplifier”

NOTE: The Ivy Bridge family of processors no longer has separate advanced analysis types, only General Exploration and Bandwidth. The Sandy Bridge advanced analysis types that used to be available for Ivy Bridge did not work on Ivy Bridge processors because of hardware incompatibilities and the metrics of interest are now included in the General Exploration analysis type. The Haswell processor family does not have separate advanced analysis types, either.

Timeline grouping options

Use the Timeline grouping menu to group the data by program units. A grouping level depends on the analysis type. For more details, please refer to “Managing Timeline View” in the product help.

Auto-rebuild of sampling and power drivers at system boot time after Linux kernel update

A Linux kernel update can lead to incompatibility with VTune Amplifier XE drivers for event-based sampling (EBS) analysis and power analysis. If the system has installed VTune Amplifier XE boot scripts, the drivers will be automatically re-built by the boot scripts at system boot time. Note: kernel development sources that are needed on the system for driver rebuild must correspond to the Linux kernel update.

Ability to change the focus function from the Caller/Callee panes

Change a focus function from the Callers or Callees panes by double-clicking a function of interest. Alternatively, you may select a function by right-click and choose the Change Focus Function context menu option. For more details please refer to the "Window: Caller/Callee" topic in the product help.

Ability to collapse recursive functions in the Call Stack pane

To collapse all recursive functions into one entry in theCall Stack pane - select the Collapse Recursion option from Context menu.

VTune Amplifier updates the view and marks the entry with collapsed recursion as follows:

In case of many frame domains use Domain drop-down menu at Summary pane to choose a frame domain to analyze with the frame rate histogram. If only one domain is available, the drop-down menu is grayed out. For more details please refer to the "Window: Summary" topic in the product help.

Automatic positioning of the hottest line in the Source/Assembly window after drilling down from the grid

VTune Amplifier Source/Assembly window now automatically positions of the hottest line in the after drilling down from the grid:

Support for importing global discrete counters collected externally

You can import global discrete counters without specifying PID/TID. In that case the performance counter timestamp will not be bound to a particular process/thread and will be visualized at Timeline in the new area for global counters with separate rows per each counter type. For more details please refer to the "Creating a CSV File with External Data" topic in the product help, paragraph "Format for Discrete Values" and examples.

If your application uses OpenCL software technology and is doing substantial computation work on the GPU, you may capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics by enabling the Trace OpenCL kernels on Processor Graphics option during analysis configuration. To view information about all OpenCL kernels running on the GPU, in the Graphics window switch Grouping to Computing Task Purpose / Computing Task (GPU) / Instance. VTune Amplifier identifies the following computing task purposes: Compute (kernels), Transfer (OpenCL routines responsible for transferring data from the host to a GPU), and Synchronization (for example, clEnqueueBarrierWithWaitList). The column “Data Transferred” representing all the data “transferred” with average bandwidth:

To view details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and analyze the time spent in the queue, zoom in and explore the Computing Queue data in the Timeline pane. You can click a kernel task to highlight the whole queue to the execution displayed at the top layer:

Standalone interface improved to provide more workspace for the analysis results

In the VTune Amplifier XE Update 14 standalone interface menu and toolbar layout was improved to provide more vertical space while exploring analysis results. Notice that Menu is now invoked by the button at the top right corner, use it to control result collection, define and view project properties, and set various options:

Ability to cache source files and explore collected performance statistics later even if the source file has been changed

Save your source files in the cache. You can go back to the cached sources at any time in the future and explore the performance data collected per code line at that moment of time. To enable the option go to Menu > Options > Intel VTune Amplifier XE 2013 > Source/Assembly and check Cache source files check box. Then VTune Amplifier caches your sources in the result database when you open the Source window for the first time and provides the following message:

When you open the Source window for this result for the second time and if the source file has been changed, the VTune Amplifier opens the source from the cached file with the proper notification. For more details please refer to the “Pane: Options - Source/Assembly” topic in the product help.

Event-based stack sampling analysis of system processes for kernels and drivers (Windows only)

You can use the VTune Amplifier to profile the Windows kernel-mode process and analyze all privileged resource operations (for example, memory management, paging) it is responsible for or to explore your multithreaded kernel-mode drivers running in the context of this process. If you are a driver developer, this option can help you profile asynchronous driver threads and identify system resource utilization issues (for example, issues caused by frequent page allocations). To analyze the system process, run the VTune Amplifier with administrative privileges and configure the analysis target to attach to PID 4. For more details please refer to the “Attaching to a Process” topic in the product help.

Ability to show kernel stacks as continuation of user stacks

To view kernel stacks in the user functions stacks select the User/system functions call stack mode on the filter toolbar:

To locate the call of the kernel function in the assembly code, double click the function in the Call Stack pane:

Support for Intel(R) microarchitecture code named Silvermont

With the VTune Amplifier XE 2013 Update 14 you may perform hardware event-based sampling analysis on Intel(R) microarchitecture code named Silvermont by using Advanced hotspots from Algorithm Analysis tree and General Exploration from Intel Atom Processor Analysis, or by creating a new custom Hardware Event-Based Sampling Analysis.

Support for Intel(R) Xeon(R) E5-2600 v2 & E5-1600 v2 processors based on the Intel microarchitecture code name IvyBridge-EP

With the VTune Amplifier XE 2013 Update 14 you may perform hardware event-based sampling analysis on Intel(R) microarchitecture code named IvyBridge-EP by using Advanced hotspots from Algorithm Analysis tree or General Exploration and Bandwidth from Sandy Bridge/Ivy Bridge/Haswell Analysis tree.

Simplified syntax for searching binary and symbol files with the -search-dir and -source-search-dir command line options

When finalizing the collected data and generating reports, the Intel® VTune™ Amplifier searches supporting user files to display analysis information in relation to your source code. For proper resolving symbol information, use -search-dir action-option to specify directories that should be searched for binary (executables and dynamic libraries) and symbol files (typically .pdb files). To enable the source code view in the command line report use -source-search-dir option for searching source files.

Support for ITT pause/resume APIs on the Intel® Xeon Phi™ coprocessor

Now you can use pause/resume ITT API to control collection on Intel® Xeon Phi™ coprocessor. Please note that To profile applications with user APIs on the Intel Xeon Phi coprocessor, environment variables that control collection must be propagated from the host to the Intel Xeon Phi coprocessor card. See User API Collection on the Intel® Xeon Phi™ Coprocessor help topic for more details.

SSH-based remote collection via amplxe-cl

Intel® VTune™ Amplifier enables you to collect data on a remote application from the host system (remote usage mode) via command line interface (amplxe-cl) and view the analysis result locally in the GUI. Remote data collection using the amplxe-cl command running on the host is very similar to the native collection on the target except that the -target ssh:user@target option is added to the command line.

As prerequisites you need to install collectors on the remote target and enable pasword-less SSH access to the target.

Example: to run event-based stack sampling collection for the application:

Support for adding external collection data (in the CSV format with a predefined structure) to the VTune Amplifier analysis result collected in parallel with external statistics

VTune Amplifier provides an option to correlate interval or discrete data, provided by an external collector, with the regular data provided by the analyzer.

For example, you can see how the data captured from SoCs or peripheral devices (camera, touch screen, sensors, and so on) correlate with VTune Amplifier metrics collected for your analysis target.

You can extend standard VTune Amplifier performance analysis and launch a custom data collector directly from the VTune Amplifier. Your custom collector can be an application you analyze with the VTune Amplifier or a collector that can be launched with the VTune Amplifier. Learn more about configuring and launching a custom collector from GUI and command line from Using a Custom Collector help topic.

VTune Amplifier can process and integrate performance statistics collected externally with a custom collector in parallel with the native VTune Amplifier analysis. To achieve this, provide the collected custom data as a csv file with a predefined structure and save this file to the VTune Amplifier result directory.

VTune Amplifier can load and process the following data types:

Interval data with start time and end time

Samples with a set of counters

Data may be optionally bound to process and thread ID. VTune Amplifier represents data not bound to a particular process and thread (there are no PID and TID values in the csv file) as frames. Data bound to a process and a thread (there are PID and TID values in the csv file) is represented as tasks. Learn more about csv data format from Creating a CSV File with External Data help topic.

Example: Integrating Interval Data Not Bound to a Particular Process

You have a csv file with the following data types:

VTune Amplifier processes this data as frames (there are no TID and PID values specified) and displays the result as follows:

With the VTune Amplifier, you can easily correlate the frame data in the Timeline pane and grid view. You see that frame 4 took longer time to process than subsequent frames 5 and 6 due to the poll_idle() call.

If your application uses OpenCL™ on Intel® Processor Graphics you can analyze GPU computing efficiency with VTune Amplifier XE by tracing of OpenCL™ kernels execution on GPU. To know OpenCL kernels execution time, monitor performance of each kernel per GPU metrics and identify hotspot kernels, select the Trace OpenCL kernels on Processor Graphics option while configuring a new analysis. When collection and post-processing is complete and the result is open, click to the Graphics tab to see details of GPU activity, also correlated with CPU processes and threads. Use grid groupings “Computing tasks (GPU)“ or “Source Computing Task (GPU)” to see average values of GPU hardware metrics aggregated per kernels or their instances. Timeline shows kernel instances within a thread submitted them. For more information please refer to the “GPU Analysis” and “Analyzing Applications Using Intel® HD Graphics” topics in the product help.

Graphical User Interface (GUI) install on Linux* (via special script)

Now on you can install the VTune Amplifier XE on Linux* via graphical user interface by invoking install_GUI.sh script. The flow is identical to the command line install, but allows easier understanding and configuring of available install options.

Support for identifying function boundaries using static binary analysis methods for binaries without symbol information

To provide accurate performance data and enable source analysis, the Intel® VTune™ Amplifier requires debug information for the binary files it analyzes. Effective Update 11 if it does not find debug information in the binaries, the VTune Amplifier statically identifies function boundaries and assigns hotspot addresses to generated pseudo names func@address for such functions. For more information please refer to the “Using Debug Information” topic in the product help.

NOTE: If debug information is absent, the Call Stack pane may not unwind the call stack correctly for user-mode sampling and tracing analysis types. Additionally in some cases, it can take significantly more time to finalize the results for modules that do not have debug information.

General Exploration metrics summary for hardware event-based sampling analysis results in the command line reports

Command line reports now provide General Exploration metrics summary for hardware event-based sampling analysis results providing a high-level overview of performance problems. The General Exploration Metrics section appears in a Summary report if events were collected during analysis. The set of metrics displayed in the summary depends on the profiled CPU type and list of events. For more information please refer to the “Viewing a Summary Report” topic in the product help.

Use Source Function Stack grouping level in the Top-down Tree pane for enabling more accurate result comparison for recompiled binary files when addresses of the same source function or same loop are different, like in these cases:

You slightly changed the source and recompiled

You changed compilation options and recompiled

You are doing compare between results compiled and collected for different microarchitectures.

By default, compared functions are grouped by the Function Stack granularity, which is based on function instances. VTune Amplifier treats the same functions with different addresses as separate instances and does not compare them:

When the data is aggregated by Source Function Stack, the VTune Amplifier ignores start addresses and compares functions by source file objects:

For more information please refer to the “About Viewing Comparison Data” topic in the product help.

Change Stack Layout option in the Top-down Tree and Bottom-up panes to switch between chain and tree types of stack layout

Use the Change Stack Layout option in the Top-down Tree and Bottom-up panes to manage stack data in the grids and switch between chain and tree types of stack layout. Click the Change Stack Layout button to switch between call stack layouts.

Chain layoutsare typically more useful for the bottom-up view:

Tree layouts are more natural for the top-down view:

Support for scientific data representation in the grid

Bottom-up and Top-down Tree panes now support displaying performance values in the scientific notation via Show Data As context menu. Typically this format is recommended for analyzing values < 0.001. For more information please refer to the “Choosing Data Format” topic in the product help.

The former “Hotspots” and “Lightweight Hotspots” analysis types were renamed in GUI to “Basic Hotspots” and “Advanced Hotspots” respectively introducing several collection levels. “Basic hotspots” provides general performance profile on user level. “Advanced Hotspots” performs Hardware Event Based Sampling analysis by using PMU counters with ability to specify collection with different levels of details and overhead:

For applications using a Graphics Processing Unit (GPU) for rendering, video processing, and computations VTune Amplifier can monitor, analyze, and correlate activities on both the CPU and GPU (Windows* only). To enable the GPU analysis, you have to configure your predefined or custom configuration to Analyze Processor Graphics and DirectX* pipeline events. GPU analysis for Intel Processor Graphics is based on hardware metrics such as Execution Units (EU) Array Active/EU Array Stalled/EU Array Idle, GPU Memory Bandwidth, GPU L3 Cache Misses, and others, it helps to estimate how effectively the Intel Integrated Graphics is used. Analysis of DirectX* pipeline events is used to correlate CPU/GPU usage and helps to identify whether an application is CPU or GPU bound. For more information please refer to the “GPU Analysis” and “GPU Metrics” topics in the product help.

GPU analysis based on DirectX* pipeline events and used to correlate CPU/GPU usage and identify whether an application is CPU or GPU bound (Windows* only)

Explore Summary pane for GPU Usage and DirectX frame rate histogram:

Switch to “Graphics” tab to see distribution of the GPU metrics over time.

Top-Down performance analysis methodology in General Exploration analysis type for the 4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell

The Update 9 introduces Top-Down performance analysis methodology for the 4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell integrated into the General Exploration analysis type. Hierarchical data display corresponds to how available execution slots in each core’s pipeline are utilized. Expand a column to see a breakdown of issues pertaining to its category of pipeline utilization: Retiring, Bad Speculation, Back-end Bound, or Front-end Bound Slots. For more details refer to the Haswell tuning guide.

Overhead and Spin time classification for GCC* and Microsoft* OpenMP* runtimes

VTune Amplifier is now capable to classify Overhead and Spin time for GCC* and Microsoft* OpenMP* runtimes and show the metrics in the grid and Timeline pane allowing to identify inefficiencies in using the threading runtimes when a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels (overhead), or when a significant portion of CPU time is spent on spin (active) waits. For more information please refer to “Overhead and Spin time” topic in the product help.

Overhead and Spin time for GCC* OpenMP*:

Overhead and Spin time for Microsoft* OpenMP*:

Source and assembly data available in the command line reports

Source and assembly data available in the all command line reports. Use the “-source-object” option to switch a report to source or assembly view mode, including associated performance data. Specify “-group-by address” to see disassembly view. For more information please refer to the “Source-object” topic in the product help.

Total metric in the Source/Assembly panes

Analyze collected data in Source/Assembly pane per code line using the Self and Total types of performance metrics. For example, for the Basic Hotspots analysis, the CPU Time: Self column shows the amount of processor time (in seconds) taken to execute a code line while the CPU Time: Total column shows the processor time spent on the code line execution and calls from this line, if any.

Support for the hardware event-based sampling analysis of Windows Store C# and JavaScript applications on Microsoft Windows 8* via the Attach to Process or Profile System modes

Windows Store C# and JavaScript applications can be profiled by using the event-based sampling analysis in “Attach to Process” and “Profile System” modes. Before analysis make sure you have administrative privileges to run the data collection. Mapping to the source file is supported for JavaScript modules. For more information and support limitations please refer to the “Windows Store Applications Analysis” topic in the product help.

Assembly grouping by RVA, basic blocks, and function ranges

Assembly view can be grouped by RVA, Basic Block, or Function Range. To change the hierarchy of the instructions - select the required granularity from the Assembly grouping drop-down menu on the Source/Assembly window toolbar. For more information on grouping capabilities please refer to the “Grouping Data” topic in the product help.

Support for applications generated by MinGW/Cygwin GCC*

Amplifier XE now supports profiling of applications built by the GCC* (MinGW and Cygwin) on Windows. The VTune Amplifier XE 2013 Update 7 release was qualified against Cygwin 1.7.17 with GCC* 4.5.3 and MinGW with GCC* 4.6.2. Pictures below demonstrating the analysis result view before and after the feature is introduced in the Update 7:

Before Update 7:

Since Update 7:

Event summary for hardware event-based sampling analysis results in the command line reports

Command line summary report is extended with the “Event summary” for the hardware event-based sampling analysis results showing summary for core and uncore PMU events.

Highlighting performance issues based on filtered-in data

Highlighting performance issues is now based on filtered-in data. See the example for CPI rate issues below.

Observe data

Filter in by selection

Results:

Before Update 7:

Since Update 7:

Stitching stacks for Intel® OpenMP* applications

Since Update 7 during the user-mode sampling and tracing analysis of an OpenMP application using Intel runtime libraries, the VTune Amplifier XE automatically enables the Stitch stacks option to restore a logical call tree by catching notifications from the runtime and attach stacks to a point introducing a parallel workload. To view the OpenMP objects hierarchy, explore the data provided in the Top-down Tree pane. To analyze a logically structured OpenMP call flow, make sure to compile and run your code with the Intel® Compiler 13.1 Update 3 or higher (part of the Intel Composer XE 2013 Update 3). For more information please refer to the “Stitching Stacks” topic in the product help.

Details:

The Caller/Callee window is available in all viewpoints that provide call stack data. Use this window to analyze parent and child functions of the selected focus function and identify the most time-critical call paths. You can double-click a function of interest to go to the source view and explore the function performance by a source line. Use the Filter In by Selection grid context menu option on a function of interest to display functions included into all sub-trees that contain the selected function at any level. For more information please refer to the “Window: Caller/Callee” topic in the product help.

Improved welcome page now provides quick access to the recently used analysis configurations and analysis results.

Separate configuration tabs for Binary/Symbol Search and Source Search. Use the tabs to configure the search directories for binary/symbol and source files required to finalize collected data and work with source/assembly view. For example: if an application to analyze and the source files were moved from the location where the application was compiled then directories for separate debug files and source files should be specified in the tabs for proper symbol resolving and work with source/assembly view.

To get context help on a particular hardware PMU event or performance metric select What’s This Column? grid context menu.

Overhead and Spin time metrics are provided in the grid and Timeline pane of the Hotspots by CPU Usage, Hotspots by Thread Concurrency, and Lightweight Hotspots viewpoints. The metrics will allow to identify inefficiencies in using threading runtimes (for example, Intel® Threading Building Blocks, Intel® Cilk™, OpenMP*) when a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels (overhead), or when a significant portion of CPU time is spent on spin (active) waits. For more information please refer to “Overhead and Spin time” topic in the product help.NOTE: VTune Amplifier ignores the Overhead and Spin time when calculating the CPU Usage metric.

To change the measurement units on the time scale select the Show Time Scale As context menu option in Timeline, and choose from the following values:

Elapsed Time (default)

OS Timestamp

CPU Timestamp

For all Timeline view control capabilities refer to “Managing Timeline View” topic in the product help

On Fedora* 18 pango packages should be installed, including pangox-compat