This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

Sorry, we can't translate this content right now, please try again later.

Memory Access Analysis for Cache Misses and High Bandwidth Issues

Use the Intel® VTune™ Amplifier's Memory Access analysis to identify memory-related issues, like NUMA problems and bandwidth-limited accesses, and attribute performance events to memory objects (data structures), which is provided due to instrumentation of memory allocations/de-allocations and getting static/global variables from symbol information.

DRAM Bound metric that shows how often the CPU was stalled on the main memory (DRAM)

Remote / Local DRAM Ratio metric that is defined by the ratio of remote DRAM loads to local DRAM loads

Local DRAM metric that shows how often the CPU was stalled on loads from the local memory

Remote DRAM metric that shows how often the CPU was stalled on loads from the remote memory

Remote cache metric that shows how often the CPU was stalled on loads from the remote cache in other sockets

Average Latency metric that shows an average load latency in cycles

Note

The list of metrics may vary depending on your microarchitecture.

Many of the collected events used in the Memory Access analysis are precise. This simplifies understanding the data access pattern. Off-core traffic is divided into the local DRAM and remote DRAM accesses. Typically, you should focus on minimizing remote DRAM accesses that usually have a high cost.

Enable the instrumentation of dynamic memory allocation/de-allocation and map hardware events to such memory objects. This option may cause additional runtime overhead due to the instrumentation of all system memory allocation/de-allocation API.

Specify a minimal size of dynamic memory allocations to analyze. This option helps reduce runtime overhead of the instrumentation.

The default value is 1024.

Evaluate max DRAM bandwidth check box

Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds.

The option is enabled by default.

Analyze OpenMP regions check box

Instrument and analyze OpenMP regions to detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction and atomic operations.

The option is disabled by default.

Details button

Expand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify or enable additional settings for the analysis, you need to create a custom configuration by copying an existing predefined configuration. VTune Amplifier creates an editable copy of this analysis type configuration.

Bottom-up window displays performance data per metric for each hotspot object. If you enable the Analyze memory objects option for data collection, the Bottom-up window also displays memory allocation call stacks in the grid and Call Stack pane. Use the Memory Object grouping level, preceded with the Function level, to view memory objects as the source location of an allocation call.

Support Limitations

Memory Access analysis is supported on the following platforms:

2nd Generation Intel® Core™ processors

Intel® Xeon® processor families, or later

3rd Generation Intel Atom® processor family, or later

If you need to analyze older processors, you can create a custom analysis and choose events related to memory accesses. However, you will be limited to memory-related events available on those processors. For information about memory access events per processor, see the VTune Amplifier tuning guides.

For dynamic memory object analysis on Linux, the VTune Amplifier instruments the following Memory Allocation APIs: