Overview

Dynamic random access memory (DRAM) is a limited resource on all platforms and must be controlled/monitored to ensure utilization is kept in check.

Cisco NX-OS uses memory in the following three ways:

Page cache

When you access files from persistent storage (CompactFlash), the kernel reads the data into the page cache, which means that when you access the data in the future, you can avoid the slow access times that are associated with disk storage. Cached pages can be released by the kernel if the memory is needed by other processes.

Some file systems (tmpfs) exist purely in the page cache (for example, /dev/sh, /var/sysmgr, /var/tmp), which means that there is no persistent storage of this data and that when the data is removed from the page cache, it cannot be recovered. tmpfs-cached files release page-cached pages only when they are deleted.

Kernel

The kernel needs memory to store its own text, data, and Kernel Loadable Modules (KLMs). KLMs are pieces of code that are loaded into the kernel (as opposed to being a separate user process). An example of kernel memory usage is when an inband port driver allocates memory to receive packets.

User processes

This memory is used by Cisco NX-OS/Linux processes that are not integrated in the kernel (such as text, stack, heap, and so on).

When you are troubleshooting high memory utilization, you must first determine what type of utilization is high (process, page cache, or kernel). Once you have identified the type of utilization, you can use additional troubleshooting commands to help you figure out which component is causing this behavior.

General/High Level Assessment of Platform Memory Utilization

You can assess the overall level of memory utilization on the platform by using two basic CLI commands: show system resources and show processes memory.

Note:

From these command outputs, you might be able to tell that platform utilization is higher than normal/expected, but you will not be able to tell what type of memory usage is high.

The show system resources command displays platform memory statistics (not per VDC).

In Cisco NX-OS, the Linux kernel monitors the percentage of memory that is used (relative to the total RAM present) and platform manager generates alerts as utilization passes default or configured thresholds. If an alert has occurred, it is useful to review the logs captured by the platform manager against the current utilization. Additional information about this monitoring is included later in this article.

By reviewing the output of these commands, you can determine if the utilization is high as a result of the page cache, processes holding memory, or kernel.

When reviewing this output, the value of none in the Filesystem column means that it is a tmpfs type.

In this example, utilization is high because the /var/sysmgr (or subfolders) is using a lot of space. /var/sysmgr is a tmpfs mount, which means that the files exist in RAM only. You need to determine what type of files are filling the partition and where they came from (cores/debugs/etc). Deleting the files will reduce utilization, but you should try to determine what type of files are taking up the space and what process left them in tmpfs.

In Cisco NX-OS release 4.2(4) and later releases, use the following commands to display and delete the problem files from the CLI:

The show system internal dir full directory path command lists all the files and sizes for the specified path (hidden command).

Use caution when using this command. You cannot recover a deleted file.

Note:

If you are running a Cisco NX-OS release prior to Cisco NX-OS release 4.2(4), you should contact your customer support representative.

You can also use the show hardware internal proc-info pcacheinfo command to determine how much space each file system is using in the page cache (Cached). The command output may help you determine which persistent file systems are using the page cache and how much memory they are using.

Kernel

Kernel issues are less common, but you can determine the problem by reviewing the slab utilization in the show system internal meminfo command output.
Generally, kernel troubleshooting requires Cisco customer support assistance to isolate why the utilization is increasing.

If slab memory usage grows over time, use the following commands to gather more information:

The show system internal kernel malloc-stats command displays all the currently loaded KLMs, malloc, and free counts.

By comparing several iterations of this command, you can determine if some KLMs are allocating a lot of memory but are not freeing/returning the memory back (the differential value will be very large compared to normal).

The show system internal kernel skb-stats command displays the consumption of SKBs (buffers used by KLMs to send and receive packets).

Compare the output of several iterations of this command to see if the differential value is growing or very high.

The show hardware internal proc-info slabinfo command dumps all of the slab information (memory structure used for kernel management). The output can be large.

User Processes

If page cache and kernel issues have been ruled out, utilization might be high as a result of some user processes taking up too much memory or a high number of running processes (due to the number of VDCs/features enabled).

Note:

Cisco NX-OS defines memory limits for most processes (rlimit). If this rlimit is exceeded, sysmgr will crash the process and a core file is usually generated. Processes close to their rlimit may not have a large impact on platform utilization but could still become an issue if a crash occurs.

Figuring Out Which Process is Using a Lot of Memory

The following commands can help you identify if a specific process is using a lot of memory:

The show process memory command displays the memory allocation per process for the current VDC (the output will contain non-VDC global processes also).

The output of the show process memory command might not provide a completely accurate picture of the current utilization (allocated does not mean in use).
This command is useful for determining if a process is approaching its rlimit.

To determine how much memory the processes are really using, you should check the Resident Set Size (RSS). This value will give you a rough indication of the amount of memory (in KB) that is being consumed by the processes. You can gather this information by using the following command:

The show system internal processes memory command displays the process information in the memory alerts log (if the event occurred).

Convert the UUID from the above output to decimal and use in the next command.

Note:

If troubleshooting in lab, you can use NX-OS hex/dec conversion using following hidden commands :

hex<dec to convert>

dec<hex to convert>

The show system internal kernel memory uuid<UUID in decimal> command displays the detailed process memory usage including its libraries for a specific UUID in the system (convert UUID from the sysmgr service output).

These outputs are usually requested by the Cisco customer support representative when investigating a potential memory leak in a process or its libraries.

Built-in Platform Memory Monitoring

Cisco NX-OS has built-in kernel monitoring of memory usage to help avoid system hangs, process crashes, and other undesirable behavior. The platform manager periodically checks the memory utilization (relative to the total RAM present) and automatically generates an alert event if the utilization passes the configured threshold values. When an alert level is reached, the kernel attempts to free memory by releasing pages that are no longer needed (for example, the page cache of persistent files that are no longer being accessed), or if critical levels are reached, the kernel will kill the highest utilization process. Other Cisco NX-OS components have introduced memory alert handling, such as BGP's graceful low memory handling, that allow processes to adjust their behavior to keep memory utilization under control.

Note:

While Cisco NX-OS implements VDCs, it is important to remember that a specific VDC's memory utilization is not limited. Platform memory issues will impact all configured VDCs.

Memory Thresholds

Prior to Release 4.2(4), the default memory alert thresholds were as follows:

70% MINOR

80% SEVERE

90% CRITICAL

From Release 4.2(4) and later releases, the memory alert thresholds were changed to the following:

85% MINOR

90% SEVERE

95% CRITICAL

This change was introduced in part due to baseline memory requirements when many features/VDCs are deployed.

The show system internal memory-status command allows you to check the current memory alert status.

N7K# show system internal memory-status
MemStatus: OK

Memory Alerts

If a memory threshold has been passed (OK -> MINOR, MINOR -> SEVERE, SEVERE -> CRITICAL), the Cisco NX-OS platform manager will capture a snapshot of memory utilization and log an alert to SYSLOG (as of Release 4.2(4), default VDC only). This snapshot is useful in determining why memory utilization is high (process, page cache, or kernel). The log is generated in the Linux root path (/) and copy is moved to OBFL (/mnt/plog) if possible.
This log is very useful for determining if memory utilization is high due to the memory that was consumed by the page cache, kernel, or Cisco NX-OS user processes.

The show system internal memory-alerts-log command displays the memory alerts log.

The memory alerts log consists of the following outputs:

Command

Description

cat /proc/memory_events

Provides a log of timestamps when memory alerts occurred.

cat /proc/meminfo

Shows the overall memory statistics including the total RAM, memory consumed by the page cache, slabs (kernel heap), mapped memory, available free memory, and so on.