Introduction

Processor topology information is important for a number of processor-resource management practices, ranging from task/thread scheduling, licensing policy enforcement, affinity control/migration, etc. Topology information of the cache hierarchy can be important to optimizing software performance. This white paper covers topology enumeration algorithm for single-socket to multiple-socket platforms using Intel 64 and IA-32 processors. The topology enumeration algorithms (both processor and cache) using initial APIC ID has been extended to use x2APIC ID, the latter mechanism is required for future platforms supporting more than 256 logical processors in a coherent domain.

Hardware multithreading in microprocessors has proliferated in recent years. The majority of Intel® architecture processors shipping today provide one or more forms of hardware multi-threading support (multicore and/or simultaneous multithreading (SMT), the latter introduced as HyperThreading Technology in 2002). From a processor hardware perspective, the physical package of an Intel 64 processor can support SMT and multi-core. Consequently, a physical processor is effectively a hierarchically ordered collection of logical processors with some forms of shared system resources (for example, memory, bus/system links, caches) From a platform hardware perspective, hardware multithreading that exists in a multi-processor system may consist of two or more physical processors organized in either uniform or non-uniform configuration with respect to the memory subsystem.

Application programming using hardware multithreading features must follow the programming models and software constructs provided by the underlying operating system. For example, an OS scheduler generally assigns a software task from a queue using hardware resource at the granularity of a logical processor; an OS may define its own data structure and provide services to applications that allows them to customize the assignment between task and logical processor via an affinity construct for multithreaded applications The OS and the software stack underneath an application (the BIOS, the OS loader) also play significant roles in bringing up the hardware multi-threading features and configuring the software constructs defined by the OS.

The CPUID instruction in Intel 64 architecture defines a rich set of information to assist BIOS, OS, and applications to query processor topology that are needed for efficient operation by each respective member of the software stack. Generally, the BIOS needs to gather topology information of a physical processor, determine how many physical processors are present in the system; prepare the necessary software constructs related to system topology, and pass along the system topology information to the next layer of the software stack that takes over control of the system. The OS and the application layers have a wide range of uses for topology information. This document covers several common software usages by OS and applications for using CPUID to analyze processor topology in a singleprocessor or multi-processor system.

The primary software usage of processor topology enumeration deals with querying and identifying the hierarchical relationship of logical processor, processor cores, and physical packages in a singleprocessor or multi-processor system. We’ll refer to this usage as system topology enumeration. System topology enumeration may be needed by OS or certain applications to implement licensing policy based on physical processors. It is used by OS to implement efficient task-scheduling, minimize thread migration, configure application thread management interfaces, and configure memory allocation services appropriate to the processor/memory topology. Multithreaded applications need system topology information to determine optimal thread binding, manage memory allocation for optimal locality, and improve performance scaling in multi-processor systems.

Intel 64 processors fulfill system topology enumeration requirements:

Each logical processor in an Intel 64 or IA-32 platform supporting coherent memory is assigned a unique ID (APIC ID) within the coherent domain. A multi-node cluster installation may employ vendor-specific BIOS that preserve the APIC IDs assigned (during processor reset) within each coherent domain, extend with node IDs to form a superset of unique IDs within the clustered system. This document will only cover the CPUID interfaces providing unique IDs within a coherent domain.

The values of unique IDs assigned within a coherent Intel 64 or IA-32 platform conform to an algorithm based on bit-field decomposition of the APIC ID into three sub-fields . The three sets of sub-fields correspond to three hierarchical levels defined as “SMT”, “processor core” (or “core”), and “physical package” (or “package”). This allows each hierarchical level to be mapped to a subfield (a sub ID) within the APIC ID.

Conceptually, a topology enumeration algorithm is simply to extract the sub ID corresponding to a given hierarchical level from the APIC ID, based on deriving two parameters that defines the subset of bits within an APIC ID. The relevant parameters are: (a) the width of a mask that can be used to mask off unneeded bits in the APIC ID, (b) an offset relative to bit 0 of the APIC ID.

The “SMT” level corresponds to the innermost constituent of the processor topology. So it is located in the least significant portion of the APIC ID. If the corresponding width for “SMT” is 0, it implies there is only 1 logical processor within the next outer level of the hierarchy. For example, Intel® CoreTM2 Duo Processors generally produce a “SMT_Mask_Width” of 0. If the corresponding width is 1 bit wide, there could be two logical processors within the next outer level of the hierarchy.

If the corresponding width for “core” is 0, it implies there is only 1 processor core within a physical processor. If the corresponding width for “core” is 1 bit wide, there could be two processor cores within a physical processor.

Note, the values of APIC ID that are assigned across all logical processor s in the system need not be contiguous. But the subsets of bit fields corresponding to three hierarchical levels are contiguous at bit boundary. Due to this requirement, the bit offset of the mask to extract a given Sub ID can be derived from the “mask width” of the inner hierarchical levels.

Glossary

Physical Processor: The physical package of a microprocessor capable of executing one or more threads of software at the same time. Each physical package plugs into a physical socket. Each physical package may contain one or more processor cores, also referred to as a physical package.

Processor Core: The circuitry that provides dedicated functionalities to decode, execute instructions, and transfer data between certain sub-systems in a physical package. A processor core may contain one or more logical processors.

Logical Processor: The basic modularity of processor hardware resource that allow software executive (OS) to dispatch task or execute a thread context. Each logical processor can execute only one thread context at a time.

Hyper-Threading Technology: A feature within the IA-32 family of processors, where each processor core provides the functionality of more than one logical processor.

SMT: Abbreviated name for Simultaneous Multi-Threading. An efficient means in silicon to provide the functionalities of multiple logical processors within the same processor core by sharing execution resources and cache hierarchy between logical processors.

Multi-core Processor: A physical processor that contains more than one processor cores.

Multi-processor Platform: A computer system made of two or more physical sockets.

Hardware Multi-threading: Refers to any combination of hardware support to allow a system to run multi-threaded software. The forms of hardware support for multi-threading are: SMT, multi-core, and multi-processor.

Cache Hierarchy: Physical arrangement of cache levels that buffers data transport between a processor entity and the physical memory subsystem.

Cache Topology: Hierarchical relationships of a cache level relative to the logical processors in a physical processor.

Unique APIC ID in a Multi-Processor System

Although legacy IA-32 multi-processor systems assigns unique APIC IDs for each logical processors in the system, the programming interfaces have evolved several times in the past. For Intel Pentium Pro processors and Pentium III Xeon processors, APIC IDs are accessible only from local APIC registers (Local APIC registers use memory-mapped IO interfaces and are managed by OS). In the first generation of Intel Pentium 4 and Intel Xeon processor processors (2000, 2001), CPUID instruction provided information on the initial APIC ID that is assigned during processor reset. The CPUID instruction in the first generation of Intel Xeon MP processor and Intel Pentium 4 processor supporting Hyper-Threading Technology (2002) provided additional information that allows software to decompose initial APIC IDs into a two-level topology enumeration. With the introduction of dual-core Intel 64 processors in 2005, system topology enumeration using CPUID evolved into a three-level algorithm on the 8-bit wide initial APIC IDs. Future Intel 64 platforms may be capable of supporting a large number of logical processors that exceed the capacity of the 8-bit initial APIC ID field. The x2APIC extension in Intel 64 architecture defines a 32-bit x2APIC ID, the CPUID instruction in future Intel 64 processors will allow software to enumerate system topology using x2APIC IDs. The extended topology enumeration leaf of CPUID (leaf 11) is the preferred interface for system topology enumeration for future Intel 64 processor.

The CPUID instruction in future Intel 64 processors may support leaf 11 independent of x2APIC hardware. For many future Intel 64 platforms, system topology enumeration may be performed using either CPUID leaf 11 or legacy initial APIC ID (via CPUID leaf 1 and leaf 4). Figure 1-1 shows an example of how software can choose which CPUID leaf information to use for system topology enumeration.

If software observes that CPUID.0:EAX < 4 on a newer Intel 64 or IA-32 processor (newer than 2004), it should examine the MSR IA32_MISC_ENABLES[bit 22].

If IA32_MISC_ENABLES[bit 22] was set to 1 (by BIOS or other means), the user can restore CPUID leaf function full reporting by having IA32_MISC_ENABLES[bit 22] set to ‘0’ (Modify BIOS CMOS setting or use WRMSR).

For older IA-32 processors that support only two-level topology, the three-level system topology enumeration algorithm (using CPUID leaf 1 and leaf 4) is fully compatible with older processors supporting two-level topology (SMT and physical package). For processors that report CPUID.1:EBX[23:16] as reserved (i.e. 0), the processor supports only one level of topology.

Example A-1 shows a code example of dealing with CPUID leaf functions across three categories of processor hardware.

System Topology Enumeration Using CPUID Extended Topology Leaf

The algorithm of system topology enumeration can be summarized as three phase of operation:

Derive “mask width” constants that will be used to extract each Sub IDs.

Gather the unique APIC IDs of each logical processor in the system, and extract/decompose each APIC ID into three sets of Sub IDs.

Analyze the relationship of hierarchical Sub IDs to establish mapping tables between OS’s thread management services according to three hierarchical levels of processor topology.

Example A-2 shows an example of the basic structure of the three phases of system wide topology as applied to processor topology and cache topology.

Figure 1-2 outlines the procedures of querying CPUID leaf 11 for the x2APIC ID and extracting sub IDs corresponding to the “SMT”, “Core”, “physical package” levels of the hierarchy.

Figure 1-2. Procedures to Extract Sub IDs from the x2APIC ID of Each Logical Processor

Example A-3 lists a data structure that holds the APIC ID, various sub IDs, and a hierarchical set of ordinal numbering scheme to enumerate each entity in the processor topology and/or cache topology of a system.

System topology enumeration at the application level using CPUID involves executing CPUID instruction on each logical processor in the system. This implies context switching using services provided by an OS. On-demand context switching by user code generally relies on a thread affinity management API provided by the OS. The capability and limitation of thread affinity API by different OS vary. For example, in some OS, the thread affinity API has a limit of 32 or 64 logical processors. It is expected that enhancement to thread affinity API to manage larger number of logical processor will be available in future versions.

System Topology Enumeration Using CPUID Leaf 1 and Leaf 4

Figure 1-3 outlines the procedures of querying initial APIC ID via CPUID leaf 1 and extracting sub IDs corresponding to the SMT, Core, physical package levels of the hierarchy using CPUID leaf 1 and leaf 4. Extraction of sub ID from initial APIC ID is based on querying CPUID leaf 1 and 4 to derive the bit widths of three select masks (SMT mask, Core mask, Pkg mask) that make up the 8-bit initial APIC ID field. The select masks allow software to extract the sub IDs corresponding to “SMT”, “core”, “package” from the initial APIC ID of each logical processor.

Figure 1-3. Procedures to Extract Sub IDs from the Initial APIC ID of Each Logical Processor

Example A-4 shows an example of querying the APID ID for each logical processor in the system and parsing each APIC ID into respective sub IDs for later analysis of the topological makeup.

Example A-5 lists support routines to extract various sub IDs from each APIC ID of the logical processor that we have bound the current execution context.

Example A-6 shows OS-specific wrapper functions.

Cache Topology Enumeration

The physical package of an Intel 64 processor has a hierarchy of cache. A given level of the cache hierarchy may be shared by one or more logical processors. Some software may wish to optimize performance by taking advantage of the shared cache of a particular level of the cache hierarchy.

Performance tuning using cache topology can be accomplished by combining the system topology information with the addition of cache topology information. Figure 1-4 outlines the procedures of decomposition of sub IDs to enumerate logical processors sharing the target cache level and enumerating the target level caches visible in the system. The Cache_ID can be extracted from the x2APIC ID for processors that reports 32-bit x2APIC ID or from the initial APIC ID for processors that do not report x2APIC ID. The array of “Cache_ID” can be used to enumerate different caches in conjunction with other sub ID derived from the processor topology to implement code-tuning techniques.

Figure 1-4. Procedures to Extract Cache_ID of the Target Cache Level of Each Logical Processor

The three-level sub IDs, SMT_ID[k], Core_ID[k], Pkg_ID[k], k = 0, .., N-1 can be used by software in a number of application-specific ways. Some of the more common usages include:

Determine the number of physical processors to implement a per-package licensing policy. Each unique value in the Pkg_ID[] array represents a physical processor.

A thread-binding strategy may choose to favor binding each new task to a separate core in the system. This may require the software to know the relationships between the affinity mask of each logical processor relative to each distinct processor core.

An MP-scaling optimization strategy may wish to partition its data working set according to the size of the large last-level cache and allow multiple threads to process the data tile residing in each last level cache. This will require software to manage the affinity masks and thread binding relative to each Cache_ID and APIC ID in the system.

Data Processing of Sub IDs of the Topology

Each hierarchy of the sub IDs represents a subset of the APIC ID (either x2APIC ID or initial APIC ID). It allows software to address each distinct entity within the parent hierarchical level. For processor topology enumeration:

SMT_ID: each unique SMT_ID allows software to distinguish different logical processors within a processor core,

Core_ID: each unique Core_ID allows software to distinguish different processor cores within a physical package,

Pkg_ID: each unique Pkg_ID allows software to distinguish different physical packages in a multiprocessor system.

Cache_ID: each unique Cache_ID allows software to distinguish different target level cache in the system.

The extraction of sub IDs from the APIC ID makes use of constant parameters that are derived from CPUID instruction. From platform hardware perspective, Intel 64 and IA-32 multi-processor system require each physical processor support the same hardware multi-threading capabilities. Therefore, system topology enumeration can execute the relevant CPUID leaf functions on one logical processor to derive system-wide sub-ID extraction parameters. But the APIC IDs must be queried by executing CPUID instruction on each logical processor in the system.

Sub ID Extraction Parameters for x2APIC ID

Extraction of sub ID from x2APIC ID is based on querying the value of CPUID.(EAX=11, ECX=n):EAX[4:0] for a valid sub leaf index n to obtain the bit width parameter to derive an extraction mask while x2APIC ID is queried by CPUID.(EAX=1,ECX=0):EDX[31:0]. The extraction mask allows software to extract a subset of bits from the x2APIC ID as a sub ID for the respective level of the hierarchy. In order to enumerate the sub IDs, increase sub leaf index (ECX=n) by 1 until CPUID.(ECX=11,ECX=n).EBX[15:0] == 0

SMT_ID: CPUID.(EAX=11, ECX=0):EAX[4:0] provides the width parameter to derive a SMT select mask to extract the SMT_IDs of logical processors within the same processor core. The sub leaf index (ECX=0) is architecturally defined and associated with the “SMT” level type (CPUID.(EAX=11, ECX=0):ECX[15:8] == 1)

Core_ID: The sub leaf index (ECX=1) is associated with “processor core” level type. Then, CPUID.(EAX=11,ECX=1):EAX[4:0] provides the width parameter to derive a select mask of all logical processors within the same physical package. The “processor core” includes “SMT” in this case, and enumerating different cores in the package can be done by zeroing out the SMT portion of the inclusive mask derived from this.

Pkg_ID: Within a coherent domain of the three-level topology, the upper bits of the APIC_ID (except the lower “CorePlus_Mask_Width” bits) can enumerate different physical packages in the system. In a clustered installation, software may need to consult vendor specific documentation to distinguish the topology of how many physical packages are organized within a given node.

Pkg_Select_Mask = (-1) << CorePlus_Mask_Width

Pkg_ID = (x2APIC_ID & Pkg_Select_Mask) >> CorePlus_Mask_Width

An example of deriving the extraction parameters for x2APIC ID can be found in the support function “CPUTopologyLeafBConstants()” in the Appendix.

Considerations for Using Three Level Topology on a System with More Levels

System software can choose to assume three level hierarchy if it was developed to understand only three levels. However, software implementation needs to ensure it does not break if it runs on systems that have more levels in the hierarchy, even if it does not recognize them. If the software does not recognize or implement certain hierarchical levels, it should assume these unknown levels as an extension of the last known level. For example, if there were additional levels between core and package which the software wants to ignore, it can extract the bitmasks as follows:

Query the right-shift value for the SMT level of the topology using CPUID leaf 0BH with ECX =0H as input. The number of bits to shift-right on x2APIC ID (EAX[4:0]) can distinguish different higherlevel entities above SMT (e.g., processor cores) in the same physical package. This is also the width of the bit mask to extract the SMT_ID.

Enumerate until the desired level is found (e.g., processor cores). Determine if the next level is the expected level. If the next level is not known to the software, keep enumerating until the next known or the last level. Software should use the previous level before this to represent the last previously known level (e.g., processor cores).

Query CPUID leaf 0BH for the amount of bit shift to distinguish next higher-level entities (e.g., physical processor packages) in the system. The width of the extraction bit mask can be used to derive the cumulative extraction bitmask to extract the sub IDs of logical processors (including different processor cores) in the same physical package. The extraction bit mask to distinguish merely different processor cores can be derived by XOR'ing the SMT extraction bit mask from the cumulative extraction bit mask.

Derive the extraction bit masks corresponding to SMT_ID, CORE_ID, and PACKAGE_ID, starting from SMT_ID and apply each extraction bit mask to the 32-bit x2APIC ID to extract sub-field IDs.

Sub ID Extraction Parameters for Initial APIC ID

Topological sub ID extraction from an INITIAL_APIC_ID (CPUID.1:EBX[31:24]) uses parameters derived from CPUID.1:EBX[23:16] and CPUID.(EAX=04H, ECX=0):EAX[31:26]. CPUID.1:EBX[23:16] represents the maximum number of addressable IDs (initial APIC ID) that can be assigned to logical processors in a physical package. The value may not be the same as the number of logical processors that are present in the hardware of a physical package. The value of (1 + (CPUID.(EAX=4, ECX=0):EAX[31:26] )) represents the maximum number of addressable IDs (Core_ID) that can be used to enumerate different processor cores in a physical package. The value also can be different than the actual number of processor cores that are present in the hardware of a physical package.

SMT_ID: The equivalent “SMT_Mask_Width” can be derived from dividing maximum number of addressable initial APIC IDs by maximum number of addressable Core IDs

SMT_Mask_Width = Log21( RoundToNearestPof2(CPUID.1:EBX[23:16]) / ((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1)), where Log2 is the logarithmic based on 2 and RoundToNearestPof2() operation is to round the input integer to the nearest power-of-two integer that is not less than the input value.

SMT_Select_Mask = ~((-1) << SMT_Mask_Width )

SMT_ID = INITIAL_APIC_ID & SMT_Select_Mask

Core_ID: The value of (1 + (CPUID.(EAX=04H, ECX=0):EAX[31:26] )) can also be use to derive an equivalent “CoreOnly_Mask_Width”

Cache ID Extraction Parameters

Cache IDs are specific to a target level cache of the cache hierarchy. Software must determine, a priori, the target cache level (the sub-level index n associated with CPUID leaf 4) it wishes to optimize with respect to the processor topology. After it has chosen the sub-leaf index ECX=n, then Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=n):EAX[25:14])) is the equivalent “Cache_Mask_Width” parameter. The “Cache_Mask_Width” parameter forms the basis to construct either a select mask to extract the sub IDs of logical processors sharing the target cache level, or a complementary mask to select the upper bits from APID to identify different cache entities of the specified target level in the system. To construct a mask to extract sub IDs of different logical processors sharing a cache, it is simply ~((-1) << Cache_Mask_Width ).

The derivation of bitmask extraction parameters for cache topology is analogous to those shown in Example A-8. Software may choose to focus on one specific cache level in the cache hierarchy. In the companion full source code package that is released separately with this paper, the reader can find code examples of derivation for the bitmask extraction parameters for each cache level, and corresponding cache topology sorting examples. For space consideration, the full source of the cache topology code is not listed in the Appendix.

1. Evaluating the Log2 value of a positive number that happens to be a power of 2 can be implemented using integer operation. For example, the BSR instruction can be used to obtain the index position of the most significant bit that is not zero.

Analyzing Topology Enumeration Result and Customization

How can software make use of topological information (in the form of hierarchical sub IDs)? This really depends on the needs and situations specific to each application. It may need adaptation due to differences of APIs provided by different OS. For the purpose of illustration, we consider some examples of using sub IDs to establish manage affinity masks hierarchically.

The knowledge of sub IDs of each topological hierarchy may be useful in several ways, For example:

Count the number of entities in a given hierarchical level across the system.

Affinity mask is a data structure that is defined within a specific OS, different OS may use the same concept but providing different means of application programming interface. For example, Microsoft Windows* provides affinity mask as a data type that can be directly manipulated via bit field by applications for affinity control. Linux implements a similar data structure internally but abstracts it so application can manipulate affinity through an iterative interface that assigned zero-based numbers to each logical processor.

The affinity mask or the equivalent numbering scheme provided by OS does not carry attributes that can store hierarchical attributes of the system topology. We will use the “affinity mask” terminology generically in this section (as the technique can be easily generalized to the numbered interface of affinity control).

In the reference code example, we use the sub IDs to create an ordinal numbering scheme (zerobased) for each hierarchical level. Different entities in the system topology (packages, cores) can be referenced by applications using a set of hierarchical ordinal numbers. Using the hierarchical ordinal number scheme and a look-up table to the corresponding affinity masks, software can easily control thread binding, optimize cache usage, etc.

Figure 1-5 depicts a basic example of data processing of Pkg_IDs and Core_IDs in the system to acquire information on the number of software visible physical packages, processor cores in the system. This basic technique can also be adapted to acquire affinity mappings, hierarchical breakdowns, and asymmetry information in the system.

Example A-10a, Example A-10b, and Example A-10c list an algorithm to analyze the sub IDs of all logical processor in the system and derive a triplet of zero-based numbering scheme to index unique entities within each topological level.

Example A-11 lists a data structure that organizes miscellaneous global variable, arrays, workspace items that are used throughout the rest of the code example. The full set of source code is provided in a separate package. The full source code can be compiled under 32-bit and 64-bit Windows and Linux operating systems. A limited set of OS and compiler tools have been tested.

Figure 1-5. Procedures to Extract Cache_ID of the Target Cache Level of Each Logical Processor

Dynamic Software Visibility of Topology Enumeration

When application software examines/uses topology information, it must keep in mind the dynamic nature of software visibility. The hardware capability present at the platform level may be presented differently through BIOS setting, through OS boot option, through OS-supported user interfaces. For example, Intel 64 and IA-32 multi-processor system require each physical processor support the same hardware multi-threading capabilities. This hardware symmetry that exists at the platform hardware level may be presented differently at application level. System topology enumeration can uncover dynamic software visible asymmetry irrespective of the cause of such asymmetry may be caused by BIOS setting, OS boot option, or UI configurations.

The appendix lists the bulk of the supporting functions that are used in enumerating processor topology of the system as visible to the current software process. A complete source code is provided separately for download. The reference code can be compiled in either 32-bit or 64-bit Windows* environment. In 64-bit environment, the cpuid64.asm file is needed to provide an enhanced intrinsic function for querying CPUID sub-leaves. An equivalent reference code implementation for 32-bit and 64-bit Linux environment will also be available.

if (!PkgMarked)
{ // mark up the pkg_ID and Core_ID of an unmarked package
pDetectedPkgIDs[maxPackageDetetcted] = packageID;
pDetectCoreIDsperPkg[maxPackageDetetcted* numMappings + 0] = coreID;
// keep track of respective hierarchical counts
glbl_ptr->perPkg_detectedCoresCount.data[maxPackageDetetcted] = 1;
glbl_ptr->perCore_detectedThreadsCount.data
[maxPackageDetetcted*MAX_CORES+0] = 1;
// build a set of zero-based numbering acheme so that
// each logical processor in the same core can be referenced by a
// zero-based index
// each core in the same package can be referenced by another
//zero-based index
// each package in the system can be referenced by a third
// zero-based index scheme.
// each system wide index i can be mapped to a triplet of
// zero-based hierarchical indices
glbl_ptr->PApicAffOrdMapping[i].packageORD = maxPackageDetetcted;
glbl_ptr->PApicAffOrdMapping[i].coreORD = 0;
glbl_ptr->PApicAffOrdMapping[i].threadORD = 0;
// this is an unmarked pkg, increment pkg count by 1
maxPackageDetetcted++;
// there is at least one core in a package
glbl_ptr->EnumeratedCoreCount++;
}
}
glbl_ptr->EnumeratedPkgCount = maxPackageDetetcted;
return 0;
}