Hyper-Threading Technology Architecture and
Microarchitecture
Deborah T. Marr, Desktop Products Group, Intel Corp.
Frank Binns, Desktop ProductsGroup, Intel Corp.
David L. Hill, Desktop Products Group, Intel Corp.
Glenn Hinton, Desktop Products Group, Intel Corp.
David A. Koufaty, Desktop Products Group, Intel Corp.
J. Alan Miller, Desktop Products Group, Intel Corp.
Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp.
Index words: architecture, microarchitecture, Hyper-Threading Technology, simultaneous multi-
threading, multiprocessor
INTRODUCTION
ABSTRACT The amazing growth of the Internet and
telecommunications is powered by ever-faster systems
Intel’s Hyper-Threading Technology brings the concept
demanding increasingly higher levels of processor
of simultaneous multi-threading to the Intel
performance. To keep up with this demand we cannot
Architecture. Hyper-Threading Technology makes a
rely entirely on traditional approaches to processor
single physical processor appear as two logical
design. Microarchitecture techniques used to achieve
processors; the physical execution resources are shared
past processor performance improvement–super-
and the architecture state is duplicated for the two
pipelining, branch prediction, super-scalar execution,
logical processors. From a software or architecture
out-of-order execution, caches–have made
perspective, this means operating systems and user
microprocessors increasingly more complex, have more
programs can schedule processes or threads to logical
transistors, and consume more power. In fact, transistor
processors as they would on multiple physical
counts and power are increasing at rates greater than
processors. From a microarchitecture perspective, this
processor performance. Processor architects are
means that instructions from both logical processors
therefore looking for ways to improve performance at a
will persist and execute simultaneously on shared
greater rate than transistor counts and
execution resources.
power dissipation. Intel’s Hyper-Threading
This paper describes the Hyper-Threading Technology Technology is one solution.
architecture, and discusses the microarchitecture details
of Intel's first implementation on the Intel Xeon
® ™
Processor Microarchitecture
processor family. Hyper-Threading Technology is an Traditional approaches to processor design have
important addition to Intel’s enterprise product line and focused on higher clock speeds, instruction-level
will be integrated into a wide variety of products. parallelism (ILP), and caches. Techniques to achieve
higher clock speeds involve pipelining the
microarchitecture to finer granularities, also called
super-pipelining. Higher clock frequencies can greatly
improve performance by increasing the number of
instructions that can be executed each second. Because
®
Intel is a registered trademark of Intel Corporation or there will be far more instructions in-flight in a super-
its subsidiaries in the United States and other countries. pipelined microarchitecture, handling of events that
™ disrupt the pipeline, e.g., cache misses, interrupts and
Xeon is a trademark of Intel Corporation or its branch mispredictions, can be costly.
subsidiaries in the United States and other countries.
Hyper-Threading Technology Architecture and Microarchitecture 1

Intel Technology Journal Q1, 2002
ILP refers to techniques to increase the number of 25
instructions executed each clock cycle. For example, a Power
super-scalar processor has multiple parallel execution 20 Die Size
units that can process instructions simultaneously. With SPECInt Perf
super-scalar execution, several instructions can be 15
executed each clock cycle. However, with simple in-
order execution, it is not enough to simply have multiple 10
execution units. The challenge is to find enough
instructions to execute. One technique is out-of-order 5
execution where a large window of instructions is
0
simultaneously evaluated and sent to execution units, i486 Pentium(TM) Pentium(TM) 3 Pentium(TM) 4
based on instruction dependencies rather than program Processor Processor Processor
order. Figure 1: Single-stream performance vs. cost
Accesses to DRAM memory are slow compared to Figure 1 shows the relative increase in performance and
execution speeds of the processor. One technique to the costs, such as die size and power, over the last ten
1
reduce this latency is to add fast caches close to the years on Intel processors . In order to isolate the
processor. Caches can provide fast memory access to microarchitecture impact, this comparison assumes that
frequently accessed data or instructions. However, the four generations of processors are on the same
caches can only be fast when they are small. For this silicon process technology and that the speed-ups are
™
reason, processors often are designed with a cache normalized to the performance of an Intel486
hierarchy in which fast, small caches are located and processor. Although we use Intel’s processor history in
operated at access latencies very close to that of the this example, other high-performance processor
processor core, and progressively larger caches, which manufacturers during this time period would have
handle less frequently accessed data or instructions, are similar trends. Intel’s processor performance, due to
implemented with longer access latencies. However, microarchitecture advances alone, has improved integer
there will always be times when the data needed will not performance five- or six-fold .
1
Most integer
be in any processor cache. Handling such cache misses applications have limited ILP and the instruction flow
requires accessing memory, and the processor is likely can be hard to predict.
to quickly run out of instructions to execute before
stalling on the cache miss. Over the same period, the relative die size has gone up
fifteen-fold, a three-times-higher rate than the gains in
The vast majority of techniques to improve processor integer performance. Fortunately, advances in silicon
performance from one generation to the next is complex process technology allow more transistors to be packed
and often adds significant die-size and power costs. into a given amount of die area so that the actual
These techniques increase performance but not with measured die size of each generation microarchitecture
100% efficiency; i.e., doubling the number of execution has not increased significantly.
units in a processor does not double the performance of
the processor, due to limited parallelism in instruction The relative power increased almost eighteen-fold
1
flows. Similarly, simply doubling the clock rate does during this period . Fortunately, there exist a number of
not double the performance due to the number of known techniques to significantly reduce power
processor cycles lost to branch mispredictions. consumption on processors and there is much on-going
research in this area. However, current processor power
dissipation is at the limit of what can be easily dealt
with in desktop platforms and we must put greater
emphasis on improving performance in conjunction with
new technology, specifically to control power.
1
These data are approximate and are intended only to
show trends, not actual performance.
™
Intel486 is a trademark of Intel Corporation or its
subsidiaries in the United States and other countries.

Thread-Level Parallelism event multi-threading techniques do not achieve optimal
overlap of many sources of inefficient resource usage,
A look at today’s software trends reveals that server such as branch mispredictions, instruction
applications consist of multiple threads or processes that dependencies, etc.
can be executed in parallel. On-line transaction Finally, there is simultaneous multi-threading, where
processing and Web services have an abundance of multiple threads can execute on a single processor
software threads that can be executed simultaneously without switching. The threads execute simultaneously
for faster performance. Even desktop applications are and make much better use of the resources. This
becoming increasingly parallel. Intel architects have approach makes the most effective use of processor
been trying to leverage this so-called thread-level resources: it maximizes the performance vs. transistor
parallelism (TLP) to gain a better performance vs. count and power consumption.
transistor count and power ratio.
In both the high-end and mid-range server markets, Hyper-Threading Technology brings the simultaneous
multiprocessors have been commonly used to get more multi-threading approach to the Intel architecture. In
performance from the system. By adding more this paper we discuss the architecture and the first
processors, applications potentially get substantial implementation of Hyper-Threading Technology on the
® ™
Intel Xeon processor family.
performance improvement by executing multiple
threads on multiple processors at the same time. These
threads might be from the same application, from HYPER-THREADING TECHNOLOGY
different applications running simultaneously, from ARCHITECTURE
operating system services, or from operating system Hyper-Threading Technology makes a single physical
threads doing background maintenance. Multiprocessor processor appear as multiple logical processors [11, 12].
systems have been used for many years, and high-end To do this, there is one copy of the architecture state for
programmers are familiar with the techniques to exploit each logical processor, and the logical processors share a
multiprocessors for higher performance levels. single set of physical execution resources. From a
In recent years a number of other techniques to further software or architecture perspective, this means
exploit TLP have been discussed and some products operating systems and user programs can schedule
have been announced. One of these techniques is chip processes or threads to logical processors as they would
multiprocessing (CMP), where two processors are put on conventional physical processors in a multi-
on a single die. The two processors each have a full set processor system. From a microarchitecture
of execution and architectural resources. The perspective, this means that instructions from logical
processors may or may not share a large on-chip cache. processors will persist and execute simultaneously on
CMP is largely orthogonal to conventional shared execution resources.
multiprocessor systems, as you can have multiple CMP Figure 2: Processors without Hyper-Threading Tech
processors in a multiprocessor configuration. Recently
announced processors incorporate two processors on
Arch State Arch State
each die. However, a CMP chip is significantly larger
than the size of a single-core chip and therefore more
expensive to manufacture; moreover, it does not begin
to address the die size and power considerations.
Processor Execution Processor Execution
Another approach is to allow a single processor to
execute multiple threads by switching between them. Resources Resources
Time-slice multithreading is where the processor
switches between software threads after a fixed time
period. Time-slice multithreading can result in wasted
execution slots but can effectively minimize the effects
of long latencies to memory. Switch-on-event multi-
threading would switch threads on long latency events tasks. However, both the time-slice and the switch-on-
such as cache misses. This approach can work well for
server applications that have large numbers of cache
misses and where the two threads are executing similar

®
Intel is a registered trademark of Intel Corporation or
its subsidiaries in the United States and other countries.
™
Xeon is a trademark of Intel Corporation or its
subsidiaries in the United States and other countries.

As an example, Figure 2 shows a multiprocessor system execution units, branch predictors, control logic, and
with two physical processors that are not Hyper- buses.
Threading Technology-capable. Figure 3 shows a
multiprocessor system with two physical processors that Each logical processor has its own interrupt controller
are Hyper-Threading Technology-capable. With two or APIC. Interrupts sent to a specific logical processor
copies of the architectural state on each physical are handled only by that logical processor.
processor, the system appears to have four logical
processors. FIRST IMPLEMENTATION ON THE
INTEL XEON PROCESSOR FAMILY
Arch State Arch State Arch State Arch State Several goals were at the heart of the microarchitecture
® ™
design choices made for the Intel Xeon processor MP
implementation of Hyper-Threading Technology. One
goal was to minimize the die area cost of implementing
Processor Execution Processor Execution Hyper-Threading Technology. Since the logical
Resources Resources processors share the vast majority of microarchitecture
resources and only a few small structures were
replicated, the die area cost of the first implementation
was less than 5% of the total die area.
A second goal was to ensure that when one logical
processor is stalled the other logical processor could
continue to make forward progress. A logical processor
Figure 3: Processors with Hyper-Threading may be temporarily stalled for a variety of reasons,
Technology including servicing cache misses, handling branch
The first implementation of Hyper-Threading mispredictions, or waiting for the results of previous
® instructions. Independent forward progress was ensured
Technology is being made available on the Intel
™ by managing buffering queues such that no logical
Xeon processor family for dual and multiprocessor processor can use all the entries when two active
servers, with two logical processors per physical 2
processor. By more efficiently using existing processor software threads were executing. This is accomplished
resources, the Intel Xeon processor family can subsidiaries in the United States and other countries.
significantly improve performance at virtually the same
system cost. This implementation of Hyper-Threading
Technology added less than 5% to the relative chip size
and maximum power requirements, but can provide
performance benefits much greater than that.
Each logical processor maintains a complete set of the
architecture state. The architecture state consists of
registers including the general-purpose registers, the
control registers, the advanced programmable interrupt
controller (APIC) registers, and some machine state
registers. From a software perspective, once the
architecture state is duplicated, the processor appears to
be two processors. The number of transistors to store
the architecture state is an extremely small fraction of
the total. Logical processors share nearly all other
resources on the physical processor, such as caches,
®
Intel is a registered trademark of Intel Corporation or
its subsidiaries in the United States and other countries.
™
Xeon is a trademark of Intel Corporation or its

by either partitioning or limiting the number of active
entries each thread can have.
A third goal was to allow a processor running only one
active software thread to run at the same speed on a
processor with Hyper-Threading Technology as on a
processor without this capability. This means that
partitioned resources should be recombined when only
one software thread is active. A high-level view of the
microarchitecture pipeline is shown in Figure 4. As
shown, buffering queues separate major pipeline logic
blocks. The buffering queues are either partitioned or
duplicated to ensure independent forward progress
through each logic block.
®
Intel is a registered trademark of Intel Corporation or
its subsidiaries in the United States and other countries.
™
Xeon is a trademark of Intel Corporation or its
subsidiaries in the United States and other countries.
2
Active software threads include the operating system
idle loop because it runs a sequence of code that
continuously checks the work queue(s). The operating
system idle loop can consume considerable execution
resources.

In the following sections we will walk through the
pipeline, discuss the implementation of major functions,
Schedule / Execute
and detail several ways resources are shared or
Queue
Queue
Rename/Allocate
Out-of-order
TC / MS-ROM
replicated.
Retirement
Decode
Queue
Queue
Queue
Fetch
FRONT END
Queue
Queue
The front end of the pipeline is responsible for
delivering instructions to the later pipe stages. As
Arch Phys Arch shown in Figure 5a, instructions generally come from
APIC
State Regs State the Execution Trace Cache (TC), which is the primary
Arch Arch
or Level 1 (L1) instruction cache. Figure 5b shows that
APIC
State State only when there is a TC miss does the machine fetch
®
and decode instructions from the integrated Level 2 (L2)
Figure 4 Intel Xeon™ processor pipeline cache. Near the TC is the Microcode ROM, which
stores decoded instructions for the longer and more
complex IA-32 instructions.
Uop
I-Fetch Queue
IP
Trace
Cache
(a)
L2 Cache Uop
Access Queue Decode Queue Fill Queue
IP
ITLB
ITLB
Decode
L2 Access
Trace
Cache
(b)
Figure 5: Front-end detailed pipeline (a) Trace Cache Hit (b) Trace Cache Miss

come first-served basis, while always reserving at least
Execution Trace Cache (TC)
one request slot for each logical processor. In this way,
The TC stores decoded instructions, called micro- both logical processors can have fetches pending
operations or “uops.” Most instructions in a program simultaneously.
are fetched and executed from the TC. Two sets of
next-instruction-pointers independently track the Each logical processor has its own set of two 64-byte
progress of the two software threads executing. The streaming buffers to hold instruction bytes in
two logical processors arbitrate access to the TC every preparation for the instruction decode stage. The ITLBs
clock cycle. If both logical processors want access to and the streaming buffers are small structures, so the die
the TC at the same time, access is granted to one then size cost of duplicating these structures is very low.
the other in alternating clock cycles. For example, if The branch prediction structures are either duplicated or
one cycle is used to fetch a line for one logical shared. The return stack buffer, which predicts the
processor, the next cycle would be used to fetch a line target of return instructions, is duplicated because it is a
for the other logical processor, provided that both very small structure and the call/return pairs are better
logical processors requested access to the trace cache. If predicted for software threads independently. The
one logical processor is stalled or is unable to use the branch history buffer used to look up the global history
TC, the other logical processor can use the full array is also tracked independently for each logical
bandwidth of the trace cache, every cycle. processor. However, the large global history array is a
The TC entries are tagged with thread information and shared structure with entries that are tagged with a
are dynamically allocated as needed. The TC is 8-way logical processor ID.
set associative, and entries are replaced based on a least-
recently-used (LRU) algorithm that is based on the full IA-32 Instruction Decode
8 ways. The shared nature of the TC allows one logical IA-32 instructions are cumbersome to decode because
processor to have more entries than the other if needed. the instructions have a variable number of bytes and
have many different options. A significant amount of
Microcode ROM logic and intermediate state is needed to decode these
When a complex instruction is encountered, the TC instructions. Fortunately, the TC provides most of the
sends a microcode-instruction pointer to the Microcode uops, and decoding is only needed for instructions that
ROM. The Microcode ROM controller then fetches the miss the TC.
uops needed and returns control to the TC. Two The decode logic takes instruction bytes from the
microcode instruction pointers are used to control the streaming buffers and decodes them into uops. When
flows independently if both logical processors are both threads are decoding instructions simultaneously,
executing complex IA-32 instructions. the streaming buffers alternate between threads so that
Both logical processors share the Microcode ROM both threads share the same decoder logic. The decode
entries. Access to the Microcode ROM alternates logic has to keep two copies of all the state needed to
between logical processors just as in the TC. decode IA-32 instructions for the two logical processors
even though it only decodes instructions for one logical
ITLB and Branch Prediction processor at a time. In general, several instructions are
decoded for one logical processor before switching to
If there is a TC miss, then instruction bytes need to be
the other logical processor. The decision to do a coarser
fetched from the L2 cache and decoded into uops to be
level of granularity in switching between logical
placed in the TC. The Instruction Translation
processors was made in the interest of die size and to
Lookaside Buffer (ITLB) receives the request from the
reduce complexity. Of course, if only one logical
TC to deliver new instructions, and it translates the
processor needs the decode logic, the full decode
next-instruction pointer address to a physical address.
bandwidth is dedicated to that logical processor. The
A request is sent to the L2 cache, and instruction bytes
decoded instructions are written into the TC and
are returned. These bytes are placed into streaming
forwarded to the uop queue.
buffers, which hold the bytes until they can be decoded.
The ITLBs are duplicated. Each logical processor has Uop Queue
its own ITLB and its own set of instruction pointers to After uops are fetched from the trace cache or the
track the progress of instruction fetch for the two logical Microcode ROM, or forwarded from the instruction
processors. The instruction fetch logic in charge of decode logic, they are placed in a “uop queue.” This
sending requests to the L2 cache arbitrates on a first- queue decouples the Front End from the Out-of-order

Execution Engine in the pipeline flow. The uop queue quickly as their inputs are ready, without regard to the
is partitioned such that each logical processor has half original program order.
the entries. This partitioning allows both logical
processors to make independent forward progress Allocator
regardless of front-end stalls (e.g., TC miss) or
The out-of-order execution engine has several buffers to
execution stalls.
perform its re-ordering, tracing, and sequencing
operations. The allocator logic takes uops from the uop
OUT-OF-ORDER EXECUTION ENGINE queue and allocates many of the key machine buffers
The out-of-order execution engine consists of the needed to execute each uop, including the 126 re-order
allocation, register renaming, scheduling, and execution buffer entries, 128 integer and 128 floating-point
functions, as shown in Figure 6. This part of the physical registers, 48 load and 24 store buffer entries.
machine re-orders instructions and executes them as Some of these key buffers are partitioned such that each
logical processor can use at most half the entries.
Uop Register Register
Queue Rename Queue Sched Read Execute L1 Cache Write Retire
Store
Buffer
RRegister
egister
RRename
ename
Allocate
Re-Order
Registers Registers Buffer
L1 D-Cache
Figure 6: Out-of-order execution engine detailed pipeline
Specifically, each logical processor can use up to a
maximum of 63 re-order buffer entries, 24 load buffers, Register Rename
and 12 store buffer entries. The register rename logic renames the architectural IA-
If there are uops for both logical processors in the uop 32 registers onto the machine’s physical registers. This
queue, the allocator will alternate selecting uops from allows the 8 general-use IA-32 integer registers to be
the logical processors every clock cycle to assign dynamically expanded to use the available 128 physical
resources. If a logical processor has used its limit of a registers. The renaming logic uses a Register Alias
needed resource, such as store buffer entries, the Table (RAT) to track the latest version of each
allocator will signal “stall” for that logical processor and architectural register to tell the next instruction(s) where
continue to assign resources for the other logical to get its input operands.
processor. In addition, if the uop queue only contains Since each logical processor must maintain and track its
uops for one logical processor, the allocator will try to own complete architecture state, there are two RATs,
assign resources for that logical processor every cycle to one for each logical processor. The register renaming
optimize allocation bandwidth, though the resource process is done in parallel to the allocator logic
limits would still be enforced. described above, so the register rename logic works on
By limiting the maximum resource usage of key buffers, the same uops to which the allocator is assigning
the machine helps enforce fairness and prevents resources.
deadlocks. Once uops have completed the allocation and register
rename processes, they are placed into two sets of

queues, one for memory operations (loads and stores) Retirement
and another for all other operations. The two sets of
Uop retirement logic commits the architecture state in
queues are called the memory instruction queue and the
program order. The retirement logic tracks when uops
general instruction queue, respectively. The two sets of
from the two logical processors are ready to be retired,
queues are also partitioned such that uops from each
then retires the uops in program order for each logical
logical processor can use at most half the entries.
processor by alternating between the two logical
processors. Retirement logic will retire uops for one
Instruction Scheduling
logical processor, then the other, alternating back and
The schedulers are at the heart of the out-of-order forth. If one logical processor is not ready to retire any
execution engine. Five uop schedulers are used to uops then all retirement bandwidth is dedicated to the
schedule different types of uops for the various other logical processor.
execution units. Collectively, they can dispatch up to
six uops each clock cycle. The schedulers determine Once stores have retired, the store data needs to be
when uops are ready to execute based on the readiness written into the level-one data cache. Selection logic
of their dependent input register operands and the alternates between the two logical processors to commit
availability of the execution unit resources. store data to the cache.
The memory instruction queue and general instruction
MEMORY SUBSYSTEM
queues send uops to the five scheduler queues as fast as
they can, alternating between uops for the two logical The memory subsystem includes the DTLB, the low-
processors every clock cycle, as needed. latency Level 1 (L1) data cache, the Level 2 (L2) unified
cache, and the Level 3 unified cache (the Level 3 cache
Each scheduler has its own scheduler queue of eight to ® ™
is only available on the Intel Xeon processor MP).
twelve entries from which it selects uops to send to the Access to the memory subsystem is also largely
execution units. The schedulers choose uops regardless oblivious to logical processors. The schedulers send
of whether they belong to one logical processor or the load or store uops without regard to logical processors
other. The schedulers are effectively oblivious to and the memory subsystem handles them as they come.
logical processor distinctions. The uops are simply
evaluated based on dependent inputs and availability of DTLB
execution resources. For example, the schedulers could
dispatch two uops from one logical processor and two The DTLB translates addresses to physical addresses. It
uops from the other logical processor in the same clock has 64 fully associative entries; each entry can map
cycle. To avoid deadlock and ensure fairness, there is a either a 4K or a 4MB page. Although the DTLB is a
limit on the number of active entries that a logical shared structure between the two logical processors,
processor can have in each scheduler’s queue. This each entry includes a logical processor ID tag. Each
limit is dependent on the size of the scheduler queue. logical processor also has a reservation register to
ensure fairness and forward progress in processing
Execution Units DTLB misses.
The execution core and memory hierarchy are also L1 Data Cache, L2 Cache, L3 Cache
largely oblivious to logical processors. Since the source
and destination registers were renamed earlier to The L1 data cache is 4-way set associative with 64-byte
physical registers in a shared physical register pool, lines. It is a write-through cache, meaning that writes
uops merely access the physical register file to get their are always copied to the L2 cache. The L1 data cache is
destinations, and they write results back to the physical virtually addressed and physically tagged.
register file. Comparing physical register numbers The L2 and L3 caches are 8-way set associative with
enables the forwarding logic to forward results to other 128-byte lines. The L2 and L3 caches are physically
executing uops without having to understand logical addressed. Both logical processors, without regard to
processors. which logical processor’s uops may have initially
After execution, the uops are placed in the re-order
buffer. The re-order buffer decouples the execution ®
stage from the retirement stage. The re-order buffer is Intel is a registered trademark of Intel Corporation or
partitioned such that each logical processor can use half its subsidiaries in the United States and other countries.
the entries. ™
Xeon is a trademark of Intel Corporation or its
subsidiaries in the United States and other countries.

brought the data into the cache, can share all entries in earlier. There are two flavors of ST-mode: single-task
all three levels of cache. logical processor 0 (ST0) and single-task logical
Because logical processors can share data in the cache, processor 1 (ST1). Tn ST0- or ST1-mode, only one
there is the potential for cache conflicts, which can logical processor is active, and resources that were
result in lower observed performance. However, there partitioned in MT-mode are re-combined to give the
is also the possibility for sharing data in the cache. For single active logical processor use of all of the
example, one logical processor may prefetch resources. The TA-32 Tntel Architecture has an
instructions or data, needed by the other, into the cache; instruction called HALT that stops processor execution
this is common in server application code. Tn a and normally allows the processor to go into a lower-
producer-consumer usage model, one logical processor power mode. HALT is a privileged instruction, meaning
may produce data that the other logical processor wants that only the operating system or other ring-0 processes
to use. Tn such cases, there is the potential for good may execute this instruction. User-level applications
performance benefits. cannot execute HALT.
On a processor with Hyper-Threading Technology,
BUS executing HALT transitions the processor from MT-
Logical processor memory requests not satisfied by the mode to ST0- or ST1-mode, depending on which logical
cache hierarchy are serviced by the bus logic. The bus processor executed the HALT. For example, if logical
logic includes the local APTC interrupt controller, as processor 0 executes HALT, only logical processor 1
well as off-chip system memory and T/O space. Bus would be active; the physical processor would be in
logic also deals with cacheable address coherency ST1-mode and partitioned resources would be
(snooping) of requests originated by other external bus recombined giving logical processor 1 full use of all
agents, plus incoming interrupt request delivery via the processor resources. Tf the remaining active logical
local APTCs. processor also executes HALT, the physical processor
would then be able to go to a lower-power mode.
From a service perspective, requests from the logical
processors are treated on a first-come basis, with queue Tn ST0- or ST1-modes, an interrupt sent to the HALTed
and buffering space appearing shared. Priority is not processor would cause a transition to MT-mode. The
given to one logical processor above the other. operating system is responsible for managing MT-mode
transitions (described in the next section).
Distinctions between requests from the logical
processors are reliably maintained in the bus queues Arch State Arch State Arch State Arch State Arch State Arch State
nonetheless. Requests to the local APTC and interrupt
delivery resources are unique and separate per logical
processor. Bus logic also carries out portions of barrier Processor Execution P r oc e s so r E Processor Execution
fence and memory ordering operations, which are Resources Resources
x e c u t io n
applied to the bus request queues on a per logical
R e s ou rc
processor basis. e s
referred to as single-task (ST) or multi-task (MT). Tn
For debug purposes, and as an aid to forward progress
MT-mode, there are two active logical processors and
mechanisms in clustered multiprocessor
some of the resources are partitioned as described
implementations, the logical processor TD is visibly sent
onto the processor external bus in the request phase
portion of a transaction. Other bus transactions, such as
cache line eviction or prefetch transactions, inherit the
logical processor TD of the request that generated the
transaction.
SINGLE-TASK AND MULTI-TASK
MODES
To optimize performance when there is one software
thread to execute, there are two modes of operation

(a) ST0-Mode (b) MT-Mode (c) ST1- Mode
Figure 7: Resource allocation
Figure 7 summarizes this discussion. On a processor
with Hyper-Threading Technology, resources are
allocated to a single logical processor if the processor is
in ST0- or ST1-mode. On the MT-mode, resources are
shared between the two logical processors.
OPERATING SYSTEM AND
APPLICATIONS
A system with processors that use Hyper-Threading
Technology appears to the operating system and
application software as having twice the number of
processors than it physically has. Operating systems
manage logical processors as they do physical

No-Hyper-Threading Hyper-Threading Enabled
logical processors. However, for best performance, the
operating system should implement two optimizations. 3
The first is to use the HALT instruction if one logical
processor is active and the other is not. HALT will 2.5
2
ST1-mode. An operating system that does not use this
1.5
a sequence of instructions that repeatedly checks for
work to do. This so-called “idle loop” can consume 1
significant execution resources that could otherwise be
used to make faster progress on the other active logical 0.5
processor.
0
The second optimization is in scheduling software
1 Processor 2 Processors 4 Processors
threads to logical processors. Tn general, for best
performance, the operating system should schedule
threads to logical processors on different physical Figure 8: Performance increases from Hyper-
processors before scheduling multiple threads to the Threading Technology on an OLTP workload
same physical processor. This optimization allows Figure 8 shows the online transaction processing
software threads to use different physical execution performance, scaling from a single-processor
resources when possible. configuration through to a 4-processor system with
Hyper-Threading Technology enabled. This graph is
PERFORMANCE normalized to the performance of the single-processor
® ™
The Tntel Xeon processor family delivers the highest system. Tt can be seen that there is a significant overall
server system performance of any TA-32 Tntel performance gain attributable to Hyper-Threading
architecture processor introduced to date. Tnitial Technology, 21% in the cases of the single and dual-
benchmark tests show up to a 65% performance processor systems.
® No Hyper-Threading Hyper-Threading Enabled
compared to the previous-generation Pentium TTT
Xeon™ processor on 4-way server platforms. A 1.4
1.2
Hyper-Threading Technology.
1
0.8
0.6
0.4
0.2
0
Webserver Webserver Server-side Java
Workload (1) Workload (2) workload
®
Tntel and Pentium are registered trademarks of Tntel
Corporation or its subsidiaries in the United States and
other countries.
™
Xeon is a trademark of Tntel Corporation or its
subsidiaries in the United States and other countries.

Figure 9: Web server benchmark performance
Figure 9 shows the benefit of Hyper-Threading
Technology when executing other server-centric
benchmarks. The workloads chosen were two different
benchmarks that are designed to exercise data and Web
server characteristics and a workload that focuses on
exercising a server-side Java environment. Tn these
cases the performance benefit ranged from 16 to 28%.

All the performance results quoted above are begun to tap into this potential. Hyper-Threading
normalized to ensure that readers focus on the relative Technology is expected to be viable from mobile
performance and not the absolute performance. processors to servers; its introduction into market
segments other than servers is only gated by the
Performance tests and ratings are measured using availability and prevalence of threaded applications and
specific computer systems and/or components and workloads in those markets.
reflect the approximate performance of Tntel products as
measured by those tests. Any difference in system ACKNOWLEDGMENTS
hardware or software design or configuration may affect
Making Hyper-Threading Technology a reality was the
actual performance. Buyers should consult other
result of enormous dedication, planning, and sheer hard
sources of information to evaluate the performance of
work from a large number of designers, validators,
systems or components they are considering purchasing.
architects, and others. There was incredible teamwork
For more information on performance tests and on the
from the operating system developers, BTOS writers,
performance of Tntel products, refer to
and software developers who helped with innovations
www.intel.com/procs/perf/limits.htm or call (U.S.) 1-
and provided support for many decisions that were
800-628-8686 or 1-916-356-3104
made during the definition process of Hyper-Threading
Technology. Many dedicated engineers are continuing
CONCLUSION to work with our TSV partners to analyze application
Tntel’s Hyper-Threading Technology brings the concept performance for this technology. Their contributions
of simultaneous multi-threading to the Tntel and hard work have already made and will continue to
Architecture. This is a significant new technology make a real difference to our customers.
direction for Tntel’s future processors. Tt will become
increasingly important going forward as it adds a new REFERENCES
technique for obtaining additional performance for A. Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz,
lower transistor and power costs. “APRTL: A processor Architecture for Multiprocessing,”
The first implementation of Hyper-Threading in Proceedings of the 17th Annual International
® ™ Symposium on Computer Architectures, pages 104-114,
Technology was done on the Tntel Xeon processor May 1990.
MP. Tn this implementation there are two logical
R. Alverson, D. Callahan, D. Cummings, B. Koblenz,
processors on each physical processor. The logical
A.
processors have their own independent architecture Porter, and B. Smith, “The TERA Computer System,” in
state, but they share nearly all the physical execution International Conference on Supercomputing, Pages 1 - 6,
and hardware resources of the processor. The goal was June 1990.
to implement the technology at minimum cost while
L. A. Barroso et. al., “Piranha: A Scalable Architecture
ensuring forward progress on logical processors, even if
Based on Single-Chip Multiprocessing,” in Proceedings
the other is stalled, and to deliver full performance even of the
when there is only one active logical processor. These 27th Annual International Symposium on Computer
goals were achieved through efficient logical processor Architecture, Pages 282 - 293, June 2000.
selection algorithms and the creative partitioning and
M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang,
recombining algorithms of many key resources.
Y.
Measured performance on the Tntel Xeon processor MP Gurevich, and W. Lee, “The M-Machine
with Hyper-Threading Technology shows performance Multicomputer,” in 28th Annual International
gains of up to 30% on common server application Symposium on Microarchitecture, Nov. 1995.
benchmarks for this technology. L. Hammond, B. Nayfeh, and K. Olukotun, “A Single-
Chip
The potential for Hyper-Threading Technology is Multiprocessor,” Computer, 30(9), 79 - 85, September
tremendous; our current implementation has only just 1997.
D. J. C. Johnson, “HP's Mako Processor,” Microprocessor
Forum, October 2001, http://www.cpus.hp.com/
®
Tntel is a registered trademark of Tntel Corporation or technical_references/mpf_2001.pd f
its subsidiaries in the United States and other countries. B.J. Smith, “Architecture and Applications of the HEP
™ Multiprocessor Computer System,” in SPIE Real
Xeon is a trademark of Tntel Corporation or its Time Signal Processing IV, Pages 2 241 - 248, 1981.
subsidiaries in the United States and other countries.
J. M. Tendler, S. Dodson, and S. Fields, “POWER4 System
Microarchitecture,” Technical White Paper. IBM Server