Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

In a decode stage of hardware processor pipeline, one particular
instruction of a plurality of instructions is decoded. It is determined
that the particular instruction requires a memory access. Responsive to
such determination, it is predicted whether the memory access will result
in a cache miss. The predicting in turn includes accessing one of a
plurality of entries in a pattern history table stored as a hardware
table in the decode stage. The accessing is based, at least in part, upon
at least a most recent entry in a global history buffer. The pattern
history table stores a plurality of predictions. The global history
buffer stores actual results of previous memory accesses as one of cache
hits and cache misses. Additional steps include scheduling at least one
additional one of the plurality of instructions in accordance with the
predicting; and updating the pattern history table and the global history
buffer subsequent to actual execution of the particular instruction in an
execution stage of the hardware processor pipeline, to reflect whether
the predicting was accurate.

Claims:

1. A method comprising: decoding, in a decode stage of hardware processor
pipeline, one particular instruction of a plurality of instructions;
determining that said particular instruction requires a memory access;
responsive to determining that said particular instruction requires a
memory access, predicting whether said memory access will result in a
cache miss, said predicting in turn comprising accessing one of a
plurality of entries in a pattern history table stored as a hardware
table in said decode stage, said accessing being based, at least in part,
upon at least a most recent entry in a global history buffer, said
pattern history table storing a plurality of predictions, said global
history buffer storing actual results of previous memory accesses as one
of cache hits and cache misses; scheduling at least one additional one of
said plurality of instructions in accordance with said predicting; and
updating said pattern history table and said global history buffer
subsequent to actual execution of said particular instruction in an
execution stage of said hardware processor pipeline, to reflect whether
said predicting was accurate.

2. The method of claim 1, wherein: said global history buffer stores
combined results for all given ones of said plurality of instructions
requiring said memory access; said pattern history table comprises a miss
pattern history table and a hit pattern history table; and said
predicting comprises: accessing said most recent entry in said global
history buffer to determine whether said most recent entry in said global
history buffer comprises a hit or a miss; accessing said miss pattern
history table if said most recent entry in said global history buffer
comprises said miss, said accessing of said miss pattern history table
comprising accessing an entry therein corresponding to a number of
contiguous misses in said global history buffer, including said most
recent entry in said global history buffer; and accessing said hit
pattern history table if said most recent entry in said global history
buffer comprises said hit, said accessing of said hit pattern history
table comprising accessing an entry therein corresponding to a number of
contiguous hits in said global history buffer. including said most recent
entry in said global history buffer.

3. The method of claim 1, wherein: said global history buffer stores
combined results for all given ones of said plurality of instructions
requiring said memory access; said pattern history table has M entries
arranged in rows and N columns. and each of said rows has a stored tag
value associated therewith; said particular instruction has a program
counter associated therewith; and said predicting comprises: hashing said
program counter to obtain a hash having log M bits; extracting a tag from
said program counter; accessing said one of said plurality of entries in
said pattern history table based on said hash; determining whether that
one of said stored tag values corresponding to a given one of said rows
in which said one of said plurality of entries in said pattern history
table resides matches said extracted tag; if said one of said stored tag
values corresponding to said given one of said rows in which said one of
said plurality of entries in said pattern history table resides does not
match said extracted tag, making a default prediction; and if said one of
said stored tag values corresponding to said given one of said rows in
which said one of said plurality of entries in said pattern history table
resides does indeed match said extracted tag, accessing an entry in said
given one of said rows based on traversing a number of contiguous bits
starting with a most significant bit of said global history buffer. until
a first bit that is different from said most significant bit, said global
history buffer being an N-bit global history buffer.

4. The method of claim 1, wherein: said pattern history table has M
entries arranged in rows and N columns, each of said rows has a stored
tag value associated therewith, and each of said rows has an
instruction-specific portion of said global history buffer associated
therewith; said particular instruction has a program counter associated
therewith; and said predicting comprises: hashing said program counter to
obtain a hash having log M bits; extracting a tag from said program
counter; accessing said one of said plurality of entries in said pattern
history table based on said hash; determining whether that one of said
stored tag values corresponding to a given one of said rows in which said
one of said plurality of entries in said pattern history table resides
matches said extracted tag; if said one of said stored tag values
corresponding to said given one of said rows in which said one of said
plurality of entries in said pattern history table resides does not match
said extracted tag, making a default prediction; and if said one of said
stored tag values corresponding to said given one of said rows in which
said one of said plurality of entries in said pattern history table
resides does indeed match said extracted tag, accessing an entry in said
given one of said rows based on traversing a number of contiguous bits
starting with a most significant bit of a corresponding one of said
instruction-specific portions of said global history buffer, until a
first hit that is different from said most significant bit, said
instruction-specific portions of said global history buffer having
N-bits.

5. The method of claim 1, wherein: said global history buffer stores
combined results for all given ones of said plurality of instructions
requiring said memory access; said pattern history table has M entries
arranged in rows and N columns, and each of said rows has a stored tag
value associated therewith; said particular instruction has a program
counter associated therewith; and said predicting comprises: extracting a
tag from said program counter; accessing a corresponding one of said rows
for said one of said plurality of entries in said pattern history table
based on said extracted tag; and accessing an entry in said corresponding
one of said rows based on traversing a number of contiguous bits starting
with a most significant bit of said global history buffer, until a first
bit that is different from said most significant bit.

6. The method of claim 1, wherein: said pattern history table has M
entries arranged in rows and N columns, each of said rows has a stored
tag value associated therewith, and each of said rows has an
instruction-specific portion of said global history buffer associated
therewith; said particular instruction has a program counter associated
therewith; and said predicting comprises: extracting a tag from said
program counter; accessing a corresponding one of said rows for said one
of said plurality of entries in said pattern history table based on said
extracted tag; and accessing an entry in said corresponding one of said
rows based on traversing a number of contiguous bits starting with a most
significant bit of a corresponding one of said instruction-specific
portions of said global history buffer, until a first bit that is
different from said most significant bit, said instruction-specific
portions of said global history buffer having N-bits.

7. A hardware processor comprising: a decode pipeline stage which decodes
one particular instruction of a plurality of instructions; hard-wired
logic circuitry which determines that said particular instruction
requires a memory access; a pattern history hardware table within said
decode pipeline stage; a global history buffer within said decode
pipeline stage; hard-wired logic circuitry which, responsive to
determining that said particular instruction requires a memory access,
predicts whether said memory access will result in a cache miss, said
predicting in turn comprising accessing one of a plurality of entries in
said pattern history hardware table, said accessing being based, at least
in part, upon at least a most recent entry in said global history buffer,
said pattern history hardware table storing a plurality of predictions,
said global history buffer storing actual results of previous memory
accesses as one of cache hits and cache misses; an issue pipeline stage
which schedules at least one additional one of said plurality of
instructions in accordance with said predicting; an execution pipeline
stage which actually executes said particular instruction; and hard-wired
logic circuitry which updates said pattern history hardware table and
said global history buffer subsequent to said actual execution of said
particular instruction, to reflect whether said predicting was accurate.

8. The hardware processor of claim 7, wherein: said global history buffer
stores combined results for all given ones of said plurality of
instructions requiring said memory access; said pattern history hardware
table comprises a miss pattern history hardware table and a hit pattern
history hardware table; and said hard-wired logic circuitry which
predicts comprises: hard-wired logic circuitry which accesses said most
recent entry in said global history buffer to determine whether said most
recent entry in said global history buffer comprises a hit or a miss;
hard-wired logic circuitry which accesses said miss pattern history
hardware table if said most recent entry in said global history buffer
comprises said miss, said accessing of said miss pattern history hardware
table comprising accessing an entry therein corresponding to a number of
contiguous misses in said global history buffer. including said most
recent entry in said global history buffer; and hard-wired logic
circuitry which accesses said hit pattern history hardware table if said
most recent entry in said global history buffer comprises said hit, said
accessing of said hit pattern history hardware table comprising accessing
an entry therein corresponding to a number of contiguous hits in said
global history buffer, including said most recent entry in said global
history buffer.

9. The hardware processor of claim 7, wherein: said global history buffer
stores combined results for all given ones of said plurality of
instructions requiring said memory access; said pattern history hardware
table has M entries arranged in rows and N columns. and each of said rows
has a stored tag value associated therewith; said particular instruction
has a program counter associated therewith; and said hard-wired logic
circuitry which predicts comprises: hard-wired logic circuitry which
hashes said program counter to obtain a hash having log M bits;
hard-wired logic circuitry which extracts a tag from said program
counter; hard-wired logic circuitry which accesses said one of said
plurality of entries in said pattern history hardware table based on said
hash; hard-wired logic circuitry which determines whether that one of
said stored tag values corresponding to a given one of said rows in which
said one of said plurality of entries in said pattern history hardware
table resides matches said extracted tag; hard-wired logic circuitry
which, if said one of said stored tag values corresponding to said given
one of said rows in which said one of said plurality of entries in said
pattern history hardware table resides does not match said extracted tag,
makes a default prediction; and hard-wired logic circuitry which, if said
one of said stored tag values corresponding to said given one of said
rows in which said one of said plurality of entries in said pattern
history hardware table resides does indeed match said extracted tag,
accesses an entry in said given one of said rows based on traversing a
number of contiguous bits starting with a most significant bit of said
global history buffer, until a first bit that is different from said most
significant bit, said global history buffer being an N-bit global history
buffer.

10. The hardware processor of claim 7, wherein: said pattern history
hardware table has M entries arranged in rows and N columns, each of said
rows has a stored tag value associated therewith, and each of said rows
has an instruction-specific portion of said global history buffer
associated therewith; said particular instruction has a program counter
associated therewith; and said hard-wired logic circuitry which predicts
comprises: hard-wired logic circuitry which hashes said program counter
to obtain a hash having log M bits; hard-wired logic circuitry which
extracts a tag from said program counter; hard-wired logic circuitry
which accesses said one of said plurality of entries in said pattern
history hardware table based on said hash; hard-wired logic circuitry
which determines whether that one of said stored tag values corresponding
to a given one of said rows in which said one of said plurality of
entries in said pattern history hardware table resides matches said
extracted tag; hard-wired logic circuitry which, if said one of said
stored tag values corresponding to said given one of said rows in which
said one of said plurality of entries in said pattern history hardware
table resides does not match said extracted tag, makes a default
prediction; and hard-wired logic circuitry which, if said one of said
stored tag values corresponding to said given one of said rows in which
said one of said plurality of entries in said pattern history hardware
table resides does indeed match said extracted tag, accessing an entry in
said given one of said rows based on traversing a number of contiguous
bits starting with a most significant bit of a corresponding one of said
instruction-specific portions of said global history buffer, until a
first bit that is different from said most significant bit, said
instruction-specific portions of said global history buffer having
N-bits.

11. The hardware processor of claim 7, wherein: said global history
buffer stores combined results for all given ones of said plurality of
instructions requiring said memory access; said pattern history hardware
table has M entries arranged in rows and N columns, and each of said rows
has a stored tag value associated therewith; said particular instruction
has a program counter associated therewith; and said hard-wired logic
circuitry which predicts comprises: hard-wired logic circuitry which
extracts a tag from said program counter; hard-wired logic circuitry
which accesses a corresponding one of said rows for said one of said
plurality of entries in said pattern history hardware table based on said
extracted tag; and hard-wired logic circuitry which accesses an entry in
said corresponding one of said rows based on traversing a number of
contiguous bits starting with a most significant bit of said global
history buffer, until a first bit that is different from said most
significant bit, said global history buffer being an N-bit global history
buffer.

12. The hardware processor of claim 7, wherein: said pattern history
hardware table has M entries arranged in rows and N columns, each of said
rows has a stored tag value associated therewith, and each of said rows
has an instruction-specific portion of said global history buffer
associated therewith; said particular instruction has a program counter
associated therewith; and said hard-wired logic circuitry which predicts
comprises: hard-wired logic circuitry which extracts a tag from said
program counter; hard-wired logic circuitry which accesses a
corresponding one of said rows for said one of said plurality of entries
in said pattern history hardware table based on said extracted tag; and
hard-wired logic circuitry which accesses an entry in said corresponding
one of said rows based on traversing a number of contiguous bits starting
with a most significant bit of a corresponding one of said
instruction-specific portions of said global history buffer, until a
first bit that is different from said most significant bit, said
instruction-specific portions of said global history buffer having
N-bits.

13. A hardware processor comprising: means for decoding, in a decode
stage of a hardware processor pipeline of said hardware processor, one
particular instruction of a plurality of instructions; means for
determining that said particular instruction requires a memory access;
means for, responsive to determining that said particular instruction
requires a memory access, predicting whether said memory access will
result in a cache miss, said predicting in turn comprising accessing one
of a plurality of entries in a pattern history table stored as a hardware
table in said decode stage, said accessing being based, at least in part,
upon at least a most recent entry in a global history buffer, said
pattern history table storing a plurality of predictions, said global
history buffer storing actual results of previous memory accesses as one
of cache hits and cache misses; means for scheduling at least one
additional one of said plurality of instructions in accordance with said
predicting; and means for updating said pattern history table and said
global history buffer subsequent to actual execution of said particular
instruction in an execution stage of said hardware processor pipeline, to
reflect whether said predicting was accurate.

14. The hardware processor of claim 13, wherein: said global history
buffer stores combined results for all given ones of said plurality of
instructions requiring said memory access; said pattern history table
comprises a miss pattern history table and a hit pattern history table;
and said means for predicting comprise: means for accessing said most
recent entry in said global history buffer to determine whether said most
recent entry in said global history buffer comprises a hit or a miss;
means for accessing said miss pattern history table if said most recent
entry in said global history buffer comprises said miss, said accessing
of said miss pattern history table comprising accessing an entry therein
corresponding to a number of contiguous misses in said global history
buffer, including said most recent entry in said global history buffer;
and means for accessing said hit pattern history table if said most
recent entry in said global history buffer comprises said hit, said
accessing of said hit pattern history table comprising accessing an entry
therein corresponding to a number of contiguous hits in said global
history buffer, including said most recent entry in said global history
buffer.

15. The hardware processor of claim 13, wherein: said global history
buffer stores combined results for all given ones of said plurality of
instructions requiring said memory access; said pattern history table has
M entries arranged in rows and N columns, and each of said rows has a
stored tag value associated therewith; said particular instruction has a
program counter associated therewith; and said means for predicting
comprise: means for hashing said program counter to obtain a hash having
log M bits; means for extracting a tag from said program counter; means
for accessing said one of said plurality of entries in said pattern
history table based on said hash; means for determining whether that one
of said stored tag values corresponding to a given one of said rows in
which said one of said plurality of entries in said pattern history table
resides matches said extracted tag; means for, if said one of said stored
tag values corresponding to said given one of said rows in which said one
of said plurality of entries in said pattern history table resides does
not match said extracted tag, making a default prediction; and means for,
if said one of said stored tag values corresponding to said given one of
said rows in which said one of said plurality of entries in said pattern
history table resides does indeed match said extracted tag, accessing an
entry in said given one of said rows based on traversing a number of
contiguous bits starting with a most significant bit of said global
history buffer, until a first bit that is different from said most
significant bit, said global history buffer being an N-bit global history
buffer.

16. The hardware processor of claim 13, wherein: said pattern history
table has M entries arranged in rows and N columns, each of said rows has
a stored tag value associated therewith, and each of said rows has an
instruction-specific portion of said global history buffer associated
therewith; said particular instruction has a program counter associated
therewith; and said means for predicting comprise: means for hashing said
program counter to obtain a hash having log M bits; means for extracting
a tag from said program counter; means for accessing said one of said
plurality of entries in said pattern history table based on said hash;
means for determining whether that one of said stored tag values
corresponding to a given one of said rows in which said one of said
plurality of entries in said pattern history table resides matches said
extracted tag; means for, if said one of said stored tag values
corresponding to said given one of said rows in which said one of said
plurality of entries in said pattern history table resides does not match
said extracted tag, making a default prediction; and means for, if said
one of said stored tag values corresponding to said given one of said
rows in which said one of said plurality of entries in said pattern
history table resides does indeed match said extracted tag, accessing an
entry in said given one of said rows based on traversing a number of
contiguous bits starting with a most significant bit of a corresponding
one of said instruction-specific portions of said global history buffer,
until a first bit that is different from said most significant bit, said
instruction-specific portions of said global history buffer having
N-bits.

17. The hardware processor of claim 13, wherein: said global history
buffer stores combined results for all given ones of said plurality of
instructions requiring said memory access; said pattern history table has
M entries arranged in rows and N columns, and each of said rows has a
stored tag value associated therewith; said particular instruction has a
program counter associated therewith; and said means for predicting
comprise: means for extracting a tag from said program counter; means for
accessing a corresponding one of said rows for said one of said plurality
of entries in said pattern history table based on said extracted tag; and
means for accessing an entry in said corresponding one of said rows based
on traversing a number of contiguous bits starting with a most
significant bit of said global history buffer, until a first bit that is
different from said most significant bit, said global history buffer
being an N-bit global history buffer.

18. The hardware processor of claim 13. wherein: said pattern history
table has M entries arranged in rows and N columns, each of said rows has
a stored tag value associated therewith, and each of said rows has an
instruction-specific portion of said global history buffer associated
therewith; said particular instruction has a program counter associated
therewith; and said means for predicting comprise: means for extracting a
tag from said program counter: means for accessing a corresponding one of
said rows for said one of said plurality of entries in said pattern
history table based on said extracted tag; and means for accessing an
entry in said corresponding one of said rows based on traversing a number
of contiguous bits starting with a most significant bit of a
corresponding one of said instruction-specific portions of said global
history buffer, until a first bit that is different from said most
significant bit, said instruction-specific portions of said global
history buffer having N-bits.

Description:

FIELD OF THE INVENTION

[0001] The present invention relates to the electrical, electronic and
computer arts, and, more particularly, to computer architecture and the
like.

BACKGROUND OF THE INVENTION

[0002] The latency of first-level (L1) cache misses is a significant
factor with respect to pipeline stalls in processors. The latency of a
load instruction depends on whether or not the data exists in the L1
cache. However, at the time the load instruction is issued, it is not
possible to determine whether or not the data will be present in the L1
cache. This affects issuing those instructions that are dependent on the
load instruction, and leads to pipeline stalls and performance
degradation.

[0003] In most of today's processors, the scheduler assumes that the data
accessed by the load instruction will be available in the L1 cache.
Dependent instructions are issued based on this assumption. A
reject-and-replay mechanism is triggered when the load misses the cache.
When the load instruction incurs a cache miss, it leads to not only a
performance penalty, but also a power overhead.

[0004] The Compaq Alpha 21264 used the most significant bit of a 4-bit
saturating counter as the load instruction's hit or miss prediction. The
counter was incremented by one every time a load hit, and decremented by
two every time a load missed. Typically, only a small percentage of
static load instructions are responsible for a large fraction of the
cache misses. The above counter can be easily saturated with the load
instructions that usually have cache hits, and hence lead to
mis-predictions.

[0006] Principles of the invention provide techniques for predicting cache
misses using data access behavior and instruction address. In one aspect,
an exemplary method includes the steps of decoding, in a decode stage of
hardware processor pipeline, one particular instruction of a plurality of
instructions; determining that the particular instruction requires a
memory access; and, responsive to determining that the particular
instruction requires a memory access, predicting whether the memory
access will result in a cache miss. The predicting in turn includes
accessing one of a plurality of entries in a pattern history table stored
as a hardware table in the decode stage. The accessing is based, at least
in part, upon at least a most recent entry in a global history buffer.
The pattern history table stores a plurality of predictions. The global
history buffer stores actual results of previous memory accesses as one
of cache hits and cache misses. Additional steps include scheduling at
least one additional one of the plurality of instructions in accordance
with the predicting; and updating the pattern history table and the
global history buffer subsequent to actual execution of the particular
instruction in an execution stage of the hardware processor pipeline, to
reflect whether the predicting was accurate.

[0007] In another aspect, an exemplary hardware processor includes a
decode pipeline stage which decodes one particular instruction of a
plurality of instructions; hard-wired logic circuitry which determines
that the particular instruction requires a memory access; a pattern
history hardware table within the decode pipeline stage; and a global
history buffer within the decode pipeline stage. Also included is
hard-wired logic circuitry which, responsive to determining that the
particular instruction requires a memory access, predicts whether the
memory access will result in a cache miss. The predicting in turn
includes accessing one of a plurality of entries in the pattern history
hardware table. The accessing is based, at least in part, upon at least a
most recent entry in the global history buffer. The pattern history
hardware table stores a plurality of predictions. The global history
buffer stores actual results of previous memory accesses as one of cache
hits and cache misses. Also included are an issue pipeline stage which
schedules at least one additional one of the plurality of instructions in
accordance with the predicting; an execution pipeline stage which
actually executes the particular instruction; and hard-wired logic
circuitry which updates the pattern history hardware table and the global
history buffer subsequent to the actual execution of the particular
instruction, to reflect whether the predicting was accurate.

[0008] In still another aspect, an exemplary hardware processor includes
means for carrying out the method steps. The means include various
structures disclosed herein, such as hardware tables, buffers such as
shift registers, and hard-wired logic circuitry to implement the method
steps.

[0009] As used herein, "facilitating" an action includes performing the
action, making the action easier, helping to carry the action out, or
causing the action to be performed. Thus, by way of example and not
limitation, instructions executing on one processor might facilitate an
action carried out by instructions executing on a remote processor. by
sending appropriate data or commands to cause or aid the action to be
performed. For the avoidance of doubt, where an actor facilitates an
action by other than performing the action, the action is nevertheless
performed by some entity or combination of entities.

[0010] Techniques of the present invention can provide substantial
beneficial technical effects. For example, one or more embodiments may
provide one or more of the following advantages: [0011] If the outcome
of the cache access can be successfully predicted using the methods
described herein, the issue logic of the processor can re-schedule the
instructions such that the dependent instructions following the load will
not be issued, and the resources can be put to better use by issuing
independent instructions to improve the performance. [0012] In a
multi-threaded processor, cache hit/miss prediction information can be
used to launch a load instruction from a thread predicted to miss the
cache in advance so that the stall can be overlapped by executing
instructions from the other threads to increase the overall throughput
[0013] One or more embodiments of the invention enable out-of-order
scheduling, by allowing a younger load instruction predicted to have a
cache miss, and with no data dependences, to go ahead of an older load
predicted to have a cache hit, and with no data dependences.

[0014] These and other features and advantages of the present invention
will become apparent from the following detailed description of
illustrative embodiments thereof, which is to be read in connection with
the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 shows memory and cache hierarchy of a modern multi-core
system, which represents a non-limiting example of a system wherein one
or more embodiments can be implemented;

[0016]FIG. 2 depicts a first exemplary embodiment for per-load hit/miss
prediction, according to an aspect of the invention;

[0017] FIG. 3 shows exemplary prediction accuracy for L1 Dcache, according
to an aspect of the invention;

[0018]FIG. 4 depicts a second exemplary embodiment for per-load hit/miss
prediction using global hit/miss behavior, according to an aspect of the
invention;

[0019] FIGS. 5 & 6 show exemplary prediction accuracy for L1 and L2
Dcache, according to an aspect of the invention;

[0020] FIG. 7 depicts a third exemplary embodiment for global application
hit/miss prediction, according to an aspect of the invention;

[0021] FIGS. 8-11 show additional exemplary prediction accuracy for L1 and
L2 Dcache, according to an aspect of the invention; and

[0025] One or more embodiments combine the knowledge of global cache
hit/miss behavior of the data accesses of an application, and the
individual load instruction's hit-miss behavior, to accurately predict
the cache hit-miss behavior of a load instruction. If the outcome of the
cache access can be successfully predicted, the issue-logic of the
processor can effectively re-schedule the instructions such that the
dependent instructions following the load will not be issued, and the
resources can be put to better use by issuing independent instructions.
In addition, in a multi-threaded processor, such prediction information
of the cache access outcome can be used to launch a load instruction from
a thread predicted to miss the cache in advance so that the stall can be
overlapped by executing instructions from the other threads to increase
the overall throughput.

[0026] Furthermore, one or more instances enable out-of-order scheduling,
by allowing a younger load instruction predicted to have a cache miss,
and with no data dependencies. to go ahead of an older load predicted to
have a cache hit, and with no data dependencies.

[0027] The bursty nature of cache misses of applications is well-known.
One or more embodiments exploit this behavior to design a predictor to
determine if the application is likely to incur a cache miss/hit.

[0028] In one or more embodiments, it is sufficient to track a change from
a "miss"' dominated region to a "hit" dominated region. A contiguous
string of "hits" or "misses" will determine whether the next cache access
is a hit or a miss. This is different from the behavior of branches.

[0029] One or more embodiments employ the "generic" idea in branch
direction predictors. Typically, conditional branches have two possible
outcomes, "not taken" (NT) in which the instruction following the branch
instruction is the next instruction in the static program sequence, or
"taken" (T) in which the instruction following the branch instruction is
the target instruction which may not be the next contiguous instruction
in the static program sequence. Typically, branch predictors are employed
to predict the outcome of the branches (NT or T). Typically, these
predictors use a saturating counter which is incremented if the branch is
"taken" and decremented if the branch resolves as "not taken." If the
most significant bit of the saturating counter is set (to 1), the branch
is predicted taken; otherwise, the branch is predicted not taken. One or
more embodiments of the invention also use a predictor to predict the
cache access outcome as "hit" or "miss," and one or more embodiments of
an exemplary method use a saturation counter for the prediction. Thus,
one or more embodiments employ the "generic" idea in branch direction
predictors, but effectively exploit the fact that cache misses are
bursty, such that a small percentage of load or store instructions are
responsible for a large percentage of cache misses, and that, as just
noted, it is sufficient to track a change from a "miss" dominated region
to a "hit" dominated region. In one or more embodiments, predict the next
cache access outcome based on the length of the last seen contiguous
string of "hits" or "misses." A Global History Buffer (GHB) tracks the
last N cache access outcomes, and a Pattern History Table (PHT) predicts
the outcome of the next reference. The PHT is indexed using the length of
the last contiguous string of "hits" or "misses" from the GHB.
Significantly, in one or more embodiments, the number of entries in the
PHT is N (length of GHB), and not 2N.

[0030] Global Hit/Miss Prediction: Referring now to FIG. 7, in one aspect,
a two-level predictor is provided to predict whether the current
reference will be a hit or miss based on the sequence of hits and misses
seen in the past. One hardware structure includes GHB (global history
buffer) 702, having N bits. In particular, in one or more embodiments.
the history register 702 is an N-bit shift register which shifts in bits
representing the cache access outcome (hit or miss) of the most recent N
accesses.

[0031] Additional hardware structures include two PHTs (pattern history
tables), namely, Miss-PHT 704 and Hit-PHT 706. In one or more
embodiments, each PHT size is N entries (and not 2ΛN entries). The
PHT has only 2-bits per entry to predict hit or miss. If the most
significant bit (MSB) is 1, predict a miss, else predict a hit. One
method step useful in connection with the hardware embodiment of FIG. 7
includes determining which PHT to access. In some cases, if the most
significant bit (MSB) of the GHB is a "Miss," as seen at 708, access the
Miss-PHT 704, otherwise, access the Hit-PHT 706. Another method step
useful in connection with the hardware embodiment of FIG. 7 includes
determining the PHT index. In some cases, start traversing the GHB bits
starting from the MSB 708. Stop the traversal at the first bit that is
different from the MSB. The number of bits traversed is used as the index
of the PHT. The maximum number of bits traversed is equal to the length
of the GHB, and hence a PHT size of N entries is sufficient. In the
example of FIG. 7. bit 710 is the same as 708, while bit 712 is the first
bit that is different. (Bit 714 is labeled for completeness.) The
traversal would be stopped at bit 712. The number of bits traversed is
two. Still another method step useful in connection with the hardware
embodiment of FIG. 7 includes updating the PHT. In some cases, after the
outcome of the current prediction is known, before entering the outcome
in the GHB 702, again determine the PHT's index, and access the
appropriate PHT, as described just above. In particular, if the current
outcome is a miss, increment the chosen PHT entry, while if the current
outcome is a hit, decrement the chosen PHT entry.

[0033] LOAD and STORE instructions are the memory instructions that access
data from a cache. Instructions come with an instruction address (the
program counter or PC). In one or more embodiments, consider that the
LOAD instruction has, as one of the operands of the instruction, fields
to help compute the memory address, such that the data at that address in
the cache memory can be accessed and read into a register. If there is a
miss in L1, go to L2, and so on; if all cache levels miss go to the main
memory. The data is brought back to the register. In one or more
embodiments, do not await decoding of the instruction fields (operands),
address determination, or cache access, to predict whether the data is
present in the cache or not. Instead, make such a prediction well ahead
of time.

[0034] So, the LOAD instruction is fetched and decoded so as to determine
that it is in fact a LOAD instruction; however. as just noted, decoding
of the operands, address determination, and cache access have not yet
taken place. At this stage, predict whether the data to be accessed by
this instruction will be present in the cache. Instead of waiting to
determine the memory address, one or more embodiments employ the
instruction's address (instruction LOAD PC) to predict whether the data
accessed by the instruction will be in the cache. Thus, LOAD PC 208,
rather that the data address accessed by the LOAD instruction, is used to
make the prediction.

[0035] One or more embodiments are implemented entirely in hardware. In
one or more embodiments, suitable hardware elements include hardware
tables, hardware counters, and similar structures. One or more
embodiments use a few bits of the LOAD PC. For example, suppose there are
16 entries in the pattern history table (PHT) 204, which is the table
used for the prediction of whether the given load instruction will
trigger a cache hit or a cache miss for the data it is accessing. The PHT
is preferably implemented as a hardware table. In the non-limiting
example of a 16-entry PHT, each table entry can be uniquely addressed
with four bits (ranging from binary 0000 to binary 1111).

[0036] In the non-limiting example, use 4 bits of the LOAD PC having a
value ranging from decimal zero to decimal fifteen; such 4 bits uniquely
specify the appropriate location in the PHT. The Hash1 function 206 could
include, for example, taking the middle four bits of the LOAD PC 208 and
using same for the index. In the general case, if the PHT has M entries,
log M bits are needed for unique addressing. Other non-limiting exemplary
hash functions include taking the highest ordered four bits of the LOAD
PC and the lowest-ordered four bits of the LOAD PC, and applying an XOR
operation to obtain four new bits which are used to index the PHT. In
general, where there are M entries in the PHT, log M bits are used for
the index; those log M bits can be determined by some combination of the
LOAD PC bits. Given the teachings herein, the skilled artisan can select
an appropriate hash function. A good hash function should result in as
few collisions as possible. If two different LOAD PCs have the same value
for the four bits of the Hash1 function, they will go to the same entry
of the PHT, resulting in a collision wherein the update from one load
will be lost because the other LOAD will overwrite it. The hash is used
to avoid an expensive operation of searching through the entries and will
inevitably result in some amount of collisions.

[0037] Another aspect to assist in making a good prediction is to employ a
significant portion of the LOAD PC 208 (by way of example and not
limitation, the uppermost 20 bits) as a tag that is stored in the PHT
(see tag1, tag2, tag3, and tag4 stored in the first column of the PHT
204). Thus, the Hash1 is used to address a particular entry of the PHT
and then a check is made whether the tag of the current load instruction
matches the tag stored in that row of the PHT 204. If they match, then it
is concluded that the latest update to the entry in the PHT was likely
made by the same LOAD. If there is no match. do not read the prediction
corresponding to the four bits from Hash1, as it is likely that a
different instruction hashed into the table at the location addressed by
the four bits of Hash1, thus avoiding mis-prediction. Thus, the tagging
feature increases the confidence that a given entry in the PHT is from
the last occurrence of the same LOAD and not from some different LOAD.

[0038] To review, the Hash1 addresses a unique entry in the PHT 204; and a
comparison is made between the tag corresponding to that entry in the PHT
and the tag (i.e., corresponding bits) for the current PC. In case of a
match, use the entry for the prediction; if not, do not read the entry
for the prediction. Thus, use Hash1 to access the PUT 204 and if there is
a match for the tag, next use the GHB (global history buffer) entry in
that particular row to determine the column of the PHT to be used to make
the prediction. In the non-limiting example, the GHB 202 has the most
recent N outcomes for the particular tag (in this case, for the
particular LOAD). Typically, the same LOAD instruction will be seen
several times when the program is executed (for example, it may be
repeated in a loop). For each occurrence of a particular load
instruction, whatever happened when it accessed the cache (i.e., that
(true, not predicted) outcome) is pushed into the global history buffer
of N bits. GHB is like a shift register--push the most recent outcome on
the right side and shift all other entries left one bit. Since the only
outcome is hit or miss, only a single bit is needed to record the result
(for example, encode a hit as a logical one and a miss as a logical
zero). The most recent outcome resides in the right-hand (most
significant) bit 218.

[0039] Now, depending on the number of columns to be used (which in turn
depends on the available amount of hardware space/area), examine that
many bits of the GHB. starting from the most recent bit. In the
non-limiting example of FIG. 2, four columns are used (the first two
columns of table 204 are the tags and GHBs while the last four columns
are the actual PHT entries, such that four columns are used in this
example). Note that other numbers of columns could be used in other
embodiments; say, eight columns in another non-limiting example. Due to
the bursty nature of cache misses, it is sufficient to track a change
from a "miss" dominated region to a "hit" dominated region. A contiguous
string of "hits" or "misses" will determine whether the next cache access
is a hit or a miss. In one exemplary embodiment. start traversing the GHB
bits starting from the MSB. Stop the traversal at the first bit that is
different from the MSB. The number of bits traversed is used as the index
of the PHT. The maximum number of bits traversed is equal to the length
of the GHB, and hence the maximum number of columns after the second
(GHB) column in the table 204, is N.

[0040] Here, the GHB has 4 bits, and correspondingly the PHT had 4
columns. Thus, if there are N bits in the GHB, then there are N columns
after the second (GHB) column in the table 204. In the example, the
rightmost bit 218 is a miss, the next 220 is a miss, and the third 222 is
a hit, and then a miss 224. So starting from the rightmost bit 218, the
total number of bits traversed before the outcome changed compared to bit
218 is two. So the second column out of the 4 columns of the PHT is
accessed.

[0041] This procedure is carried out for every LOAD instruction that is
seen. Because the tags are saved, which are portions of the load
instruction's PC, as soon as it is determined that a particular
instruction is a LOAD, use bits of the LOAD PC 208 to hash 206 into the
PHT 204, use the high order bits as a tag, and check for a match with the
tag stored in the first column of the table 204. If there is a match,
read out the prediction based on the GHB's number of contiguous bits with
the same value as the MSB, and read the corresponding column output as
hit or miss to obtain the prediction as to whether there will be a cache
miss or a cache hit associated with the instruction. The prediction is
used to benefit execution. If a cache miss is predicted, avoid pushing
other, dependent instructions down the pipeline and instead schedule
different instructions. If the prediction of a miss is correct, time is
saved by scheduling other instructions instead; if the prediction of a
miss is incorrect, some time is lost but accuracy is not impacted.

[0042] The exemplary embodiment of FIG. 2 represents a relatively
hardware-intensive implementation as compared to other exemplary
implementations herein, as a number of hardware tables are needed and the
PHT has quite a few entries to reduce the likelihood of collisions. Other
non-limiting exemplary embodiments herein have less hardware complexity
but trade-off other factors. In the example of FIG. 2, prediction
accuracy is high because of the use of the tags. There is a theoretical
possibility of two LOADs that hash to the same table entry "ping-ponging"
back and forth and constantly overwriting entries without ever predicting
due to lack of a tag match. Other embodiments predict more aggressively
and have lower prediction accuracy by relaxing certain constraints, thus
lowering hardware requirements.

[0043] Note that hardware tables are preferably implemented as hardware
registers (storage elements such as latches) residing in the processor
106 itself.

[0044] It should be noted that in some embodiments, instead of using the
Hash1 function 206, let the LOAD be any of the entries in the PHT. That
is, match the higher order bits of the program counter with every single
tag in the PHT (typically hundreds). Whichever one matches is the entry
of interest. This is an example of CAM lookup, discussed further below.
This provides added flexibility because a load now has, say, 256
different spots to reside in and there is less chance of a collision. A
suitable technique is used to "kick out" an entry when the 257th
LOAD comes (in a non-limiting example, kick out the oldest entry).
However, there is an expense in terms of power in the CAM lookup
embodiment, since all (say 256) tag entries in the PHT must be examined.
The use of the hash function avoids this by restricting to only a single
entry in the table. Thus, as compared to the CAM, the hash technique
simplifies the lookup but at the expense of a greater risk of collision
because two different loads may hash to the same location. On the other
hand, the CAM, as compared to the hash. gives the freedom to go anywhere
in the table, but at the cost of a more expensive lookup.

[0045] By way of review, one hardware structure includes a GHB (global
history buffer) 202 of N bits. In particular, in one or more embodiments,
the history register 202 is an N-bit shift register which shifts in bits
representing the cache access outcome (hit or miss) of the most recent N
accesses. An additional hardware structure includes a PC-based-table 204.
Table 204 is an M-entry table, with each entry holding a tag of the PC
(in the example of FIG. 2, there are four entries corresponding to tag1,
tag2, tag3, and tag4), and a hit or miss predictor of depth N (based on
the GHB size of N bits) (in the example of FIG. 2, N=4; thus, the
hit-or-miss predictor corresponding to tag1 includes entries M, M, M, and
H). Each of the N hit- or miss-predictor entries is only two bits; a miss
is predicted if the MSB is 1, and hit is predicted if the MSB is 0. One
method step useful in connection with the hardware embodiment of FIG. 2
includes determining the hash index into the PC-based-table by applying a
hash function 206 to the PC 208. An example hash function could be some
log M bits of the PC. Another example of a hash function could be the
result of an XOR operation of 2 separate log M bits of the PC resulting
in the result having log M bits. The log M bits of the result of the hash
function can access only one of the M entries of the Table 204.

[0046] The values of M and N are determined at the design time based on
the nature of the applications that will be run on the machine. For
example, N could be a few 10s (like 30 or so), and M could be a few 100s
(like 256 or so).

[0047] One or more embodiments do not require a content-addressable memory
(CAM) lookup of the table. If a hash function is used as described above,
the load PC tag can reside in only one of the M entries of the Table 204
(the entry determined by the value of the log M bits chosen in the hash
function). This could potentially lead to collisions as more than one
load instruction could map to the same entry of the Table 204. To reduce
the chances of collision, it is possible to allow a load PC tag to reside
in any of the M locations of the Table 204. As discussed above, this can
be achieved using a CAM, where to find a match in the table 204, all the
entries are searched with the incoming tag of the load PC, and the
matching entry is used for prediction. Similarly, a new entry can be
inserted in any of the M positions. If the tag in the hash index does not
match the tag of the PC, then stop prediction because the update in the
PHT is likely from another LOAD instruction and not the LOAD instruction
of interest. In such cases, a default prediction (typically HIT) can be
employed.

[0048] Another method step useful in connection with the hardware
embodiment of FIG. 2 includes determining the hit or miss predictor index
for a given entry of the PC-based-table 204. In some instances, start
traversing the GHB bits starting from the MSB 218. Stop the traversal at
the first bit (here, 222) that is different from the MSB. The number of
bits traversed is used as the index of the PHT. Here, two bits 218, 220
are traversed (bits 222 and 224 are numbered for completeness). The
maximum number of bits traversed is equal to the length of the GHB, and
hence a size of N predictors per entry of the PC-based-table is
sufficient. Still another method step useful in connection with the
hardware embodiment of FIG. 2 includes updating the PC-based table entry.
After the outcome of the current prediction is known, before entering the
outcome in the GHB 202, again determine the hash index using the PC, and
access the appropriate hit or miss predictor index as described just
above. If the current outcome is a miss, increment the chosen hit or miss
predictor entry. If the current outcome is a hit, decrement the chosen
hit or miss predictor entry.

[0049] One or more embodiments relate to predicting whether a cache access
(load or store) will result in a cache miss. Based on such a prediction,
the processor can take some action, such as re-ordering the instruction
schedule so that other instructions go ahead of the instruction that is
predicted to result in a cache miss. One or more embodiments make use of
the fact that cache misses (and hits) come in bunches. In other words,
misses are bursty.

[0050] In one or more embodiments, when in a "miss regime" (i.e., a regime
where most of the cache accesses were misses), then the next access is
predicted to be a miss.

[0051] Referring again to FIG. 7, in one aspect, a record of the results
of the past x accesses (i.e., either hits or misses) for the application
is maintained (in the Global History Table (GHT)) 702 (same as the GHB),
and historical information about the results of any accesses that were
made after each number of consecutive hits or misses is maintained (in
the hit PHT 706 and miss PHT 704). If the last access is a miss, and
there have been m consecutive misses, then the corresponding PHT is
examined to get information about recent behavior after m consecutive
misses. If there were generally misses after in consecutive misses, then
the next access is assumed to be a miss. A corresponding action is taken
for hits.

[0053] In general, all of the illustrative embodiments are preferably
carried out early in the processor pipeline (early in the decode stage of
the processor). as soon as it is determined that a given instruction is a
memory-accessing instruction, without even knowing what memory address is
going to be accessed. The PHTs are preferably implemented as hardware
tables in the processor, preferably in the decode stage of the processor
pipeline. In one or more embodiments, the given instruction is executed
normally regardless of the prediction, but the prediction result is used
in the scheduling of dependent instructions.

[0054] With continued reference to FIG. 7, FIG. 7 can be thought of as the
simplest of the exemplary embodiments presented herein. In general, a
program, during execution, will have many LOAD instructions seeking to
access data from memory. In the example of FIG. 7, rather than being
concerned with every separate LOAD instruction in the program, have, for
the entire program, a single hardware-implemented global history buffer
(GHB) 702 in the processor pipeline. Here, simply monitor cache misses
and hits for the entire program, and not for any particular LOAD. Every
cache hit or miss is pushed into the GHB, with the most recent outcome
being stored in the most significant bit (MSB) 708 and every other result
being pushed to the left. For example, a miss can be encoded as a logical
zero and a hit can be encoded as a logical one. GHB 702 is preferably
implemented as a shift register. In all embodiments, the contents of GHB
202, 402, 702 reflect actual results and not predictions.

[0055] Here, looking from the most recent outcome (MSB) 708, determine
whether in a "HIT Regime" (a number of hits in a row) or "MISS Regime" (a
number of misses in a row). This can be used to predict whether a
memory-accessing instruction will result in a cache hit or a cache miss.
In the exemplary embodiment of FIG. 7, pick out the most recent outcome,
here, a miss, and based on that access the miss pattern history table
(Miss-PHT) 704. If the most recent outcome had instead been a hit, then
the hit pattern history table (Hit-PHT) 706 would have been accessed. All
that remains to be determined is what entry in the Miss-PHT or Hit-PHT
should be used to make the prediction. To make such determination,
determine the number of contiguous hits or misses, starting from the most
recent one (MSB 708). Here, since the most recent actual outcome was a
miss, count the number of misses (here, two) until a hit is encountered
at 712. Since there are two contiguous misses, access the second entry of
the Miss-PHT 704. If there are N bits in the GHB 702, then the maximum
number of contiguous hits or misses is N. Accordingly, the Miss-PHT and
Hit-PHT each require N entries. The value of the entry is the prediction.

[0056] At this point, the LOAD instruction is still in the early part of
the pipeline (i.e., the decoder). The cache has not yet been accessed.
Later, the cache will actually be accessed and the actual outcome will be
pushed into GHB 702 as the MSB 708. Furthermore, the Miss-PHT, which was
used to make the prediction, will also be updated based on whether the
prediction was right or wrong (and similarly for the Hit-PHT if the
Hit-PHT had been used to make the prediction).

[0057] One or more embodiments thus carry out the indicated techniques
before accessing the cache, determine that an instruction will require a
memory access, and then predict whether it will be a cache miss or a
cache hit.

[0058] Reference should now be had to FIG. 4. Items similar to those in
FIG. 2 have received the same reference character incremented by two
hundred. Recall that in FIG. 2, each entry in the PHT 204 maintained its
own GHB 202, i.e. GHB1, GHB2, and so on. The individual GHBs in FIG. 2
maintained the hit-miss outcomes for the particular tag, i.e. tag1, tag2,
and so on. Every load instruction in the embodiment of FIG. 2 thus had
its own GHB. This is expensive. In the embodiment of FIG. 4, the GHB 402
is truly global and not specific to a particular LOAD or STORE. Here, the
single GHB 402 monitors for the entire program, and not just for a
specific LOAD, the most recent N outcomes of cache access. In the
embodiment of FIG. 4, use the global GHB 402 and not the per-entry GHB of
FIG. 2 to determine which column of the PHT 404 should be accessed. The
operation of the embodiment of FIG. 4 is otherwise similar to that of
FIG. 2; namely, take the LOAD PC. use some bits thereof in a hash 406
(e.g., take the most significant 30 bits) to determine a tag identifier
407, and match the tag in the first column of the PHT 404 (analogous to
the CAM lookup of FIG. 2). The entry with the matching tag is the entry
of interest. Within that entry, it remains to determine what column to
access, using the global GHB 402. Here, use the same technique of
determining the number of contiguous hits or misses and use that to
determine what column to access in the row for the tag of interest. The
embodiment of FIG. 4 gives up some accuracy as compared to FIG. 2, but
saves hardware.

[0059] To reiterate, the tables for each illustrated embodiment are
preferably maintained as hardware tables in the decode stage of the
processor pipeline, and the illustrated steps are carried out early-on,
upon determination that a given instruction is a memory-accessing
instruction such as a LOAD.

[0060]FIG. 12 presents a non-limiting example of five stages in a generic
processor pipeline, such as might be employed by a core 106 or other
processor. Instructions are fetched in FETCH stage 1202. Instructions are
decoded in DECODE stage 1204. In one or more embodiments, the prediction
techniques set forth herein are executed at this stage, as soon as it is
determined that a particular instruction is one requiring memory access
(e.g., LOAD or STORE). The issue stage 1206 comes next and the prediction
can be used at this stage for scheduling. The next stage is the execute
stage 1208; following execution, the actual results can be updated as
discussed elsewhere herein. The final stage is the commit stage 1210.
Other types of staging can be employed in other embodiments; however, a
significant aspect of one or more embodiments is the prediction at the
decode or equivalent stage and taking the prediction into account before
execution or the like.

[0061] One or more embodiments work separately at each individual cache
level (L1, L2, and so on). However, in some cases, the techniques
described can be implemented for multiple levels at the same time. For
example, the GHB tracks all the true outcomes. The illustrative examples
have focused, by way of example and not limitation, on the L1 level of
cache. Every time there is an L1 actual hit or miss, the GHB is updated
by pushing the result into the MSB. A similar GHB can be maintained for
the L2 level of cache. This L2 GHB is updated whenever there is an L2
actual hit or miss. A separate GHB and PHT can be maintained for every
cache level. The operation is otherwise similar.

[0062] One or more embodiments thus provide techniques wherein the GHR
(similar to GHB) and/or branch target buffer (BTB) can be used to predict
cache misses. In one or more embodiments, a prediction table is on the
order of 2*N bits ("*" stands for multiplication not exponential). In one
or more embodiments, replacements from the cache are handled by an
existing replacement policy, and techniques are provided to predict
whether a cache access will result in a hit or a miss. One or more
instances are applicable to data and/or instruction caches.

[0063] FIG. 3 presents non-limiting exemplary data for a 64 KB, 8 way
associative Dcache and 1.5 MB. 12 way associative L2 Dcache. The input
was a plurality of z-series PCATZ D-cache reference traces. The
horizontal axis represents a number of different well-known so-called
SPEC applications and IBM applications while the vertical axis shows the
percent accuracy in predicting the L1 cache hit or miss.

[0064] FIGS. 5 and 6 present non-limiting exemplary data for prediction
accuracy for L1 and L2 Dcache. The axes are analogous to those in FIG. 3.
As can be seen therein, high accuracy can be achieved with less hardware.
Furthermore, effectiveness is shown as to both L1 and L2 cache "miss"
prediction. The benefit of holding-off dependent instructions if data
will not be present in the L2 cache as well can also be seen.
Furthermore, as can be seen, there is effectiveness for both
single-thread (FIG. 5) and simultaneous multithreading (SMT) (FIG. 6),
and a benefit is attained by scheduling a different thread when a cache
miss is predicted for one thread. A small percentage of load and/or store
instructions are responsible for a large percentage of cache misses.

[0065] FIGS. 8-11 present additional non-limiting exemplary data for
prediction accuracy for L1 and L2 Dcache. As can be seen therein,
application global behavior is sufficient to predict misses, and one or
more exemplary techniques are effective in both L1 and L2 cache "miss"
prediction, and for single and multi-threaded applications. The axes are
analogous to those in FIG. 3.

[0067] Thus, in one or more instances, it is sufficient to predict the
next cache access outcome based on the length of the last seen contiguous
string of "hits" or "misses." One or more embodiments use only N entries
to predict outcome instead of 2N entries. In tracking per-load hit
or miss behavior, it is noted that in many instances, only a small
percentage of loads are responsible for a large fraction of cache misses.
Furthermore, in tracking a global "hit" and/or "miss" region of
applications, application behavior is typically a good guide to hit
and/or miss prediction, and non-destructive interference among loads is
observed. Thus, FIG. 2 has the most complex hardware, FIG. 7 the least,
and FIG. 4 has intermediate hardware complexity.

[0068] Given the discussion thus far, it will be appreciated that, in
general terms, an exemplary method, according to an aspect of the
invention, includes the step of decoding, in a decode stage 1204 of
hardware processor pipeline, one particular instruction of a plurality of
instructions (e.g., the instruction whose LOAD PC is 208 or 408). An
additional step includes determining that the particular instruction
requires a memory access (e.g., is a LOAD or STORE); and, responsive to
determining that the particular instruction requires a memory access,
predicting whether the memory access will result in a cache miss; for
example, using the techniques of FIG. 2, 4, or 7. The predicting in turn
includes accessing one of a plurality of entries in a pattern history
table (e.g., 204, 404, or 704 plus 706) stored as a hardware table in the
decode stage 1204. The accessing is based, at least in part, upon at
least a most recent entry in a global history buffer such as 202, 402, or
702. The pattern history table stores a plurality of predictions (e.g.,
the individual "M" or "H" values). The global history buffer stores
actual results of previous memory accesses as one of cache hits and cache
misses. Additional steps include scheduling at least one additional one
of the plurality of instructions in accordance with the predicting (as
described elsewhere herein); and updating the pattern history table and
the global history buffer subsequent to actual execution of the
particular instruction in an execution stage 1208 of the hardware
processor pipeline, to reflect whether the predicting was accurate.

[0069] In some cases, such as depicted in FIG. 7, the global history
buffer 702 stores combined results for all given ones of the plurality of
instructions requiring memory access, and the pattern history table
includes a miss pattern history table 704 and a hit pattern history table
706. Furthermore, in such cases, the predicting includes accessing the
most recent entry 708 in the global history buffer to determine whether
the most recent entry in the global history buffer is a hit or a miss (in
the example of FIG. 7, a miss); accessing the miss pattern history table
704 if the most recent entry in the global history buffer is a miss; and
accessing the hit pattern history table 706 if the most recent entry in
the global history buffer is a hit. In each case, accessing the
respective table 704, 706 includes accessing an entry therein
corresponding to a number of contiguous misses or hits in the global
history buffer, as the case may be, including the most recent entry in
the global history buffer.

[0070] Reference should now be had to FIG. 4. To avoid confusion, recall
that FIG. 4 can include a CAM aspect where a hash is carried out only to
extract a tag, as well as a hash aspect, where there is a hash to obtain
an index to a table entry, as well as a check whether the extracted tag
matches the tag for that row. Thus, in some instances, such as the "hash"
approach of FIG. 4, the global history buffer 402 stores combined results
for all given ones of the plurality of instructions requiring memory
access; the pattern history table 404 has M entries (in the example, 16)
arranged in rows (here, 4) and N columns (here, N=4), and each of the
rows also has a stored tag value (tag1, tag2, tag3, or tag4 in the
example) associated therewith. The particular instruction has a program
counter 408 associated therewith. The predicting includes hashing the
program counter to obtain a hash 406 having log M bits; extracting a tag
407 from the program counter; accessing the one of the plurality of
entries in the pattern history table based on the hash; and, as described
above, determining whether that one of the stored tag values
corresponding to a given one of the rows in which the one of the
plurality of entries in the pattern history table resides matches the
extracted tag. If the one of the stored tag values corresponding to the
given one of the rows in which the one of the plurality of entries in the
pattern history table resides does not match the extracted tag, it is
assumed that a collision has occurred, and a default prediction
(typically, assuming a cache hit) is made. If the one of the stored tag
values corresponding to the given one of the rows in which the one of the
plurality of entries in the pattern history table resides does indeed
match the extracted tag, a further step includes accessing an entry in
the given one of the rows based on the number of contiguous bits
traversed starting with the MSB of the global history buffer, until the
first bit that is different from the MSB

[0071] On the other hand, in the CAM approach of FIG. 4, the global
history buffer 402 stores combined results for all given ones of the
plurality of instructions requiring the memory access; the pattern
history table has M entries (here, 16) arranged in rows (here, 4) and N
columns (here, N=4), and each of the rows has a stored tag value (tag1,
tag2, tag3, tag4, in the example) associated therewith. The particular
instruction has a program counter 408 associated therewith. The
predicting includes extracting a tag 407 from the program counter;
accessing a corresponding one of the rows for the one of the plurality of
entries in the pattern history table 404 based on the extracted tag
(i.e., look for a match with tag1, tag2, tag3, or tag4 in CAM fashion);
and accessing an entry in the corresponding one of the rows based on the
number of contiguous bits traversed starting with the MSB of the GHB,
until the first bit that is different from the MSB. If for some reason
there is no matching tag, a default prediction can optionally be made as
above.

[0072] In some cases, such as the "hash" approach of FIG. 2, the pattern
history table 204 has M entries (here, 16) arranged in rows (here, 4) and
N columns (here, N=4), each of the rows has a stored tag value (in the
example, tag1, tag2, tag3, or tag4) associated therewith, and each of the
rows has an instruction-specific portion of the global history buffer 202
associated therewith (designated as GHB1, GHB2, GHB3, or GHB4, in the
example). Furthermore, the particular instruction has a program counter
208 associated therewith. In addition, the predicting includes hashing
the program counter to obtain a hash 206 having log M bits; extracting a
tag from the program counter; and accessing the one of the plurality of
entries in the pattern history table based on the hash. An additional
step includes determining whether that one of the stored tag values
corresponding to a given one of the rows in which the one of the
plurality of entries in the pattern history table resides matches the
extracted tag. If this is not so, assume a collision has occurred and
make a default prediction, typically, "hit." On the other hand, if there
is indeed a match, access an entry in the given one of the rows access
the column based on the number of contiguous bits traversed starting with
the MSB of the GHB, until the first bit that is different from the MSB.

[0073] On the other hand, in the CAM approach of FIG. 2, the pattern
history table 204 has M entries (here, 16) arranged in rows (here, 4) and
N columns (here, N=4), each of the rows has a stored tag value (in the
example, tag1, tag2, tag3, or tag4) associated therewith, and each of the
rows has an instruction-specific portion of the global history buffer 202
associated therewith (i.e., GHB1, GHB2, GHB3, or GHB4 in the example).
The particular instruction has a program counter 208 associated
therewith. The predicting includes extracting a tag from the program
counter; accessing a corresponding one of the rows for the one of the
plurality of entries in the pattern history table based on the extracted
tag (i.e., look for a match with tag1, tag2, tag3, or tag4 in CAM
fashion); and accessing an entry in the corresponding one of the rows
based on the number of contiguous bits traversed starting with the MSB of
the corresponding instruction-specific portions of the N bit global
history buffer, until the first bit that is different from the MSB. If
for some reason there is no matching tag, a default prediction can
optionally be made as above.

[0074] Furthermore, given the discussion thus far, it will be appreciated
that, in general terms, an exemplary hardware processor, according to an
aspect of the invention, includes a decode pipeline stage 1204 which
decodes one particular instruction of a plurality of instructions;
hard-wired logic circuitry which determines that the particular
instruction requires a memory access; a pattern history hardware table
(e.g., 204, 404, or 704 and 706) within the decode pipeline stage; and a
global history buffer 202, 402, or 702 within the decode pipeline stage.
Also included is hard-wired logic circuitry which. responsive to
determining that the particular instruction requires a memory access,
predicts whether the memory access will result in a cache miss. The
predicting in turn includes accessing one of a plurality of entries in
the pattern history hardware table. The accessing is based, at least in
part, upon at least a most recent entry in the global history buffer. The
pattern history hardware table stores a plurality of predictions. The
global history buffer stores actual results of previous memory accesses
as one of cache hits and cache misses. Also included are an issue
pipeline stage 1206 which schedules at least one additional one of the
plurality of instructions in accordance with the predicting; an execution
pipeline stage 1208 which actually executes the particular instruction;
and hard-wired logic circuitry which updates the pattern history hardware
table and the global history buffer subsequent to the actual execution of
the particular instruction, to reflect whether the predicting was
accurate.

[0075] The global history buffer and pattern history table can be
configured as described with respect to any of the methods herein, and
hard-wired logic circuitry can be provided to implement and of the method
steps described herein.

[0076] In still another aspect, an exemplary hardware processor includes
means for carrying out the method steps. The means include various
structures disclosed herein, such as hardware tables, buffers such as
shift registers, and hard-wired logic circuitry to implement the method
steps.

[0077] Integrated circuit chips embodying aspects of the invention can be
distributed by the fabricator in raw wafer form (that is, as a single
wafer that has multiple unpackaged chips), as a bare die, or in a
packaged form. In the latter case the chip is mounted in a single chip
package (such as a plastic carrier, with leads that are affixed to a
motherboard or other higher level carrier) or in a multichip package
(such as a ceramic carrier that has either or both surface
interconnections or buried interconnections). In any case the chip is
then integrated with other chips, discrete circuit elements, and/or other
signal processing devices as part of either (a) an intermediate product,
such as a motherboard, or (b) an end product. The end product can be any
product that includes integrated circuit chips, typically an advanced
computer product having one or more central processors and optionally a
display, a keyboard or other input device, and the like.

[0078]FIG. 13 shows a block diagram of an exemplary design flow 1900 used
for example, in semiconductor IC logic design, simulation, test, layout,
and manufacture. Design flow 1900 includes processes, machines and/or
mechanisms for processing design structures or devices to generate
logically or otherwise functionally equivalent representations of the
design structures and/or devices described above and shown in FIGS. 1, 2,
4, 7, and 12. The design structures processed and/or generated by design
flow 1900 may be encoded on machine-readable transmission or storage
media to include data and/or instructions that when executed or otherwise
processed on a data processing system generate a logically, structurally,
mechanically, or otherwise functionally equivalent representation of
hardware components, circuits, devices, or systems. Machines include, but
are not limited to, any machine used in an IC design process, such as
designing, manufacturing, or simulating a circuit, component, device, or
system. For example. machines may include: lithography machines, machines
and/or equipment for generating masks (e.g. e-beam writers), computers or
equipment for simulating design structures, any apparatus used in the
manufacturing or test process, or any machines for programming
functionally equivalent representations of the design structures into any
medium (e.g. a machine for programming a programmable gate array).

[0079] Design flow 1900 may vary depending on the type of representation
being designed. For example, a design flow 1900 for building an
application specific IC (ASIC) may differ from a design flow 1900 for
designing a standard component or from a design flow 1900 for
instantiating the design into a programmable array, for example a
programmable gate array (PGA) or a field programmable gate array (FPGA)
offered by Altera® Inc. or Xilinx® Inc.

[0080]FIG. 13 illustrates multiple such design structures including an
input design structure 1920 that is preferably processed by a design
process 1910. Design structure 1920 may be a logical simulation design
structure generated and processed by design process 1910 to produce a
logically equivalent functional representation of a hardware device.
Design structure 1920 may also or alternatively comprise data and/or
program instructions that when processed by design process 1910, generate
a functional representation of the physical structure of a hardware
device. Whether representing functional and/or structural design
features, design structure 1920 may be generated using electronic
computer-aided design (ECAD) such as implemented by a core
developer/designer. When encoded on a machine-readable data transmission,
gate array. or storage medium, design structure 1920 may be accessed and
processed by one or more hardware and/or software modules within design
process 1910 to simulate or otherwise functionally represent an
electronic component, circuit, electronic or logic module, apparatus,
device, or system such as those shown in FIGS. 1, 2, 4, 7, and 12. As
such, design structure 1920 may comprise files or other data structures
including human and/or machine-readable source code, compiled structures,
and computer-executable code structures that when processed by a design
or simulation data processing system. functionally simulate or otherwise
represent circuits or other levels of hardware logic design. Such data
structures may include hardware-description language (HDL) design
entities or other data structures conforming to and/or compatible with
lower-level HDL design languages such as Verilog and VHDL, and/or higher
level design languages such as C or C++.

[0081] Design process 1910 preferably employs and incorporates hardware
and/or software modules for synthesizing, translating, or otherwise
processing a design/simulation functional equivalent of the components,
circuits, devices, or logic structures shown in FIGS. 1, 2, 4, 7, and 12
to generate a Netlist 1980 which may contain design structures such as
design structure 1920. Netlist 1980 may comprise, for example, compiled
or otherwise processed data structures representing a list of wires,
discrete components, logic gates, control circuits, I/O devices, models,
etc. that describes the connections to other elements and circuits in an
integrated circuit design. Netlist 1980 may be synthesized using an
iterative process in which netlist 1980 is resynthesized one or more
times depending on design specifications and parameters for the device.
As with other design structure types described herein, netlist 1980 may
be recorded on a machine-readable data storage medium or programmed into
a programmable gate array. The medium may be a non-volatile storage
medium such as a magnetic or optical disk drive, a programmable gate
array. a compact flash, or other flash memory. Additionally, or in the
alternative, the medium may be a system or cache memory, buffer space, or
electrically or optically conductive devices and materials on which data
packets may be transmitted and intermediately stored via the Internet, or
other networking suitable means.

[0082] Design process 1910 may include hardware and software modules for
processing a variety of input data structure types including Netlist
1980. Such data structure types may reside, for example, within library
elements 1930 and include a set of commonly used elements, circuits, and
devices, including models, layouts, and symbolic representations, for a
given manufacturing technology (e.g., different technology nodes, 32 nm,
45 nm, 90 nm, etc.). The data structure types may further include design
specifications 1940, characterization data 1950, verification data 1960,
design rules 1970, and test data files 1985 which may include input test
patterns, output test results, and other testing information. Design
process 1910 may further include, for example, standard mechanical design
processes such as stress analysis, thermal analysis, mechanical event
simulation, process simulation for operations such as casting, molding,
and die press forming, etc. One of ordinary skill in the art of
mechanical design can appreciate the extent of possible mechanical design
tools and applications used in design process 1910 without deviating from
the scope and spirit of the invention. Design process 1910 may also
include modules for performing standard circuit design processes such as
timing analysis, verification, design rule checking, place and route
operations, etc.

[0083] Design process 1910 employs and incorporates logic and physical
design tools such as HDL compilers and simulation model build tools to
process design structure 1920 together with some or all of the depicted
supporting data structures along with any additional mechanical design or
data (if applicable), to generate a second design structure 1990. Design
structure 1990 resides on a storage medium or programmable gate array in
a data format used for the exchange of data of mechanical devices and
structures (e.g. information stored in an IGES, DXF, Parasolid XT, JT,
DRG, or any other suitable format for storing or rendering such
mechanical design structures). Similar to design structure 1920, design
structure 1990 preferably comprises one or more files, data structures,
or other computer-encoded data or instructions that reside on
transmission or data storage media and that when processed by an ECAD
system generate a logically or otherwise functionally equivalent form of
one or more of the embodiments of the invention shown in FIGS. 1, 2, 4,
7, and 12. In one embodiment, design structure 1990 may comprise a
compiled, executable HDL simulation model that functionally simulates the
devices shown in FIGS. 1, 2, 4, 7, and 12.

[0084] Design structure 1990 may also employ a data format used for the
exchange of layout data of integrated circuits and/or symbolic data
format (e.g. information stored in a GDSII (GDS2), GL1, OASIS, map files,
or any other suitable format for storing such design data structures).
Design structure 1990 may comprise information such as, for example,
symbolic data, map files, test data files, design content files,
manufacturing data, layout parameters, wires, levels of metal, vias,
shapes, data for routing through the manufacturing line, and any other
data required by a manufacturer or other designer/developer to produce a
device or structure as described above and shown in FIGS. 1, 2, 4, 7, and
12. Design structure 1990 may then proceed to a stage 1995 where, for
example, design structure 1990: proceeds to tape-out, is released to
manufacturing, is released to a mask house, is sent to another design
house, is sent back to the customer. etc.

[0085] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of the
invention. As used herein, the singular forms "a", "an" and "the" are
intended to include the plural forms as well, unless the context clearly
indicates otherwise. It will be further understood that the terms
"comprises" and/or "comprising," when used in this specification, specify
the presence of stated features, integers, steps, operations, elements,
and/or components, but do not preclude the presence or addition of one or
more other features, integers. steps, operations, elements, components,
and/or groups thereof.

[0086] The corresponding structures, materials, acts, and equivalents of
all means or step plus function elements in the claims below are intended
to include any structure, material, or act for performing the function in
combination with other claimed elements as specifically claimed. The
description of the present invention has been presented for purposes of
illustration and description, but is not intended to be exhaustive or
limited to the invention in the form disclosed. Many modifications and
variations will be apparent to those of ordinary skill in the art without
departing from the scope and spirit of the invention. The embodiment was
chosen and described in order to best explain the principles of the
invention and the practical application, and to enable others of ordinary
skill in the art to understand the invention for various embodiments with
various modifications as are suited to the particular use contemplated.