Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

A memory management and protection system that manages memory access
requests from a number of requestors. Memory accesses are allowed or
disallowed based on the privilege level of the requestor, based on a
Privilege Identifier that accompanies each memory access request. An
extended memory controller selects the appropriate set of segment
registers based on the Privilege Identifier to insure that the request is
compared to and translated by the segment register associated with the
master originating the request. A set of mapping registers allow flexible
mapping of each Privilege Identifier to the appropriate access
permission.

Claims:

1. A memory management system comprising: a memory organized as a
plurality of memory segments addressable within the logical address space
of the memory, an extended memory controller operable to control the
access to the memory by requests originated by a plurality of requestors
based on the privilege identifier (PRIVID) accompanying each request for
access, a plurality of segment registers within the extended memory
controller operable to define the access permissions of the memory
segments based on the allowable ID (AID) bits within said segment
registers, a plurality of PRIVID to AID (PAMAP) mapping registers
operable to associate a PRVID with an AID.

2. The extended memory controller of claim 1, wherein: said PAMAP mapping
registers are further operable to translate the 4 bit wide PRVID field to
the 3 bit wide AID field.

3. The extended memory controller of claim 2, wherein: said PAMAP mapping
registers are readable by all requestors with all privilege and security
levels.

4. The extended memory controller of claim 3, wherein: said PAMAP mapping
registers are only writeable by secure supervisors on secure requestors,
or by supervisors or by emulation on non secure requestors.

5. The extended memory controller of claim 4, wherein: said PAMAP mapping
registers contain a one bit wide field that specifies whether the
requestor associated with the PRVID within the PAMAP is allowed to
initiate a secure transaction into the memory segment associated with the
PRVID.

Description:

TECHNICAL FIELD OF THE INVENTION

[0001] This invention relates in general to the field of computing systems
and more particularly to memory management and protection.

BACKGROUND OF THE INVENTION

[0002] Current state of the art general purpose processors rely on memory
management units (MMU) to provide both memory protection and address
translation functions. The most typical memory management units provide a
page-oriented architecture, usually fixed around a small number of
different page sizes--4 kB only, 4 kB+64 kB, 4 kB+2 MB, etc. Memory
management units are very flexible, but they suffer from multiple
drawbacks: [0003] Complicated to design and verify; [0004] Variable
performance depending on how the memory management unit is implemented.
Most of the commonly used implementations can add significant uncertainty
to memory system performance; [0005] For best performance, the memory
management unit needs to be tightly integrated in the memory hierarchy,
preferably at or near the CPU/L1 cache boundary. Once tightly integrated,
it is difficult to remove, reducing the flexibility of the architecture
to scale to meet cost and performance targets; [0006] A potentially large
memory footprint is required for page table entries due to the small page
sizes in big systems. For example, it takes 1,048,576 4 kB pages to fully
map a 4 GB memory system.

[0007] An additional problem arises with conventional memory management
units used in large, multiprocessor systems using 32 bit processor
elements. While the logical address place for each processor or each task
will still fit into a 32 bit address place, the physical address place of
the memory system may exceed the 32 bit range. In this case an address
translation unit is required that is capable of producing wider physical
addresses than the logical addresses it started with, so that the total
physical address space can be larger than the logical address space.

SUMMARY OF THE INVENTION

[0008] This invention describes a distributed memory management and
protection system. In a complex multi processor system there are many
parallel threads of execution, and many sources making memory requests
ant any given time. In such a system the memory protection system must
make decisions based on the privilege context associated with each
request. These requests may come from a variety of sources like DMA
controllers, but it is preferable to consider the CPU that originated the
request instead of the immediate source.

[0009] In the described system each request is accompanied by a "Privilege
Identifier" (PRVID). Smart masters, such as CPUs carry unique PRVIDs.
"Deputy" masters such as DMA controllers inherit the PRVID of the
originating CPU.

[0010] With multiple CPUs the PRVID may be used to select different sets
of memory segments to match the memory map of the CPU. When the CPU
initiates a DMA request, the DMA parameters will be compared against and
translated by the appropriate CPUs set of segment registers.

[0011] A flexible mapping method is shown that allows arbitrary mapping of
the PRVID to the Allowed ID (AID) of the memory segment.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] These and other aspects of this invention are illustrated in the
drawings, in which:

[0013] FIG. 1 illustrates the organization of a typical digital signal
processor to which this invention is applicable (prior art);

[0014] FIG. 2 illustrates details of a very long instruction word digital
signal processor core suitable for use in FIG. 1 (prior art);

[0023] FIG. 1 illustrates the organization of a typical digital signal
processor system 100 to which this invention is applicable (prior art).
Digital signal processor system 100 includes central processing unit core
110. Central processing unit core 110 includes the data processing
portion of digital signal processor system 100. Central processing unit
core 110 could be constructed as known in the art and would typically
includes a register file, an integer arithmetic logic unit, an integer
multiplier and program flow control units. An example of an appropriate
central processing unit core is described below in conjunction with FIGS.
2 to 4.

[0024] Digital signal processor system 100 includes a number of cache
memories. FIG. 1 illustrates a pair of first level caches. Level one
instruction cache (L1I) 121 stores instructions used by central
processing unit core 110. Central processing unit core 110 first attempts
to access any instruction from level one instruction cache 121. Level one
data cache (L1D) 123 stores data used by central processing unit core
110. Central processing unit core 110 first attempts to access any
required data from level one data cache 123. The two level one caches are
backed by a level two unified cache (L2) 130. In the event of a cache
miss to level one instruction cache 121 or to level one data cache 123,
the requested instruction or data is sought from level two unified cache
130. If the requested instruction or data is stored in level two unified
cache 130, then it is supplied to the requesting level one cache for
supply to central processing unit core 110. As is known in the art, the
requested instruction or data may be simultaneously supplied to both the
requesting cache and central processing unit core 110 to speed use.

[0025] Level two unified cache 130 is further coupled to higher level
memory systems. Digital signal processor system 100 may be a part of a
multiprocessor system. The other processors of the multiprocessor system
are coupled to level two unified cache 130 via a transfer request bus 141
and a data transfer bus 143. A direct memory access unit 150 provides the
connection of digital signal processor system 100 to external memory 161
and external peripherals 169.

[0027] Central processing unit 1 has a 32-bit, byte addressable address
space. Internal memory on the same integrated circuit is preferably
organized in a data space including level one data cache 123 and a
program space including level one instruction cache 121. When off-chip
memory is used, preferably these two spaces are unified into a single
memory space via the external memory interface (EMIF) 4.

[0028] Level one data cache 123 may be internally accessed by central
processing unit 1 via two internal ports 3a and 3b. Each internal port 3a
and 3b preferably has 32 bits of data and a 32-bit byte address reach.
Level one instruction cache 121 may be internally accessed by central
processing unit 1 via a single port 2a. Port 2a of level one instruction
cache 121 preferably has an instruction-fetch width of 256 bits and a
30-bit word (four bytes) address, equivalent to a 32-bit byte address.

[0030] Program fetch unit 10, instruction dispatch unit 11 and instruction
decode unit 12 recall instructions from level one instruction cache 121
and deliver up to eight 32-bit instructions to the functional units every
instruction cycle. Processing occurs simultaneously in each of the two
data paths 20 and 30. As previously described each data path has four
corresponding functional units (L, S, M and D) and a corresponding
register file containing 16 32-bit registers. Each functional unit is
controlled by a 32-bit instruction. The data paths are further described
below. A control register file 13 provides the means to configure and
control various processor operations.

[0031]FIG. 3 illustrates the pipeline stages 300 of digital signal
processor core 110 (prior art). These pipeline stages are divided into
three groups: fetch group 310; decode group 320; and execute group 330.
All instructions in the instruction set flow through the fetch, decode,
and execute stages of the pipeline. Fetch group 310 has four phases for
all instructions, and decode group 320 has two phases for all
instructions. Execute group 330 requires a varying number of phases
depending on the type of instruction.

[0032] The fetch phases of the fetch group 310 are: Program address
generate phase 311 (PG); Program address send phase 312 (PS); Program
access ready wait stage 313 (PW); and Program fetch packet receive stage
314 (PR). Digital signal processor core 110 uses a fetch packet (FP) of
eight instructions. All eight of the instructions proceed through fetch
group 310 together. During PG phase 311, the program address is generated
in program fetch unit 10. During PS phase 312, this program address is
sent to memory. During PW phase 313, the memory read occurs. Finally
during PR phase 314, the fetch packet is received at CPU 1.

[0033] The decode phases of decode group 320 are: Instruction dispatch
(DP) 321; and Instruction decode (DC) 322. During the DP phase 321, the
fetch packets are split into execute packets. Execute packets consist of
one or more instructions which are coded to execute in parallel. During
DP phase 322, the instructions in an execute packet are assigned to the
appropriate functional units. Also during DC phase 322, the source
registers, destination registers and associated paths are decoded for the
execution of the instructions in the respective functional units.

[0034] The execute phases of the execute group 330 are: Execute 1 (E1)
331; Execute 2 (E2) 332; Execute 3 (E3) 333; Execute 4 (E4) 334; and
Execute 5 (E5) 335. Different types of instructions require different
numbers of these phases to complete. These phases of the pipeline play an
important role in understanding the device state at CPU cycle boundaries.

[0035] During E1 phase 331, the conditions for the instructions are
evaluated and operands are read for all instruction types. For load and
store instructions, address generation is performed and address
modifications are written to a register file. For branch instructions,
branch fetch packet in PG phase 311 is affected. For all single-cycle
instructions, the results are written to a register file. All
single-cycle instructions complete during the E1 phase 331.

[0036] During the E2 phase 332, for load instructions, the address is sent
to memory. For store instructions, the address and data are sent to
memory. Single-cycle instructions that saturate results set the SAT bit
in the control status register (CSR) if saturation occurs. For single
cycle 16 by 16 multiply instructions, the results are written to a
register file. For M unit non-multiply instructions, the results are
written to a register file. All ordinary multiply unit instructions
complete during E2 phase 322.

[0038] During E4 phase 334, for load instructions, data is brought to the
CPU boundary. For multiply extension instructions, the results are
written to a register file. Multiply extension instructions complete
during the E4 phase 334.

[0040]FIG. 4 illustrates an example of the instruction coding of
instructions used by digital signal processor core 110 (prior art). Each
instruction consists of 32 bits and controls the operation of one of the
eight functional units. The bit fields are defined as follows. The creg
field (bits 29 to 31) is the conditional register field. These bits
identify whether the instruction is conditional and identify the
predicate register. The z bit (bit 28) indicates whether the predication
is based upon zero or not zero in the predicate register. If z=1, the
test is for equality with zero. If z=0, the test is for nonzero. The case
of creg=0 and z=0 is treated as always true to allow unconditional
instruction execution. The creg field is encoded in the instruction
opcode as shown in Table 1.

[0041] Note that "z" in the z bit column refers to the zero/not zero
comparison selection noted above and "x" is a don't care state. This
coding can only specify a subset of the 32 registers in each register
file as predicate registers. This selection was made to preserve bits in
the instruction coding.

[0042] The dst field (bits 23 to 27) specifies one of the 32 registers in
the corresponding register file as the destination of the instruction
results.

[0043] The scr2 field (bits 18 to 22) specifies one of the 32 registers in
the corresponding register file as the second source operand.

[0044] The scr1/cst field (bits 13 to 17) has several meanings depending
on the instruction opcode field (bits 3 to 12). The first meaning
specifies one of the 32 registers of the corresponding register file as
the first operand. The second meaning is a 5-bit immediate constant.
Depending on the instruction type, this is treated as an unsigned integer
and zero extended to 32 bits or is treated as a signed integer and sign
extended to 32 bits. Lastly, this field can specify one of the 32
registers in the opposite register file if the instruction invokes one of
the register file cross paths 27 or 37.

[0045] The opcode field (bits 3 to 12) specifies the type of instruction
and designates appropriate instruction options. A detailed explanation of
this field is beyond the scope of this invention except for the
instruction options detailed below.

[0047] The p bit (bit 0) marks the execute packets. The p-bit determines
whether the instruction executes in parallel with the following
instruction. The p-bits are scanned from lower to higher address. If p=1
for the current instruction, then the next instruction executes in
parallel with the current instruction. If p=0 for the current
instruction, then the next instruction executes in the cycle after the
current instruction. All instructions executing in parallel constitute an
execute packet. An execute packet can contain up to eight instructions.
Each instruction in an execute packet must use a different functional
unit.

[0053] To provide these benefits, the memory protection hardware must be
suitably powerful and flexible.

[0054] The described memory protection architecture provides these
benefits through a combination of CPU privilege levels and a memory
system permission structure. Device security is also supported as an
extension to the memory protection architecture, thereby allowing secure
devices to be built within this framework.

[0055] The privilege of an execution thread determines what level of
permissions that thread might have. This privilege is actually divided
into two concepts: privilege Level and secure level.

[0056] Code running on the CPU executes in one of two privilege modes:
Supervisor Mode or User Mode. Supervisor code is considered `more
trusted` than User code. Examples of Supervisor threads include operating
system kernels and hardware device drivers. User threads are all end
applications.

[0057] Supervisor Mode is generally granted access to peripheral registers
and the memory protection configuration. User Mode is generally confined
to the memory spaces that the operating system (OS) specifically
designates for its use.

[0058] Requestors provide a single privilege bit alongside each memory
access to indicate the privilege level associated with that access. The
memory protection hardware uses this bit to determine what subset of the
memory protection fields to examine. This bit is encoded as shown below:

TABLE-US-00002
Privilege Level Encoding
User 0
Supervisor 1

[0059] CPU accesses as well as DMA and other accesses have a privilege
level associated with them. CPU privilege level is determined as
described above. DMA and other accesses initiated by the CPU inherit the
CPU's privilege level at the time they are initiated. Mastering
peripherals generally issue transfers with supervisor privileges,
although the specific details depend on the peripheral.

[0060] On secure devices there is an additional axis to the privilege
structure. A device which does not implement device security, or which
has device security disabled is referred to as a non-secure device.

[0061] Threads of execution may be considered either secure or non-secure.
Whereas the CPU's privilege level is purely an internal CPU mode, the
secure level is a function both of the memory system configuration and
the CPU mode. A thread of execution may be secure only if it is executing
from a page of memory that is marked secure. The CPU can only branch to
code in a secure page if it is already in secure mode. The CPU enters
secure mode via an exception or an interrupt.

[0062] Secure mode privileges are always a superset of non-secure mode.
Regions of memory that are marked as secure are not accessible to
non-secure threads. Secure threads may access memory regardless of
whether it is marked secure or not secure. This is in direct contrast to
the independent permissions offered for supervisor vs. user, nothing
requires or guarantees that supervisor has more privilege than user
within a given page.

[0063] To the memory protection hardware, the secure mode looks like an
additional privilege bit that comes in parallel with each access. On
non-secure devices, this bit is hardwired so that all accesses look like
Secure accesses. The encoding of this bit is as follows:

TABLE-US-00003
Secure Level Encoding
Not Secure 0
Secure 1

[0064] As with Privilege levels, all CPU, DMA and other accesses have a
secure level associated them. DMA accesses issued by a CPU inherit the
CPU's Secure level, and other accesses are always treated as Non Secure.

[0065] A non-secure device has two privilege levels, Supervisor and User.
Such devices transition into Supervisor mode in response to interrupts
and the Software Exception (SWE) instruction. All privilege transitions
are strictly controlled within the CPU.

[0066] In the generic OS model, pages of memory are usually configured
with minimal privileges. For instance, user data pages are configured as
readable and writeable by the User, but not necessarily readable or
writeable by the Supervisor. The same is true for code pages. Shared code
pages are marked as executable by both Supervisor and User. Other
executable pages are marked as executable by Supervisor or User
exclusively.

[0067] This model provides a high level of error checking robustness
against incorrect program code, errors due to corrupted pointers, and
potential attacks against a system.

[0068] It is worth noting that Non-Secure in this context refers only to
the lack of device level IP Security support. It makes no implication
about the security of the OS against remote attacks. It only implies that
the Intellectual Property (IP) stored on the device does not have
hardware locks that prevent developer access or a determined attacker
with physical access to the device from revealing the IP stored therein.

[0069] In such a model, operating system requests are made from User Mode
in the following manner: [0070] The User Mode task places any
information it is passing to the OS in a memory page that is at a minimum
readable to the Supervisor, and writeable by the User. (The page may or
may not be readable by the User, and may or may not be writeable by the
Supervisor); [0071] The User Mode task constructs a set of arguments in
registers and/or memory as noted above; [0072] The User Mode task
executes an SWE instruction.

[0073] In response to this sequence, the CPU transitions into Supervisor
mode and begins executing the OS Exception Handler. The Exception
Handler, upon recognizing a valid system call, performs the service call
requested. If the OS must return a block of data to the user, it must do
so via a page that is writeable to Supervisor, and readable by the User.

[0074] Such a model relies on the OS being somewhat trustworthy. While
memory protection pages are set up to prevent Supervisor access to User
pages whenever possible, this configuration is a matter of trust. The
Supervisor may change the configuration of these pages at any time. The
purpose of memory protection blocking Supervisor accesses in this case is
to ensure that such accesses are deliberate as opposed to accidental or
due to attempts (from hostile code) to subvert the supervisor. In the
DSP-BIOS model, DSP-BIOS configures memory protections such that
Supervisor permissions are always a superset of User permissions. The
goal of this configuration is performance: Most operating system service
calls avoid a costly switch into Supervisor mode, as a large portion of
DSP-BIOS' code and data structures exist in User-mode space.

[0075] DSP-BIOS treats Supervisor as "fully trusted" mode. As a result,
DSP-BIOS does not execute non-trusted User code in Supervisor mode.
DSP-BIOS does not rely on hardware assistance to enforce this
requirement, however. Rather, DSP-BIOS relies on the correctness of its
own programming to not invoke untrusted code while in Supervisor mode.
Note that restriction applies to untrusted User mode code. DSP-BIOS may
invoke some of its own code from both Supervisor and User modes, since
the code represents common library routines used both by the kernel and
user-space.

[0076] In the embedded processor realm, it is common for hardware to
provide a minimum of functionality, and for software to provide the rest.
DSP-BIOS' memory protection model is designed for that paradigm.

[0077] On a Secure Device, the usage model differs somewhat. On these
devices, the CPU enters Secure Supervisor mode on taking an interrupt or
an exception. It leaves Secure mode by branching to a Non-secure page,
and it leaves Supervisor mode via "B IRP"/"B NRP" instruction. Thus,
privilege and secure-level transitions are a function of both CPU mode
switches and memory system attributes.

[0078] The purpose of a Secure Device is to protect IP and sensitive data
(e.g. credit card numbers, personal identification information) from
exposure to hostile attackers, including those who have development
equipment and physical access to the device.

[0079] In the Secure Device model, there exists a small Secure Kernel
which has ultimate control over all memory protection entries, interrupt
and exception handlers, and various Secure IP that is stored on the die.
A separate OS can coexist with this kernel, but that OS runs with
Non-secure Supervisor privileges.

[0080] Secure Devices have four privilege/security modes altogether. Table
2 illustrates these four modes and what they are typically used for under
this model.

[0081] In this model, the Real Time Operating System (RTOS) relinquishes
some responsibilities to the Secure Kernel, such as configuring the
Interrupt Service Table, a subset of the memory protection entries, and
configuring the cache hierarchy.

[0082] The memory protection architecture described uses a distributed
structure. The distributed model avoids memory protection bottlenecks in
a large system by placing the controls with the resources being
controlled, and each protected resource implements the memory protection
hardware locally.

[0083] The architecture simply defines "resource" as "a peripheral or
memory accessible through addresses in the memory map." This allows the
same hardware within the resource to perform protection checks for all
requestors to that resource, without bottlenecking on accesses to the
protection hardware itself.

[0084] This approach has advantages and disadvantages. The advantages
include a consistent view of memory protection attributes for all
requestors in the system, and a lack of bottlenecks that might arise from
centralized memory protection hardware. Disadvantages include the
requirement that each peripheral (or hardware outside the peripheral)
implement protection hardware and provide a CPU accessible register
interface to that hardware.

[0085] Each endpoint can tailor its protection hardware to a small extent
in order to implement the desired level of protection.

[0086] To support this distributed protection architecture and to support
caches in such a system, additional sideband signaling must accompany
each memory access. Most notably, the presence of cacheable memory shifts
some of the burden on access control. Caches introduce a semantic
disconnect between program accesses and the memory accesses which reach
the target endpoint.

[0087] The memory protection architecture divides the memory map into
pages, with each page having an associated set of permissions.

[0088] Memories typically have power of 2 page sizes that range in size
from 1K to 16 M bytes. The page size chosen for a given pool of memory
depends on the size of the memory, and the complexity of the memory
controller. These power of 2 sized pages typically start and end at power
of 2 boundaries equal to their size, although this is not a requirement.

[0089] Peripherals modify the notion of pages, using pages to cover unique
resources within the peripheral. An example of unique resource might
include separate channels on a multi-channel peripheral that implements
separate register sets for each channel. Another example might be a range
of locations in a DMA parameter associated with one or more channels.

[0090] As a result, pages do not have a uniform size, and might even cover
discontinuous regions of the address space. Different resources may size
their protection pages differently, according to the needs of the
resource. For instance, a small L1 memory might use 1K pages, whereas a
large off-chip memory might use 16 MB pages. A peripheral might define a
`page` that covers only a handful of registers. This range and variation
in page sizes offer a balance between granularity of control and the cost
of implementing the protection architecture.

[0091] For example, the L1 and L2 memory controllers select their page
size based on a static configuration parameters. Other resources, such as
peripherals with small, fixed address spaces, have inherently fixed page
sizes.

[0092] Other resources may control large regions of the memory map, and
addresses in these regions may be used in a variety of ways by the end
system designers. Peripherals may act as proxies for resources outside
the control of chip designer, and such peripherals should therefore
support run-time programmable page sizes in order to tailor the memory
protection to the needs of the system. Registers that control a
peripheral's memory protection become part of the peripheral's memory
mapped register set.

[0093] Hardware resources may also assign permissions to individual
registers as required. These permissions may be fixed by the hardware,
and might not be exposed to the programmer in a "page attribute"
register. For instance, several cache control registers provided by
unified memory controller (UMC), the data memory controller (DMC) and
program memory controller (PMC) have fixed permissions, and no associated
register for configuring those permissions.

[0094] The memory protection architecture defines a per page permission
structure with three permission fields in a 16-bit permission entry. As
shown in FIG. 5, bits 16-32 are reserved, bits 8-15 contain the Allowed
IDs, bits 6-7 set the security and bits 0-5 define the Access types.

[0095] Each requestor on the device has an N-bit code associated with it
that identifies it for privilege purposes. This code, referred to as the
PrivID, accompanies all memory accesses made on behalf of that requestor.
That is, when a requestor triggers a transfer directly, either by writing
to DMA registers or by triggering the execution of a set of DMA
parameters, the corresponding DMA engine will capture the requestors
PrivID and provide that PrivID alongside the transfer.

[0096] Each memory protection entry has an allowed ID field associated
with it that indicates which requestors may access the given page. The
memory protection hardware maps the PrivIDs of all the possible
requestors to the allowed IDs field in the memory protection registers.
The allowed IDs field discriminates between the various CPUs, non-CPU
requestors, and a given CPU's accesses to its own local memories.

[0097] When set to `1`, the AID bit grants access to the corresponding
PrivID. When set to `0`, the AID bit denies access to the corresponding
requestor. Table 3 gives the default mapping of the allowed ID bits to
devices.

[0098] The above PrivID assignments for bits AID0 through AID5 apply to
all DMA, IDMA and CPU memory accesses other than to the CPU's local L1
and L2 memories. The LOCAL bit governs CPU accesses to its own local L1
and L2 memories.

[0099] The AIDX bit maps to PrivIDs that do not have dedicated AID bits
associated with them. It is intended but not required that this bit
refers to external mastering peripherals, especially on systems with a
large number of CPUs. If a given device must discriminate among external
mastering peripherals, it can assign lower numbered PrivIDs to these
peripherals.

[0100] As described the architecture only supports 6 unique PrivIDs. The
remaining PrivIDs map onto a single AID. On devices that need to
discriminate among more than 6 different requestors (CPUs and mastering
peripherals), the AID field may be extended. Alternate useful mappings of
PrivID to AID include: [0101] Fixed N:1 groupings of PrivID to AID.
This groups CPUs and requestors into `clusters.` Such a mapping might be
appropriate for a shared peripheral. For instance, a device with 12 CPUs
might pair up the CPUs such that each AID bit corresponds to two CPUs.
[0102] Topological mapping. In this scheme, each CPU might employ a 1:1
mapping of PrivID to AID for a "global master" and a set of "logical
neighbors," and an N:1 mapping for all other PrivIDs. For example, one
might set up a mesh where each CPU has 3 neighbors, a global master, and
all other CPUs map onto the remaining AID. [0103] Run time programmable
dynamic mapping. This method is the most flexible.

[0104] A device with many CPUs might employ a combination of these mapping
schemes. As an example, one might cluster PrivIDs with N:1 mappings for
shared peripherals while using a more topological approach for the CPUs
themselves.

[0105] An expanded implementation of this invention introduces the concept
of flexible mapping of the PRIVID bits to AID bits.

[0106] A set of 16, 32 bit Provid to Aid MAP (PAMAP) memory mapped
configuration registers are provided where each register maps the 4 bit
PRIVID value to the 3 bit AID value. These registers are shown in FIG. 10
where bits 0-2 represent the AID bits, bits 3-30 are reserved and bit 31
implements an additional level of memory protection. Bit 31 controls
whether the associated PRIVID is allowed to initiate secure transactions
into the memory system.

[0108] The PMAP registers are readable by all privilege and security
levels, but are writeable only by secure supervisors on secure devices,
or by supervisors an emulation on non secure devices.

[0109] To aid in code development and to add robustness to a system
design, the memory protection architecture described distinguishes a
CPU's accesses to its local memory from the DMAs it issues that access
the same memory.

[0110] All requests issued by the CPU carry the CPU's PrivID as the ID of
the requestor. This allows each peripheral to permit or deny the accesses
according to the list of allowed IDs. In the case of accesses to a CPU's
local L1 and L2 memories, it is useful to distinguish between the CPU's
direct accesses to these memories and DMA accesses. Generally,
applications will only devote a subset of the local memory to DMA
activity, and DMA accesses outside these areas are incorrect accesses.

[0111] With respect to the allowed ID field, the architecture treats all
remote CPUs accesses to a given module's memories identically to DMA
accesses from that remote CPU. Therefore, the PrivID is used to determine
the corresponding AID bit when considering remote accesses, and the LOCAL
bit to determine whether the CPU may access its local memory.

[0112] In a complex system there are multiple threads of execution and
multiple masters making requests at any given time. With a distributed
memory protection system, it becomes necessary to make decisions and take
memory management actions based on the privilege context associated with
a request.

[0113] It is not sufficient to consider the master that made the request.
For example, DMA controllers are masters, and make requests on behalf of
multiple CPUs in the system. When it comes time to make a memory
protection decision on a DMA request, it's more valuable to consider the
CPU that programmed the DMA request than the fact it came from DMA.

[0114] In the described embodiment, each request is accompanied by a
Privilege Identifier (PrivID). Smart masters, such as CPUs carry unique
PrivIDs. Deputy masters, such as the DMA controllers inherit the PrivID
of the component that programmed the DMA.

[0115] The PrivID may then select different sets of MPAX (Memory
Protection and Address eXtension) segments to apply to requests. This
lets each of the multiple CPUs in the system (4 or 8 in this embodiment)
define the MPAX segments to match its image of the memory map. When a
given CPU then requests DMA transfers, the DMA parameters (by virtue of
inheriting that CPU's PrivID) will get compared against and translated by
the appropriate CPU's set of segment registers in the Extended Memory
Controller (XMC).

[0116] This method allows each of the smart masters to manage its own
traffic without having to coordinate excessively with other smart
masters. Furthermore, it allows each smart master to request as many DMAs
as it requires, ignoring other masters as long as the DMAs inherit the
privilege ID from the master that initiated the DMA.

[0117] This can be applied to any distributed memory protection
architecture whereby primary masters can "deputize" secondary masters.

[0118] With this architecture, a program could designate a page as direct
CPU access only by setting "LOCAL=1" and setting all other allowed IDs to
zero. Conversely, a program could designate a page as "DMAs issued from
this CPU only" by setting its AID bit and clearing the LOCAL bit. Such a
setting can be useful in a paranoid environment such as a secure device,
or in the context of a device driver. In most cases only the L1 and L2
memory controllers implement the LOCAL bit. In theory, a mastering
peripheral that can make self-referential requests may also use the LOCAL
bit, but in all likelihood such support is unnecessary. Nothing in the
described architecture prevents such an implementation.

[0119] The described memory protection model defines three fundamental
functional access types: Read, Write and Execute. Read and Write refer to
data accesses--accesses originating via the load/store units on the CPU
or via the DMA engines. Execute refers to accesses associated with
program fetch. The memory protection model allows controlling Read, Write
and Execute permissions independently for both User and Supervisor mode.
This results in 6 permission bits, as shown in Table 4.

[0120] For each bit, a `1` permits the access type, and a `0` denies it.
Thus "UX=1" means that User Mode may execute from the given page.

[0121] This method allows a small number of blanket permission settings to
describe the general mapping requirements with coarse granularity for
large ranges of system addresses without describing all of the fine
details of the system.

[0122] In a single processor system, this approach is sufficient to insure
protected memory accesses. However, in the multiprocessor shared memory
system described there may be a number of processors with their own
memory controllers interacting with one or more shared memories. While
the memory management unit controlling the local shared memory has the
required permission information for requests originating locally, it does
not have adequate information regarding the permissions associated with
the access requests originating with other processors. This invention
implements a distributed protection approach to solve this problem.

[0123] On the first level, the MPAX unit may assign coarse, default
permissions to a memory segment. As an example, it may allow read and
execute but not write access to a segment, and these permissions will
apply to all remote accesses to that memory segment irrespective of the
source of the request.

[0124] The processor module originating the access request may need to
modify these permissions through it's own memory management unit. As a
further example, the process executing within the remote requestor may
have execute, but not read or write permission. In order to insure that
only allowed memory operations take place, the final permissions must
take into consideration both the permissions assigned by the MPAX unit
controlling the shared memory, and the permissions assigned by the memory
controller originating the request.

[0125] As shown on FIG. 9, this is accomplished by the MPAX unit
controlling the shared memory by merging it's permissions with the
permissions provided by the requestors as part of the request, by
performing a logical AND operation between the two permissions. Register
901 is part of the segment register selected by the address of the
request as shown on FIG. 6, and register 903 is the register holding the
permission bits provided by the requestor. The contents of these
registers are merged in AND block 902, resulting in the final permission
bits shown in register 904. The local MPAX unit will use the merged
permissions in forming the decision whether to grant access to the
request or not. The meanings of the various permission bits are described
in Tables 4 and 5. Requestors that do not perform permission checks will
return all 1's thus not modifying the local MPAX permissions.

[0126] The DMC and UMC perform the fine-grain access-type checks for
accesses that result in cache line fills, not the peripheral. The
peripheral just makes a coarse grain check to ensure that "any" access is
possible. Further, the peripheral does not perform fine-grain access-type
checks for cache writebacks. Rather, the peripheral only checks that the
region of memory is minimally writeable.

[0127] By default, the architecture permits emulation accesses to pages,
regardless of the settings in the allowed ID and access type fields.
While this is appropriate for typical debug environments, it is not
appropriate for a secure device.

[0128] To support Secure devices, the NS and EMU bits indicate the
security configuration of the page. Table 5 illustrates the encoding of
these two bits:

TABLE-US-00008
TABLE 6
NS EMU CPU Effects Emulation Effects
0 0 Page is secure. Only Secure EDI reads/writes to this
mode may access this page. page are NOT permitted.
Secure code executing within Emulation halts and trace
this page retains its secure are NOT permitted when
status. executing from this page.
0 1 Page is secure. Only Secure EDI reads/writes to this
mode may access this page. page ARE permitted.
Secure code executing within Emulation halts and trace
this page retains its secure ARE permitted when
status. executing from this page.
1 x Page is not secure. Both EDI reads/writes to this
Secure and Non-secure code page ARE permitted.
may access this page. Emulation halts and trace
Secure code may branch to ARE permitted when
this page, but upon doing so executing from this page.
it will lose its Secure
status.

[0129] Setting `NS=1` makes the page `not secure.` Both secure supervisor
and son-secure supervisor mode may modify the permissions for this page.
Setting `NS=0` marks the page `secure.` Only secure supervisor CPU
accesses may modify the permissions secure pages. Only Secure supervisor
code may mark a page as secure by setting NS=0.

[0130] The EMU bit allows for debugging secure code. When NS=1, the
hardware ignores the EMU bit. When NS=0, the hardware disables emulation
for that page unless EMU=1. Specifically, when NS=0 and EMU=0, the
hardware: [0131] Blocks all EDI accesses to the affected page; [0132]
Blocks emulation accesses through secondary routes;

[0133] When the CPU executes secure code from a page with NS=0 and EMU=0,
it: [0134] Prevents emulation from halting code when the CPU executes
from the secure page. The CPU ignores both software and hardware
breakpoints. (This is the equivalent of "Run Free"); [0135] Prevents
emulation from tracing code or events generated by that code.

[0136] When either NS=1 or EMU=1, the hardware permits full emulation
access to the page. Trace and breakpoints function normally allowing the
debugging of secure code. This helps the most for debugging program
issues that arise when modifying an otherwise correct algorithm to work
in secure mode. On non-secure devices, the architecture forces NS to
always be 1, thereby preventing unwanted operation.

[0137] While the above emulation architecture will give the programmer a
consistent view between emulation debugging of the program and normal
execution of the program, it does not allow for the arbitrary inspection
of the entire physical memory map. In order to enable this, a XEDIAX
(eXtended memory controller, Emulation Data Interface, Address eXtension)
window may be implemented.

[0138] What makes this window unique is that it's truly parallel to
anything the user's program might try to do. A section of arbitrary
logical address space (0801--0000 through 0801_FFFF in the present
embodiment) may be cordoned off as "emulation only." In other words, we
now have 16 dedicated functional segments with programmable size, base
address and replacement, and 1 dedicated emulation segment with a fixed
size and fixed base address and only programmable replacement.

[0139] There is no possible way for functional application code to
influence this emulation-dedicated memory resource without generating
fake debugger requests. This means that the debugger will always have a
means of displaying the entire *physical* address map, regardless of what
the application has configured for everything else.

[0140] Software must define the mechanism for setting/clearing EMU on
secure pages. A secure authentication algorithm, likely embedded in a ROM
on a secure device, should perform this operation. Developers debugging
secure code would then run a short sequence of code on the target to
invoke this mechanism. A debugging environment could automate such an
unlocking sequence by downloading and executing a short, predefined code
sequence along with authentication data that the system requires.

[0141] When it encounters an invalid access, the memory protection
hardware has two distinct duties:

[0142] Prevent the access from occurring,

[0143] Report the error to the operating environment.

[0144] Invalid accesses are those memory accesses which require greater
permissions than those specified for the page or register involved. When
presented with an invalid access, the memory protection hardware must
prevent the requestor from making the access. What this means
specifically varies with the hardware being protected, and whether the
request originated from a cache controller. The guiding principle that
the memory protection hardware must uphold can be summarized in one
statement: The memory or peripheral being protected must not change in
state due to an invalid access.

[0145] The hardware must always drop invalid writes. Invalid writes must
never modify memory or change the status of a peripheral.

[0146] The hardware must never allow illegally read data to reach the
requestor. Further, if the read would cause side-effects (that is, update
the state of a peripheral, such as a FIFO or a status bit), the hardware
must prevent those side-effects from occurring. The read may return
garbage, so long as the garbage is unrelated to the contents of the
memory being accessed.

[0147] An incorrect access may change some of the cache's internal state,
so long as it does not change the semantic meaning of the cache's
contents. Specifically, when a cache encounters an invalid access:
[0148] The cache may update its LRU; [0149] The cache may allocate and/or
evict lines; [0150] The cache may not mark lines `dirty`; [0151] The
cache may not change the contents of the cache line.

[0152] Upon detecting an invalid access, the memory protection hardware
must report the error to as appropriate. The memory protection hardware
generates exception signals that are routed to an interrupt/exception
selector block, and the selector then routes these exceptions to one or
more CPUs. Only CPU, peripheral and DMA accesses to protected resources
generate exceptions.

[0153] Emulation accesses never generate exceptions, rather they generate
unique debug access status (DAS) codes. The architecture always permits
emulation accesses unless the page has both NS=0 and EMU=0 (or
equivalent, in the case of control registers with fixed implicit
permissions).

[0154] Each peripheral which implements memory protection checks provides
at least one exception output for signaling privilege violations. When
the protection hardware detects a fault, it records the information
regarding the fault and signals an exception. Under most circumstances,
it does not record new faults or signal new exceptions until the
supervisor clears the initial fault. Hardware which implements the LOCAL
bit has an additional notion of "local" versus "remote" exceptions. Such
hardware allows local faults to replace the information recorded for a
remote fault, and subsequently trigger a new exception.

[0155] Each peripheral may provide more than one exception output, in
order to signal different exceptions based on the faulting requestor, or
some other aspect of the request. Indeed, the DMC, PMC and UMC specify
separate exception outputs for local vs. global faults

[0156] When a given piece of memory protection hardware detects a
privilege violation, it captures some basic information about the
violation as part of the exception-triggering process. Specifically, it
captures the address of the fault, and the type of access that generated
the fault. Invalid emulation accesses never generate memory protection
exceptions. Rather, the memory controller reports the fault to emulation
through special debug access status (DAS) codes.

[0157] The Memory Protection and Address eXtension (MPAX) unit addresses
the needs above with a segment-based protection and translation scheme
that has the following features: [0158] Small, fixed number of segments
(8 and 16 in our initial embodiment); [0159] Programmable power of two
segment size (4 kB through 4 GB in initial embodiment); [0160]
Programmable permissions per segment (rather than segments dedicated by
purpose; [0161] Strict priority between segments whose logical addresses
overlap; [0162] Support for merging segment permissions with permissions
returned from remote endpoints. This allows stacking various protection
units in a distributed system; [0163] Replacement address associated with
each segment can have more bits than the logical address bits it
replaces, thereby supporting a large physical address space. Benefits of
this approach include: [0164] Complex and compact mappings possible with
small number of segments, thanks in part to priority scheme; [0165]
Enables a distributed protection scheme that scales up to the needs of a
complex system, but which can be scaled down for or omitted from cost
sensitive devices. Multiple MPAX units or other firewalls/protection
mechanisms can exist along the path to a given endpoint, each providing
its say (and potentially translation) over whether and how the access can
proceed; [0166] Physical address space on a large system can be sized to
cover the overall memory requirements of the entire system, while the
individual processors can stick with 32-bit addressing internally,
thereby avoiding changing their datapaths; [0167] Because the segments
have dedicated storage, the segment descriptors do not compete for memory
space, and the maximum cost of an MPAX lookup, permission check and
address translation can be performed in 2 clock cycles. In some cases,
part of the lookup can be avoided, reducing this to 1 clock cycle; [0168]
One can alias the same physical memory at multiple logical addresses, as
there are no constraints on the logical-to-physical translation other
than page size and address alignment. This can be useful in the context
of a processor megamodule that assigns memory attributes (cacheability,
prefetchability, writethrough) on the basis of logical address; [0169]
The MPAX registers may also be used to redirect requests from a master
port back to a slave port via the system-level interconnect to facilitate
embedded functional testing.

[0170] Device security in a shared memory system refers to the concept of
supporting truly secure regions of memory and truly secure code
execution, where the chain of trust extends from a known boot image up
through the actual secure code execution, and through access to secure
resources.

[0171] In the described system security starts with a secure boot ROM and
a set of security keys programmed into eFuse. The secure boot ROM puts
the system into a known state, and erases potentially sensitive data from
the RAM before allowing code execution.

[0172] Additional rules are implemented in the MPAX unit to ensure that
security can not be subverted by the memory protection and address
translation functions: [0173] Only a secure supervisor can change the
security level of any given segment. As segments have a strict priority,
a secure supervisor can claim the highest priority segments for itself.
[0174] Non secure memory requests that match a secure segment generate an
access fault. [0175] Secure memory requests that match a non secure
segment get demoted to non secure status. As non secure space controls
the address mapping of non secure segments, the purpose here is to
prevent a bug or malicious code to attempt secure access to a logical
address whose mapping is controlled by non secure space. The secure level
of a request leaving the XMC should match the level for the addressed
segment, and the endpoint must then allow or disallow the request based
on the secure level. [0176] The XMC must match the secure level of the
endpoint with the secure level of the segment matched. If they mismatch,
the XMC faults the request. [0177] The secure ROM only accepts secure
requests and reports itself as secure. [0178] The Multicore Shared Memory
Controller (MSMC) keeps a "secure address threshold" that is programmable
by a secure supervisor only, and will only allow secure requests that are
above the secure threshold. [0179] Emulation requests are only allowed
for a secure segment if the segment is marked emulable, thus introducing
the concept of secure emulation at the system level.

[0180] Security in this implementation is ultimately enforced by the
secure endpoints. The system provides a mechanism to extend the chain of
trust from the fixed ROM and the effectively fixed sFuse key values to
establish and maintain security.

[0181] Address translation may be done via adding an address offset, or by
bit replacement. The described implementation uses bit-replacement to
save silicon area and increase the likelihood of meeting timing
constrains. A lower clock-rate implementation (or a more aggressive
implementation at the same clock rate) could use address-offset.

[0182] Address extension is implemented in the preferred embodiment by bit
replacement. The upper address bits of the logical address are replaced
with the corresponding bits from the replacement address (RADDR). The
replacement address field may be wider than the field it replaces, thus
extending the addressing range. In the current implementation the 32 bit
logical address space is extended to a 36 bit physical address space.
Table 6 shows the bits replaced, depending on segment size (SEGSZ).

[0183] FIG. 6 shows the layout of the segment registers. There is a
segment register for each memory segment, 16 in the preferred
implementation. The registers occupy two 32 bit words shown as 601 and
602 in the memory map, although not all 64 bits are implemented in the
current implementation. The fields in the segment registers are described
in Table 7.

[0185]FIG. 7 shows a memory map of a typical system. 701 represents a 32
bit addressable logical memory being mapped into the 36 bit address space
of the available physical memory 703. Mapping and address translation is
implemented in two of the segment registers in register block 702. In the
mapping shown in FIG. 7, segment register 704 contains BADDR=00000h;
RADDR=000000h; and SEGSZ=11110b representing 2 GB. Register 705 contains
BADDR=80000h; RADDR=080000h; and SEGSZ=11110b.

[0186] FIG. 8 shows a high level description of one implementation of the
Extended Memory Controller. Register bank select block 803 selects the
appropriate segment register bank from segment register banks 801 or 802
by looking at the PRVID 804 accompanying the memory request. Comparator
block 806 then selects the requested segment register by comparing
logical address 805 to the segment base addresses and segment sizes in
the appropriate segment registers, and then compares the permission bits
of the selected segment register to the permissions associated with
logical address 805 as communicated by PRVID 804. If the comparison
indicates a valid memory request, block 806 generates physical address
807 by replacing the appropriate bits in logical address 805 by the
corresponding bits in the replacement address field RADDR of the selected
segment register.

[0187] The exact number of segments, segment register banks and the range
of segment sizes are not fundamental to this invention. In the current
preferred embodiments we have 8 or 16 segments, and segment sizes from 4
kB to 4 GB. One could easily envision a lower cost realization with fewer
segments, a larger smallest-segment size, or no address extension.