Patents

This patent is for the split-stream method of encoding that permits Mill CPUs to decode extremely wide instructions (those with many independent operations in each instruction) using a compact and flexible variable-length encoding that is also fast to decode. The stream of instructions that is the program being executed is split into two streams of half-instructions that are stored separately in memory but are processed in lock-step by the CPU decoder. Because wide fixed length instruction encodings use an impractical amount of cache and memory, and variable-length encodings take time polynomial in the width, instruction decode on legacy CPUs is limited to eight operations per cycle or fewer; Mill split-stream encoding supports instruction widths of over thirty operations decoded per cycle. In addition, split-stream permits doubling the amount of instruction cache with no clock or pipeline penalty.

This patent is for the Belt, the Mill CPU mechanism that replaces the function of the general registers used for temporary storage in legacy CPUs. Because the Belt entry is write-once-read-many, the Mill is immune the ordering hazards (WAW, RAW, WAR) that force use of massive numbers of rename registers in legacy CPUs. Removing the rename registers and their associated power-hungry circuitry leads to a more compact layout for better yield and lower cost, and saves the pipeline delay of the several stages devoted to rename translation.

This patent is for the per-byte validation used in Mill caches. Each byte has an extra bit that indicates whether the data in that byte is valid, or must be found lower in the memory hierarchy. The valid-bits obviate the write buffers and consolidating buffers used by legacy CPUs that must update entire cache lines, a substantial saving in power, area and complexity. In addition, a Mill store operation takes effect at once and need not wait for a line to be read from external memory, so slow memory barrier operations are not needed by the Mill program.

This patent covers two different ways to use meta-information in pointer formats. In one use, each pointer carries a few “event” bits besides the target address of the pointer, which are checked against several mask registers in the CPU whenever executing a memory operation – load, store, or specifically storing a pointer; a match triggers a trap to application software or the runtime system. When set appropriately, the event bits speed up certain kinds of garbage collection and detect several kinds of security violations.

In the second use, the pointer format holds granularity information, and pointer arithmetic and array access operations can be checked by hardware for bounds violations without requiring memory tag bits or increasing the size of a pointer.

This patent describes memory mapping hardware that performs implicit zero memory initialization of cache lines upon local memory frame activation. This same memory mapping hardware also performs invalidation of the cache lines when the local memory frame activation is terminated.

The hardware described eliminates the need to perform store operations for initialization to zero. The consistent initialization to zero of local frame data provides a stable starting place for function code that can eliminate bugs and also hide data left by functions using the same address region for local frame data at other times.

The hardware described also prevents modified cache lines containing meaningless data from being copied out to the lowest level of the hierarchical memory system after function exit.

For a less patentese explanation of this Mill technology see our memory talk video here, or look at the corresponding powerpoint starting at slide 59.

This patent describes the deferred-load hardware of the Mill CPU architecture. By allowing a program to avoid stalling when data that is being loaded is in cache, the invention allows a statically scheduled CPU to achieve load performance comparable to a dynamically scheduled Out-Of-Order CPU.

The patent describes two types of deferred load:

a load that completes upon expiration of its programmed delay in machine cycles and

a load that completes upon issuance by the program of a corresponding pickup operation.