One of the tasks that a hardware designer must face up is the balance of conflicting design
goals. This thesis focuses on one of these conflicts of interest between desirable objectives in High-Performance Out-of-Order processors.
In particular, we refer to the difficulty that exists between operating at a high frequency and the ability to expose and exploit instruction level parallelism of the executed code. In
Out-of-Order processors, the issue queue and the scheduler are related to exposure and
exploitation of instruction level parallelism. One of the design parameters of these two
elements is the number of entries (instructions) of issue queue. This parameter is directly
related to the amount of parallelism that the processor will be able to expose. On the other
hand, another design parameter is the number of instructions that can be issued to execution in parallel, the issue-width. This parameter determines how much instruction level parallelism the processor can exploit. A high-performance processor requires that the values of these parameters to be as high as possible. However, the cost (area and delay) of the issue queue and the scheduler also grow with these parameters, thereby increasing the value of these parameters may compromise the processor cycle time (for example).
In this thesis we employ two different techniques that can alleviate the conflict between
frequency and cost (area and delay) of the scheduling stage. On the one hand, pipelining the scheduling stage. On the other hand, slicing the selection logic. The aim of the first technique is to increase the number of entries in the instruction queue without increasing cycle time. The aim of the second technique is to remove the implicit serialization of arbiters in the selection logic due to issue-ports can handle the same instruction type. Also, this second technique reduces area and dely of aribters. However, these techniques involve some performance losses when applied in a straight-forward manner
The problem of pipelining the instruction scheduling-logic lies in the hardware-loop
between the main two elements that constitute it: wakeup and selection logic. On the one
hand, ready instructions start competing for execution in the selection logic. On the other
hand, scheduled instructions must wake their dependent instructions up in the wakeup logic.
When this hardware-loop latency a single cycle, the scheduling logic is able to schedule
dependent instructions in consecutive cycles, and then back-to-back execution is possible. In contrast, by pipelining the scheduling logic, instructions with an execution latency shorter than the hardware-loop latency can not wake their dependent instructions up in time for back-to-back execution. This latency in the hardware-loop, according to our evaluations, supposes a performance loss of around a 10% on average for an issue queue of 32 entries.
This thesis proposes the idea of Dependence Level Scheduler (DLS), which is able to
tolerate the latency in the wakeup-select hardware-loop. The proposal makes use of the
observation that usually the selection logic schedules all competing for selection instructions in a single cycle. In DLS, producer instructions (shorter execution-latency than the hardware-loop latency) can wakeup in advance to their dependent instructions, this is, once they begin to compete for selection. This manner, the selection phase of producer instructions and the wakeup phase (in advance) of their consumer instructions are overlapped. Since the observation, in which DLS relies on, is not met always, it is necessary to manage the situation in which there are more producer instructions than the ones that can be scheduled in a single cycle. Therefore, every cycle, DLS checks that all producer instructions have been selected...

Pipelining the scheduling logic, which exposes and exploits the instruction level parallelism, degrades processor performance. In a 4-issue processor, our evaluations show that pipelining the scheduling logic over two cycles degrades performance by 10% in
SPEC-2000 integer benchmarks. Such a performance degradation is due to sacrificing the ability to execute dependent instructions in consecutive cycles.
Speculative selection is a previously proposed technique that boosts the performance of a processor with a pipelined scheduling logic. However, this new
speculation source increases the overall number of misspeculated instructions, and this unuseful work wastes energy. In this work we introduce a non-speculative
mechanism named Dependence Level Scheduler (DLS)which not only tolerates the scheduling-logic latency but also reduces the number of misspeculated instructions with respect to a scheduler with speculative
selection. In DLS, the selection of a group of one-cycle instructions (producer-level) is overlapped with the wake up in advance of its group of dependent instructions. DLS is not speculative because the group of woken in advance instructions will compete for
selection only after issuing all producer-level instructions. On average, DLS reduces the number of misspeculated instructions with respect to a speculative
scheduler by 17.9%. From the IPC point of view, the speculative scheduler outperforms DLS by 0.3%. Moreover, we propose two non-speculative improvements to DLS.

Mitigating the effect of the large latency of load instructions is one of challenges of micro-processor designers. This thesis analyses one of the alternatives for tackling this problem: address prediction and speculative execution.Several authors have noticed that the effective addresses computed by the load instructions are quite predictable. First of all, we study why this predictability appears; our study tries to detect the high-level language structures that are compiled into predictable load instructions. We also analyse the conventional address predictors in order to determine which address predictors are most appropriate for the typical applications.Our study continues by proposing address predictors that use their storage structures more efficiently. Address predictors track history information of the load instructions; however, the requirements of the predictable instructions are different from the requirements of the unpredictable instructions. We then propose an organization of the prediction tables considering the existence of both kinds of instructions. We also show that there is a certain degree of redundancy in the prediction tables of the address predictors. We propose organizing the prediction tables in order to reduce this redundancy. These proposals allow us to reduce the area cost of the address predictors without impacting their performance.After that, we evaluate the impact of address prediction on processor performance. Our evaluations assume that address prediction is used to start speculatively some memory accesses and to execute speculatively their dependent instructions. On a correct prediction, all the speculative work is considered as correct; on a misprediction, the speculative work must be discarded. Our study is focused on several aspects such as the interaction of address prediction and branch prediction, the implementation of verification mechanisms, the recovery mechanism on address mispredictions, and the influence of several processor parameters (the issue-queue size, the cache latency and the issue width) on the performance impact of address prediction. Finally, we evaluate several recovery mechanisms for latency mispredictions. Latency prediction is a speculative technique used by the schedulers of some superscalar processors to deal with variable-latency instructions (for instance, load instructions). Our evaluations are focused on a conventional recovery mechanism for latency mispredictions and a new proposal. We also evaluate the proposed recovery mechanism in the scope of address prediction; we conclude that it represents a cost-effective alternative to the conventional recovery mechanisms used for address mispredictions.