Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.

Microprocessor industry has recently shifted
towards multi-core to take advantage of the ever increasing number of transistors provided by the new technologies. Unfortunately, the multi-core approach does not allow single threaded applications to benefit from the additional cores to improve their execution time. Speculative multithreading (SpMT) has been proposed in the past to boost performance of irregular applications in multi-core environments. In this work, we study the main bottlenecks of these architectures, such as the memory behavior and the pre-computation slices
and propose two novel schemes that allow SpMT to get 25% average speedup over single threaded execution. We propose Selective Replication as a technique to improve the
performance of the SpMT memory system. This technique does not introduce additional traffic in the bus and improves the performance of a conventional SpMT memory model by 6% on average and up to 21% for some applications. Also, we propose a scheme called Slice Specialization that reduces the
number of instructions in the pre-computation slices by adapting the slice to every single speculative thread spawned.
The later proposal outperforms previous schemes with slices by 15% and overall, both techniques combined achieve an improvement of 20% over a conventional SpMT processor.