Adapting to Intermittent Faults in Multicore Systems

Philip M. Wells, Koushik Chakraborty, and Gurindar S. Sohi

Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several cycles to several seconds or more. Due to practical limitations of circuit techniques, cost-effective reliability will likely require the ability to temporarily suspend execution on a core during periods of intermittent faults.

We investigate three of the most obvious techniques for adapting to the dynamically changing resource availability caused by intermittent faults, and demonstrate their different system-level implications. We show that system software reconfiguration has very high overhead, that temporarily pausing execution on a faulty core can lead to cascading livelock, and that using spare cores has high fault-free cost. To remedy these and other drawbacks of the three baseline techniques, we propose using a thin hardware/firmware layer to manage an overcommitted system -- one where the OS is configured to use more virtual processors than the number of currently available physical cores. We show that this proposed technique can gracefully degrade performance during intermittent faults of various duration with low overhead, without involving system software, and without requiring spare cores.

Serializing Instruction in System-Intensive Workloads: Amdahl's Law Strikes Again

Philip M. Wells and Gurindar S. Sohi

Serializing instructions (SIs), such as writes to control registers, have many complex dependencies, and are difficult to execute out-of-order (OoO). To avoid unnecessary complexity, processors often serialize the pipeline to maintain sequential semantics for these instructions.

We observe frequent SIs across several system-intensive workloads and three ISAs, SPARC V9, X86-64, and PowerPC. As explained by Amdahl's Law, these SIs, which create serial regions within the instruction-level parallel execution of a single thread, can have a significant impact on performance. For the SPARC ISA (after removing TLB and register window effects), we show that operating system (OS) code incurs a 8-45% performance drop from SIs.

We observe that the values produced by most control register writes are quickly consumed, but the writes are often effectively useless (EU), i.e., they do not actually change the execution of the consuming instructions. We propose EU prediction, which allows younger instructions to proceed, possibly reading a stale value, and yet still execute correctly. This technique improves the performance of OS code by 6-35%, and overall performance by 2-12%.

Cooperative Cache Partitioning for Chip Multiprocessors

Jichaun Chang and Gurindar S. Sohi

This paper presents Cooperative Cache Partitioning (CCP) to allocate cache resources among co-scheduled applications running on CMPs. Unlike single-partition based cache partitioning schemes, CCP resolves cache contention between thrashing applications using multiple time-sharing partitions (MTP). Time sharing cache resources among partitions allows each thrashing application to speedup dramatically in at least one partition by unfairly shrinking other applications' cache allocations, while achieving fair speedups by giving different partitions equal chance to execute. Quality-of-Service (QoS) is guaranteed by orchestrating each application's shrinking/expansion across partitions to bound its average slowdown. CCP further integrates MTP with previously proposed CMP cooperative caching to combine the strengths of cache partitioning schemes and LRU-based latency-reduction designs, which leads to a simplified partitioning algorithm and better performance for workloads that do not benefit from cache partitioning.

We evaluate the effectiveness of CCP by simulating a 4-core CMP with 4MB L2 cache running 210 multiprogramming combinations of 7 SPEC2000 benchmarks. For workloads that benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than an exhaustive search of single partitions. Overall, CCP is the best performing scheme on almost all metrics and performs robustly under high capacity pressure with half the cache size.

Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly

Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.

We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.

When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.

Hardware Support for Spin Management in Overcommitted Virtual Machines

Philip M. Wells, Koushik Chakraborty, and Gurindar S. Sohi

Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is infeasible for some application domains, and is inflexible in other domains.

In an overcommitted environment, an individual guest OS has more VCPUs than available physical processors (PCPUs), precluding the use of gang scheduling. In such an environment, we demonstrate a more than two-fold increase in runtime when transparently virtualizing a chip-multiprocessor's cores. To combat this problem, we propose a hardware technique to detect several cases when a VCPU is not performing useful work, and suggest preempting that VCPU to run a different, more productive VCPU. Our technique can dramatically reduce cycles wasted on OS synchronization, without requiring any semantic information from the software.

We then present a case study, typical of server consolidation, to demonstrate the potential of more flexible scheduling policies enabled by our technique. We propose one such policy that logically partitions the CMP cores between guest VMs. This policy increases throughput by 10-25% for consolidated server workloads due to improved cache locality and core utilization, and substantially improves performance isolation in private caches.

Program Demultiplexing: Data-flow based Speculative Execution of Methods in Sequential Programs

Saisanthosh Balakrishnan and Gurindar S. Sohi

We present Program Demultiplexing (PD), an execution paradigm that creates concurrency in sequential programs by "demultiplexing" methods (functions or subroutines). Call sites of a demultiplexed method in the program are associated with handlers that allow the method to be separated from the sequential program and executed on an auxiliary processor. The demultiplexed execution of a method (and its handler) is speculative and occurs when the inputs of the method are (speculatively) available, which is typically far in advance of when the method is actually called in the sequential execution. A trigger, composed of predicates that are based on program counters and memory write addresses, launches the speculative execution of the method on another processor.

Our implementation of PD is based on a full-system execution-based chip multi-processor simulator with software to generate triggers and handlers from an x86-program binary. We evaluate eight integer benchmarks from the SPEC2000 suite -- programs written in C with no explicit concurrency and/or motivation to create concurrency -- and achieve a harmonic mean speedup of 1.8x with our implementation of PD.

Cooperative Caching for Chip Multiprocessors

Jichaun Chang and Gurindar S. Sohi

This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper.

Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

The Impact of Performance Asymmetry in Emerging Multicore Architectures

Saisanthosh Balakrishnan, Ravi Rajwar, Mike Upton, and Konrad Lai

Performance asymmetry in multicore architectures arises when individual cores have different performance. Building such multicore processors is desirable because many simple cores together provide high parallel performance while a few complex cores ensure high serial performance. However, application developers typically assume computational cores provide equal performance, and performance asymmetry breaks this assumption. This paper is concerned with the behavior of commercial applications running on performance asymmetric systems. We present the first study investigating the impact of performance asymmetry on a wide range of commercial applications using a hardware prototype. We quantify the impact of asymmetry on an application's performance variance when run multiple times, and the impact on the application's scalability. Performance asymmetry adversely affects behavior of many workloads. We study ways to eliminate these effects. In addition to asymmetry-aware operating system kernels, the application often itself needs to be aware of performance asymmetry for stable and scalable performance.

Characterization of Problem Stores

Allison L. Holloway and Gurindar S. Sohi

This paper introduces the concept of problem stores: static stores whose dependent loads often miss in the cache. Accurately identifying problem stores allows the early determination of addresses likely to cause later misses, potentially allowing for the development of novel, proactive prefetching and memory hierarchy management schemes.

We present a detailed empirical characterization of problem stores using the SPEC2000 CPU benchmarks. The data suggests several key observations about problem stores. First, we find that the number of important problem stores is typically quite small; the worst 100 problem stores write the values that will lead to about 90% of non-cold misses for a variety of cache configurations. We also find that problem stores only account for 1 in 8 dynamic stores, though they result in 9 of 10 misses. Additionally, the problem stores' dependent loads miss in the L2 cache a larger fraction of the time than loads not dependent on problem stores. We also observe the set of problem stores is stable across a variety of cache configurations. Finally, we found that the instruction distance from problem store to miss and problem store to evict is often greater than one million instructions, but the value is often needed within 100,000 instructions of the eviction.

Coherence Decoupling: Making Use of Incoherence

Jaehyuk Huh, Jichuan Chang, Doug Burger, and Gurindar S. Sohi

This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce -- if not eliminate -- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.

Use Based Register Caching

J. Adam Butts and Gurindar S. Sohi

Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value.

The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files andp reviously proposed hierarchical register files.

Exploiting Value Locality in Physical Register Files

Saisanthosh Balakrishnan and Gurindar S. Sohi

The physical register file is an important component of a dynamically-scheduled processor. Increasing the amount of parallelism places increasing demands on the physical register file, calling for alternative file organization and management strategies. This paper considers the use of value locality to optimize the operation of physical register files.

We present empirical data showing that: (i) the value produced by an instruction is often the same as a value produced by another recently executed instruction, resulting in multiple physical registers containing the same value, and (ii) the values 0 and 1 account for a considerable fraction of the values written to and read from physical registers. The paper then presents three schemes to exploit the above observations.

The first scheme extends a previously-proposed scheme to use only a single physical register for each unique value. The second scheme is a special case for the values 0 and 1. By restricting optimization to these values, the second scheme eliminates many of the drawbacks of the first scheme. The third scheme further improves on the second, resulting in an optimization that reduces physical register requirements with simple micro-architectural extensions. A performance evaluation of the three schemes is also presented.

Parallelism in the Front-End

Paramjit S. Oberoi and Gurindar S. Sohi

As processor back-ends get more aggressive, front-ends will have to scale as well. Although the back-ends of superscalar processors have continued to become more parallel, the front-ends remain sequential. This paper describes techniques for fetching and renaming multiple non-contiguous portions of the dynamic instruction stream in parallel using multiple fetch and rename units. It demonstrates that parallel front-ends are a viable alternative to high-performance sequential front-ends.

Compared with an equivalently-sized trace cache, our technique increases cache bandwidth utilization by 17%, front-end throughput by 20%, and performance by 5%. Parallelism also enhances latency tolerance: a parallel front-end loses only 6% performance as the cache size is decreased from 128 KB to 8 KB, compared with a 50-65% performance loss for sequential fetch mechanisms.

Using Coherence Value Speculation to Improve Multiprocessor Performance

Jichaun Chang, Jaehyuk Huh, Rajagopalan Desikan, Doug Burger, Gurindar Sohi

Transmission of cache lines in cache-coherent shared memory machines is necessary for communication but can cause significant latencies across the system. The ongoing growth in cache capacities shifts the distribution of cache misses from capacity and conflict misses to coherence misses, which consist of misses caused by both true and false sharing. In this paper we propose coherence decoupling and coherent value speculation to improve multiprocessor performance. Coherence decoupling splits the caching operation from the coherence operation, allowing processors to speculate on the load value or the data access permission, despite not having both for that operation. Coherent value speculation provides speculatively-coherent values for a coherence-decoupled system. In this paper we focus on using shared value prediction for coherent value speculation. We describe and profile the potential of three value prediction schemes: address-cache-based, PC-table-based, and PC+address-table-based. The first scheme simply uses the stale values existing in the cache, while the other two predict using recent value tables. We present early results that suggest excellent potential for coherent value speculation.

Characterizing and Predicting Value Degree of Use

J. Adam Butts and Gurindar S. Sohi

A value's degree of use--the number of dynamic uses of that value--provides the most essential information needed to optimize its communication. We present simulation results demonstrating the properties of degree of use of values, including their predictability: most static instructions generate values with few degrees of use and these exhibit temporal locality. We use these results to guide the design of a degree of use predictor. The development and detailed characterization of this predictor is the focus of this paper. Our predictor leverages future control flow information (e.g., branch predictions) to select among different possible degrees of use. We study the effects of several optimizations and variations in the predictor s algorithms to tune the predictor for maximum performance. The resulting predictor generates correct degree of use predictions for over 92% of all dynamic values and has a misprediction rate below 2.5%. Such a predictor has a wide range of potential applications in optimizing value communication.

Dynamic Dead-Instruction Detection and Elimination

J. Adam Butts and Gurindar S. Sohi

We observe a non-negligible fraction3 to 16% in our benchmarks of dynamically dead instructions, dynamic instruction instances that generate unused results. The majority of these instructions arise from static instructions that also produce useful results. We find that compiler optimization (specifically instruction scheduling) creates a significant portion of these partially dead static instructions. We show that most of the dynamically dead instructions arise from a small set of static instructions that produce dead values most of the time.

We leverage this locality by proposing a dead instruction predictor and presenting a scheme to avoid the execution of predicted-dead instructions. Our predictor achieves an accuracy of 93% while identifying over 91% of the dead instructions using less than 5 KB of state. We achieve such high accuracies by leveraging future control flow information (i.e., branch predictions) to distinguish between useless and useful instances of the same static instruction.

We then present a mechanism to avoid the register allocation, instruction scheduling, and execution of predicted dead instructions. We measure reductions in resource utilization averaging over 5% and sometimes exceeding 10%, covering physical register management (allocation and freeing), register file read and write traffic, and data cache accesses. Performance improves by an average of 3.6% on an architecture exhibiting resource contention. Additionally, our scheme frees future compilers from the need to consider the costs of dead instructions, enabling more aggressive code motion and optimization. Simultaneously, it mitigates the need for good path profiling information in making inter-block code motion decisions.

Out-of-Order Instruction Fetch using Multiple Sequencers

Paramjit S. Oberoi and Gurindar S. Sohi

Conventional instruction fetch mechanisms fetch contiguous blocks of instructions in each cycle. They are difficult to scale since taken branches make it hard to increase the size of these blocks beyond eight instructions. Trace caches have been proposed as a solution to this problem, but they use cache space inefficiently.

We show that fetching large blocks of contiguous instructions, or wide fetch, is inefficient for modern out-oforder processors. Instead of the usual approach of fetching large blocks of instructions from a single point in the program, we propose a high-bandwidth fetch mechanism that fetches small blocks of instructions from multiple points in a program.

In this paper, we demonstrate that it is possible to achieve high-bandwidth fetch by using multiple narrow fetch units operating in parallel. Our mechanism performs as well as a trace cache, does not waste cache space, is more resilient to instruction cache misses, and is a natural fit for techniques that require fetching multiple threads, like multithreading, dual-path execution, and speculative threads.

Execution-based Prediction Using Speculative Slices

Craig B. Zilles and Gurindar S. Sohi

A relatively small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be accurately anticipated using existing prefetching or branch prediction mechanisms.

The behavior of many problem instructions can be predicted by executing a small code fragment called a speculative slice. If a speculative slice is executed before the corresponding problem instructions are fetched, then the problem instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine.

To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the programis execution path) that they will not be used. This enables the remaining predictions to be correctly correlated.

A Programmable Co-processor for Profiling

Craig B. Zilles and Gurindar S. Sohi

Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile samples generated by a microprocessor. From this stream of samples, the co-processor can detect correlations between instructions (e.g., memory dependence profiling) as well as those between different dynamic instances of the same instruction (e.g., value profiling). The profiler's programmable nature allows a broad range of data to be extracted, post-processed, and formatted, as well as provides the flexibility to tailor the profiling application to the program under test. Because the co-processor is specialized for profiling, it can execute profiling applications more efficiently than a general-purpose processor. The co-processor should not significantly impact the cost or performance of the main processor because it can be implemented using a small number of transistors at the chip's periphery.

We demonstrate the proposed design through a detailed evaluation of load value profiling. Our implementation quickly and accurately estimates the value invariance of loads, with time overhead roughly proportional to the size of the instruction working set of the program. This algorithm demonstrates a number of general techniques for profiling, including: estimating the completeness of a profile, a means to focus profiling on particular instructions, and management of profiling resources.

Speculative Data-Driven Multithreading

Amir Roth and Gurindar S. Sohi

Mispredicted branches and loads that miss in the cache cause the majority of retirement stalls experienced by sequential processors; we call these critical instructions. Despite their importance, a sequential processor has difficulty prioritizing critical computations (computations of critical instructions), because it must fetch all computations sequentially, regardless of their contribution to performance. Speculative data-driven multithreading (DDMT) is a general-purpose mechanism for overcoming this limitation.

In DDMT, critical computations are annotated so that they can execute standalone. When the processor predicts an upcoming instance of a critical instruction, it microarchitecturally forks a copy of its computation as a new kind of speculative thread: a data-driven thread (DDT). The DDT executes in parallel with the main program thread, but typically generates the critical result much faster since it fetches and executes only the critical computation and not the whole program. A DDT pre-executes a critical computation and effectively consumes its latency on behalf of the main thread. A DDMT component called integration incorporates results computed in DDTs directly into the main thread, sparing it from having to repeat the work.

We simulate an implementation of DDMT on top of a simultaneous multithreading (SMT) processor and use program pro- files to create DDTs and annotate them into the executable. Our experiments show that DDMT pre-execution of critical loads and branches can improve performance significantly.

A Static Power Model for Architects

J. Adam Butts and Gurindar S. Sohi

Static power dissipation due to transistor leakage constitutes an increasing fraction of the total power in modern semiconductor technologies. Current technology trends indicate that the contribution will increase rapidly, reaching one half of total power dissipation within three process generations. Developing power efficient products will require consideration of static power in the earliest phases of design, including architecture and microarchitecture definition. We propose a simple equation for estimating static power consumption at the architectural level: Pstatic = Vcc * N * kdesign * Ileak, where Vcc is the supply voltage, N is the number of transistors, kdesign is a design dependent parameter, and Ileak is a technology dependent parameter. This model enables high-level reasoning about the likely static power demands of alternative microarchitectures. Reasonably accurate values for the factors within the equation may be obtained directly from the high-level designs or by straightforward scaling arguments. The factors within the equation also suggest opportunities for static power optimization, including reducing the total number of devices, partitioning the design to allow for lower supply voltages or slower, less leaky transistors, turning off unused devices, favoring certain design styles, and favoring high bandwidth over low latency. Speculation is also examined as a means to employ slower transistors without a significant performance penalty.

Register Integration: A Simple and Efficient Implementation of Squash Reuse

Amir Roth and Gurindar S. Sohi

Register integration (or simply integration) is a mechanism for incorporating speculative results directly into a sequential execution using data-dependence relationships. In this paper, we use integration to implement squash reuse, the salvaging of instruction results that were needlessly discarded during the course of sequential recovery from a control- or data- mis-speculation.

To implement integration, we first allow the results of squashed instructions to remain in the physical register file past mis-speculation recovery. As the processor re-traces portions of the squashed path, integration logic examines each instruction as it is being renamed. Using an auxiliary table, this circuit searches the physical register file for the physical register belonging to the corresponding squashed instance of the instruction. If this register is found, integration succeeds and the squashed result is re-validated by a simple update of the rename table. Once integrated, an instruction is complete and may bypass the out-of-order core of the machine entirely. Integration reduces contention for queuing and execution resources, collapses dependent chains of instructions and accelerates the resolution of branches. It achieves this using only rename-table manipulations; no additional values are read from or written to the physical registers.

Our preliminary evaluation shows that a minimal integration configuration can provide performance improvements of up to 8% when applied to current-generation micro-architectures and up to 11.5% when applied to more aggressive micro-architectures. Integration also reduces the amount of wasteful speculation in the machine, cutting the number of instructions executed by up to 15% and the number of instructions fetched along mis-speculated paths by as much as 6%.

Speculative Miss/Execute Decoupling

Amir Roth, Craig B. Zilles and Gurindar S. Sohi

The decoupled access/execute architecture described a machine that enables the access of memory values to be decoupled from the consumption of those values. Although never widely adopted in its original form, the decoupled design is a compelling way to tolerate memory latency. In this paper, we propose and demonstrate a novel implementation of decoupling, one based on the following two refinements of the original idea. First, because the latency of cache hits can generally be tolerated, we only decouple from the main program accesses that are likely to miss in the cache. Second, our decoupling takes place at the microarchitectural level, not the architectural level. By treating the access stream as a speculative thread and not allowing it to modify the architectural state of the machine, we relax the correctness constraints that were placed on it in the original design. For many programs, this added flexibility enables a level of decoupling and, consequently, latency tolerance that could not be achieved under the more constrained architectural model.

Understanding the Backward Slices of Performance Degrading Instructions

Craig B. Zilles and Gurindar S. Sohi

For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices.

This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.

The Use of Multithreading for Exception Handling

Craig B. Zilles, Joel S. Emer, and Gurindar S. Sohi

Common hardware exceptions, when implemented by trapping, unnecessarily serialize program execution in dynamically scheduled superscalar processors. To avoid the consequences of trapping the main thread, multithreaded CPUs can exploit control and data independence by executing the exception handler in a seperate hardware context. The main thread doesn't squash instructions after the squashing instruction, conserving fetch bandwidth and allowing execution of instructions independent of the exception. This leads to earlier branch resolution in the branch resolution code and additional memory latency tolerance. As proof of concept, using threads to handle software TLB misses is shown to provide performance approaching that of an agressive hardware TLB miss handler.

Improving Virtual Function Call Target Prediction via Dependence-Based Pre-Computation

Amir Roth, Andreas Moshovos and Gurindar S. Sohi

We introduce dependence-based pre-computation as a complement to history-based target prediction schemes. We present pre-computation in the context of virtual function calls (v-calls), a class of control transfers that is becoming increasingly important and has resisted conventional prediction. Our proposed technique dynamically identifies the sequence of operations that computes a v-call's target. When the first instruction in such a sequence is encountered, a small execution engine speculatively and aggressively pre-executes the rest. The pre-computed target is stored and subsequently used when a prediction needs to be made. We show that a common v-call instruction sequence can be exploited to implement pre-computation using a previously proposed prefetching mechanism and minimal additional hardware. In a suite of C++ programs, dependence-based pre-computation eliminates 46% of the mispredictions incurred by a simple BTB and 24% of those associated with a path-based two-level predictor.

Effective Jump-Pointer Prefetching for Linked Data Structures

Amir Roth and Gurindar S. Sohi

Current techniques for prefetching linked data structures (LDS) exploit the work available in one loop iteration or recursive call to overlap pointer chasing latency. Jump-pointers, which provide direct access to non-adjacent nodes, can be used for prefetching when loop and recursive procedure bodies are small and do not have sufficient work to overlap a long latency. This paper describes a framework for jump-pointer prefetching (JPP) that supports four prefetching idioms: queue, full, chain, and root jumping and three implementations: software-only, hardware-only, and a cooperative software/hardware technique. On a suite of pointer intensive programs, jump-pointer prefetching reduces memory stall time by 72% for software, 83% for cooperative and 55% for hardware, producing speedups of 15%, 20% and 22% respectively.

Understanding the Differences Between Value Prediction and Instruction Reuse

Avinash Sodani and Gurindar S. Sohi

Recently two hardware techniques - Value Prediction (VP) and Instruction Reuse (IR) - have been proposed for exploiting the redundancy in programs to collapse data dependences. In this paper, we attempt to understand the different ways in which VP and IR interact with other microarchitectural features and the impact of such interactions on net performance. More specifically, we perform the following tasks: (i) we identify the various differences between the two techniques and qualitatively discuss their microarchitectural interactions, (ii) we evaluate the impact on performance of these interactions, and (iii) since IR is more restrictive of the two techniques, we also estimate the amount of total redundancy, present in programs, that can be captured by IR. Our results show that the performance obtained by VP is sensitive to the way branches with value-speculative operands are handled. We also see that, although IR captures less amount of redundancy, it may perform equally well because it validates results early, it is non-speculative, and it reduces branch misprediction penalty. Finally, we show that 84-97% of redundancy in programs can be reused, implying that the approach of detecting redundant instructions non-speculatively, based on their operands, does not significantly restrict IR's ability to capture redundancy present in programs.

Task Selection for a Multiscalar Processor

T.N. Vijaykumar and Gurindar S. Sohi

The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implications of the architectural features of distributed processor organization and task-level speculation for compiler task selection from the point of view of performance. We identify the fundamental performance issues to be: control flow speculation, data communication, data dependence speculation, load imbalance, and task overhead. We show that these issues are intimately related to a few key characteristics of tasks: task size, inter-task control flow, and inter-task data dependence. We describe compiler heuristics to select tasks with favorable characteristics. We report experimental results to show that the heuristics are successful in boosting overall performance by establishing larger ILP windows.

An Empirical Analysis of Instruction Repetition

Avinash Sodani and Gurindar S. Sohi

We study the phenomenon of instruction repetition, where the inputs and outputs of multiple dynamic instances of a static instruction are repeated. We observe that over 80% of the dynamic instructions executed in several programs are repeated and most of the repetition is due to a small number of static instructions. We attempt to understand the source of this repetitive behavior by categorizing dynamic program instructions into dynamic program slices at both a global level and a local (within function) level. We observe that repeatability is more an artifact of how computation is expressed, and less of program inputs. Function-level analysis suggests that many functions are called with repeated arguments, though almost all of them have side effects. We provide commentary on exploiting the observed phenomenon and its sources in both software and hardware.`

Dependence Based Prefetching for Linked Data Structures

Amir Roth, Andreas Moshovos and Gurindar S. Sohi

We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identifying producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching effect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs.

Exploitation Idle Floating-Point Resources for Integer Execution

S. Subramanya Sastry, Subbarao Palacharla, and Jim Smith

In conventional superscalar microarchitectures with partitioned integer and floating-point resources, all floating-point resources are idle during execution of integer programs. Palacharla and Smith \cite{intdec} addressed this drawback and proposed that the floating-point subsystem be augmented to support integer operations. The hardware changes required are expected to be fairly minimal. To exploit these idle floating resources, the compiler must identify integer code that can be profitably offloaded to the augmented floating-point subsystem. In this paper, we present two compiler algorithms to do this. The basic scheme offloads integer computation to the floating-point subsystem using existing program loads/stores for inter-partition communication. For the SPECINT95 benchmarks, we show that this scheme offloads from 5% to 29% of the total dynamic instructions to the floating-point subsystem. The advanced scheme inserts copy instructions and duplicates some instructions to further offload computation. We evaluate the effectiveness of the two schemes using timing simulation. We show that the advanced scheme can offload from 9% to 41% of the total dynamic instructions to the floating-point subsystem. In doing so, speedups from 3% to 23% are achieved over a conventional microarchitecture.

Speculative Versioning Cache

Sridhar Gopal, T.N. Vijaykumar, J. E. Smith and G. S. Sohi

Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. A preliminary evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance, and private cache solutions trade-off hit rate for hit latency.

Exploiting Dead Value Information

Milo M. Martin, Amir Roth, Charles N. Fischer

We describe Dead Value Information (DVI) and introduce three new optimizations which exploit it. DVI provides assertions that certain register values are dead, meaning they will not be read before being overwritten. The processor can use DVI to track dead registers and dynamically eliminate unnecessary save and restore instructions from the execution stream at procedure calls and context switches. Our results indicate that dynamic saves and restore instances can be reduced by 46% for procedure calls and by 51% for context switches. In addition, save/restore elimination for procedure calls can improve overall performance by up to 5%. DVI also allows the processor manage physical registers to efficiently, reducing the size requirements of the physical register file. When the system clock rate is proportional to the register file cycle time, this optimization can improve performance. All of these optimizations can be supported with only a few new instructions and minimal additional hardware structures.

Streamlining Inter-operation Memory Communication via Data Dependence Prediction

Andreas Moshovos and Gurindar S. Sohi

We revisit memory hierarchy design viewing memory as an inter-operation communication agent. This perspective leads to the development of novel methods of performing inter-operation memory communication:

(1) We use data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access.

(2) We also use data dependence prediction to convert, def-store-load-use chains within the instruction window into def-use chains prior to address calculation and disambiguation.

(3) We use true and output data dependence status prediction to introduce and manage a small storage structure called the Transient Value Cache (TVC). The TVC captures memory values that are short-lived. It also captures recently stored values that are likely to be accessed soon. Accesses that are serviced by the TVC do not have to be serviced by other parts of the memory hierarchy, e.g., the data cache.

The first two techniques are aimed at reducing the effective communication latency whereas the last technique is aimed at reducing data cache bandwidth requirements. Experimental analysis of the proposed techniques shows that: (i) the proposed speculative communication methods correctly handle a large fraction of memory dependences, and (ii) a large number of the loads and stores do not have to ever reach the data cache when the TVC is in place.

Path-Based Next Trace Prediction

Quinn Jacobson, Eric Rotenberg and James E. Smith

The trace cache has been proposed as a mechanism for providing increased fetch bandwidth by allowing the processor to fetch across multiple branches in a single cycle. But to date predicting multiple branches per cycle has meant paying a penalty in prediction accuracy. We propose a next trace predictor that treats the traces as basic units and explicitly predicts sequences of traces. The predictor collects histories of trace sequences (paths) and makes predictions based on these histories. The basic predictor is enhanced to a hybrid configuration that reduces performance losses due to cold starts and aliasing in the prediction table. The Return History Stack is introduced to increase predictor performance by saving path history information across procedure call/returns. Overall, the predictor yields about a 26% reduction in misprediction rates when compared with the most aggressive previously proposed, multiple-branch-prediction methods.

The Predictability of Data Values

Yiannakis Sazeides and James E. Smith

The predictability of data values is studied at a fundamental level. Two basic predictor models are defined: Computational predictors perform an operation on previous values to yield predicted next values. Examples we study are stride value prediction (which adds a delta to a previous value) and last value prediction (which performs the trivial identity operation on the previous value); Context Based} predictors match recent value history (context) with previous value history and predict values based entirely on previously observed patterns.

To understand the potential of value prediction we perform simulations with unbounded prediction tables that are immediately updated using correct data values. Simulations of integer SPEC95 benchmarks show that data values can be highly predictable. Best performance is obtained with context based predictors; overall prediction accuracies are between 56% and 91%. The context based predictor typically has an accuracy about 20% better than the computational predictors (last value and stride).
Comparison of context based prediction and stride prediction shows that the higher accuracy of context based prediction is due to relatively few static instructions giving large improvements; this suggests the usefulness of hybrid predictors. Among different instruction types, predictability varies significantly. In general, load and shift instructions are more difficult to predict correctly, whereas add instructions are more predictable.

Keywords: Prediction, Value Prediction, Context Based Prediction, Stride Prediction, Last Value Prediction

Trace Processors

Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, Jim Smith

Traces are dynamic instruction sequences constructed and cached by hardware. A microarchitecture organized around traces is presented as a means for efficiently executing many instructions per cycle. Trace processors exploit both control flow and data flow hierarchy to overcome complexity and architectural limitations of conventional superscalar processors by (1) distributing execution resources based on trace boundaries and (2) applying control and data prediction at the trace level rather than individual branches or instructions.

Three sets of experiments using the SPECInt95 benchmarks are presented. (i) A detailed evaluation of trace processor configurations: the results affirm that significant instruction-level parallelism can be exploited in integer programs (2 to 6 instructions per cycle). We also isolate the impact of distributed resources, and quantify the value of successively doubling the number of distributed elements. (ii) A trace processor with data prediction applied to inter-trace dependences: potential performance improvement with perfect prediction is around 45% for all benchmarks. With realistic prediction, gcc achieves an actual improvement of 10%. (iii) Evaluation of aggressive control flow: some benchmarks benefit from control independence by as much as 10%.

Dynamic Speculation and Synchronization of Data Dependences

Andreas I. Moshovos, Scott E. Breach, T.N. Vijaykumar, Gurindar S. Sohi

Data dependence speculation is used in instruction-level parallel processors to allow the early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated) In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in data dependence mis-speculation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented with the context of a Multiscalar processor.

Dynamic Instruction Reuse

Avinash Sodani and Gurindar S. Sohi

This paper introduces the concept of dynamic instruction reuse. Empirical observations suggest that many instructions, and groups of instructions, having the same inputs, are executed dynamically. Such instructions do not have to be executed repeatedly -- their results can be obtained from a buffer where they were saved previously. This paper presents three hardware schemes for exploiting the phenomenon of dynamic instruction reuse, and evaluates their effectiveness using execution-driven simulation. We find that in some cases over 50% of the instructions can be reused. The speedups so obtained, though less striking than the percentage of instructions reused, are still quite significant.

Complexity-Effective Superscalar Processors

Subbarao Palacharla, Norman P. Jouppi, James E. Smith

The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0.8um, 0.35um, and 0.18um. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster -- consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.

Control Flow Speculation in Multiscalar Processors

Quinn Jacobson, Steve Bennett, Nikhil Sharma, J. E. Smith

The Multiscalar architecture executes a single sequential program following multiple flows of control. In the Multiscalar hardware, a global sequencer, with help from the compiler, takes large steps through the program's control flow graph (CFG) speculatively, starting a new thread of control (task) at each step. This is inter-task control flow speculation. Within a task, traditional control flow speculation is used to extract instruction level parallelism. This is intra-task control flow speculation. This paper focuses on mechanisms to implement inter-task control flow speculation (task prediction) in a Multiscalar implementation. This form of speculation has fundamental differences from traditional branch prediction. We look in detail at the issues of prediction automata, history generation and target buffers. We present implementations in each of these areas that offer good accuracy, size and performance characteristics.

The Performance Potential of Data Dependence Speculation & Collapsing

Yiannakis Sazeides, Stamatis Vassiliadis, J. E. Smith

(To be made available soon)

Assigning Confidence to Conditional Branch Predictions

Erik Jacobsen, Eric Rotenberg, J. E. Smith

(To be made available soon)

Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching

Eric Rotenberg, Steve Bennett, J. E. Smith

As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further, it is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

High-Bandwidth Address Translation for Multiple-Issue Processors

T. M. Austin and G. S. Sohi

In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency. We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multi ported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggy back ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times. We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physi cally indexed caches.

Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency

T. M. Austin and G. S. Sohi

Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect, we present an aggressive hardware-based mech anism that provides effective support for reducing the latency of load instructions. Through the judicious use of instruction predecode, base register caching, and fast address calculation, it becomes possible to complete load instructions up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data cache access, this results in what we term a zero-cycle load. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for pipelines with a single stage of instruction decode, and another for pipelines with multiple decode stages . We evaluate these designs in a number of contexts: with and without software support, in-order vs. out-of-order issue, and on architectures with many and few registers. We find that our approach is quite ef fective at reducing the impact of load latency, even more so on architectures with in-order issue and few registers.

The Microarchitecture of Superscalar Processors

J. E. Smith and G. S. Sohi

(To be made available soon)

ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

M. Franklin and G. S. Sohi

To exploit instruction level parallelism, it is important not only to execute multiple memory references per cycle, but also to reorder memory references, especially to execute loads before stores that precede them in the sequential instruction stream. To guarantee correctness of execution in such situations, memory reference addresses have to be disambiguated. This paper presents a novel hardware mechanism, called an Address Resolution Buffer (ARB), for performing dynamic reordering of memory references. The ARB supports the following features: (i) dynamic memory disambiguation in a decentralized manner, (ii) multiple memory references per cycle, (iii) out-of-order execution of memory references, (iv) unresolved loads and stores, (v) speculative loads and stores, and (vi) memory renaming. The paper presents the results of a simulation study that we conducted to verify the efficacy of the ARB for a superscalar processor. The paper also shows the ARB's application in a multiscalar processor.

Multiscalar Processors

G. S. Sohi, S. Breach, and T. N. Vijaykumar

Multiscalar processors use a new, aggressive implementation paradigm for extracting large quantities of instruction level parallelism from ordinary high level language programs. A single program is divided into a collection of tasks by a combination of software and hardware. The tasks are distributed to a number of parallel processing units which reside within a processor complex. Each of these units fetches and executes instructions belonging to its assigned task. The appearance of a single logical register file is maintained with a copy in each parallel processing unit. Register results are dynamically routed among the many parallel processing units with the help of compiler-generated masks. Memory accesses may occur speculatively without knowledge of preceding loads or stores. Addresses are disambiguated dynamically, many in parallel, and processing waits only for true data dependences. This paper presents the philosophy of the multiscalar paradigm, the structure of multiscalar programs, and the hardware architecture of a multiscalar processor. The paper also discusses performance issues in the multiscalar model, and compares the multiscalar paradigm with other paradigms. Experimental results evaluating the performance of a sample of multiscalar organizations are also presented.

Streamlining Data Cache Access with Fast Address Calculation

T. M. Austin, D. N. Pnevmatikatos, and G. S. Sohi

For many programs, especially integer codes, untolerated load in struction latencies account for a significant portion of total execution time. In this paper,we present the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores. Our approach works by predicting early in the pipeline (part of) the effective address of a memory access and using this predicted address to speculatively access the data cache. If the prediction is correct, the cache access is overlapped with non-speculative effective address calculation. Otherwise, the cache is accessed again in the following cycle, this time using the correct effective address. The impact on the cache access critical path is minimal; the prediction circuitry adds only a single OR operation before cache access can commence. In addition, verification of the predicted effective address is completely decoupled from the cache access critical path. Analyses of program reference behavior and subsequent performance analysis of this approach shows that this design is a good one, servicing enough accesses early enough to result in speedups for all the programs we tested. Our approach also responds well to software support, which can significantly reduce the number of mispredicted effective addresses, in many cases providing better program speedups and reducing cache bandwidth requirements.

The Anatomy of the Register File in a Multiscalar Processor

S. Breach, T. N. Vijaykumar, and G. S. Sohi

This paper presents the operation of the register file in the Multiscalar architecture. The register file provides the appearance of a logically centralized register file, yet is implemented as physically decentralized register files, queues, and control logic in a Multiscalar processor. We address the key issues of storage, communication, and synchronization required for a successful design and discuss the complications that arise in the face of speculation. In particular, the hardware required to implement the register file is detailed, and software support to streamline the operation of the register file is described. Illustrative examples detailing important aspects of the operation of the register file and an evaluation of its effectiveness are provided.

Efficient Detection of All Pointer and Array Access Errors

T. M. Austin, S. E. Breach and G. S. Sohi

We present a pointer and array access checking technique that provides complete error coverage through a simple set of program transformations. Our technique, based on an extended safe pointer representation, has a number of novel aspects. Foremost, it is the first technique that detects all spatial and temporal access errors. Its use is not limited by the expressiveness of the language; that is, it can be applied successfully to compiled or interpreted languages with subscripted and mutable pointers, local references, and explicit and typeless dynamic storage management, e.g., C. Because it is a source level transformation, it is amenable to both compile- and run-time optimization. Finally, its performance, even without compile-time optimization, is quite good. We implemented a prototype translator for the C language and analyzed the checking overheads of six non-trivial, pointer intensive programs. Execution overheads range from 130% to 540%; with text and data size overheads typically below 100%.

Guarded Execution and Branch Prediction in Dynamic ILP Processors

D. Pnevmatikatos and G. S. Sohi

We evaluate the effects of guarded (or conditional, or predicated) execution on the performance of an instruction level parallel processor employing dynamic branch prediction. First, we assess the utility of guarded execution, both qualitatively and quantitatively, using a variety of application programs. Our assessment shows that guarded execution significantly increases the opportunities, for both compiler and dynamic hardware, to extract and exploit parallelism. However , existing methods of specifying guarded execution have several drawbacks that limit its use. Second, we study the interaction of guarded execution and dynamic branch prediction and show that the use of guarded execution significantly increases the number of instructions between mispredicted branches. Third, we propose a new method of specifying guarded execution. The proposed method uses special GUARD instructions, which can be used to incorporate guarded execution into existing instruction sets. GUARD instructions realize the full power of guarded execution, without the drawbacks of existing methods of specifying guarded execution.

Control Flow Prediction for Dynamic ILP Processors

D. Pnevmatikatos, M. Franklin and G. S. Sohi

We introduce a technique to enhance the ability of dynamic ILP processors to exploit (speculatively exe cuted) parallelism. Existing branch prediction mechanisms used to establish a dynamic window from which ILP can be extracted are limited in their abilities to: (i) create a large, accurate dynamic window, (ii) initiate a large number of instructions into this window in every cycle, and (iii) traverse multiple branches of the control flow graph per prediction. We introduce control flow prediction which uses information in the control flow graph of a program to overcome these limitations. We discuss how information present in the control flow graph can be represented using multiblocks, and conveyed to the hardware using Control Flow Tables and Control Flow Prediction Buffers. We evaluate the potential of control flow prediction on an abstract machine and on a dynamic ILP processing model. Our results indicate that control flow prediction is a powerful and effective assist to the hardware in making more informed run time decisions about program control flow.

Register Traffic Analysis for Streamlining Inter-operation Communication in Fine-Grain Parallel Processors

M. Franklin and G. S. Sohi

Traditionally, register files have been the primary agent for inter-operation communication in load/store architectures. As processors start issuing multiple instructions per cycle, a centralized register file can easily become a bottleneck . This paper analyzes the register file traffic in a load/store architecture with a view to motivate the development of alternate inter-operation communication mechanisms that reduce the bandwidth demanded of a centralized register file.

We first provide metrics to characterize the register traffic. These metrics deal with the degree and locality of use of the register instances created. We then present the results of a simulation study that uses the MIPS R2000 architecture and the SPEC benchmark programs. We have two major results. First, a large number of the register instances are used only once, and the average degree of use of register instances is about 2. Second, most of the register instances are used up soon after they are created (within about 30-40 instructions). This suggests that alternate inter-operation communication mechanisms that exploit the temporal locality of use of register instances are likely to be effective in reducing the traffic burden on the centralized register file. The second result was pivotal in the design of the distributed register file for the multiscalar processing paradigm.

The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism

M. Franklin and G. S. Sohi

We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting fine grain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain parallelism by overlapping the execution of multiple windows. The basic idea is to connect multiple sequential processors, in a decoupled and decentralized manner, to achieve overall multiple issue. This processing paradigm shares a number of properties of the restricted dataflow machines, but was derived from the sequential von Neumann architecture. We also present an implementation of the Expandable Split Window execution model, and preliminary performance results.

Dynamic Dependency Analysis of Ordinary Programs

T.M. Austin and G. S. Sohi

(To be made available soon)