California State University Fullerton Microarchitecture of Superscalar Processors Discussion

Question Description

Review and critically evaluate the concepts, techniques, designs and observations in the IEEE invited paper by James Smith and Gurinder Sohi (attached below)

Create a set of MS-PowerPoint slides that explain the contents of the paper. Do not use more than five bullet-points per side. Use Times New Roman font with font-size 24 for slide text. Use clear and labeled figures wherever necessary. Figures must be explained. Convert your set of slides into a single PDF document. Upload the PDF document on TITANium

Unformatted Attachment Preview

The Microarchitecture of Superscalar Processors JAMES E. SMITH, MEMBER, IEEE, AND GURINDAR S. SOHI, SENIOR MEMBER, IEEE Invited Paper Superscalar processing is the latest in a long series of innovations aimed at producing everyaster microprocessors. By exploiting instruction-levelparallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of superscalar processors. We begin with a discussion of the general problem solved by superscalar processors: converting an ostensibly sequential program into a more parallel one. The principles underlying this process, and the constraints that must be met, are discussed. The paper then provides a description of the speciJic implementation techniques used in the important phases of superscalar processing. The major phases include: 1 ) instruction fetching and conditional branch processing, 2 ) the determination of data dependences involving register values, 3) the initiation, or issuing, of instructions for parallel execution, 4 ) the communication of data values through memory via loads and stores, and 5) committing the process state in correct order so thatprecise interrupts can be supported. Examples of recent superscalar microprocessors, the MIPS RIOOOO, the DEC 21164, and the AMD K5 are used to illustrate a variety of superscalar methods. I. INTRODUCTION Superscalar processing, the ability to initiate multiple instructions during the same clock cycle, is the latest in a long series of architectural innovations aimed at producing ever faster microprocessors. Introduced at the beginning of this decade, superscalar microprocessors are now being designed and produced by all the microprocessor vendors for high-end products. Although viewed by many as an extension of the reduced instruction set computer (RISC) movement of the 1980’s, superscalar implementations are in fact heading toward increasing complexity. And superscalar methods have been applied to a spectrum of instruction sets, ranging from the DEC Alpha, the “newest” RISC instruction set, to the decidedly non-RISC Intel x86 instruction set. Manuscript received February 9, 1995; revised August 23, 1995. This work was supported in part by NSF Grants CCR-9303030 and MIP9505853, in part by ONR Grant NOOO14-93-1-0465, in part by the University of Wisconsin Graduate School, and in part by Intel Corporation. J. E. Smith is with the Department of Electrical and Computer Engineering, The University of Wisconsin, Madison, WI 53706 USA. G. S. Sohi is with the Computer Sciences Department, The University of Wisconsin, Madison, WI 53706 USA. IEEE Log Number 9415183. A typical superscalar processor fetches and decodes the incoming instruction stream several instructions at a time. As part of the instruction fetching process, the outcomes of conditional branch instructions are usually predicted in advance to ensure an uninterrupted stream of instructions. The incoming instruction stream is then analyzed for data dependences, and instructions are distributed to functional units, often according to instruction type. Next, instructions are initiated for execution in parallel, based primarily on the availability of operand data, rather than their original program sequence. This important feature, present in many superscalar implementations, is referred to as dynamic instruction scheduling. Upon completion, instruction results are resequenced so that they can be used to update the process state in the correct (original) program order in the event that an interrupt condition occurs. Because individual instructions are the entities being executed in parallel, superscalar processors exploit what is referred to as instruction level parallelism (ILP). A. Historical Perspective Instruction level parallelism in the form of pipelining has been around for decades. A pipeline acts like an assembly line with instructions being processed in phases as they pass down the pipeline. With simple pipelining, only one instruction at a time is initiated into the pipeline, but multiple instructions may be in some phase of execution concurrently. Pipelining was initially developed in the late 1950’s [8] and became a mainstay of large scale computers during the 1960’s. The CDC 6600 [61] used a degree of pipelining, but achieved most of its ILP through parallel functional units. Although it was capable of sustained execution of only a single instruction per cycle, the 6600’s instruction set, parallel processing units, and dynamic instruction scheduling are similar to the superscalar microprocessors of today. Another remarkable processor of the 1960’s was the IBM 360/91 [3]. The 360/91 was heavily pipelined, and provided a dynamic instruction issuing mechanism, known as Tomasulo’s algorithm [63] after its inventor. As with the CDC 6600, the IBM 360/91 could sustain only one 0018-9219/95$04.00 D 1995 IEEE PROCEEDINGS OF THE IEEE, VOL. 83, NO. 12, DECEMBER 1995 1609 instruction per cycle and was not superscalar, but the strong influence of Tomasulo’s algorithm is evident in many of today’s superscalar processors. The pipeline initiation rate remained at one instruction per cycle for many years and was often perceived to be a serious practical bottleneck. Meanwhile other avenues for improving performance via parallelism were developed, such as vector processing [28], [49] and multiprocessing [ 5 ] , [6]. Although some processors capable of multiple instruction initiation were considered during the 1960’s and 1970’s [50],[62], none were delivered to the market. Then, in the mid-to-late 1980’s, superscalar processors began to appear [21], [43], [54]. By initiating more than one instruction at a time into multiple pipelines, superscalar processors break the single-instruction-per-cyclebottleneck. In the years since its introduction, the superscalar approach has become the standard method for implementing high performance microprocessors. that would be present if the sequential execution model was strictly followed and processing was stopped precisely at the interrupted instruction. Restart could then be implemented by simply resuming instruction processing with the interrupted instruction. Today, a computer designer is usually faced with maintaining binary compatibility, i.e., maintaining instruction set compatibility and a sequential execution model (which typically implies precise interrupts).’ For high performance, however, superscalar processor implementations deviate radically from sequential execution-much has to be done in parallel. As a result, the program binary nowadays should viewed as a specification of what has to be done, not how it is done in reality. A modern superscalar microprocessor takes the sequential specification as embodied in the program binary and removes much of the nonessential sequentiality to tarn the program into a parallel, higherperformance version, yet the processor retains the outward appearance of sequential execution. B. The Instruction Processing Model Because hardware and software evolve, it is rare for a processor architect to start with a clean slate; most processor designs inherit a legacy from their predecessors. Modem superscalar processors are no different. A major component of this legacy is binary compatibility, the ability to execute a machine program written for an earlier generation processor. When the very first computers were developed, each had its own instruction set that reflected specific hardware constraints and design decisions at the time of the instruction set’s development. Then, software was developed for each instruction set. It did not take long, however, until it became apparent that there were significant advantages to designing instruction sets that were compatible with previous generations and with different models of the same generation [2]. For a number of very practical reasons, the instruction set architecture, or binary machine language level, was chosen as the level for maintaining software compatibility. The sequencing model inherent in instruction sets and program binaries, the sequential execution model, closely resembles the way processors were implemented many years ago. In the sequential execution model, a program counter is used to fetch a single instruction from memory. The instruction is then executed-in the process, it may load or store data to main memory and operate on registers held in the processor. Upon completing the execution of the instruction, the processor uses an incremented program counter to fetch the next instruction, with sequential instruction processing occasionally being redirected by a conditional branch or jump. Should the execution of the program need to be interrupted and restarted later, for example in case of a page fault or other exception condition, the state of the machine needs to be captured. The sequential execution model has led naturally to the concept of a precise state. At the time of an interrupt, a precise state of the machine (architecturally visible registers and memory) is the state 1610 C. Elements of Xigh Perj6ormance Processing Simply stated, achieving higher performance means processing a given program in a smaller amount of time. Each individual instruction takes some time to fetch and execute; this time is the instruction’s latency. To reduce the time to execute a sequence of instructions (e.g., a program), one can: 1) reduce individual instruction latencies, or 2) execute more instructions in parallel. Because superscalar processor implementations are distinguished by the latter (while adequate attention is also paid to the former), we will concentrate on the latter method in this paper. Nevertheless, a significant challenge in superscalar design is to not increase instruction latencies due to increased hardware complexity brought about by the drive for enhanced parallelism. Parallel instruction processing requires: the determination of the dependence relationships between instructions, adequate hardware resources to execute multiple operations in parallel, strategies to determine when an operation is ready for execution, and techniques to pass values from one operation to another. When the effects of instructions are committed, and the visible state of the machine updated, the appearance of sequential execution must be maintained. More precisely, in hardware terms, this means a superscalar processor implements: 1) Instruction fetch strategies that simultaneously fetch multiple instructions, often by predicting the outcomes of, and fetching beyond, conditional branch instructions. 2) Methods for determining true dependences involving register values, and mechanisms for communicating these values to where they are needed during execution. 3) Methods for initiating, or issuing, multiple instructions in parallel. Some recently developed instruction sets relax the strictly sequential execution model by allowing a few exception conditions to result in an “imprecise” saved state where the program counter is inconsistent with respect to the saved registers and memory values. PROCEEDINGS OF THE IEEE, VOL. 83, NO. 12, DECEMBER 1995 for (i=O; i<last; i++)=”” if=”” (aril=””> a[i+ll) ( temp = a[il: a[il = a[i+ll; a[i+l] = temp; change++ ; Resources for parallel execution of many instructions, including multiple pipelined functional units and memory hierarchies capable of simultaneously servicing multiple memory references. Methods for communicating data values through memory via load and store instructions, and memory interfaces that allow for the dynamic and often unpredictable performance behavior of memory hierarchies. These interfaces must be well matched with the instruction execution strategies. Methods for committing the process state in correct order; these mechanisms maintain an outward appearance of sequential execution. Although we will discuss the above items separately, in reality they cannot be completely separated-nor should they be. In good superscalar designs they are often integrated in a cohesive, almost seamless, manner. D. Paper Overview In Section TI, we discuss the general problem solved by superscalar processors: converting an ostensibly sequential program into a parallel one. This is followed in Section 111 with a description of specific techniques used in typical superscalar microprocessors. Section IV focuses on three recent superscalar processors that illustrate the spectrum of current techniques. Section V presents conclusions and discusses future directions for instruction level parallelism. REPRESENTATION, DEPENDENCES 11. PROGRAM AND PARALLEL EXECUTION An application begins as a high level language program; it is then compiled into the static machine level program, or the program binary. The static program in essence describes a set of executions, each corresponding to a particular set of data that is given to the program. Implicit in the static program is the sequencing model, the order in which the instructions are to be executed. Fig. 1 shows the assembly code for a high level language program fragment (The assembly code is the human-readable version of the machine code). We will use this code fragment as a working example. As a static program executes with a specific set of input data, the sequence of executed instructions forms a dynamic instruction stream. As long as instructions to be executed are consecutive, static instructions can be entered into the dynamic sequence simply by incrementing the program counter, which points to the next instruction to be executed. When there is a conditional branch or jump, however, the program counter may be updated to a nonconsecutive address. An instruction is said to be control dependent on its preceding dynamic instruction(s), because the flow of program control must pass through preceding instructions first. The two methods of modifying the program counter-incrementing and updating-result in two types of control dependences (though typically when people talk about control dependences, they tend to ignore the former). ( 1 1 (a) L2: move lw add lw ble r3,rl r8, (r3) r3,r3,4 r9, (r3) r8,r9,L3 #r3->a[i1 #load a[il #r3->a [i+ll #load ati+ll #branch a[il>a[i+ll move sw add r3,r7 r9, (r3) r3,r3,4 #r3 ->a [iI #store a[il #r3->a [i+ll sw add r8, (r3) r5,r5,1 #store a[i+ll #change++ r6,r6,1 r7,r7,4 r6,r4,L2 #i++ #r4->a [il #branch i u[i+l]. The variable change keeps track of the number of switches (if change = 0 at the end of a pass through the array, then the array is sorted.) The first step in increasing instruction level parallelism is to overcome control dependences. Control dependences due to an incrementing program counter are the simplest, and we deal with them first. One can view the static program as a collection of basic blocks, where a basic block is a contiguous block of instructions, with a single entry point and a single exit point [l]. In the assembly code of Fig. 1, there are three basic blocks. The first basic block consists of the five instructions between the label L2 and the ble instruction, inclusive, the second basic block consists the five instructions between the ble instruction, exclusive, and the label L3, and the third basic block consists of the three instructions between the label L3 and the blt instruction, inclusive. Once a basic block has been entered by the instruction fetcher, it is known that all the instructions in the basic block will be executed eventually. Therefore, any sequence of instructions in a basic block can be initiated into a conceptual window of execution, en masse. We consider the window of execution to be the full set of instructions that may be simultaneously considered for parallel execution. Once instructions have been initiated into this window of execution, they are free to execute in parallel, subject only to data dependence constraints (which we will discuss shortly). Within individual basic blocks there is some parallelism, but to get more parallelism, control dependences due to updates of the program counter, especially due to condi- SMITH AND SOHI: THE MICROARCHITECTURE OF SUPERSCALAR PROCESSORS 1611 tional branches, have to be overcome. A method for doing this is to predict the outcome of a conditional branch and speculatively fetch and execute instructions from the predicted path. Instructions from the predicted path are entered into the window of execution. If the prediction is later found to be correct, then the speculative status of the instructions is removed, and their effect on the state is the same as any other instruction. If the prediction is later found to be incorrect, the speculative execution was incorrect, and recovery actions must be initiated so that the architectural process state is not incorrectly modified. We will discuss branch prediction and speculative execution in more detail in Section 111. In our example of Fig. 1, the branch instruction (ble) creates a control dependence. To overcome this dependence, the branch could be predicted as not taken, for example, with instructions between the branch and the label L3 being executed speculatively. Instructions that have been placed in the window of execution may begin execution subject to data dependence constraints. Data dependences occur among instructions because the instructions may access (read or write) the same storage (a register or memory) location. When instructions reference the same storage location, a hazard is said to exist-i.e., there is the possibility of incorrect operation unless some steps are taken to make sure the storage location is accessed in correct order. Ideally, instructions can be executed subject only to true dependence constraints. These true dependences appear as read-ajler-write (RAW) hazards, because the consuming instruction can only read the value after the producing instruction has written it. It is also possible to have artificial dependences, and in the process of executing a program, these artificial dependences have to be overcome to increase the available level of parallelism. These artificial dependences result from write-after-read (WAR), and write-afier-write (WAW hazards. A WAR hazard occurs when an instruction needs to write a new value into storage location, but must wait until all preceding instructions needing to read the old value have done so. A WAW hazard occurs when multiple instructions update the same storage location; it must appear that these updates occur in proper sequence. Artificial dependences can be caused in a number of ways, for example: by unoptimized code, by limited register storage, by the desire to economize on main memory storage, and by loops where an instruction can cause a hazard with itself. Fig. 2 shows some of the data hazards that are present in a segment of our working example. The move instruction produces a value in register r3 that is used both by the first lw and add instructions. This is a RAW hazard because a true dependence is present. The add instruction also creates a value which is bound to register r3. Accordingly, there is a WAW hazard involving the move and the add instructions. A dynamic execution must ensure that accesses to r3 made by instructions that occur after the add in the program access the value bound to r3 by the add instruction and not the move instruction. Likewise, there is a WAR hazard involving the first lw and the add instructions. Execution must ensure that the value of r3 used in the Zw instruction 1612 L2 : r3, r7 add lw ble r8,r9,L 3 Fig. 2. Example of data hazards involving registers. imh. fetch static program instr. dispatch …
Purchase answer to see full attachment</last;>

Place this order or similar order and get an amazing discount. USE Discount code “GET20” for 20% discount