


















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The computer is controlled by a clock whose period is such that the fetch and execute steps of any instruction can each be completed in one clock cycle.
Typology: Study Guides, Projects, Research
1 / 58
This page cannot be seen from the preview
Don't miss anything!
In this chapter you will learn about:
concurrently
453
454 C H A P T E R 8 • PIPELINING
chapter, we discuss in detail the concept of pipelining, which is used in modern com- puters to achieve high performance. We begin by explaining the basics of pipelining and how it can lead to improved performance. Then we examine machine instruction features that facilitate pipelined execution, and we show that the choice of instructions and instruction sequencing can have a significant effect on performance. Pipelined organization requires sophisticated compilation techniques, and optimizing compilers have been developed for this purpose. Among other things, such compilers rearrange the sequence of operations to maximize the benefits of pipelined execution.
The speed of execution of programs is influenced by many factors. One way to improve performance is to use faster circuit technology to build the processor and the main memory. Another possibility is to arrange the hardware so that more than one operation can be performed at the same time. In this way, the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed. We have encountered concurrent activities several times before. Chapter 1 in- troduced the concept of multiprogramming and explained how it is possible for I/O transfers and computational activities to proceed simultaneously. DMA devices make this possible because they can perform I/O transfers independently once these transfers are initiated by the processor. Pipelining is a particularly effective way of organizing concurrent activity in a computer system. The basic idea is very simple. It is frequently encountered in manu- facturing plants, where pipelining is commonly known as an assembly-line operation. Readers are undoubtedly familiar with the assembly line used in car manufacturing. The first station in an assembly line may prepare the chassis of a car, the next station adds the body, the next one installs the engine, and so on. While one group of workers is installing the engine on one car, another group is fitting a car body on the chassis of another car, and yet another group is preparing a new chassis for a third car. It may take days to complete work on a given car, but it is possible to have a new car rolling off the end of the assembly line every few minutes. Consider how the idea of pipelining can be used in a computer. The processor executes a program by fetching and executing instructions, one after the other. Let F i and E i refer to the fetch and execute steps for instruction I i. Execution of a program consists of a sequence of fetch and execute steps, as shown in Figure 8.1 a. Now consider a computer that has two separate hardware units, one for fetching instructions and another for executing them, as shown in Figure 8.1 b. The instruction fetched by the fetch unit is deposited in an intermediate storage buffer, B1. This buffer is needed to enable the execution unit to execute the instruction while the fetch unit is fetching the next instruction. The results of execution are deposited in the destination location specified by the instruction. For the purposes of this discussion, we assume that both the source and the destination of the data operated on by the instructions are inside the block labeled “Execution unit.”
456 C H A P T E R 8 • PIPELINING
second clock cycle, the execution of instruction I 1 is completed and instruction I 2 is available. Instruction I 2 is stored in B1, replacing I 1 , which is no longer needed. Step E (^2) is performed by the execution unit during the third clock cycle, while instruction I 3 is being fetched by the fetch unit. In this manner, both the fetch and execute units are kept busy all the time. If the pattern in Figure 8.1 c can be sustained for a long time, the completion rate of instruction execution will be twice that achievable by the sequential operation depicted in Figure 8.1 a. In summary, the fetch and execute units in Figure 8.1 b constitute a two-stage pipeline in which each stage performs one step in processing an instruction. An inter- stage storage buffer, B1, is needed to hold the information being passed from one stage to the next. New information is loaded into this buffer at the end of each clock cycle. The processing of an instruction need not be divided into only two steps. For example, a pipelined processor may process each instruction in four steps, as follows:
F Fetch: read the instruction from the memory. D Decode: decode the instruction and fetch the source operand(s). E Execute: perform the operation specified by the instruction. W Write: store the result in the destination location.
The sequence of events for this case is shown in Figure 8.2 a. Four instructions are in progress at any given time. This means that four distinct hardware units are needed, as shown in Figure 8.2 b. These units must be capable of performing their tasks simultane- ously and without interfering with one another. Information is passed from one unit to the next through a storage buffer. As an instruction progresses through the pipeline, all the information needed by the stages downstream must be passed along. For example, during clock cycle 4, the information in the buffers is as follows:
Each stage in a pipeline is expected to complete its operation in one clock cycle. Hence, the clock period should be sufficiently long to complete the task being performed in any stage. If different units require different amounts of time, the clock period must allow the longest task to be completed. A unit that completes its task early is idle for the remainder of the clock period. Hence, pipelining is most effective in improving
8. 1 B ASIC CONCEPTS 457
I 4 F 4
F 1
F 2
F 3
I 1
I 2
I 3
D (^1)
D (^2)
D (^3)
D (^4)
E (^1)
E (^2)
E (^3)
E (^4)
W (^1)
W 2
W (^3)
W 4
Instruction
Clock cycle 1 2 3 4 5 6 7
(a) Instruction execution divided into four steps
F : Fetch instruction
D : Decode instruction and fetch operands
E: Execute operation
W : Write results
Interstage buffers
(b) Hardware organization
B1 B2 B
Time
Figure 8.2 A 4-stage pipeline.
performance if the tasks being performed in different stages require about the same amount of time. This consideration is particularly important for the instruction fetch step, which is assigned one clock period in Figure 8.2 a. The clock cycle has to be equal to or greater than the time needed to complete a fetch operation. However, the access time of the main memory may be as much as ten times greater than the time needed to perform basic pipeline stage operations inside the processor, such as adding two numbers. Thus, if each instruction fetch required access to the main memory, pipelining would be of little value. The use of cache memories solves the memory access problem. In particular, when a cache is included on the same chip as the processor, access time to the cache is usually the same as the time needed to perform other basic operations inside the processor. This
8. 1 B ASIC CONCEPTS 459
Pipelined operation in Figure 8.3 is said to have been stalled for two clock cycles. Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. As a result some operation has to be delayed, and the pipeline stalls. The pipeline may also be stalled because of a delay in the availability of an instruc- tion. For example, this may be a result of a miss in the cache, requiring the instruction to be fetched from the main memory. Such hazards are often called control hazards or instruction hazards. The effect of a cache miss on pipelined operation is illustrated in Figure 8.4. Instruction I 1 is fetched from the cache in cycle 1, and its execution proceeds normally. However, the fetch operation for instruction I 2 , which is started in cycle 2, results in a cache miss. The instruction fetch unit must now suspend any further fetch re- quests and wait for I 2 to arrive. We assume that instruction I 2 is received and loaded into buffer B1 at the end of cycle 5. The pipeline resumes its normal operation at that point.
F 1
F 2
F 3
I (^1)
I (^2)
I (^3)
D 1
D (^2)
D (^3)
E (^1)
E (^2)
E (^3)
W 1
W (^2)
W (^3)
Instruction
Clock cycle 1 2 3 4 5 6 7 8 9
(a) Instruction execution steps in successive clock cycles
Clock cycle 1 2 3 4 5 6 7 8
Stage F: Fetch
D: Decode
E: Execute
W: Write
F 1 F 2 F 3
D 1 idle idle idle D 2 D (^3)
E 1 idle idle idle E 2 E (^3)
W 1 idle idle idle W (^2)
(b) Function performed by each processor stage in successive clock cycles
9
W (^3)
F 2 F 2 F 2
Time
Time
Figure 8.4 Pipeline stall caused by a cache miss in F2.
460 C H A P T E R 8 • PIPELINING
An alternative representation of the operation of a pipeline in the case of a cache miss is shown in Figure 8.4 b. This figure gives the function performed by each pipeline stage in each clock cycle. Note that the Decode unit is idle in cycles 3 through 5, the Execute unit is idle in cycles 4 through 6, and the Write unit is idle in cycles 5 through 7. Such idle periods are called stalls. They are also often referred to as bubbles in the pipeline. Once created as a result of a delay in one of the pipeline stages, a bubble moves downstream until it reaches the last unit. A third type of hazard that may be encountered in pipelined operation is known as a structural hazard. This is the situation when two instructions require the use of a given hardware resource at the same time. The most common case in which this hazard may arise is in access to memory. One instruction may need to access memory as part of the Execute or Write stage while another instruction is being fetched. If instructions and data reside in the same cache unit, only one instruction can proceed and the other instruction is delayed. Many processors use separate instruction and data caches to avoid this delay. An example of a structural hazard is shown in Figure 8.5. This figure shows how the load instruction
Load X(R1),R
can be accommodated in our example 4-stage pipeline. The memory address, X+[R1], is computed in step E 2 in cycle 4, then memory access takes place in cycle 5. The operand read from memory is written into register R2 in cycle 6. This means that the execution step of this instruction takes two clock cycles (cycles 4 and 5). It causes the pipeline to stall for one cycle, because both instructions I 2 and I 3 require access to the register file in cycle 6. Even though the instructions and their data are all available, the pipeline is
F 1
F 2
F 3
I (^1)
I 2 (Load)
I (^3)
E (^1)
M 2
D (^1)
D (^2)
D (^3)
W (^1)
W 2
Instruction
I (^4) F 4
Clock cycle 1 2 3 4 5 6 7
I (^5) F 5 D (^5)
Time
E (^2)
E 3 W (^3)
D 4 E (^4)
Figure 8.5 Effect of a Load instruction on pipeline timing.
462 C H A P T E R 8 • PIPELINING
F 1
F 2
F 3
I 1 (Mul)
I 2 (Add)
I (^3)
D 1
D (^3)
E (^1)
E (^3)
E (^2)
W 3
Instruction
Clock cycle 1 2 3 4 5 6 7 8 9
W (^1)
D (^) 2A W (^2)
I 4 F 4 D 4 E 4 W 4
D (^2)
Time
Figure 8.6 Pipeline stalled by data dependency between D 2 and W 1.
This example illustrates a basic constraint that must be enforced to guarantee correct results. When two operations depend on each other, they must be performed sequentially in the correct order. This rather obvious condition has far-reaching con- sequences. Understanding its implications is the key to understanding the variety of design alternatives and trade-offs encountered in pipelined computers. Consider the pipeline in Figure 8.2. The data dependency just described arises when the destination of one instruction is used as a source in the next instruction. For example, the two instructions Mul R2,R3,R Add R5,R4,R
give rise to a data dependency. The result of the multiply instruction is placed into register R4, which in turn is one of the two source operands of the Add instruction. Assuming that the multiply operation takes one clock cycle to complete, execution would proceed as shown in Figure 8.6. As the Decode unit decodes the Add instruction in cycle 3, it realizes that R4 is used as a source operand. Hence, the D step of that instruction cannot be completed until the W step of the multiply instruction has been completed. Completion of step D 2 must be delayed to clock cycle 5, and is shown as step D (^) 2A in the figure. Instruction I 3 is fetched in cycle 3, but its decoding must be delayed because step D 3 cannot precede D 2. Hence, pipelined execution is stalled for two cycles.
The data hazard just described arises because one instruction, instruction I 2 in Figure 8.6, is waiting for data to be written in the register file. However, these data are available at the output of the ALU once the Execute stage completes step E 1. Hence, the delay can
8. 2 D ATA HAZARDS 463
be reduced, or possibly eliminated, if we arrange for the result of instruction I 1 to be forwarded directly for use in step E 2. Figure 8.7 a shows a part of the processor datapath involving the ALU and the register file. This arrangement is similar to the three-bus structure in Figure 7.8, except that registers SRC1, SRC2, and RSLT have been added. These registers constitute the
Register file
SRC1 SRC
RSLT
Destination
Source 1 Source 2
(a) Datapath
ALU
E: Execute (ALU)
W: Write (Register file)
SRC1,SRC2 RSLT
(b) Position of the source and result registers in the processor pipeline
Forwarding path
Figure 8.7 Operand forwarding in a pipelined processor.
8. 3 INSTRUCTION HAZARDS 465
than the one named as the destination. An instruction that uses an autoincrement or autodecrement addressing mode is an example. In addition to storing new data in its destination location, the instruction changes the contents of a source register used to access one of its operands. All the precautions needed to handle data dependencies involving the destination location must also be applied to the registers affected by an autoincrement or autodecrement operation. When a location other than one explicitly named in an instruction as a destination operand is affected, the instruction is said to have a side effect. For example, stack instructions, such as push and pop, produce similar side effects because they implicitly use the autoincrement and autodecrement addressing modes. Another possible side effect involves the condition code flags, which are used by instructions such as conditional branches and add-with-carry. Suppose that registers R and R2 hold a double-precision integer number that we wish to add to another double- precision number in registers R3 and R4. This may be accomplished as follows:
Add R1,R AddWithCarry R2,R
An implicit dependency exists between these two instructions through the carry flag. This flag is set by the first instruction and used in the second instruction, which performs the operation
R4 ← [R2] + [R4] + carry
Instructions that have side effects give rise to multiple data dependencies, which lead to a substantial increase in the complexity of the hardware or software needed to resolve them. For this reason, instructions designed for execution on pipelined hardware should have few side effects. Ideally, only the contents of the destination location, either a register or a memory location, should be affected by any given instruction. Side effects, such as setting the condition code flags or updating the contents of an address pointer, should be kept to a minimum. However, Chapter 2 showed that the autoincrement and autodecrement addressing modes are potentially useful. Condition code flags are also needed for recording such information as the generation of a carry or the occurrence of overflow in an arithmetic operation. In Section 8.4 we show how such functions can be provided by other means that are consistent with a pipelined organization and with the requirements of optimizing compilers.
The purpose of the instruction fetch unit is to supply the execution units with a steady stream of instructions. Whenever this stream is interrupted, the pipeline stalls, as Figure 8.4 illustrates for the case of a cache miss. A branch instruction may also cause the pipeline to stall. We will now examine the effect of branch instructions and the techniques that can be used for mitigating their impact. We start with unconditional branches.
466 C H A P T E R 8 • PIPELINING
Figure 8.8 shows a sequence of instructions being executed in a two-stage pipeline. Instructions I 1 to I 3 are stored at successive memory addresses, and I 2 is a branch instruction. Let the branch target be instruction I k. In clock cycle 3, the fetch operation for instruction I 3 is in progress at the same time that the branch instruction is being decoded and the target address computed. In clock cycle 4, the processor must discard I 3 , which has been incorrectly fetched, and fetch instruction I k. In the meantime, the hardware unit responsible for the Execute (E) step must be told to do nothing during that clock period. Thus, the pipeline is stalled for one clock cycle. The time lost as a result of a branch instruction is often referred to as the branch penalty. In Figure 8.8, the branch penalty is one clock cycle. For a longer pipeline, the branch penalty may be higher. For example, Figure 8.9 a shows the effect of a branch instruction on a four-stage pipeline. We have assumed that the branch ad- dress is computed in step E 2. Instructions I 3 and I 4 must be discarded, and the tar- get instruction, I k , is fetched in clock cycle 5. Thus, the branch penalty is two clock cycles. Reducing the branch penalty requires the branch address to be computed earlier in the pipeline. Typically, the instruction fetch unit has dedicated hardware to identify a branch instruction and compute the branch target address as quickly as possible after an instruction is fetched. With this additional hardware, both of these tasks can be performed in step D 2 , leading to the sequence of events shown in Figure 8.9 b. In this case, the branch penalty is only one clock cycle.
F 2
F 1
I 2 (Branch)
I (^3)
I k
E (^2)
F (^3)
F k E k
I k +1 F k+ 1 E k+ 1
Instruction
Execution unit idle
Clock cycle 1 2 3 4 5
Time
I 1 E (^1)
6
X
Figure 8.8 An idle cycle caused by a branch instruction.
468 C H A P T E R 8 • PIPELINING
F : Fetch instruction
E : Execute instruction
W : Write results
D : Dispatch/ Decode
Instruction queue
Instruction fetch unit
unit
Figure 8.10 Use of an instruction queue in the hardware organization of Figure 8.2 b.
sends them to the execution unit. This leads to the organization shown in Figure 8.10. The dispatch unit also performs the decoding function. To be effective, the fetch unit must have sufficient decoding and processing capa- bility to recognize and execute branch instructions. It attempts to keep the instruction queue filled at all times to reduce the impact of occasional delays when fetching in- structions. When the pipeline stalls because of a data hazard, for example, the dispatch unit is not able to issue instructions from the instruction queue. However, the fetch unit continues to fetch instructions and add them to the queue. Conversely, if there is a delay in fetching instructions because of a branch or a cache miss, the dispatch unit continues to issue instructions from the instruction queue. Figure 8.11 illustrates how the queue length changes and how it affects the rela- tionship between different pipeline stages. We have assumed that initially the queue contains one instruction. Every fetch operation adds one instruction to the queue and every dispatch operation reduces the queue length by one. Hence, the queue length remains the same for the first four clock cycles. (There is both an F and a D step in each of these cycles.) Suppose that instruction I 1 introduces a 2-cycle stall. Since space is available in the queue, the fetch unit continues to fetch instructions and the queue length rises to 3 in clock cycle 6. Instruction I 5 is a branch instruction. Its target instruction, I k , is fetched in cycle 7, and instruction I 6 is discarded. The branch instruction would normally cause a stall in cycle 7 as a result of discarding instruction I 6. Instead, instruction I 4 is dispatched from the queue to the decoding stage. After discarding I 6 , the queue length drops to 1 in cycle 8. The queue length will be at this value until another stall is encountered. Now observe the sequence of instruction completions in Figure 8.11. Instructions I 1 , I 2 , I 3 , I 4 , and I k complete execution in successive clock cycles. Hence, the branch instruc- tion does not increase the overall execution time. This is because the instruction fetch unit has executed the branch instruction (by computing the branch address) concurrently with the execution of other instructions. This technique is referred to as branch folding.
8. 3 INSTRUCTION HAZARDS 469
X
F 1 D 1 E 1 E 1 E 1 W 1
F 4
E 3 W 3
I 5 (Branch)
I (^1)
F 2 D (^2)
Clock cycle 1 2 3 4 5 6 7 8 9
E 2 W 2
F 3 D (^3)
D 4 E 4 W 4
F 5 D (^5)
F 6
F k D k E k
F k+ 1 D k+ 1
I (^2)
I (^3)
I (^4)
I (^6)
I k
I k+ 1
W k
E k+ 1
10 Queue length 1 1 1 1 2 3 2 1 1 1
Time
Figure 8.11 Branch timing in the presence of an instruction queue. Branch target address is computed in the D stage.
Note that branch folding occurs only if at the time a branch instruction is encoun- tered, at least one instruction is available in the queue other than the branch instruction. If only the branch instruction is in the queue, execution would proceed as in Figure 8.9 b. Therefore, it is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing. This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache. In many processors, the width of the connection between the fetch unit and the instruction cache allows reading more than one instruction in each clock cycle. If the fetch unit replenishes the instruction queue quickly after a branch has occurred, the probability that branch folding will occur increases. Having an instruction queue is also beneficial in dealing with cache misses. When a cache miss occurs, the dispatch unit continues to send instructions for execution as long as the instruction queue is not empty. Meanwhile, the desired cache block is read from the main memory or from a secondary cache. When fetch operations are resumed, the instruction queue is refilled. If the queue does not become empty, a cache miss will have no effect on the rate of instruction execution. In summary, the instruction queue mitigates the impact of branch instructions on performance through the process of branch folding. It has a similar effect on stalls
8. 3 INSTRUCTION HAZARDS 471
Add
LOOP Shift_left R Decrement Branch=
R LOOP NEXT
(a) Original program loop
LOOP Decrement R Branch= Shift_left
LOOP R NEXT
(b) Reordered instructions
Add
R1,R
R1,R
Figure 8.12 Reordering of instructions for a delayed branch.
F E
F E
F E
F E
F E
F E
F E
Instruction Decrement
Branch
Shift (delay slot)
Decrement (Branch taken)
Branch
Shift (delay slot)
Add (Branch not taken)
Clock cycle 1 2 3 4 5 6 7 8
Time
Figure 8.13 Execution timing showing the delay slot being filled during the last two passes through the loop in Figure 8.12 b.
472 C H A P T E R 8 • PIPELINING
The effectiveness of the delayed branch approach depends on how often it is pos- sible to reorder instructions as in Figure 8.12. Experimental data collected from many programs indicate that sophisticated compilation techniques can use one branch delay slot in as many as 85 percent of the cases. For a processor with two branch delay slots, the compiler attempts to find two instructions preceding the branch instruction that it can move into the delay slots without introducing a logical error. The chances of finding two such instructions are considerably less than the chances of finding one. Thus, if increasing the number of pipeline stages involves an increase in the number of branch delay slots, the potential gain in performance may not be fully realized.
Branch Prediction Another technique for reducing the branch penalty associated with conditional branches is to attempt to predict whether or not a particular branch will be taken. The simplest form of branch prediction is to assume that the branch will not take place and to continue to fetch instructions in sequential address order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis. Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution sequence. Hence, care must be taken that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed. If the branch decision indicates otherwise, the instructions and all their associated data in the execution units must be purged, and the correct instructions fetched and executed. An incorrectly predicted branch is illustrated in Figure 8.14 for a four-stage pipeline. The figure shows a Compare instruction followed by a Branch>0 instruction. Branch
F (^1)
F 2
I 1 (Compare)
I 2 (Branch>0)
I (^3)
D 1 E 1 W (^1)
F (^3)
F 4
F k D k
D 3 X
I 4 X
I k
Instruction
E (^2)
Clock cycle 1 2 3 4 5 6
D 2 /P (^2)
Time
Figure 8.14 Timing when a branch decision has been incorrectly predicted as not taken.