Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Pipelining: Enhancing Processor Performance, Study Guides, Projects, Research of Architecture

The computer is controlled by a clock whose period is such that the fetch and execute steps of any instruction can each be completed in one clock cycle.

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/27/2022

plastic-tree
plastic-tree 🇬🇧

4.4

(8)

213 documents

1 / 58

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Hamacher-38086 book June 28, 2001 11:50
CHAPTER
8
PIPELINING
CHAPTER OBJECTIVES
In this chapter you will learn about:
Pipelining as a means for executing machine instructions
concurrently
Various hazards that cause performance degradation in pipelined
processors and means for mitigating their effect
Hardware and software implications of pipelining
Influence of pipelining on instruction set design
Superscalar processors
453
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a

Partial preview of the text

Download Pipelining: Enhancing Processor Performance and more Study Guides, Projects, Research Architecture in PDF only on Docsity!

C H A P T E R

PIPELINING

CHAPTER OBJECTIVES

In this chapter you will learn about:

  • (^) Pipelining as a means for executing machine instructions

concurrently

  • (^) Various hazards that cause performance degradation in pipelined processors and means for mitigating their effect
  • (^) Hardware and software implications of pipelining
  • (^) Influence of pipelining on instruction set design
  • (^) Superscalar processors

453

454 C H A P T E R 8 • PIPELINING

T he basic building blocks of a computer are introduced in preceding chapters. In this

chapter, we discuss in detail the concept of pipelining, which is used in modern com- puters to achieve high performance. We begin by explaining the basics of pipelining and how it can lead to improved performance. Then we examine machine instruction features that facilitate pipelined execution, and we show that the choice of instructions and instruction sequencing can have a significant effect on performance. Pipelined organization requires sophisticated compilation techniques, and optimizing compilers have been developed for this purpose. Among other things, such compilers rearrange the sequence of operations to maximize the benefits of pipelined execution.

8.1 BASIC CONCEPTS

The speed of execution of programs is influenced by many factors. One way to improve performance is to use faster circuit technology to build the processor and the main memory. Another possibility is to arrange the hardware so that more than one operation can be performed at the same time. In this way, the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed. We have encountered concurrent activities several times before. Chapter 1 in- troduced the concept of multiprogramming and explained how it is possible for I/O transfers and computational activities to proceed simultaneously. DMA devices make this possible because they can perform I/O transfers independently once these transfers are initiated by the processor. Pipelining is a particularly effective way of organizing concurrent activity in a computer system. The basic idea is very simple. It is frequently encountered in manu- facturing plants, where pipelining is commonly known as an assembly-line operation. Readers are undoubtedly familiar with the assembly line used in car manufacturing. The first station in an assembly line may prepare the chassis of a car, the next station adds the body, the next one installs the engine, and so on. While one group of workers is installing the engine on one car, another group is fitting a car body on the chassis of another car, and yet another group is preparing a new chassis for a third car. It may take days to complete work on a given car, but it is possible to have a new car rolling off the end of the assembly line every few minutes. Consider how the idea of pipelining can be used in a computer. The processor executes a program by fetching and executing instructions, one after the other. Let F i and E i refer to the fetch and execute steps for instruction I i. Execution of a program consists of a sequence of fetch and execute steps, as shown in Figure 8.1 a. Now consider a computer that has two separate hardware units, one for fetching instructions and another for executing them, as shown in Figure 8.1 b. The instruction fetched by the fetch unit is deposited in an intermediate storage buffer, B1. This buffer is needed to enable the execution unit to execute the instruction while the fetch unit is fetching the next instruction. The results of execution are deposited in the destination location specified by the instruction. For the purposes of this discussion, we assume that both the source and the destination of the data operated on by the instructions are inside the block labeled “Execution unit.”

456 C H A P T E R 8 • PIPELINING

second clock cycle, the execution of instruction I 1 is completed and instruction I 2 is available. Instruction I 2 is stored in B1, replacing I 1 , which is no longer needed. Step E (^2) is performed by the execution unit during the third clock cycle, while instruction I 3 is being fetched by the fetch unit. In this manner, both the fetch and execute units are kept busy all the time. If the pattern in Figure 8.1 c can be sustained for a long time, the completion rate of instruction execution will be twice that achievable by the sequential operation depicted in Figure 8.1 a. In summary, the fetch and execute units in Figure 8.1 b constitute a two-stage pipeline in which each stage performs one step in processing an instruction. An inter- stage storage buffer, B1, is needed to hold the information being passed from one stage to the next. New information is loaded into this buffer at the end of each clock cycle. The processing of an instruction need not be divided into only two steps. For example, a pipelined processor may process each instruction in four steps, as follows:

F Fetch: read the instruction from the memory. D Decode: decode the instruction and fetch the source operand(s). E Execute: perform the operation specified by the instruction. W Write: store the result in the destination location.

The sequence of events for this case is shown in Figure 8.2 a. Four instructions are in progress at any given time. This means that four distinct hardware units are needed, as shown in Figure 8.2 b. These units must be capable of performing their tasks simultane- ously and without interfering with one another. Information is passed from one unit to the next through a storage buffer. As an instruction progresses through the pipeline, all the information needed by the stages downstream must be passed along. For example, during clock cycle 4, the information in the buffers is as follows:

  • (^) Buffer B1 holds instruction I 3 , which was fetched in cycle 3 and is being decoded by the instruction-decoding unit.
  • (^) Buffer B2 holds both the source operands for instruction I 2 and the specification of the operation to be performed. This is the information produced by the decoding hardware in cycle 3. The buffer also holds the information needed for the write step of instruction I 2 (step W 2 ). Even though it is not needed by stage E, this information must be passed on to stage W in the following clock cycle to enable that stage to perform the required Write operation.
  • (^) Buffer B3 holds the results produced by the execution unit and the destination information for instruction I 1.

8.1.1 R OLE OF C ACHE M EMORY

Each stage in a pipeline is expected to complete its operation in one clock cycle. Hence, the clock period should be sufficiently long to complete the task being performed in any stage. If different units require different amounts of time, the clock period must allow the longest task to be completed. A unit that completes its task early is idle for the remainder of the clock period. Hence, pipelining is most effective in improving

8. 1 B ASIC CONCEPTS 457

I 4 F 4

F 1

F 2

F 3

I 1

I 2

I 3

D (^1)

D (^2)

D (^3)

D (^4)

E (^1)

E (^2)

E (^3)

E (^4)

W (^1)

W 2

W (^3)

W 4

Instruction

Clock cycle 1 2 3 4 5 6 7

(a) Instruction execution divided into four steps

F : Fetch instruction

D : Decode instruction and fetch operands

E: Execute operation

W : Write results

Interstage buffers

(b) Hardware organization

B1 B2 B

Time

Figure 8.2 A 4-stage pipeline.

performance if the tasks being performed in different stages require about the same amount of time. This consideration is particularly important for the instruction fetch step, which is assigned one clock period in Figure 8.2 a. The clock cycle has to be equal to or greater than the time needed to complete a fetch operation. However, the access time of the main memory may be as much as ten times greater than the time needed to perform basic pipeline stage operations inside the processor, such as adding two numbers. Thus, if each instruction fetch required access to the main memory, pipelining would be of little value. The use of cache memories solves the memory access problem. In particular, when a cache is included on the same chip as the processor, access time to the cache is usually the same as the time needed to perform other basic operations inside the processor. This

8. 1 B ASIC CONCEPTS 459

Pipelined operation in Figure 8.3 is said to have been stalled for two clock cycles. Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. As a result some operation has to be delayed, and the pipeline stalls. The pipeline may also be stalled because of a delay in the availability of an instruc- tion. For example, this may be a result of a miss in the cache, requiring the instruction to be fetched from the main memory. Such hazards are often called control hazards or instruction hazards. The effect of a cache miss on pipelined operation is illustrated in Figure 8.4. Instruction I 1 is fetched from the cache in cycle 1, and its execution proceeds normally. However, the fetch operation for instruction I 2 , which is started in cycle 2, results in a cache miss. The instruction fetch unit must now suspend any further fetch re- quests and wait for I 2 to arrive. We assume that instruction I 2 is received and loaded into buffer B1 at the end of cycle 5. The pipeline resumes its normal operation at that point.

F 1

F 2

F 3

I (^1)

I (^2)

I (^3)

D 1

D (^2)

D (^3)

E (^1)

E (^2)

E (^3)

W 1

W (^2)

W (^3)

Instruction

Clock cycle 1 2 3 4 5 6 7 8 9

(a) Instruction execution steps in successive clock cycles

Clock cycle 1 2 3 4 5 6 7 8

Stage F: Fetch

D: Decode

E: Execute

W: Write

F 1 F 2 F 3

D 1 idle idle idle D 2 D (^3)

E 1 idle idle idle E 2 E (^3)

W 1 idle idle idle W (^2)

(b) Function performed by each processor stage in successive clock cycles

9

W (^3)

F 2 F 2 F 2

Time

Time

Figure 8.4 Pipeline stall caused by a cache miss in F2.

460 C H A P T E R 8 • PIPELINING

An alternative representation of the operation of a pipeline in the case of a cache miss is shown in Figure 8.4 b. This figure gives the function performed by each pipeline stage in each clock cycle. Note that the Decode unit is idle in cycles 3 through 5, the Execute unit is idle in cycles 4 through 6, and the Write unit is idle in cycles 5 through 7. Such idle periods are called stalls. They are also often referred to as bubbles in the pipeline. Once created as a result of a delay in one of the pipeline stages, a bubble moves downstream until it reaches the last unit. A third type of hazard that may be encountered in pipelined operation is known as a structural hazard. This is the situation when two instructions require the use of a given hardware resource at the same time. The most common case in which this hazard may arise is in access to memory. One instruction may need to access memory as part of the Execute or Write stage while another instruction is being fetched. If instructions and data reside in the same cache unit, only one instruction can proceed and the other instruction is delayed. Many processors use separate instruction and data caches to avoid this delay. An example of a structural hazard is shown in Figure 8.5. This figure shows how the load instruction

Load X(R1),R

can be accommodated in our example 4-stage pipeline. The memory address, X+[R1], is computed in step E 2 in cycle 4, then memory access takes place in cycle 5. The operand read from memory is written into register R2 in cycle 6. This means that the execution step of this instruction takes two clock cycles (cycles 4 and 5). It causes the pipeline to stall for one cycle, because both instructions I 2 and I 3 require access to the register file in cycle 6. Even though the instructions and their data are all available, the pipeline is

F 1

F 2

F 3

I (^1)

I 2 (Load)

I (^3)

E (^1)

M 2

D (^1)

D (^2)

D (^3)

W (^1)

W 2

Instruction

I (^4) F 4

Clock cycle 1 2 3 4 5 6 7

I (^5) F 5 D (^5)

Time

E (^2)

E 3 W (^3)

D 4 E (^4)

Figure 8.5 Effect of a Load instruction on pipeline timing.

462 C H A P T E R 8 • PIPELINING

F 1

F 2

F 3

I 1 (Mul)

I 2 (Add)

I (^3)

D 1

D (^3)

E (^1)

E (^3)

E (^2)

W 3

Instruction

Clock cycle 1 2 3 4 5 6 7 8 9

W (^1)

D (^) 2A W (^2)

I 4 F 4 D 4 E 4 W 4

D (^2)

Time

Figure 8.6 Pipeline stalled by data dependency between D 2 and W 1.

This example illustrates a basic constraint that must be enforced to guarantee correct results. When two operations depend on each other, they must be performed sequentially in the correct order. This rather obvious condition has far-reaching con- sequences. Understanding its implications is the key to understanding the variety of design alternatives and trade-offs encountered in pipelined computers. Consider the pipeline in Figure 8.2. The data dependency just described arises when the destination of one instruction is used as a source in the next instruction. For example, the two instructions Mul R2,R3,R Add R5,R4,R

give rise to a data dependency. The result of the multiply instruction is placed into register R4, which in turn is one of the two source operands of the Add instruction. Assuming that the multiply operation takes one clock cycle to complete, execution would proceed as shown in Figure 8.6. As the Decode unit decodes the Add instruction in cycle 3, it realizes that R4 is used as a source operand. Hence, the D step of that instruction cannot be completed until the W step of the multiply instruction has been completed. Completion of step D 2 must be delayed to clock cycle 5, and is shown as step D (^) 2A in the figure. Instruction I 3 is fetched in cycle 3, but its decoding must be delayed because step D 3 cannot precede D 2. Hence, pipelined execution is stalled for two cycles.

8.2.1 O PERAND F ORWARDING

The data hazard just described arises because one instruction, instruction I 2 in Figure 8.6, is waiting for data to be written in the register file. However, these data are available at the output of the ALU once the Execute stage completes step E 1. Hence, the delay can

8. 2 D ATA HAZARDS 463

be reduced, or possibly eliminated, if we arrange for the result of instruction I 1 to be forwarded directly for use in step E 2. Figure 8.7 a shows a part of the processor datapath involving the ALU and the register file. This arrangement is similar to the three-bus structure in Figure 7.8, except that registers SRC1, SRC2, and RSLT have been added. These registers constitute the

Register file

SRC1 SRC

RSLT

Destination

Source 1 Source 2

(a) Datapath

ALU

E: Execute (ALU)

W: Write (Register file)

SRC1,SRC2 RSLT

(b) Position of the source and result registers in the processor pipeline

Forwarding path

Figure 8.7 Operand forwarding in a pipelined processor.

8. 3 INSTRUCTION HAZARDS 465

than the one named as the destination. An instruction that uses an autoincrement or autodecrement addressing mode is an example. In addition to storing new data in its destination location, the instruction changes the contents of a source register used to access one of its operands. All the precautions needed to handle data dependencies involving the destination location must also be applied to the registers affected by an autoincrement or autodecrement operation. When a location other than one explicitly named in an instruction as a destination operand is affected, the instruction is said to have a side effect. For example, stack instructions, such as push and pop, produce similar side effects because they implicitly use the autoincrement and autodecrement addressing modes. Another possible side effect involves the condition code flags, which are used by instructions such as conditional branches and add-with-carry. Suppose that registers R and R2 hold a double-precision integer number that we wish to add to another double- precision number in registers R3 and R4. This may be accomplished as follows:

Add R1,R AddWithCarry R2,R

An implicit dependency exists between these two instructions through the carry flag. This flag is set by the first instruction and used in the second instruction, which performs the operation

R4 ← [R2] + [R4] + carry

Instructions that have side effects give rise to multiple data dependencies, which lead to a substantial increase in the complexity of the hardware or software needed to resolve them. For this reason, instructions designed for execution on pipelined hardware should have few side effects. Ideally, only the contents of the destination location, either a register or a memory location, should be affected by any given instruction. Side effects, such as setting the condition code flags or updating the contents of an address pointer, should be kept to a minimum. However, Chapter 2 showed that the autoincrement and autodecrement addressing modes are potentially useful. Condition code flags are also needed for recording such information as the generation of a carry or the occurrence of overflow in an arithmetic operation. In Section 8.4 we show how such functions can be provided by other means that are consistent with a pipelined organization and with the requirements of optimizing compilers.

8.3 INSTRUCTION HAZARDS

The purpose of the instruction fetch unit is to supply the execution units with a steady stream of instructions. Whenever this stream is interrupted, the pipeline stalls, as Figure 8.4 illustrates for the case of a cache miss. A branch instruction may also cause the pipeline to stall. We will now examine the effect of branch instructions and the techniques that can be used for mitigating their impact. We start with unconditional branches.

466 C H A P T E R 8 • PIPELINING

8.3.1 U NCONDITIONAL B RANCHES

Figure 8.8 shows a sequence of instructions being executed in a two-stage pipeline. Instructions I 1 to I 3 are stored at successive memory addresses, and I 2 is a branch instruction. Let the branch target be instruction I k. In clock cycle 3, the fetch operation for instruction I 3 is in progress at the same time that the branch instruction is being decoded and the target address computed. In clock cycle 4, the processor must discard I 3 , which has been incorrectly fetched, and fetch instruction I k. In the meantime, the hardware unit responsible for the Execute (E) step must be told to do nothing during that clock period. Thus, the pipeline is stalled for one clock cycle. The time lost as a result of a branch instruction is often referred to as the branch penalty. In Figure 8.8, the branch penalty is one clock cycle. For a longer pipeline, the branch penalty may be higher. For example, Figure 8.9 a shows the effect of a branch instruction on a four-stage pipeline. We have assumed that the branch ad- dress is computed in step E 2. Instructions I 3 and I 4 must be discarded, and the tar- get instruction, I k , is fetched in clock cycle 5. Thus, the branch penalty is two clock cycles. Reducing the branch penalty requires the branch address to be computed earlier in the pipeline. Typically, the instruction fetch unit has dedicated hardware to identify a branch instruction and compute the branch target address as quickly as possible after an instruction is fetched. With this additional hardware, both of these tasks can be performed in step D 2 , leading to the sequence of events shown in Figure 8.9 b. In this case, the branch penalty is only one clock cycle.

F 2

F 1

I 2 (Branch)

I (^3)

I k

E (^2)

F (^3)

F k E k

I k +1 F k+ 1 E k+ 1

Instruction

Execution unit idle

Clock cycle 1 2 3 4 5

Time

I 1 E (^1)

6

X

Figure 8.8 An idle cycle caused by a branch instruction.

468 C H A P T E R 8 • PIPELINING

F : Fetch instruction

E : Execute instruction

W : Write results

D : Dispatch/ Decode

Instruction queue

Instruction fetch unit

unit

Figure 8.10 Use of an instruction queue in the hardware organization of Figure 8.2 b.

sends them to the execution unit. This leads to the organization shown in Figure 8.10. The dispatch unit also performs the decoding function. To be effective, the fetch unit must have sufficient decoding and processing capa- bility to recognize and execute branch instructions. It attempts to keep the instruction queue filled at all times to reduce the impact of occasional delays when fetching in- structions. When the pipeline stalls because of a data hazard, for example, the dispatch unit is not able to issue instructions from the instruction queue. However, the fetch unit continues to fetch instructions and add them to the queue. Conversely, if there is a delay in fetching instructions because of a branch or a cache miss, the dispatch unit continues to issue instructions from the instruction queue. Figure 8.11 illustrates how the queue length changes and how it affects the rela- tionship between different pipeline stages. We have assumed that initially the queue contains one instruction. Every fetch operation adds one instruction to the queue and every dispatch operation reduces the queue length by one. Hence, the queue length remains the same for the first four clock cycles. (There is both an F and a D step in each of these cycles.) Suppose that instruction I 1 introduces a 2-cycle stall. Since space is available in the queue, the fetch unit continues to fetch instructions and the queue length rises to 3 in clock cycle 6. Instruction I 5 is a branch instruction. Its target instruction, I k , is fetched in cycle 7, and instruction I 6 is discarded. The branch instruction would normally cause a stall in cycle 7 as a result of discarding instruction I 6. Instead, instruction I 4 is dispatched from the queue to the decoding stage. After discarding I 6 , the queue length drops to 1 in cycle 8. The queue length will be at this value until another stall is encountered. Now observe the sequence of instruction completions in Figure 8.11. Instructions I 1 , I 2 , I 3 , I 4 , and I k complete execution in successive clock cycles. Hence, the branch instruc- tion does not increase the overall execution time. This is because the instruction fetch unit has executed the branch instruction (by computing the branch address) concurrently with the execution of other instructions. This technique is referred to as branch folding.

8. 3 INSTRUCTION HAZARDS 469

X

F 1 D 1 E 1 E 1 E 1 W 1

F 4

E 3 W 3

I 5 (Branch)

I (^1)

F 2 D (^2)

Clock cycle 1 2 3 4 5 6 7 8 9

E 2 W 2

F 3 D (^3)

D 4 E 4 W 4

F 5 D (^5)

F 6

F k D k E k

F k+ 1 D k+ 1

I (^2)

I (^3)

I (^4)

I (^6)

I k

I k+ 1

W k

E k+ 1

10 Queue length 1 1 1 1 2 3 2 1 1 1

Time

Figure 8.11 Branch timing in the presence of an instruction queue. Branch target address is computed in the D stage.

Note that branch folding occurs only if at the time a branch instruction is encoun- tered, at least one instruction is available in the queue other than the branch instruction. If only the branch instruction is in the queue, execution would proceed as in Figure 8.9 b. Therefore, it is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing. This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache. In many processors, the width of the connection between the fetch unit and the instruction cache allows reading more than one instruction in each clock cycle. If the fetch unit replenishes the instruction queue quickly after a branch has occurred, the probability that branch folding will occur increases. Having an instruction queue is also beneficial in dealing with cache misses. When a cache miss occurs, the dispatch unit continues to send instructions for execution as long as the instruction queue is not empty. Meanwhile, the desired cache block is read from the main memory or from a secondary cache. When fetch operations are resumed, the instruction queue is refilled. If the queue does not become empty, a cache miss will have no effect on the rate of instruction execution. In summary, the instruction queue mitigates the impact of branch instructions on performance through the process of branch folding. It has a similar effect on stalls

8. 3 INSTRUCTION HAZARDS 471

Add

LOOP Shift_left R Decrement Branch=

R LOOP NEXT

(a) Original program loop

LOOP Decrement R Branch= Shift_left

LOOP R NEXT

(b) Reordered instructions

Add

R1,R

R1,R

Figure 8.12 Reordering of instructions for a delayed branch.

F E

F E

F E

F E

F E

F E

F E

Instruction Decrement

Branch

Shift (delay slot)

Decrement (Branch taken)

Branch

Shift (delay slot)

Add (Branch not taken)

Clock cycle 1 2 3 4 5 6 7 8

Time

Figure 8.13 Execution timing showing the delay slot being filled during the last two passes through the loop in Figure 8.12 b.

472 C H A P T E R 8 • PIPELINING

The effectiveness of the delayed branch approach depends on how often it is pos- sible to reorder instructions as in Figure 8.12. Experimental data collected from many programs indicate that sophisticated compilation techniques can use one branch delay slot in as many as 85 percent of the cases. For a processor with two branch delay slots, the compiler attempts to find two instructions preceding the branch instruction that it can move into the delay slots without introducing a logical error. The chances of finding two such instructions are considerably less than the chances of finding one. Thus, if increasing the number of pipeline stages involves an increase in the number of branch delay slots, the potential gain in performance may not be fully realized.

Branch Prediction Another technique for reducing the branch penalty associated with conditional branches is to attempt to predict whether or not a particular branch will be taken. The simplest form of branch prediction is to assume that the branch will not take place and to continue to fetch instructions in sequential address order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis. Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution sequence. Hence, care must be taken that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed. If the branch decision indicates otherwise, the instructions and all their associated data in the execution units must be purged, and the correct instructions fetched and executed. An incorrectly predicted branch is illustrated in Figure 8.14 for a four-stage pipeline. The figure shows a Compare instruction followed by a Branch>0 instruction. Branch

F (^1)

F 2

I 1 (Compare)

I 2 (Branch>0)

I (^3)

D 1 E 1 W (^1)

F (^3)

F 4

F k D k

D 3 X

I 4 X

I k

Instruction

E (^2)

Clock cycle 1 2 3 4 5 6

D 2 /P (^2)

Time

Figure 8.14 Timing when a branch decision has been incorrectly predicted as not taken.