pipelining

Pipelining is a technique of executing instructions in such a way that the execution of the current instruction is overlapped by the execution of subsequent instructions. A pipeline acts like an assembly line which is divided into stages and these stages are connected to form a pipe-like structure. Instructions enter at one end and exit at the other end. The goal of pipelining is to increase the overall instruction throughput. Nowadays, state-of-the-art machines are quite heavily pipelined.

sequential processing vs pipelining

Without a pipeline, a processor gets the first instruction from memory, performs the operation requested by this instruction, then proceeds to get the next instruction from memory, and so forth. While fetching the instruction, the arithmetic part of the processor is idle since it must wait until it gets the next instruction. With pipelining, however, while the arithmetic unit is performing the computation, a different part of the CPU can already work on fetching the next instruction from memory (perhaps storing it in an internal buffer until it can be executed).

pipeline and its stages

A textbook example of pipelined execution is the 5-stage ‘RISC’ pipeline and looks like this:

Instruction Fetch (IF) - fetches instruction from memory or cache,
Instruction Decode (ID) - decodes instruction and reads operands (which must be in registers),
Execute (EX) - executes the instruction or, if it is a memory access instruction, computes the memory address,
Memory Access (MEM) - reads or writes memory; no-op if the instruction does not access memory,
Write Back (WB) - write results back to registers.

A pipeline first filling up and emptying (as it runs out of instructions to execute) then looks something like this, with each row being a single instruction, each column a single instant in time and each small box representing a single instruction being processed by the given stage:

This five-stage pipeline will potentially produce results five times as fast as a non-pipelined sequential processing. Five stage pipeline is a commonly-used learning example, though modern processors have often more than ten pipeline stages and even 20 stages is not unheard of. However, deciding how many stages to use is not a simple task and a number of things needs to taken into account.

Each stage should ideally take the same amount of time. If some stages are slower than others, some parts of the processor are are not used at their full potential, wasting some cycles waiting. Possible solution is to split these big slow stages to multiple smaller ones. Another source of issues are resource conflicts between stages.

Some resource conflicts can be resolved by duplicating the resource, but this is not always possible (or practical). If one of the stages needs to wait to use a certain resource, we say that the pipeline stalls. When that happens, the entire pipeline loses a cycle, since the units in front (upstream) of the stuck unit cannot push their results down the pipe and behind it (downstream), more and more units become idle since no work arrives for them in that cycle.

While it might be tempting to make pipelines longer (more stages means more parallelism, and hardware is good at parallelism), there are also some serious drawbacks. Everything works fine as long as the pipeline is kept full, but this becomes hard when it is not clear which instruction should be introduced into the pipeline next. Which is exactly what happens when a jump instruction enters the pipeline (it gets worse if the jump is conditional or indirect).

The IF stage (or its analog) needs to know the address of the instruction to fetch, but even with direct jump this cannot happen until the instruction is at least decoded (if it is a direct jump) or until it is executed (for conditional and indirect jumps). If the IF does not have an address, it cannot do anything and a bubble starts forming in the pipeline (each item is a single clock cycle):

the IF fetches a jump instruction,
the IF is idle, ID works on the jump,
both IF and ID are idle, the conditional jump reaches EX which resolves the jump and passes the address M to IF,
IF fetches the instruction from M, ID is idle, EX is idle,
IF fetches (M + 1), ID decodes instruction M, EX and MEM are idle,
IF fetches (M + 2), ID decodes (M + 1) EX executes M, MEM and WB are idle.

Even with a short pipeline, we have wasted a total of 9 unit-cycles of work due to a 2-unit bubble. If we imagine IF + ID is split into 5 stages instead of 2 and the entire pipeline has 10, this gets a lot worse: 1 + 2 + 3 + 4 + 5 + 5 * 5 + 4 + 3 + 2 + 1 for a total of 15 + 25 + 15 = 55 unit-cycles of unused capacity.

The critical metric here is the distance from instruction fetch (the unit which needs the address of next instruction) and the units that resolve jump addresses. However, those unit gravitate toward the end of the pipeline, since the computation needs inputs ready, which are usually only available fairly late. A common approach to reduce effective distance is branch prediction: something that can be done around the time of ID (instruction decode) and hence not far away from IF. However, at this stage, the best we can do is guess, and sometimes we are going to guess wrong. When that happens, part of the pipeline is going to have incorrect intermediate results which need to be cleared.

hazards

For a pipelined processor, there are three sources of hazard:

Structural (resource) hazards when two or more instructions that are already in the pipeline need the same resource. The result is that instruction must be executed in series rather than parallel for a part of pipeline.
Control hazards which occur when there is a wrong decision on branch prediction and therefore the instructions in the pipeline must be discarded.
Data hazards which occur when an instruction needs to use data which are not yet available. There are three forms of data hazards but only one of them, the read after write hazard, is a real issue for pipelined processors.

data dependencies

Consider the following sequence of instructions (we will remain faithful to the classic five-stage RISC pipeline). All relevant registers are set to 1 at the outset.

ADD R1, R1, R6   ; R1 = R1+R6
MUL R5, R1, R2   ; R5 = R1*R2

The instructions flow down the pipeline, but notice the data dependency (a type of data hazard between them):

Let us step through the computation with no hazard consideration:

in cycle 3, the ADD operation calculates the new value for R1 (= 2),
also in cycle 3, the MUL is decoded, fetching value of R1 from the register file; at this point, that value is 1, because ADD did not reach the WB stage yet (that is going to happen in cycle 5),
in cycle 4, MUL reaches EX and computes its result as 1, which is clearly incorrect.

Traditionally, this is fixed by stalling the pipeline until ADD reaches the WB stage, making the entire computation take 10 cycles instead of 6.

operand forwarding

An alternative is forwarding (or bypassing), which can eliminate certain data hazards without inducing a stall. In our case, we can pass the data that was computed by the ADD operation back to the EX stage of the MUL instruction.

Since the operands are up to date, the result of the MUL instruction is computed correctly and a stall is not required. However, we cannot always do forwarding. Consider this case when the value is not computed yet when needed:

LD R1, 8(R2)
MUL R5, R1, R4

And the computation:

Our only option that is guaranteed to work here is stalling (delaying the execution of an instruction in order to resolve a hazard):

forwarding hardware support

The ALU result generated in the EX stage is normally passed through the pipeline registers to the MEM and WB stages, before it is finally written to the register file. This is shown in the following pipelined data path without forwarding support.

The pipelined data path with forwarding support contains a forwarding unit. The forwarding unit selects the correct ALU inputs for the EX stage. If no hazard is detected, the operands of the ALU will come from the register file. If a data hazard is detected, the operands will come from either the EX/MEM or MEM/WB pipeline registers. The ALU sources will be selected by two new multiplexers and corresponding control signals.

The following diagram shows a pipelined data path with forwarding unit and the two new multiplexers with control signals.