pipelining
Pipelining is a technique of executing instructions in such a way that the
execution of the current instruction is overlapped by the execution of
subsequent instructions. A pipeline acts like an assembly line which is
divided into stages and these stages are connected to form a pipe-like
structure. Instructions enter at one end and exit at the other end. The goal
of pipelining is to increase the overall instruction throughput. Nowadays,
state-of-the-art machines are quite heavily pipelined.
sequential processing vs pipelining
Without a pipeline, a processor gets the first instruction from memory,
performs the operation requested by this instruction, then proceeds
to get the next instruction from memory, and so forth. While fetching
the instruction, the arithmetic part of the processor is idle since it must
wait until it gets the next instruction. With pipelining, however, while the
arithmetic unit is performing the computation, a different part of the CPU can
already work on fetching the next instruction from memory (perhaps storing it
in an internal buffer until it can be executed).
pipeline and its stages
A textbook example of pipelined execution is the 5-stage ‘RISC’ pipeline and
looks like this:
- Instruction Fetch (IF) - fetches instruction from memory or cache,
- Instruction Decode (ID) - decodes instruction and reads operands (which must be in registers),
- Execute (EX) - executes the instruction or, if it is a memory access instruction, computes the memory address,
- Memory Access (MEM) - reads or writes memory; no-op if the instruction does not access memory,
- Write Back (WB) - write results back to registers.
This five-stage pipeline will potentially produce results five times as fast
as a non-pipelined sequential processing. Five stage pipeline is a
commonly-used learning example, though modern processors have often more than
ten pipeline stages and even 20 stages is not unheard of. However, deciding
how many stages to use is not a simple task and a number of things needs to
taken into account.
Each stage should ideally take the same amount of time. If some stages are
slower than others, some parts of the processor are are not used at their full
potential, wasting some cycles waiting. Possible solution is to split these
big slow stages to multiple smaller ones. Another source of issues are resource
conflicts between stages.
Some resource conflicts can be resolved by duplicating the resource, but this
is not always possible (or practical). If one of the stages needs to wait to
use a certain resource, we say that the pipeline stalls. When that happens,
the entire pipeline loses a cycle, since the units in front (upstream) of the
stuck unit cannot push their results down the pipe and behind it (downstream),
more and more units become idle since no work arrives for them in that cycle.
While it might be tempting to make pipelines longer (more stages means more
parallelism, and hardware is good at parallelism), there are also some serious
drawbacks. Everything works fine as long as the pipeline is kept full, but
this becomes hard when it is not clear which instruction should be introduced
into the pipeline next. Which is exactly what happens when a jump instruction
enters the pipeline (it gets worse if the jump is conditional or indirect).
The IF stage (or its analog) needs to know the address of the instruction to
fetch, but even with direct jump this cannot happen until the instruction is
at least decoded (if it is a direct jump) or until it is executed (for
conditional and indirect jumps). If the IF does not have an address, it cannot
do anything and a bubble starts forming in the pipeline (each item is a single
clock cycle):
- the IF fetches a jump instruction,
- the IF is idle, ID works on the jump,
- both IF and ID are idle, the conditional jump reaches EX which resolves the jump and passes the address M to IF,
- IF fetches the instruction from M, ID is idle, EX is idle,
- IF fetches (M + 1), ID decodes instruction M, EX and MEM are idle,
- IF fetches (M + 2), ID decodes (M + 1) EX executes M, MEM and WB are idle.
The critical metric here is the distance from instruction fetch (the unit
which needs the address of next instruction) and the units that resolve jump
addresses. However, those unit gravitate toward the end of the pipeline, since
the computation needs inputs ready, which are usually only available fairly
late. A common approach to reduce effective distance is branch prediction:
something that can be done around the time of ID (instruction decode) and
hence not far away from IF. However, at this stage, the best we can do is
guess, and sometimes we are going to guess wrong. When that happens, part of
the pipeline is going to have incorrect intermediate results which need to be
cleared.
hazards
For a pipelined processor, there are three sources of hazard:
- Structural (resource) hazards when two or more instructions that are already in the pipeline need the same resource. The result is that instruction must be executed in series rather than parallel for a part of pipeline.
- Control hazards which occur when there is a wrong decision on branch prediction and therefore the instructions in the pipeline must be discarded.
- Data hazards which occur when an instruction needs to use data which are not yet available. There are three forms of data hazards but only one of them, the read after write hazard, is a real issue for pipelined processors.
data dependencies
Consider the following sequence of instructions (we will remain faithful to
the classic five-stage RISC pipeline). All relevant registers are set to
1
at the outset.
ADD R1, R1, R6 ; R1 = R1+R6
MUL R5, R1, R2 ; R5 = R1*R2
The instructions flow down the pipeline, but notice the data dependency (a
type of data hazard between them):
Let us step through the computation with no hazard consideration:
- in cycle 3, the ADD operation calculates the new value for R1 (= 2),
- also in cycle 3, the MUL is decoded, fetching value of R1 from the register
file; at this point, that value is 1, because
ADD
did not reach the WB stage yet (that is going to happen in cycle 5), - in cycle 4,
MUL
reaches EX and computes its result as 1, which is clearly incorrect.
ADD
reaches the
WB stage, making the entire computation take 10 cycles instead of 6.
operand forwarding
An alternative is forwarding (or bypassing), which can eliminate certain
data hazards without inducing a stall. In our case, we can pass the data that
was computed by the
ADD
operation back to the EX stage of the MUL
instruction.
Since the operands are up to date, the result of the
MUL
instruction is
computed correctly and a stall is not required. However, we cannot always do
forwarding. Consider this case when the value is not computed yet when needed:
LD R1, 8(R2)
MUL R5, R1, R4
And the computation:
Our only option that is guaranteed to work here is stalling (delaying the
execution of an instruction in order to resolve a hazard):
forwarding hardware support
The ALU result generated in the EX stage is normally passed through the
pipeline registers to the MEM and WB stages, before it is finally written to
the register file. This is shown in the following pipelined data path without
forwarding support.
The pipelined data path with forwarding support contains a forwarding unit.
The forwarding unit selects the correct ALU inputs for the EX stage. If no
hazard is detected, the operands of the ALU will come from the register file.
If a data hazard is detected, the operands will come from either the EX/MEM or
MEM/WB pipeline registers. The ALU sources will be selected by two new
multiplexers and corresponding control signals.
The following diagram shows a pipelined data path with forwarding unit and the
two new multiplexers with control signals.