superscalar architectures
Superscalar processing is the ability to initiate multiple instructions during
the same clock cycle, significantly improving throughput of processors
designed in this style. In older processor designs (including those based on
pipelining), at most one instruction enters the processor and at most one
completes execution in any given cycle. Thus, in the best case, the processor
can have an average execution rate of instruction per cycle (IPC). A
superscalar processor, on the other hand, can retire multiple instructions in
one clock cycle. Under ideal conditions a superscalar processor can run at
an average IPC (instructions per clock) greater than one.
Typically, a superscalar processor fetches and decodes several instructions at
a time. During instruction fetch, many pipelined CPUs attempt to predict the
outcomes of conditional branch instructions to ensure an uninterrupted stream
of instructions (this is not, in principle, limited to superscalar processors).
Additionally, the instruction stream is analyzed for possible data
dependencies and instructions are distributed to multiple functional units
for parallel execution (this is where ‘superscalarness’ enters the picture;
these units are often specialized and each can execute certain instruction
types: think ALU, …).
Now since we are no longer processing the instruction stream in sequence, care
must be taken to only execute instructions if their inputs (operands) are
ready. This is most often done by the CPU dynamically (and is known as
dynamic instruction scheduling), though somewhat ill-fated experiments have
also been done with compiler-scheduled parallel instruction execution. Of
course, this whole approach can improve execution speed (and achieve IPC > 1)
when sufficient instruction-level parallelism is available in the
instruction stream.
instruction-level parallelism
A degree of parallel processing in the form of pipelining has been around for
decades. A pipeline acts like an assembly line with instructions being
processed in phases as they pass down the pipeline. With simple pipelining,
only one instruction at a time is initiated into the pipeline, but multiple
instructions may be in some phase of execution concurrently. The pipeline
initiation rate remained at one instruction per cycle for many years and was
often considered to be a serious bottleneck. Of course, superscalar processors
address this problem by feeding multiple instructions into what are effectively
multiple, parallel pipelines. A somewhat simplistic example:
dependencies and conflicts
The main limitations for superscalar processors come from the dependencies
between instructions. There are three possible kinds of dependencies or
conflicts, depending on why they arise – due to data, control or resources.
- Data conflicts are caused by data dependencies between instructions – an instruction cannot be executed until its operands are available (generally produced by some earlier instruction).
- Instructions after a branch have a procedural (control) dependency on the branch instruction, and cannot be completely executed until the branch is executed (which becomes complicated if the branch instruction in question is conditional or worse, indirect).
- Resource conflicts are perhaps the easiest to understand: those arise when multiple instructions require the same resource for their completion (where a resource might be a register, the memory bus, or a functional unit). Unlike the other two conflict types, this type of conflict can be solved by literally throwing more resources at it (which is what superscalar processors generally do).
- static - instructions are issued and executed in program order,
- dynamic - out-of-order program execution is permitted, though avoided in complicated cases (the canonic example of a difficult case would be a control dependency on a conditional branch),
- speculative - like before, with the added ability to partially execute instructions with unresolved control dependencies (in this case, it may later turn out that the instruction wasn't supposed to execute at all and must be aborted and any effects which already took place need to be rolled back).
register renaming
A superscalar processor uses register renaming and out-of-order execution
to detect and extract as much instruction-level parallelism as possible, with
the goal of maximizing its IPC (instructions per clock) rate.
Register renaming eliminates resource conflicts on registers (what you could
call anti-dependencies) by decoupling the physical registers from their names.
Essentially, the processor performs an on-the-fly data-flow analysis of the
instruction stream and detects cases, where the same logical register (i.e.
the same register name) is being used for two unrelated purposes.
Consider this: when a write ‘kills’ a value stored in some register without
also reading that particular register, this write instruction (and anything
that depends on the value that it wrote) depends on any previous reads of that
register – this would be known as a write-after-read dependency. However, in
this scenario, the dependency is accidental (as opposed to fundamental) and
the CPU can basically re-allocate those conflicting logical registers onto
non-conflicting physical registers, thus removing the dependency.
The implementation of register renaming is straightforward. If the processor
encounters an instruction that addresses a destination register, it
temporarily writes the result of this instruction into a dynamically allocated
rename buffer rather than into the specified destination register. For
instance, in the case of the following write-after-read dependency:
I1: ADD ..., R2, ...; [... ← (R2) + (...)]
I2: MUL R2, ..., ...; [R2 ← (...) ∗ (...)]
R2
, the destination register of I2
, is renamed, say to R31
. Then,
instruction I2
effectively becomes:
I2 : MUL R31, ..., ...; [R31 ← (...) * (...)]
The result of
I2
is written into R31
instead of into R2
. This resolves
the previous write-after-read dependency between I1
and I2
. Obviously, in
subsequent instructions, references to source registers must be redirected to
the rename buffers allocated to them as long as this renaming remains valid.
speculative execution
Major limitations on the amount of instruction-level parallelism are
uncertainties in program execution sequence. While conditional and indirect
branch instructions serve essential purposes in high-level languages, they
introduce uncertainties for program execution. The classic approach to solve
this problem is a branch prediction, where there is a mechanism to predict the
next instructions to execute when branches are encountered during program
execution. However, branch prediction alone is not fully sufficient. One needs
to speculatively execute the instructions on the predicted path of execution
and also recover from any mispredictions.
The branch prediction is also the topic of our future blog posts, so stay
tuned for more details to come.