superscalar architectures

Superscalar processing is the ability to initiate multiple instructions during the same clock cycle, significantly improving throughput of processors designed in this style. In older processor designs (including those based on pipelining), at most one instruction enters the processor and at most one completes execution in any given cycle. Thus, in the best case, the processor can have an average execution rate of instruction per cycle (IPC). A superscalar processor, on the other hand, can retire multiple instructions in one clock cycle. Under ideal conditions a superscalar processor can run at an average IPC (instructions per clock) greater than one.

Typically, a superscalar processor fetches and decodes several instructions at a time. During instruction fetch, many pipelined CPUs attempt to predict the outcomes of conditional branch instructions to ensure an uninterrupted stream of instructions (this is not, in principle, limited to superscalar processors).

Additionally, the instruction stream is analyzed for possible data dependencies and instructions are distributed to multiple functional units for parallel execution (this is where ‘superscalarness’ enters the picture; these units are often specialized and each can execute certain instruction types: think ALU, …).

Now since we are no longer processing the instruction stream in sequence, care must be taken to only execute instructions if their inputs (operands) are ready. This is most often done by the CPU dynamically (and is known as dynamic instruction scheduling), though somewhat ill-fated experiments have also been done with compiler-scheduled parallel instruction execution. Of course, this whole approach can improve execution speed (and achieve IPC > 1) when sufficient instruction-level parallelism is available in the instruction stream.

instruction-level parallelism

A degree of parallel processing in the form of pipelining has been around for decades. A pipeline acts like an assembly line with instructions being processed in phases as they pass down the pipeline. With simple pipelining, only one instruction at a time is initiated into the pipeline, but multiple instructions may be in some phase of execution concurrently. The pipeline initiation rate remained at one instruction per cycle for many years and was often considered to be a serious bottleneck. Of course, superscalar processors address this problem by feeding multiple instructions into what are effectively multiple, parallel pipelines. A somewhat simplistic example:

dependencies and conflicts

The main limitations for superscalar processors come from the dependencies between instructions. There are three possible kinds of dependencies or conflicts, depending on why they arise – due to data, control or resources.

Data conflicts are caused by data dependencies between instructions – an instruction cannot be executed until its operands are available (generally produced by some earlier instruction).
Instructions after a branch have a procedural (control) dependency on the branch instruction, and cannot be completely executed until the branch is executed (which becomes complicated if the branch instruction in question is conditional or worse, indirect).
Resource conflicts are perhaps the easiest to understand: those arise when multiple instructions require the same resource for their completion (where a resource might be a register, the memory bus, or a functional unit). Unlike the other two conflict types, this type of conflict can be solved by literally throwing more resources at it (which is what superscalar processors generally do).

Depending on how instructions are scheduled, we can classify superscalar processors into a few classes (in order of increasing complexity):

static - instructions are issued and executed in program order,
dynamic - out-of-order program execution is permitted, though avoided in complicated cases (the canonic example of a difficult case would be a control dependency on a conditional branch),
speculative - like before, with the added ability to partially execute instructions with unresolved control dependencies (in this case, it may later turn out that the instruction wasn't supposed to execute at all and must be aborted and any effects which already took place need to be rolled back).

register renaming

A superscalar processor uses register renaming and out-of-order execution to detect and extract as much instruction-level parallelism as possible, with the goal of maximizing its IPC (instructions per clock) rate.

Register renaming eliminates resource conflicts on registers (what you could call anti-dependencies) by decoupling the physical registers from their names. Essentially, the processor performs an on-the-fly data-flow analysis of the instruction stream and detects cases, where the same logical register (i.e. the same register name) is being used for two unrelated purposes.

Consider this: when a write ‘kills’ a value stored in some register without also reading that particular register, this write instruction (and anything that depends on the value that it wrote) depends on any previous reads of that register – this would be known as a write-after-read dependency. However, in this scenario, the dependency is accidental (as opposed to fundamental) and the CPU can basically re-allocate those conflicting logical registers onto non-conflicting physical registers, thus removing the dependency.

The implementation of register renaming is straightforward. If the processor encounters an instruction that addresses a destination register, it temporarily writes the result of this instruction into a dynamically allocated rename buffer rather than into the specified destination register. For instance, in the case of the following write-after-read dependency:

I1: ADD ..., R2, ...; [... ← (R2) + (...)]
I2: MUL R2, ..., ...; [R2  ← (...) ∗ (...)]

R2, the destination register of I2, is renamed, say to R31. Then, instruction I2 effectively becomes:

I2 : MUL R31, ..., ...; [R31 ← (...) * (...)]

The result of I2 is written into R31 instead of into R2. This resolves the previous write-after-read dependency between I1 and I2. Obviously, in subsequent instructions, references to source registers must be redirected to the rename buffers allocated to them as long as this renaming remains valid.

speculative execution

Major limitations on the amount of instruction-level parallelism are uncertainties in program execution sequence. While conditional and indirect branch instructions serve essential purposes in high-level languages, they introduce uncertainties for program execution. The classic approach to solve this problem is a branch prediction, where there is a mechanism to predict the next instructions to execute when branches are encountered during program execution. However, branch prediction alone is not fully sufficient. One needs to speculatively execute the instructions on the predicted path of execution and also recover from any mispredictions.

The branch prediction is also the topic of our future blog posts, so stay tuned for more details to come.