I recently implemented support for conditional move (or more precisely, conditional select) for the MRISC32 ISA, and I wanted to share some thoughts on that. It’s one of those pieces that just fit into the MRISC32 ISA puzzle very nicely.
The problems with branches
In any pipelined CPU architecture, one of the biggest problems is branching. Like any CPU instruction, a branch instruction takes several pipeline cycles to fetch, decode and execute. This means that by the time you know whether or not a branch is taken, and to what program address it shall jump, you will already have several following instructions lined up in the pipeline that, if the branch is taken, will have to be canceled (and obviously that translates directly to wasted CPU cycles). The longer your pipeline is, the bigger the problem is.
Here is where branch prediction usually comes to the rescue. Simply put, any modern CPU will guess where a branch is going to jump, and whether or not it will perform that jump, by looking at the code structure and by keeping track of the outcome of previous branch events. Statistically, the vast majority of the branches are correctly predicted (e.g. loops and error handling).
However, there are a few problems that branch prediction can not solve.
- Some conditions are by nature very hard to predict, approaching a random 50:50 taken/not taken probability. Even the best branch predictor will be of little use, and if those conditions appear in hot loops you will see a noticeable performance hit.
- Vectorized code can not use branches in the same way as scalar code, since conditions need to be evaluated for each individual vector element.
- The machine code for the branches often requires many instructions and that takes up valuable space in the instruction cache (among other things).
For instance, consider the following code that picks one of two values based on the value of a status variable:
if (status == 8) y = a; else y = b;
This will typically be translated to something like this:
seq s1, s1, #8 ; status == 8? bns s1, status_not_8 ; branch if not true mov s1, s2 ; y = a b continue ; branch (skip the else part) status_not_8: mov s1, s3 ; y = b continue: ...
That’s five instructions (of which at least three will have to be executed one way or the other). Can we do better? Of course we can!
The previous example is were conditional moves can be used instead of branches. There are various ways in which a conditional move can be implemented in an ISA. Below are some examples.
On the x86 you can combine a compare instruction (that sets flags in the condition code register) with one of several “move if condition X is set” instructions.
In this scenario, the previous example could translate to this:
mov eax, esi ; y = a cmp edi, 8 ; status == 8? cmovne eax, edx ; y = b if not equal
This code is shorter than the corresponding branch-based solution, and we also get rid of the branches. However there are some problems with this implementation. First, you need to use two explicit move instructions. Second, the CMOV[cc] instruction can be “nullified” (i.e. it may or may not produce a result), which makes it problematic to form a continuous dependency chain between instructions for efficient execution in a pipeline.
The latter problem can be solved by using the output operand as an input operand too and treat the CMOV[cc] instruction as a conditional selection instead (but that can apparently have unexpected side effects).
The 32-bit ARMv7 ISA provides the ability to predicate most instructions (including moves) so that they are executed or not depending on the result of a comparison operation. For instance, we can do this:
cmp r0, #8 ; status == 8? moveq r0, r1 ; y = a if equal movne r0, r2 ; y = b if not equal
This is quite similar to x86, and comes with similar caveats.
With the 64-bit AArch64 ISA, ARM did away with predication (it turns out that it’s a bad fit for modern CPUs), but introduced a “conditional select” instruction instead. The CSEL instruction selects one of two source operands based on the result of a comparison operation.
cmp w0, 8 ; status == 8? csel w0, w1, w2, eq ; y = a if equal, else y = b
This is an improvement over both ARMv7 and x86, since the CSEL instruction always produces a result, and we only require two instructions instead of three to perform the conditional move.
Most high performance CPU architectures have some sort of conditional move (including MIPS and Alpha etc). RISC-V on the other hand does not have a conditional move instruction, largely due to its design philosophy as far as I can tell.
Instead some RISC-V implementations treat short forward branches as predicates (i.e. the CPU front end transforms a branch into a predicate: execute or skip the following instruction). A conditional move could then look like this:
mv a4, a1 ; y = a li a5, 8 beq a0, a5, skip ; Skip next instruction if status == 8 mv a4, a2 ; y = b skip:
However, this is neither compact nor very efficient. At best it’s equivalent to the x86 CMOV instruction.
The MRISC32 way
As usual, MRISC32 does it slightly differently. Unlike most of the architectures discussed previously, MRISC32 does not use a condition code register. Instead comparison instructions produce bit masks that can be stored in any register.
For instance, the instruction “seq s1, s2, #8” will compare the register S2 with the value 8, and depending on the result of the comparison the value 0xFFFFFFFF (true) or the value 0x00000000 (false) will be stored in S1.
Thus, we can implement conditional selection by introducing a SEL instruction that does bitwise selection, like this:
y = (a & mask) | (b & ~mask)
Here, mask can be the result of a comparison operation (like SEQ), and a and b are the two values to select between. Thus we get:
seq s1, s1, #8 ; status == 8? sel s1, s2, s3 ; y = a if true, else y = b
This is very similar to the AArch64 solution, but it comes with a few twists as we’ll see shortly.
Note that the first operand is both the source mask operand and the destination operand. This is because MRISC32 instruction encoding only allows a maximum of three operands. However, there are four variants of the SEL instruction that have different permutations of the operands, which makes it possible to choose which source register to overwrite.
Of course the SEL instruction works just as well with floating-point values as with integer values. Remember, the MRISC32 stores integer and floating-point values in the same registers, so no special “floating-point select” instruction is needed.
By extension, it is trivial to combine integer conditions with floating-point conditional selection, and vice versa. For instance:
fsunord s1, s1, s2 ; Is S1 or S2 unordered (NaN)? ldi s2, #3 sel s1, s2, #8 ; y = 3 if true, else y = 8
The conditional selection instruction also works with vector operations. This is usually the best way to handle conditions for vector instructions, since there is no way to execute different branch outcomes for different vector elements. For example:
seq v1, v1, #8 ; status[i] == 8? sel v1, v2, v3 ; y[i] = a[i] if true, else y[i] = b[i]
Packed data types
Recall that MRISC32 supports packed data types by storing two 16-bit half-words or four 8-bit bytes in a single 32-bit register (or in a single 32-bit element of a vector register). Thus, a packed “set” instruction will produce true/false bit masks for the individual sub-words of the 32-bit word, which in turn works perfectly with the SEL instruction (since it does selection at the bit level).
In the following example, the SEQ.H instruction performs independent equality checks for the upper and the lower half-words of the “status” variable, and consequently the SEL instruction will perform individual selections for the upper and lower half-words:
ldi s4, #0x00080008 seq.h s1, s1, s4 ; status == 8? sel s1, s2, s3 ; y = a if true, else y = b
The selection mask in this case can have one of the following values:
- 0x00000000 – false, false
- 0x0000FFFF – false, true
- 0xFFFF0000 – true, false
- 0xFFFFFFFF – true, true
An additional feature of the bit-mask representation is that it is very straight forward to combine several conditions into a single conditional move (or branch) using regular bitwise operations (and, or, xor, etc). For instance, consider the following code:
if (status == 3 || status == 8) y = a; else y = b;
This can be implemented as:
seq s4, s1, #3 ; status == 3? seq s1, s1, #8 ; status == 8? or s1, s1, s4 ; status == 3 || status == 8? sel s1, s2, s3 ; y = a if true, else y = b
I believe that the SEL instruction is a good addition to the MRISC32 ISA, and since it works in many different situations it is a fairly powerful instruction.
The only thing that is special about the instruction, compared to other instructions, is that it has three source operands and one destination operand. This requires more hardware than if we only allowed instructions with two source operands (for instance the register file of an in-order, single issue CPU needs three read ports instead of two).
On the other hand the store instructions (e.g. STW) already require three source registers (value to store, base address and scaled index). Also, at least one more three-source + one-destination instruction is planned in the future: fused multiply-accumulate.