For some time now I have been working on my first CPU design as a hobby project. It started with a desire to learn more about CPU architecture design, and in particular instruction set architecture (ISA) design.
To be clear, I’m not a hardware or CPU designer by trade, but I have worked with performance critical software since the late 80’s, and I have done low level assembly code development on several different CPU architectures, including 8-bit CISC (6502), 16/32-bit CISC (68000) and 32-bit RISC (ARMv7). As far as possible I have stayed away from the abominable x86 ISA, but I have done a fair amount of SSE intrinsics coding in C/C++.
The initial idea was to make a very simple RISC CPU similar to the MIPS architecture. Early on, however, I realized that I needed to come up with a solution for floating point and SIMD (single instruction multiple data) operations, so I looked at different contemporary solutions.
Being familiar with both x86 SSE/AVX and ARM NEON, I knew that they are hard to work with, set an upper limit to parallelism (due to a fixed SIMD register size), and are bolted on to an existing architecture as a fairly separate ISA (effectively creating different execution environment silos). Similarly almost every CPU architecture from the last four decades treat floating point as a separate extension (e.g. via the coprocessor interface on the MIPS). In short, I wasn’t convinced that the prevalent ISA:s of today give the right answer to how integer, floating point and SIMD instructions should work and interact with each other.
The turning point came when I learned more about the Cray-1 architecture from the mid 70’s. It was an epiphany! The Cray clearly had an architecture way ahead of its time, and in many respects the modern CPU industry is still trying to catch up with the ideas and inventions from that architecture:
If you look closer at the diagram above, you can see that the Cray had a few interesting features not found in most modern architectures:
- A huge vector register file: Eight registers with 64×64 bits (4096 bits) in each register!
- There are no floating point registers! Scalar and vector registers can hold both integer and floating point values.
- Scalar and vector operands can be fed to the same execution units, and can even be mixed!
- There is a special vector length (VL) register that controls the length of vector operations. Unlike most current SIMD ISA:s, there is no need to split vector loops into a “main” (vector) and a “tail” (scalar) loop!
This inspired me to go for an ISA that borrows a mix of ideas from Cray-1 (the vector parts and the mixed float/integer register file parts) and MIPS (the RISC parts and the use-register-content-as-branch-condition part).
After many iterations (but with much more work still remaining), an ISA and an accompanying CPU design is starting to take shape:
MRISC32 – “Mostly harmless Reduced Instruction Set Computer – 32-bit edition”
Documentation, code and tools can be found at GitHub: mbitsnbites/mrisc32
So, what is it?
In essence, it’s a 32-bit RISC ISA designed with a holistic view on integer, floating point, scalar and vector operations. In addition there is a hardware implementation of a single issue, in order, pipelined CPU. The hardware implementation mostly serves as an aid in the design of the ISA (at the time of writing the CPU is still incomplete).
Some key features are:
- 32-bit RISC instruction set.
- 32 32-bit scalar registers.
- 32 Nx32-bit vector registers (where N is typically 16 or more).
- All registers can hold integer or floating point values.
- Variable length vector instructions.
- Almost all instructions can use either scalar or vector operands, or a mix of the two (including immediate scalar values).
- Branches are based on register content (there are no condition code flags).
Some of the design choices/consequences are described below.
A quick guide to the assembly language examples:
- Scalar registers are prefixed ‘s’.
- Vector registers are prefixed ‘v’.
- The leftmost operand is usually the destination operand, and the other operands are source operands.
- Most instruction names should be self explanatory. If not, see Instructions.md.
Integer and floating point can be mixed
Since all registers can be used by both integer and floating point instructions, you can easily use integer instructions for dealing with floating point data.
Example: Load a floating point value from memory and compute x = 2.0 * |x|.
ldi s12, 0x7fffffff ; s12 = sign mask ldw s13, s7, 0 ; Load a 32-bit word from addr (s7 + 0) and s13, s13, s12 ; Mask off the sign bit from the word fadd s13, s13, s13 ; Calculate x = x + x (floating point) stw s13, s7, 0 ; Store the result back to memory
Vector and scalar operations can be mixed
Virtually all scalar instructions have identical vector instruction counterparts (the notable exception is branch instructions, which only exist as scalar operations).
Example: Add integer values.
add s12, s1, s2 ; Scalar operation: s12 = s1 + s2 add v12, v1, v2 ; Vector operation: v12 = v1 + v2
Most vector operations can use one scalar operand (even immediate values).
Example: Multiply all vector elements by 10 and right shift by 2 bits
ldi s12, 10 ; Scalar operation: s12 = 10 mul v12, v1, s12 ; Vector operation: v12 = v1 * 10 asr v12, v12, 2 ; Vector operation: v12 = v12 >> 2
Vector operations are variable length
The length of each vector operation (i.e. the number of vector elements to process) is determined by the scalar register VL.
Example: Perform a vector addition for three elements.
ldi vl, 3 ; Set the vector length to 3 add v12, v1, v2 ; v12 = v1 + v2 (first three elements)
The maximum vector length (i.e. the size of one vector register) is implementation dependent, and can be queried using the CPUID instruction. In a loop this can be conveniently combined with the MIN instruction to decide how many elements to process in each iteration.
Example: Process as many elements as possible per loop iteration.
; s1: Data array start address (32-bit floating point elements) ; s2: Number of data elements (at least 1) cpuid s8, z, z ; s8 = the maximum vector length lsl s9, s8, 2 ; s9 = memory increment per iteration (4*s8) loop: min vl, s8, s2 ; Vector length = min(max_vl, elements_left) sub s2, s2, vl ; Decrement the loop counter ldw v8, s1, 4 ; Load elements into v8 (stride: 4 bytes/element) fmul v8, v8, v8 ; Calculate the floating point square (x = x^2) stw v8, s1, 4 ; Store the elements back to memory add s1, s1, s9 ; Increment the memory pointer bgt s2, loop ; Loop if there are any elements left
- This code will utilize maximum vector parallelism, regardless of the size of the vector registers of the CPU.
- No special tail loop is required to handle non vector sized operations.
Branches and conditions
One thing that sticks out is the compare & branch model. Unlike most architectures (x86, ARM, PowerPC, 68k, etc), but like MIPS and Alpha, the MRISC32 does not use condition code flags for keeping track of branch conditions. Instead any scalar register can be used as a branch condition.
Example: Branch as long as a register value is > 0.
ldi s12, 1234 ; Set the loop counter loop: add s12, s12, -1 ; Decrement the loop counter ... bgt s12, loop ; Continue loop if the loop counter > 0
Example: Compare floating point values and branch.
fslt s12, s7, s8 ; Set s12 if s7 < s8 bns s12, not_less_than ; Branch if s12 is not set ... not_less_than:
One feature of the “compare and set” operations is that they produce a bit mask. I.e. “set” means all bits in the register are set to 1, whereas the opposite is that all bits are cleared to 0. This is a concept that has been borrowed from SIMD ISA:s such as NEON and SSE. This makes the standard compare instructions useful for masking of vector registers too.
Example: Conditionally add elements of one vector register to another vector register.
slt v12, v7, 100 ; Set elements of v12 where v7[k] < 100 and v13, v7, v12 ; Clear (zero) elements that are >= 100 add v13, v13, v8 ; Add the masked vector to v8 and store in v13
Another side effect of the bit mask concept is that it works well with packed data types (see below). For instance it is possible to do packed compare-and-set operations on each byte of a word, and then branch if all, none or some of the packed bytes matched the condition.
Small data types are packed
This is the thing that I am the least satisfied with at the moment, since packed data is inherently cumbersome to work with, but it does have some advantages and at least it is semi-elegant.
In many instruction sets there are special mechanisms for working with data types that are smaller than the basic word size. For instance the 68000 had .B, .W and .L suffixes for treating each operand as 8, 16 or 32 bits, respectively. On the x86 you have special virtual registers to deal with sub-regions of a register (e.g. AX is the lowest 16 bits of EAX, which in turn is the lowest 32 bits of RAX). The RISC-V RV64I ISA has special instructions for doing 32-bit operations on 64-bit registers (e.g. ADDW, SRLW, etc).
In the MRISC32 ISA this kind of functionality is combined with increased parallelism by treating each 32-bit register as a container of packed smaller data types. To this end, many instructions support the following “packed” operation modes:
- Packed Bytes (.B): each 32-bit register/element is treated as four bytes (4 x 8 bits).
- Packed Half-Words (.H): each 32-bit register/element is treated as two half-words (2 x 16 bits).
- Unpacked: each 32-bit register/element is treated as a single 32-bit word.
Example: Perform packed unsigned byte operations: w = z * (x + y).
add.b s12, s1, s2 ; 4 packed byte additions mul.b s12, s12, s3 ; 4 packed byte multiplications
In most situations the packed operations can be used for single byte or half-word sized operations too, by simply ignoring the unused sub-elements.
As a bonus, packed floating point data types are supported too. “Packed Half-Word” translates into 2 x half-precision (16-bit floating point) and “Packed Byte” translates into 4 x quarter-precision (8-bit floating point).
Example: Perform packed half-precision floating point operations: w = z * (x + y).
fadd.h s12, s1, s2 ; 2 half-precision additions fmul.h s12, s12, s3 ; 2 half-precision multiplications
I’m still working on how to pack, unpack and convert between the different packed data types, but so far it seems to land in a reasonable set of instructions.
How about 64-bit support?
Simple answer: This is my first CPU design and I wanted to keep things simple, and I wanted to make sure that a CPU implementation would fit easily in a cheap FPGA. Thus I went for a 32-bit ISA.
On the other hand most of the concepts in the MRISC32 ISA are easily transferable to a corresponding 64-bit ISA (although some parts need to be redesigned).
One notable problem with the current MRISC32 ISA is that it does not support 64-bit floating point, at all. The reason is that it would complicate the design too much. A reasonable solution might be to work with even:odd register pairs, but it would likely increase the number of register file read and write ports and complicate operand forwarding logic (or similar), etc.
Instead, it is anticipated that if/when a 64-bit MRISC ISA is developed, it will naturally support 64-bit floating point (just as the Cray-1 did), and 32-bit floating point would be implemented as a packed data type.
The MRISC32 ISA is in constant development. Many pieces are still missing (interrupts, supervisor mode, memory management, cache management, multi-core support, etc, etc), but the parts that are there are starting to stabilize and are enough to run simple programs.
In the MRISC32 GitHub project you can find documentation, a CPU simulator (written in C++), a simple assembler (written in Python) and work-in-progress VHDL implementation of a pipelined MRISC32 CPU, called MRISC32-A1.
Here is a rough overview of the CPU pipeline:
As an example, the following (rather dull) image of a Mandelbrot fractal was generated by the MRISC32-A1 (VHDL simulation):
That’s about it for now. I’ll probably post more updates as I make more progress.