The MRISC32 – A vector first CPU design

For some time now I have been working on my first CPU design as a hobby project. It started with a desire to learn more about CPU architecture design, and in particular instruction set architecture (ISA) design.

To be clear, I’m not a hardware or CPU designer by trade, but I have worked with performance critical software since the late 80’s, and I have done low level assembly code development on several different CPU architectures, including 8-bit CISC (6502), 16/32-bit CISC (68000) and 32-bit RISC (ARMv7). As far as possible I have stayed away from the abominable x86 ISA, but I have done a fair amount of SSE intrinsics coding in C/C++.

Background

The initial idea was to make a very simple RISC CPU similar to the MIPS architecture. Early on, however, I realized that I needed to come up with a solution for floating point and SIMD (single instruction multiple data) operations, so I looked at different contemporary solutions.

Being familiar with both x86 SSE/AVX and ARM NEON, I knew that they are hard to work with, set an upper limit to parallelism (due to a fixed SIMD register size), and are bolted on to an existing architecture as a fairly separate ISA (effectively creating different execution environment silos). Similarly almost every CPU architecture from the last four decades treat floating point as a separate extension (e.g. via the coprocessor interface on the MIPS). In short, I wasn’t convinced that the prevalent ISA:s of today give the right answer to how integer, floating point and SIMD instructions should work and interact with each other.

The turning point came when I learned more about the Cray-1 architecture from the mid 70’s. It was an epiphany! The Cray clearly had an architecture way ahead of its time, and in many respects the modern CPU industry is still trying to catch up with the ideas and inventions from that architecture:

Cray-1 computation section
Cray-1 computation section

If you look closer at the diagram above, you can see that the Cray had a few interesting features not found in most modern architectures:

  • A huge vector register file: Eight registers with 64×64 bits (4096 bits) in each register!
  • There are no floating point registers! Scalar and vector registers can hold both integer and floating point values.
  • Scalar and vector operands can be fed to the same execution units, and can even be mixed!
  • There is a special vector length (VL) register that controls the length of vector operations. Unlike most current SIMD ISA:s, there is no need to split vector loops into a “main” (vector) and a “tail” (scalar) loop!

This inspired me to go for an ISA that borrows a mix of ideas from Cray-1 (the vector parts and the mixed float/integer register file parts) and MIPS (the RISC parts and the use-register-content-as-branch-condition part).

The MRISC32

After many iterations (but with much more work still remaining), an ISA and an accompanying CPU design is starting to take shape:

MRISC32 – “Mostly harmless Reduced Instruction Set Computer – 32-bit edition”

Documentation, code and tools can be found at GitHub: mbitsnbites/mrisc32

So, what is it?

In essence, it’s a 32-bit RISC ISA designed with a holistic view on integer, floating point, scalar and vector operations. In addition there is a hardware implementation of a single issue, in order, pipelined CPU. The hardware implementation mostly serves as an aid in the design of the ISA (at the time of writing the CPU is still incomplete).

Some key features are:

  • 32-bit RISC instruction set.
  • 32 32-bit scalar registers.
  • 32 Nx32-bit vector registers (where N is typically 16 or more).
  • All registers can hold integer or floating point values.
  • Variable length vector instructions.
  • Almost all instructions can use either scalar or vector operands, or a mix of the two (including immediate scalar values).
  • Branches are based on register content (there are no condition code flags).

Some of the design choices/consequences are described below.

A quick guide to the assembly language examples:

  • Scalar registers are prefixed ‘s’.
  • Vector registers are prefixed ‘v’.
  • The leftmost operand is usually the destination operand, and the other operands are source operands.
  • Most instruction names should be self explanatory. If not, see Instructions.md.

Integer and floating point can be mixed

Since all registers can be used by both integer and floating point instructions, you can easily use integer instructions for dealing with floating point data.

Example: Load a floating point value from memory and compute x = 2.0 * |x|.

    ldi   s12, 0x7fffffff   ; s12 = sign mask
    ldw   s13, s7, 0        ; Load a 32-bit word from addr (s7 + 0)
    and   s13, s13, s12     ; Mask off the sign bit from the word
    fadd  s13, s13, s13     ; Calculate x = x + x (floating point)
    stw   s13, s7, 0        ; Store the result back to memory

Vector and scalar operations can be mixed

Virtually all scalar instructions have identical vector instruction counterparts (the notable exception is branch instructions, which only exist as scalar operations).

Example: Add integer values.

    add   s12, s1, s2       ; Scalar operation: s12 = s1 + s2
    add   v12, v1, v2       ; Vector operation: v12 = v1 + v2

Most vector operations can use one scalar operand (even immediate values).

Example: Multiply all vector elements by 10 and right shift by 2 bits

    ldi   s12, 10           ; Scalar operation: s12 = 10
    mul   v12, v1, s12      ; Vector operation: v12 = v1 * 10
    asr   v12, v12, 2       ; Vector operation: v12 = v12 >> 2

Vector operations are variable length

The length of each vector operation (i.e. the number of vector elements to process) is determined by the scalar register VL.

Example: Perform a vector addition for three elements.

    ldi   vl, 3        ; Set the vector length to 3
    add   v12, v1, v2  ; v12 = v1 + v2 (first three elements)

The maximum vector length (i.e. the size of one vector register) is implementation dependent, and can be queried using the CPUID instruction. In a loop this can be conveniently combined with the MIN instruction to decide how many elements to process in each iteration.

Example: Process as many elements as possible per loop iteration.

    ; s1: Data array start address (32-bit floating point elements)
    ; s2: Number of data elements (at least 1)
    cpuid s8, z, z     ; s8 = the maximum vector length
    lsl   s9, s8, 2    ; s9 = memory increment per iteration (4*s8)
  .loop:
    min   vl, s8, s2   ; Vector length = min(max_vl, elements_left)
    sub   s2, s2, vl   ; Decrement the loop counter
    ldw   v8, s1, 4    ; Load elements into v8 (stride: 4 bytes/element)
    fmul  v8, v8, v8   ; Calculate the floating point square (x = x^2)
    stw   v8, s1, 4    ; Store the elements back to memory
    add   s1, s1, s9   ; Increment the memory pointer
    bgt   s2, .loop    ; Loop if there are any elements left

Note that:

  • This code will utilize maximum vector parallelism, regardless of the size of the vector registers of the CPU.
  • No special tail loop is required to handle non vector sized operations.

Branches and conditions

One thing that sticks out is the compare & branch model. Unlike most architectures (x86, ARM, PowerPC, 68k, etc), but like MIPS and Alpha, the MRISC32 does not use condition code flags for keeping track of branch conditions. Instead any scalar register can be used as a branch condition.

Example: Branch as long as a register value is > 0.

    ldi  s12, 1234      ; Set the loop counter
  loop:
    add  s12, s12, -1   ; Decrement the loop counter
    ...
    bgt  s12, loop      ; Continue loop if the loop counter > 0

Example: Compare floating point values and branch.

    fslt s12, s7, s8           ; Set s12 if s7 < s8
    bns  s12, .not_less_than   ; Branch if s12 is not set
    ...
  .not_less_than:

One feature of the “compare and set” operations is that they produce a bit mask. I.e. “set” means all bits in the register are set to 1, whereas the opposite is that all bits are cleared to 0. This is a concept that has been borrowed from SIMD ISA:s such as NEON and SSE. This makes the standard compare instructions useful for masking of vector registers too.

Example: Conditionally add elements of one vector register to another vector register.

    slt  v12, v7, 100    ; Set elements of v12 where v7[k] < 100
    and  v13, v7, v12    ; Clear (zero) elements that are >= 100
    add  v13, v13, v8    ; Add the masked vector to v8 and store in v13

Another side effect of the bit mask concept is that it works well with packed data types (see below). For instance it is possible to do packed compare-and-set operations on each byte of a word, and then branch if all, none or some of the packed bytes matched the condition.

Small data types are packed

This is the thing that I am the least satisfied with at the moment, since packed data is inherently cumbersome to work with, but it does have some advantages and at least it is semi-elegant.

In many instruction sets there are special mechanisms for working with data types that are smaller than the basic word size. For instance the 68000 had .B, .W and .L suffixes for treating each operand as 8, 16 or 32 bits, respectively. On the x86 you have special virtual registers to deal with sub-regions of a register (e.g. AX is the lowest 16 bits of EAX, which in turn is the lowest 32 bits of RAX). The RISC-V RV64I ISA has special instructions for doing 32-bit operations on 64-bit registers (e.g. ADDW, SRLW, etc).

In the MRISC32 ISA this kind of functionality is combined with increased parallelism by treating each 32-bit register as a container of packed smaller data types. To this end, many instructions support the following “packed” operation modes:

  • Packed Bytes (.B): each 32-bit register/element is treated as four bytes (4 x 8 bits).
  • Packed Half-Words (.H): each 32-bit register/element is treated as two half-words (2 x 16 bits).
  • Unpacked: each 32-bit register/element is treated as a single 32-bit word.

Example: Perform packed unsigned byte operations: w = z * (x + y).

    add.b s12, s1, s2    ; 4 packed byte additions
    mul.b s12, s12, s3   ; 4 packed byte multiplications

In most situations the packed operations can be used for single byte or half-word sized operations too, by simply ignoring the unused sub-elements.

As a bonus, packed floating point data types are supported too. “Packed Half-Word” translates into 2 x half-precision (16-bit floating point) and “Packed Byte” translates into 4 x quarter-precision (8-bit floating point).

Example: Perform packed half-precision floating point operations: w = z * (x + y).

    fadd.h s12, s1, s2   ; 2 half-precision additions
    fmul.h s12, s12, s3  ; 2 half-precision multiplications

I’m still working on how to pack, unpack and convert between the different packed data types, but so far it seems to land in a reasonable set of instructions.

How about 64-bit support?

Simple answer: This is my first CPU design and I wanted to keep things simple, and I wanted to make sure that a CPU implementation would fit easily in a cheap FPGA. Thus I went for a 32-bit ISA.

On the other hand most of the concepts in the MRISC32 ISA are easily transferable to a corresponding 64-bit ISA (although some parts need to be redesigned).

One notable problem with the current MRISC32 ISA is that it does not support 64-bit floating point, at all. The reason is that it would complicate the design too much. A reasonable solution might be to work with even:odd register pairs, but it would likely increase the number of register file read and write ports and complicate operand forwarding logic (or similar), etc.

Instead, it is anticipated that if/when a 64-bit MRISC ISA is developed, it will naturally support 64-bit floating point (just as the Cray-1 did), and 32-bit floating point would be implemented as a packed data type.

Status

The MRISC32 ISA is in constant development. Many pieces are still missing (interrupts, supervisor mode, memory management, cache management, multi-core support, etc, etc), but the parts that are there are starting to stabilize and are enough to run simple programs.

In the MRISC32 GitHub project you can find documentation, a CPU simulator (written in C++), a simple assembler (written in Python) and work-in-progress VHDL implementation of a pipelined MRISC32 CPU, called MRISC32-A1.

Here is a rough overview of the CPU pipeline:

MRISC32-A1 pipeline
MRISC32-A1 pipeline

As an example, the following (rather dull) image of a Mandelbrot fractal was generated by the MRISC32-A1 (VHDL simulation):

MRISC32 Mandelbrot
MRISC32 Mandelbrot

That’s about it for now. I’ll probably post more updates as I make more progress.

29 thoughts on “The MRISC32 – A vector first CPU design

  1. This is great! We need to *move forward* in computing and I love to see new ISAs and microarchitectures.

    Question: When registers are able to hold both integers and floats, is there a performance or complexity cost? I don’t know much about the ground level engineering and physics here, but I assume floating point requires computational structures that integers don’t. Is that correct?

    Random thoughts:

    1. I think string processing is underappreciated, and I’d love to see instructions for string comparison like SSE 4.2 in the x86-64 ISA. Would these sorts of instructions break the RISC-ness of MRISC32?

    Strings are treated as integers (or arrays of integers), and I wonder if it would pay off to treat strings as a basic type, either having string registers or registers that could hold strings, integers, and floats, with the recognition that strings ≠ integers. We’d know that various integer operations could not be performed on strings, and I wonder if that could give a performance advantage.

    2. I wish we had UI primitives in our instruction sets. It’s extremely unclear to me how we get from these UI-unaware ISAs to a screen full of visually rich applications. I know GPUs have something to do with it, but most of the work is still actually coming from the CPU. I wonder what it would mean to have UI primitives in an ISA, what that would look like.

    1. Marcus Geelnard

      There’s no inherent cost of storing floating point in the same registers as integers (that’s how SSE/AVX/NEON/etc support both floating point and integer operations, for instance). On the contrary, I have found it to be simpler in a regular scalar architecture (which the MRISC32-A1 is). The registers don’t really care what data is stored in them, and the execution units don’t really care where it’s operands are coming from.

      The situation may be different in a superscalar architecture (where register file traffic may be a limiting factor), but I have not come that far yet.

      1. Christian

        I think historically, FPUs were physically separate from the IU. Most of the current architectures were defined in the early to mid 1980’s, or late 70’s for 16-bit x86. So from a practical point of view, it made sense to have separate register files, as the FPU was a separate chip.

        The Motorola 88000 (successor to the 68k) defined a combined IU/FP register file, though many consider this a mistake. Each read/wrote port added to the register file (required for super scaler access) adds (electrical) load and complexity to the register file, though I don’t know the details as hardware isn’t my thing.

        Still, I personally like the idea of a combined IU/FP register file, it has a nice orthogonality, and the implementation details are probably not even that much of an issue anymore.

  2. Interesting. I am working on a similar instruction set with variable vector length. My ISA is called ForwardCom. See http://www.forwardcom.info
    I think that a global VL register is a problem for superscalar processors when different vectors have different lengths. During a task switch you have to save all vector registers with their maximum length even if they contain nothing. My approach is to give each vector register a tag with its length. In a register operation like v1 = v2 + v3, the destination will get the same vector length as the first source operand. With memory operands you have to specify the length explicitly using an extra scalar register operand to indicate the length.
    My ForwardCom instruction set has many other interesting features. There is a discussion forum at http://www.forwardcom.info/forum

    1. Marcus Geelnard

      I’ll look into the ForwardCom ISA more closely (I’ve stumbled across it before, and it sounds very interesting). Actually, my plan is to add a vector length tag to each register (it’s hidden in the GitHub docs somewhere). As you suggest, writing to a vector register updates the register length (RL) to the vector operation length (VL), and if you read elements beyond the register length you get zeros. So the following code would effectively do a quick clear of a vector register:

      ldi vl, 0
      or v7, vz, vz

      I have still not figured out how to efficiently deal with subroutine calls and interrupts, but the idea is that you should only need to store RL elements on the stack.

      1. Riz

        Many modern CPUs have more unlisted cores than one would think. When in doubt, make it asynchronous.

        Although chances are, the most efficient way to deal with interrupts is located in a patent.

        1. Marcus Geelnard

          Since I like to think big I’d like to find a solution that allows for zero overhead regardless of the vector size (e.g. support 1024 vector elements, which equals 128 KB memory for the full register file -> not pretty for a regular stack push/pop).

          One train of thought I’ve had leads to some sort of switchable set of register files (e.g. 4-16 copies), and then add a dedicated memory/bus/cache that a vector register file switcheroo-manager can use to asynchronously push/pop the LRU register file. Or similar… Not sure how it would actually work, but I think that something similar is required to transparently support low overhead task switching.

          1. Peter Rundberg

            The “switchable set of register files” sounds a lot like the register windows used with the Sun SPARC architecture. If I am not mistaken, some people have attributed the difficulty of scaling the performance of the SPARC to match the competition to its use of register windows. May be worth thinking about.

          2. Marcus Geelnard

            Yes you risk eventually running into slow casing every thread switch when you run out of hw register sets. I think that the solution would have to be more like a cache to make it work well. For instance each thread could have a unique “register set id” (say, a 16-32 bit id), and in supervisor mode you can load/store a specific register set, which transparently swaps in data from a small local dedicated cache. The cache would asynchronously sync the cache with ram, using prediction techniques to have the most relevant register sets cached.

  3. DJ DefE D

    The Hitachi SH-4, the CPU used in the Dreamcast, had both vector and matrix operations – not nearly as well implemented as you’re planning, but it might be worth looking at if you haven’t already.

  4. Aaron Peacock

    Ignorant programmer guy here: would such an architecture be suitable for processing audio streams/buffers?

    1. Marcus Geelnard

      Yes. It’s one of the best applications for this architecture, actually. It even has fixed point Q15 support, which is very common in audio DSP applications.

  5. Mentoles

    How did you very started in this space? What kind of documentation did you use or resources did you read?

    1. Marcus Geelnard

      First off, I have programmed assembly language on a few CPU architectures, I took a course in microprocessor design at the university (decades ago), and I programmed VHDL/FPGA as part of my master thesis.

      For this particular project I studied a lot of different CPU architectures. The best architectures to study, IMO, are from the 60’s, 70’s and 80’s (you’ll find a greater diversity, and smaller and easier to grasp architectures). The RISC-V is also very interesting and well though through.

      Google, find articles, blogs, old papers and manuals.

  6. UX-admin

    I hate the fact that you chose “dst, src” rather than “src, dst”.

    In real life you’d tell someone to take something (src) and put it somewhere (dst), why make this illogical and unintuitive in assembler? Change the assembler before it’s too late!

    You say you coded on MC68000 and that CPU has move.l src, dst. How could you do this, what posessed you to implement the retarded intel dst, src?!?

    1. gus3

      RISC (e.g ARM and MIPS) uses a variable number of source operands: one for move, two for math ops (usually). When a math op is in the for “a = a + b” (or “a += b”), the internal form is ADD R0, R0, R1, but the human-readable form can be condensed to ADD R0, R1, with R0 being assumed the destination and a source.

      CISC (x86 and M68K) have no such: an ADD instruction always treats the destination register as a source register.

    2. Marcus Geelnard

      Ha ha! As you noticed I’m a 68k veteran, and for many years I hated assembler syntax with the destination operand to the left (I also hated little endian). However, with age comes wisdom I guess, and now I just go for the model that is most common since that is less likely to confuse people. And as others have said, thinking in math terms makes it more logical.

    3. Sarreq Teryx

      I’m not all that experienced with assembly, but my assumption was that having the dst first allowed you to trail with as many srcs as you want without special notation for the destination. so:
      add dst1, src1, src2, src3, src4, src5, etc…
      so: dst1=src1+src2+src3+src4+src5+…
      am I wrong?

      1. Marcus Geelnard

        If you have a look at the instruction format at the top of mr32asm.py (each instruction is 32 bits), you’ll see that there is room for up to three operands (5 bits each, i.e. register 0-31). The orders in the instruction word and in the assembly syntax do not have to be the same, so it’s purely a syntactical decision. Since the convention for most 3-operand architectures (e.g. ARM) is to use “INSTRUCTION DST, SRC1, SRC2”, I used that aswell.

        1. Sykobee

          You could be funky – and this will look nice in the text editor with modern ligature programmer fonts like FiraCode…

          ‘Load v1 <- v0 + 5' is more readable than 'LD v1, v0, 5' for example.

          as you say, it's just syntactic sugar on top of the real ISA, so no need to worry about funky short mnemonics and the like.

          Add.pb v4 v4
          Multiply.ph s0 <- s0, 5

          1. Sykobee

            odd, the comment got munged up… ah, the pointy brackets aren’t escaped by the commenting software.

            ‘Add.pb v4 <- v1, 8’ == ‘PBADD v4,v1,8’ == ‘Add.pb v1, 8 -> v4’

            could all be accepted by the assembler.

  7. gus3

    You’re familiar with ARM NEON, which is somewhat compatible with the obsolete Vector Floating Point (VFP) extension. The VFP was available on the original Raspberry Pi’s BCM2835; I explored VFP programming here:

    Part 1: Introduction
    https://mindplusplus.wordpress.com/2013/06/25/arm-vfp-vector-programming-part-1-introduction/

    Part 2: Examples
    https://mindplusplus.wordpress.com/2013/06/27/arm-vfp-vector-programming-part-2-examples/

    The Raspberry Pi 2 uses NEON by default, but supports 4 variants of VFP according to /proc/cpuinfo: vfp, vfpv3, vfpv4, and vfpd32. I have run vector-enabled code on my RPi2, and it generates correct results with no instruction exceptions.

    1. Marcus Geelnard

      Cool! I always wondered what the V in VFP meant. Nice explanations.

  8. pax

    math is both logical and intuitive, and is used in real life:
    A (dst) = B (src)

  9. Sarreq Teryx

    probably a silly question, how about Packed Nibbles (4 bits)? I know there are cases where calculating tiny numbers super fast is desirable, so for flexibility’s sake, why not?

    Packed Nibble (PN/F): each 32-bit register/element is treated as eight nibbles (8 x 4 bits). / eighth precision
    Packed Bytes (PB/F): each 32-bit register/element is treated as four bytes (4 x 8 bits). / quarter precision
    Packed Half-Words (PH/F): each 32-bit register/element is treated as two half-words (2 x 16 bits). / Half precision
    Unpacked: each 32-bit register/element is treated as a single 32-bit word.

    1. Marcus Geelnard

      I essentially have 2 bits (i.e. 4 combinations) in the instruction word to describe the packed mode. I decided to go easy and use only three of the combinations (leaving one for future use). It could get used for nibbles, but I doubt that it would be worth it. First of all there is no native 4-bit data type in popular languages (e.g. C), so no compiler would generate code for it. Second it might introduce undesirable hardware overhead to have that many sub-parts of a word.

      On the other hand I was bold and let packed bytes be 8-bit floating point for floating point instructions. In terms of floating point, they are truly tiny numbers, but fast 😀 E.g. Microsoft uses something called “ms-fp8” for neural networks.

      1. You could implement floating point numbers as posits for better range (e.g. 8 bit floats).

        1. Marcus Geelnard

          Yes I considered that, but I’d rather have IEEE 754 first. Posits could be added later.

Leave a Reply

Your email address will not be published. Required fields are marked *