I’ve been developing software for over 30 years, and during those years I have frequently come across problems relating to a few features of the IEEE 754 floating point standard, namely:

- Denormalized numbers
- Configurable rounding modes

Recently I have also developed a custom CPU (MRISC32), and when doing the hardware implementation of the floating point unit I simply ignored those (and some other) IEEE 754 features, because the hardware costs outweigh the functional benefits.

Not only do these features provide little value to software developers, they actually present a number of problems (both for software and hardware developers).

UPDATE (2020-01-11): A more detailed specification can be found here: LeanFloat (GitLab).

## Problems with denormals

Denormalized numbers (denormals for short) is an encoding trick that extends the exponent range of floating point numbers slightly, making it possible to represent smaller numbers than is possible with regular normalized numbers. In addition we get “gradual underflow”, which gives a few algorithmic guarantees and simplifies things in some edge cases.

For instance, the smallest possible normalized binary64 number is roughly 10^{−308}, while the smallest possible denormalized number is roughly 5 × 10^{−324}.

This all sounds great, but the problem is that since denormals use a different encoding format, additional logic is required for handling them (decoding, rounding and encoding). In practice this means that hardware developers have two solutions to choose from:

- Include extra logic in the hardware for handling denormals, thus making the floating point pipeline longer and lowering the performance for all floating point operations.
- Assume that denormals are very uncommon and allow them to run slower than normalized numbers, e.g. by handling them in software via interrupts.

All CPU:s that I have come in contact with use the latter method. The implementation details differ between CPU architectures, but in general you can expect operations on denormals to be about **one or two orders of magnitude slower** (or more) than operations on normalized numbers.

Thus, if you are a software developer and you care about performance, you need to make sure that your floating point calculations do not involve denormalized numbers. This can be a real headache since it is very hard to predict when and where denormals may occur, and adding logic in your software to detect and prevent denormals further reduces the performance of your program.

One solution is to configure the floating point unit to disable the support for denormalized numbers (effectively treat them as zero), but unfortunately this is non-trivial.

One problem with this approach is that it is not standardized and relies on hardware specific configuration functionality, and you need to use different, low level (sometimes assembly code snippets) routines for different CPU architectures and platforms. In addition you have to consider that you affect **all** floating point operations of the software process. This can be quite problematic if you are implementing a software library that is to be used in different programs, for instance.

## Problems with configurable rounding modes

The IEEE 754 floating point standard mandates that you (as a software developer) must be able to select which rounding mode you prefer. There are five different rounding modes (two variants of round-to-nearest, and three directed modes).

In my entire career I have *never* found a need for utilizing this functionality for solving a mathematical problem or for improving the precision of any calculations.

On the other hand I have spent weeks and months debugging and fixing problems related to inconsistencies in rounding mode configurations. A particularly tricky bug was caused by a certain antivirus software that occasionally re-configured the floating point rounding mode for our software process.

From a hardware point of view, including support for different rounding modes is less costly than that for denormalized numbers. On the other hand the required hardware is non-negligible, and it is certainly possible to make floating point operations faster (if ever so slightly) and the hardware would be simpler and easier to test if it uses only a single, fixed rounding mode.

Additionally the control logic, usually in the form of a floating point control register, adds complexity (e.g. you need to make sure that the correct configuration is used for each floating point operation, which has implications in a pipelined, parallel architecture).

## Consequences

As a result of the above mentioned problems, several hardware floating point implementations are simplified (e.g. do not support denormals) and as such are not fully IEEE 754 compliant (or they do not support floating point at all).

Examples include GPU:s, DSP:s, AI accelerators, SIMD instruction sets and CPU:s for embedded applications.

Another consequence is the increased difficulty of writing portable software, since the rounding and underflow behavior may differ between platforms; either due to hardware implementation differences, or because the FPU configuration is hard to control in a consistent manner, or simply because most software developers are not aware of these subtle effects.

## Suggestion: A “core” subset

I think that it would be great if the IEEE 754 standard could be split into a light weight “core” subset, and a “full” feature set (the latter being more or less identical to the current standard).

Compared to the full version, the core subset would have the following properties:

- No denormalized numbers. All denormalized numbers are treated as zero, and any result that would be a denormalized number will be turned into zero (a.k.a. “flush-to-zero”).
- Only a single rounding mode: Round to nearest, ties to even (the default rounding mode in IEEE 754).
- No exceptions. Any operation that would throw an exception will silently produce a well defined result (e.g. sqrt(-1) will return NaN).
- All NaN:s are treated as quiet NaN:s. There are no signalling NaN:s.

The latter two bullet points aim to simplify hardware implementations and optimize for parallelized floating point execution (SIMD, out-of-order, etc). For some solid arguments against fault trapping in floating point hardware, see Agner Fog’s “NAN propagation versus fault trapping in floating point code”.

The core subset would acknowledge the current state of diverse uses of floating point in modern hardware (such as for graphics and AI), and improve software portability between different platforms. Additionally it would lower the barrier for adding standards compliant floating point functionality to new hardware.

hear hear . I agree.

I have spent a lot of time implementing my ‘fictional’ multiplier which handles both normal and subnormal numbers , without subnormals it would have taken me a quarter of the time.

In fact I am still puzzeling over where to inject the rounding bits for a subnormal number (leading zero count of subnormal number +1 ???)

> In my entire career I have never found a need for utilizing this functionality for solving a mathematical problem or for improving the precision of any calculations.

I used floor() in voxel game before casting to int. As I remember default rounding during cast was either round to zero or round to nearest. Maybe with some tricks I could use default one, but meh.

NaN are funny. A!=A, (A>B)!=!(B>A). AFAIK IEEE recommends min and max to return non-NaN.