Floating Point and IEEE 754 Compliance for NVIDIA GPUs (2024)

In 2008 the IEEE 754 standard was revised to include the fused multiply-add operation (FMA). The FMA operation computes $rn (X \times Y + Z)$ with only one rounding step. Without the FMA operation the result would have to be computed as $rn (rn (X \times Y) + Z)$ with two rounding steps, one for multiply and one for add. Because the FMA uses only a single rounding step the result is computed more accurately.

Let's consider an example to illustrate how the FMA operation works using decimal arithmetic first for clarity. Let's compute $x^{2} - 1$ with four digits of precision after the decimal point, or a total of five digits of precision including the leading digit before the decimal point.

For $x = 1.0008$ , the correct mathematical result is $x^{2} - 1 = 1.60064 \times 10^{- 4}$ . The closest number using only four digits after the decimal point is $1.6006 \times 10^{- 4}$ . In this case $rn (x^{2} - 1) = 1.6006 \times 10^{- 4}$ which corresponds to the fused multiply-add operation $rn (x \times x + (- 1))$ . The alternative is to compute separate multiply and add steps. For the multiply, $x^{2} = 1.00160064$ , so $rn (x^{2}) = 1.0016$ . The final result is $rn (rn (x^{2}) - 1) = 1.6000 \times 10^{- 4}$ .

Rounding the multiply and add separately yields a result that is off by 0.00064. The corresponding FMA computation is wrong by only 0.00004, and its result is closest to the correct mathematical answer. The results are summarized below:

$\begin{aligned} x & = & 1.0008 \\ x^{2} & = & 1.00160064 \\ x^{2} - 1 & = & 1.60064 \times 10^{- 4} & true value \\ rn (x^{2} - 1) & = & 1.6006 \times 10^{- 4} & fused multiply-add \\ rn (x^{2}) & = & 1.0016 \times 10^{- 4} \\ rn (rn (x^{2}) - 1) & = & 1.6000 \times 10^{- 4} & multiply, then add \end{aligned}$

Below is another example, using binary single precision values:

$\begin{aligned} A & = & 2^{0} & \times 1.00000000000000000000001 \\ B & = & - & 2^{0} & \times 1.00000000000000000000010 \\ rn (A \times A + B) & = & 2^{- 46} & \times 1.00000000000000000000000 \\ rn (rn (A \times A) + B) & = & 0 \end{aligned}$

In this particular case, computing $rn (rn (A \times A) + B)$ as an IEEE 754 multiply followed by an IEEE 754 add loses all bits of precision, and the computed result is 0. The alternative of computing the FMA $rn (A \times A + B)$ provides a result equal to the mathematical value. In general, the fused-multiply-add operation generates more accurate results than computing one multiply followed by one add. The choice of whether or not to use the fused operation depends on whether the platform provides the operation and also on how the code is compiled.

Figure 1 shows CUDA C++ code and output corresponding to inputs A and B and operations from the example above. The code is executed on two different hardware platforms: an x86-class CPU using SSE in single precision, and an NVIDIA GPU with compute capability 2.0. At the time this paper is written (Spring 2011) there are no commercially available x86 CPUs which offer hardware FMA. Because of this, the computed result in single precision in SSE would be 0. NVIDIA GPUs with compute capability 2.0 do offer hardware FMAs, so the result of executing this code will be the more accurate one by default. However, both results are correct according to the IEEE 754 standard. The code fragment was compiled without any special intrinsics or compiler options for either platform.

The fused multiply-add helps avoid loss of precision during subtractive cancellation. Subtractive cancellation occurs during the addition of quantities of similar magnitude with opposite signs. In this case many of the leading bits cancel, leaving fewer meaningful bits of precision in the result. The fused multiply-add computes a double-width product during the multiplication. Thus even if subtractive cancellation occurs during the addition there are still enough valid bits remaining in the product to get a precise result with no loss of precision.

Figure 1. Multiply and Add Code Fragment and Output for x86 and NVIDIA Fermi GPU

union { float f; unsigned int i} a, b;float r;a.i = 0x3F800001;b.i = 0xBF800002;r = a.f * a.f + b.f;printf("a %.8g\n", a.f); printf("b %.8g\n", b.f); printf("r %.8g\n", r);

x86-64 output:

a: 1.0000001 b: -1.0000002 r: 0

NVIDIA Fermi output:

a: 1.0000001 b: -1.0000002r: 1.4210855e-14