Floating Point -- Mark Smotherman

Clemson University
CPSC 464/664 Lecture Notes
Fall 2002
Mark Smotherman

Integer and Floating Point Arithmetic

Integer Arithmetic
1. ripple carry adder => 2n gate delays, so fast arithmetic needs carry predict
2. unsigned multiplication by 3-register data path, add and shift logic
3. unsigned division by 3-register data path, shift and subtract logic (need extra bit in registers and in adder path; restoring and non-restoring algorithms)
4. sign notation is 2's complement (one zero, easier addition)
5. signed multiplication by Booth recoding
6. signed division typically by first converting operands to unsigned
7. system issues
  1. multiple-cycle integer operations - e.g., multiplication
    1. single-step instruction
      1. library routines to call (or in-line) to obtain general case
      2. optimizing compiler generates short shift and add sequences for special cases (but note that integer overflow may not be correctly detected by shifts)
    2. alternative is use of special function unit or FPU (but integer unit must stall until multiple-cycle operation is complete, or processor must support out-of-order completion)
  2. double-length operands => register pairs, or separate instructions to generate high and low
  3. signed integer overflow
    1. carry into sign bit differs from carry out of sign bit
    2. set condition code bit, trap to exception handler, or do nothing
    3. execution mode bit or separate instructions to provide different options
  4. multiple precision needs carry-out of previous operation as a third operand
Floating Point Arithmetic
1. floating point is an approximation of real numbers
  1. finite representation
    1. => results of certain calculations will be wrong!
    2. => results of most calculations will be approximate! (e.g., 0.1)
    3. => arithmetic laws may not hold! (e.g., x + (y+z) != (x+y) + z)
  2. quasi-logarithmic format extends range while easing addition
    1. (-1)^sign * 1.fraction * (base)^exp
    2. range determined by number of digits in exponent (i.e., scaling factor)
    3. precision determined by number of digits in fraction
  3. non-unique representations => choose normal form
    1. minimum exponent preserves maximum precision
    2. for base of 2, leading 1 can be implicit => extra precision
  4. seven regions:
    1. negative overflow
    2. expressible negative numbers
    3. negative underflow
    4. zero
    5. positive underflow
    6. expressible positive numbers
    7. positive overflow
  5. geometric spacing within expressible regions
  6. result of operation typically not exactly representable => choose nearest (rounding or truncation)
2. IEEE standard has replaced vendor-specific formats - IBM S/360, DEC VAX, Cray, ...
  1. semantics varied - problems encountered included:
    1. x + x != 2.0*x
    2. (x-y) + y = y
    3. x*y != y*x
    4. x > y but 1.0*x < y
  2. portability problems
3. IEEE standard 754
  1. specifies 4 precisions, 4 rounding modes, special numbers, and exceptions
  2. single precision
    1. 8-bit exponent field, bias of 127 (range -126 to 127, values 1 to 254)
    2. 1-bit sign and 23-bit fraction with hidden bit to left of binary point
    3. exp_field = 0, fraction = 0 => zero
    4. exp_field = 0, fraction != 0 => denormal number (gradual underflow)
    5. exp_field = 255, fraction = 0 => infinity
    6. exp_field = 255, fraction != 0 => NaN
  3. special values allow continued execution even under exceptional conditions (the user program can check for these special values without dealing with defining operating system trap handlers)
  4. can use denormals rather than flush to zero to deal with small numbers (e.g., never allows x != y but x-y = 0) but complicates hardware implementation
  5. rounding to even (default mode)
    1. guard bit - used for rounding and prevents loss of significance during effective subtraction (i.e., when post-normalization requires a left shift)
    2. round bit
    3. sticky bit - keeps record of any 1's in bits shifted to right
4. addition/subtraction
  1. take difference of exponents
  2. equalize exponents by shifting operand with smaller exponent to right, setting sticky bit
  3. add using carry position on left and guard and round positions on right
  4. post-normalize if carry set or if leading bits zero, adjusting exponent
  5. round
5. multiplication
  1. add exponent fields, and subtract bias
  2. multiply to obtain double-length product and round to single-length
6. division
  1. divide hardware for nonrestoring division
  2. iterative schemes (number of correct bits doubles at each step)
    1. Newton's iteration
    2. Goldschmidt's algorithm
  3. SRT
7. multiply-accumulate (fused multiply add)
  1. useful in dense linear algebra and DSP
  2. only option for multiplication on EDSAC (1947)
  3. IBM Stretch (1961) was an accumulator-based architecture but also included a factor register for use in multiply-and-add; vector inner product required only four instructions:
```
    LP:  LOAD FACTOR NORMALIZED, 0(X4)
         MULTIPLY AND ADD NORMALIZED, 0(X5)
         ADD IMMEDIATE TO VALUE, X5, COLUMN LENGTH
         COUNT BRANCH AND REFILL PLUS, X4, LP
         
```
  4. DSP microprocessors
  5. high-end workstation architectures - IBM RS/6000, MIPS R10000 (but note that intermediate product rounded in R10000 but not in RS/6000)
  6. Itanium
8. exceptions - set bits in FPSR or invoke user trap handlers
  1. underflow and overflow
  2. divide by zero
  3. inexact - when result must be rounded or when it overflows
  4. invalid - e.g., sqrt(-1), 0/0, (+inf) - (+inf)

Floating-point examples

summation of ten million fractions

    int main(void){
        int i;
        float sum;
        float c,y,t;
  
        sum = 0.0;
        for(i=1; i<=10000000; i++){
            sum = sum + 1.0/((float)i);
        }
        printf("decreasing order: %f\n",sum);
  
        sum = 0.0;
        for(i=10000000; i>0; i--){
            sum = sum + 1.0/((float)i);
        }
        printf("increasing order: %f\n",sum);
  
        /* sum formula suggested by W. Kahan */
        sum = 1.0/((float)1);
        c = 0.0;
        for(i=2; i<=10000000; i++){
            y = 1.0/((float)i) - c;
            t = sum + y;
            c = (t - sum) - y;
            sum = t;
        }
        printf("kahan summation1: %f\n",sum);
  
        sum = 1.0/((float)10000000);
        c = 0.0;
        for(i=9999999; i>0; i--){
            y = 1.0/((float)i) - c;
            t = sum + y;
            c = (t - sum) - y;
            sum = t;
        }
        printf("kahan summation2: %f\n",sum);
    }

    /* output:

    float:

    decreasing order: 15.403683
    increasing order: 16.686031
    kahan summation1: 16.695311
    kahan summation2: 16.695311

    double:

    decreasing order: 16.695311
    increasing order: 16.695311
    kahan summation1: 16.695311
    kahan summation2: 16.695311

    */

representation and arithmetic errors

    #define FP_TYPE float
    main()
    {
        FP_TYPE eps,epsp1,small,lastsmall,x,y,h,a,b,c,d,q;
        int i;


        /* find smallest eps such than eps + one not equal to one */
        eps=1.0;
        epsp1=eps+1.0;
        while(epsp1>1.0){
            eps/=2.0;
            epsp1=eps+1.0;
        }
        printf("part 1: eps=%23.16e\n",2.0*eps);


        /* find smallest non-zero number */
        small=1.0;
        while(small>0.0){
            lastsmall=small;
            small/=2.0;
        }
        printf("part 2: small=%23.16e\n",lastsmall);


        /* find the error in 0.1 added ten times */
        x=0.0;
        h=0.1;
        for(i=0;i<10;i++) x+=h;
        y=1.0-x;
        printf("part 3: x=%23.16e   y=%23.16e\n",x,y);


        /* compare calculations */
        h=1.0/2.0;
        a=2.0/3.0-h;      /* 2/3 - 1/2 should equal 1/6    */
        b=3.0/5.0-h;      /* 3/5 - 1/2 should equal 1/10   */
        c=(a+a+a)-h;      /* 3*(1/6) - 1/2 should equal 0  */
        d=(b+b+b+b+b)-h;  /* 5*(1/10) - 1/2 should equal 0 */
        q=c/d;            /* 0/0 should give an error      */
        printf("part 4: a=%23.16e   b=%23.16e\n",a,b);
        printf("        c=%23.16e   d=%23.16e\n",c,d);
        printf("        q=%23.16e\n",q);

    }

    /* single precision output:

    part 1: eps= 1.1920928955078125e-07
    part 2: small= 1.4012984643248171e-45
    part 3: x= 1.0000001192092896e+00   y=-1.1920928955078125e-07
    part 4: a= 1.6666667163372040e-01   b= 1.0000000149011612e-01
            c= 0.0000000000000000e+00   d= 0.0000000000000000e+00
            q=                    NaN
    */
    /* double precision output:

    part 1: eps= 2.2204460492503131e-16
    part 2: small=4.9406564584124654e-324
    part 3: x= 9.9999999999999989e-01   y= 1.1102230246251565e-16
    part 4: a= 1.6666666666666663e-01   b= 9.9999999999999978e-02
            c=-1.1102230246251565e-16   d=-1.1102230246251565e-16
            q= 1.0000000000000000e+00
    */

use of guard bit

    Assume a binary significand with two digits to the
    right of the binary point.  Subtract 1.11 * (2)^0
    from 1.00 * (2)^1 without and then with a guard digit.
    That is, subtract 1 3/4 from 2.

    Without a guard digit

          1.00 * (2)^1
        - 1.11 * (2)^0
        --------------

        requires that we align the operands to equalize
        the exponents

          1.00 * (2)^1
        - 0.11 * (2)^1
        --------------
          0.01 * (2)^1
        = 1.00 * (2)^-1  = 1/2  => 100% error

    With a guard digit

          1.00 * (2)^1
        - 1.11 * (2)^0
        --------------

        requires that we align the operands to equalize
        the exponents

              g
          1.000 * (2)^1
        - 0.111 * (2)^1
        ---------------
          0.001 * (2)^1
        = 1.000 * (2)^-2
        = 1.00  * (2)^-2  =  1/4  => correct

Key Points

Floating-point is a finite representation.

Input numbers and results that do not have exact binary representations in the limited number of bits must be rounded to their nearest representable neighbors and consequently will contain errors.

Avoid effective subtractions, which magnify errors.

Avoid adding a very small number to a large number. The small number may be flushed to zero when it is right-shifted to equalize exponents during FP addition.

[Course home page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu