Floating point operations

Single-precision floating point operations

The ANSI standard for C has a provision that allows expressions to be evaluated in single-precision arithmetic if there is no double (or long double) operand in the expression. The C compiler supports this provision.

Floating point constants are double-precision, unless explicitly stated to be float. For example, in the statements

   float a,b;
    ...
   a = b + 1.0;

because the constant 1.0 has type double, b is promoted to double before the addition and the result is converted back to float. However, the constant can be made explicitly a float:

   a = b + 1.0f;
     /* or */
   a = b + (float) 1.0;

In this case, the statement can potentially be compiled to a single instruction. Single-precision operations tend to be faster than double-precision operations.

Whether a computation can be done in single-precision is decided based on the operands of each operator. Consider the following:

   float s;
   double d;

   d = d + s * s;

s * s is computed to produce a single-precision result, which is promoted to double-precision and added to ``d''. Note that using single-precision (as versus double-precision) arithmetic can result in loss of precision, as illustrated in the following example.

   float f  = 8191.f * 8191.f;     /* evaluate as a float  */
   double d = 8191.  * 8191. ;     /* evaluate as a double */
   printf ("As float:  %f\nAs double: %f\n", f, d);

The result is:

   As float: 67092480.000000
   As double: 67092481.000000

Also, long int variables (same as int) have more precision than float variables. Consider the following example:

   int i,j;
   i = 0x7ffffff;
   j = i * 1.0;
   printf("j = %x\n", j);
   j = i * 1.0f;
   printf("j = %x\n", j);

The first printf statement outputs 7ffffff, while the second prints 0. The second printf prints 0 because the nearest float to 0x7fffffff has a value of 0x80000000. When the value is converted to an integer, the result is 0, and a floating point imprecise result exception occurs. A trap occurs if this exception was enabled.

A function that is declared to return a float may actually return either a float or a double. If the function declaration is a prototype declaration in which at least one of the parameters is float, the function returns a float. Otherwise, it returns a double with precision limited to that of a float. (All of this is transparent.) For example:

   float retflt(float);    /* actually returns a float  */
   float retdbl1();        /* actually returns a double */
   float retdbl2(int);     /* actually returns a double */

Arguments work as follows:

   double takeflt(float x);    /* takes a float  */

   double takedbl(x)
   float x;                    /* takes a double */