|
|
The ANSI standard for C has a provision that allows expressions to be evaluated in single-precision arithmetic if there is no double (or long double) operand in the expression. The C compiler supports this provision.
Floating point constants are double-precision, unless explicitly stated to be float. For example, in the statements
float a,b;
...
a = b + 1.0;
because the constant 1.0 has type double, b is
promoted to double before the addition and the result
is converted back to float.
However, the constant can be made explicitly a float:
a = b + 1.0f;
/* or */
a = b + (float) 1.0;
In this case, the statement can potentially be
compiled to a single instruction.
Single-precision operations
tend to be faster than double-precision operations.
Whether a computation can be done in single-precision is decided based on the operands of each operator. Consider the following:
float s; double d;s * s is computed to produce a single-precision result, which is promoted to double-precision and added to ``d''. Note that using single-precision (as versus double-precision) arithmetic can result in loss of precision, as illustrated in the following example.d = d + s * s;
float f = 8191.f * 8191.f; /* evaluate as a float */
double d = 8191. * 8191. ; /* evaluate as a double */
printf ("As float: %f\nAs double: %f\n", f, d);
The result is:
As float: 67092480.000000 As double: 67092481.000000Also, long int variables (same as int) have more precision than float variables. Consider the following example:
int i,j;
i = 0x7ffffff;
j = i * 1.0;
printf("j = %x\n", j);
j = i * 1.0f;
printf("j = %x\n", j);
The first printf statement
outputs 7ffffff, while the second prints 0.
The second printf prints 0 because the nearest float to
0x7fffffff has a value of 0x80000000.
When the value is converted to an integer,
the result is 0,
and a floating point imprecise result exception occurs.
A trap occurs if this exception was enabled.
A function that is declared to return a float may actually return either a float or a double. If the function declaration is a prototype declaration in which at least one of the parameters is float, the function returns a float. Otherwise, it returns a double with precision limited to that of a float. (All of this is transparent.) For example:
float retflt(float); /* actually returns a float */ float retdbl1(); /* actually returns a double */ float retdbl2(int); /* actually returns a double */Arguments work as follows:
double takeflt(float x); /* takes a float */double takedbl(x) float x; /* takes a double */