How Floating-point number are represented in IEEE-754 standard
The maximal and minimal number
FP | maximal | minimal |
---|---|---|
32 | 1.11111111111111111111111*2^127^=3.4028234663852886e038 | 1.0000 0000 0000 0000 0000 001*2^-126^= 1.1754944909521339e-038 |
64 | 1.7976931348623157e308 | 2.2250738585072014e−308 |
For details please check this. Never forget the exponent bits cannot be all 1s, otherwise it’s regarded as NaN or Inf. Also note the difference between subnormal double and normal double. |
Some facts
1 | $ python |
It’s a known fact that associativity of FP’s addition does NOT hold and that’s why we see a1+c1+b1+d1
, a1+b1+c1+d1
give distinct results. Another important fact is the rounding from FP64 to FP32 behaves differently between 1+e
and 1-e
. Clearly, rounding for 1-e
is more accurate than another. It can be explained by following table.
1 | case: 1-e |
At rounding stage, the last 29 bits are rounded based on whether greater than 10...0
(29 bits). It’s not hard to infer that the smallest(in abs value) number FP32 can represent without rounding error is 2^-24
. Likewise, 2^-53
for FP64.
Largest continuously representable integer
Precision is solely determined by number of bits for mantissa. Assume the exponent is 23, let’s look at number
a = 1.11...1*2^23
, here there are 23 bits of 1s followed by decimal. Any positive integer less than a is precisely representable in FP32. The next integer on real axis is b=a+1
and don’t worry about exponent for this moment, since it can goes way beyond 23. The mantissa for 1 is all 0s. Thus b=1.00..0*2^24=(16777216)~10
. For the next one in binary sense, we have c=1.00..1*2^24
, but unfortunately this is number 16777218 where 16777217 is missing. The same induction can be carried through and not hard to imagine that there leaves more blank spaces when numbers go large.
1 | >>> np.float32(2**24+1) |
As summary, by IEEE754 standard, the integer range which can be represented accurately is
| FP | Range |
|—-|————–|
| 32 | [-2^24^,2^24^] |
| 64 | [-2^53^,2^53^] |
Why 0.1+0.2 != 0.3
We should by far aware that not all FP numbers can be represented accurately, e.g.1.25 can but 3.1415926… can not. 0.1 and 0.2 happen to sit between two representable numbers in FP64. So the sum is not a perfect 0.3.
Conclusion
If you’re still thirsty on this topic, please check this blog.