3.1. Math IEEE-754

Floating-point numbers are not real numbers, so the result of 1.0/3.0 cannot be represented exactly without infinite precision. In the decimal (base 10) number system, one-third is a repeating fraction, so it has an infinite number of digits. Even simple non-repeating decimal numbers can be a problem. One-tenth (0.1) is obviously non-repeating, so we can express it exactly with a finite number of digits. As it turns out, since numbers within computers are stored in binary (base 2) form, even one-tenth cannot be represented exactly with floating-point numbers. [1]

When should you use integers and when should you use floating-point numbers? A good rule of thumb is this: use integers to count things and use floating-point numbers for quantities obtained from a measuring device. As examples, we can measure length with a ruler or a laser range finder; we can measure volume with a graduated cylinder or a flow meter; we can measure mass with a spring scale or triple-beam balance. In all of these cases, the accuracy of the measured quantity is limited by the accuracy of the measuring device and the competence of the person or system performing the measurement. Environmental factors such as temperature or air density can affect some measurements. In general, the degree of inexactness of such measured quantities is far greater than that of the floating-point values that represent them. [1]

Despite their inexactness, floating-point numbers are used every day throughout the world to solve sophisticated scientific and engineering problems. The limitations of floating-point numbers are unavoidable since values with infinite characteristics cannot be represented in a finite way. Floating-point numbers provide a good trade-off of precision for practicality. [1]

>>> 0.1
0.1
>>>
>>> 0.2
0.2
>>>
>>> 0.3
0.3
>>>
>>> 0.1 + 0.2 == 0.3
False
>>> round(0.1+0.2, 16) == 0.3
True
>>>
>>> round(0.1+0.2, 17) == 0.3
False
>>> 0.1 + 0.2
0.30000000000000004
../../_images/math-ieee754-joke.png

3.1.1. Problem

Why are floating-point calculations so inaccurate? Users are often surprised by results like this:

>>> 1.2 - 1.0
0.19999999999999996

and think it is a bug in Python. It's not. This has little to do with Python, and much more to do with how the underlying platform handles floating-point numbers.

The float type in CPython uses a C double for storage. A float object's value is stored in binary floating-point with a fixed precision (typically 53 bits) and Python uses C operations, which in turn rely on the hardware implementation in the processor, to perform floating-point operations. This means that as far as floating-point operations are concerned, Python behaves like many popular languages including C and Java.

Many numbers that can be written easily in decimal notation cannot be expressed exactly in binary floating-point. For example, after:

>>> x = 1.2

the value stored for x is a (very good) approximation to the decimal value 1.2, but is not exactly equal to it. On a typical machine, the actual stored value is:

1.0011001100110011001100110011001100110011001100110011 (binary)

which is exactly:

1.1999999999999999555910790149937383830547332763671875 (decimal)

The typical precision of 53 bits provides Python floats with 15–16 decimal digits of accuracy.

For a fuller explanation, please see the floating point arithmetic chapter in the Python tutorial.

>>> candy = 0.10      # price in dollars
>>> cookie = 0.20     # price in dollars
>>>
>>> result = candy + cookie
>>> print(result)
0.30000000000000004
>>> (candy+cookie) * 1
0.30000000000000004
>>>
>>> (candy+cookie) * 10
3.0000000000000004
>>>
>>> (candy+cookie) * 100
30.000000000000004
>>>
>>> (candy+cookie) * 1000
300.00000000000006
>>>
>>> (candy+cookie) * 10000
3000.0000000000005
>>>
>>> (candy+cookie) * 100000
30000.000000000004

3.1.2. IEEE 754 standard

>>> import numpy as np
>>> a = 1.234
>>> b = 1234 * 10**-3
>>>
>>> a == b
True
>>>
>>> 1234 * 10**-3
1.234
>>>
>>> 1.234 == 1234 * 10e-4
True

Write to memory:

>>> sign = 0  # 0 is plus; 1 is minus
>>> mantissa = 1234
>>> exponent = -3
>>>
>>> sign, exponent, mantissa
(0, -3, 1234)
>>>
>>> sign = np.binary_repr(0, width=1)          # '0'
>>> exponent = np.binary_repr(-3, width=8)     # '11111101'
>>> mantissa = np.binary_repr(1234, width=23)  # '00000000000010011010010'
>>>
>>> print(sign, exponent, mantissa, sep='')
01111110100000000000010011010010

Read from memory:

>>> sign = 0  # 0 is plus; 1 is minus
>>> mantissa = 1234
>>> exponent = -3
>>>
>>> mantissa * 10 ** exponent
1.234

Warning

This is only demonstration for such conversion. I used simplified formula, to demonstrate how it could be done. Actual formula varies from above example.

../../_images/math-ieee754-parts.png

Figure 3.5. What is float as defined by IEEE 754 standard

../../_images/math-ieee754-expression.png

Figure 3.6. Points chart

../../_images/math-ieee754-mantissa-1.png

Figure 3.7. How computer store float? As defined by IEEE 754 standard

../../_images/math-ieee754-mantissa-2.png

Figure 3.8. How to read/write float from/to memory?

../../_images/math-ieee754-normalized.png

Figure 3.9. Normalized Line

3.1.3. Floats in Doctest

>>> def add(a, b):
...     """
...     >>> add(1.0, 2.0)
...     3.0
...
...     >>> add(0.1, 0.2)
...     0.30000000000000004
...
...     >>> add(0.1, 0.2)   
...     0.3000...
...     """
...     return a + b

3.1.4. Decimal Type

>>> from decimal import Decimal
>>> a = Decimal('0.1')
>>> b = Decimal('0.2')
>>>
>>> a + b
Decimal('0.3')

3.1.5. Performance Comparison

  • Python 3.12.0

>>> 
... a = Decimal('0.1')
... b = Decimal('0.2')
...
... %%timeit -r1000 -n1000
... a + b
85.2 ns ± 16.4 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)
>>> 
... a = 0.1
... b = 0.2
...
... # %%timeit -r1000 -n1000
... a + b
32.8 ns ± 6.9 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)
>>> 
... a = float('0.1')
... b = float('0.2')
...
... %%timeit -r1000 -n1000
... a + b
32.9 ns ± 6.26 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)
>>> 
... %%timeit -r1000 -n1000
... Decimal('0.1') + Decimal('0.2')
415 ns ± 79.6 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)
>>> 
... %%timeit -r1000 -n1000
... 0.1 + 0.2
9.24 ns ± 5.45 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)
>>> 
... %%timeit -r1000 -n1000
... float(0.1) + float(0.2)
64.5 ns ± 11.5 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)
>>> 
... %%timeit -r1000 -n1000
... float('0.1') + float('0.2')
160 ns ± 33 ns per loop (mean ± std. dev. of 1000 runs, 1,000 loops each)

3.1.6. Solutions

  • Round values to 4 decimal places (generally acceptable)

  • Store values as int, do operation and then divide. For example instead of 1.99 USD, store price as 199 US cents

  • Use Decimal type

  • Decimal type is much slower

Problem:

>>> candy = 0.10      # price in dollars
>>> cookie = 0.20     # price in dollars
>>>
>>> result = candy + cookie
>>> print(result)
0.30000000000000004

Round values to 4 decimal places (generally acceptable):

>>> candy = 0.10      # price in dollars
>>> cookie = 0.20     # price in dollars
>>>
>>> result = round(candy + cookie, 4)
>>> print(result)
0.3

Store values as int, do operation and then divide:

>>> CENT = 1
>>> DOLLAR = 100 * CENT
>>>
>>> candy = 10*CENT
>>> cookie = 20*CENT
>>>
>>> result = (candy + cookie) / DOLLAR
>>> print(result)
0.3

Use Decimal type:

>>> from decimal import Decimal
>>>
>>>
>>> candy = Decimal('0.10')     # price in dollars
>>> cookie = Decimal('0.20')    # price in dollars
>>>
>>> result = candy + cookie
>>> print(result)
0.30

3.1.7. References

3.1.8. Assignments

Code 3.43. Solution
"""
* Assignment: Math IEEE754 NoFix
* Complexity: easy
* Lines of code: 3 lines
* Time: 2 min

English:
    1. Define variables with prices:
        a. candy = 0.10 USD
        b. cookie = 0.20 USD
    2. Define `result: float` with sum of prices for a candy and a cookie
    3. Do not fix precision error from IEEE-754
    4. Run doctests - all must succeed

Polish:
    1. Zdefiniuj zmienne z cenami:
        a. cukierek (candy) = 0,10 USD
        b. ciasteczko (cookie) = 0,20 USD
    2. Zdefiniuj `result: float` z sumą cen za ciasteczko i cukierek
    3. Nie uwzględniaj poprawki na błąd precyzji wynikający z IEEE-754
    4. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> result
    0.30000000000000004
"""

# Define variables with prices:
# - candy = 0.10 USD
# - cookie = 0.20 USD
candy = ...
cookie = ...

# Total price for both a candy and a cookie
# Do not fix precision error from IEEE-754
# type: int
result = ...

Code 3.44. Solution
"""
* Assignment: Math IEEE754 IntFix
* Complexity: easy
* Lines of code: 3 lines
* Time: 2 min

English:
    1. Define variables with prices:
        a. candy = 0.10 USD
        b. cookie = 0.20 USD
    2. Define `result: float` with sum of prices for a candy and a cookie
    3. Fix precision error from IEEE-754
    4. Use `int` type for that reason
    5. Run doctests - all must succeed

Polish:
    1. Zdefiniuj zmienne z cenami:
        a. cukierek (candy) = 0,10 USD
        b. ciasteczko (cookie) = 0,20 USD
    2. Zdefiniuj `result: float` z sumą cen za ciasteczko i cukierek
    3. Uwzględnij poprawkę na błąd precyzji wynikający z IEEE-754
    4. W tym celu wykorzystaj typ `int`
    5. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> result
    0.3
"""

CENT = 1
DOLLAR = 100*CENT

# Define variables with prices:
# - candy = 0.10 USD
# - cookie = 0.20 USD
candy = ...
cookie = ...

# Total price for both a candy and a cookie
# Fix precision error from IEEE-754
# Use `int` type for that reason
# type: int
result = ...


Code 3.45. Solution
"""
* Assignment: Math IEEE754 DecimalFix
* Complexity: easy
* Lines of code: 3 lines
* Time: 2 min

English:
    1. Define variables with prices:
        a. candy = 0.10 USD
        b. cookie = 0.20 USD
    2. Define `result: Decimal` with sum of prices for a candy and a cookie
    3. Fix precision error from IEEE-754
    4. Use `Decimal` type for that reason
    5. Run doctests - all must succeed

Polish:
    1. Zdefiniuj zmienne z cenami:
        a. cukierek (candy) = 0,10 USD
        b. ciasteczko (cookie) = 0,20 USD
    2. Zdefiniuj `result: Decimal` z sumą cen za ciasteczko i cukierek
    3. Uwzględnij poprawkę na błąd precyzji wynikający z IEEE-754
    4. W tym celu wykorzystaj typ `Decimal`
    5. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> result
    Decimal('0.30')
"""

from decimal import Decimal


# Define variables with prices:
# - candy = 0.10 USD
# - cookie = 0.20 USD
candy = ...
cookie = ...

# Total price for both a candy and a cookie
# Fix precision error from IEEE-754
# Use `Decimal` type for that reason
# type: int
result = ...