Floating Point Numbers¶

The Base Ten System¶

The number system that we are used to dealing with in everyday life is based on powers of 10. In this system, an integer is represented as a sequence of digits from 0 to 9:

$\ldots d_2 d_1 d_0$

In the base 10 system, the value of a number is based on its position in the sequence. The right-most digit $d_0$ indicates ones (or $10^0$ ), the next digit $d_1$ indicates tens (or $10^1$ ), the next digit $d_2$ indicates hundreds (or $10^2$ ). So, the number $d_2 d_1 d_0$ means $10^2 \cdot d_2 + 10^1 \cdot d_1 + 10^0 \cdot d_0$ . This system allows us to represent integers as large as we like by adding increasing powers of ten to the left.

We can extend this system to include numbers less than 1 by the addition of a decimal point:

$\ldots d_2 d_1 d_0 . d_{-1} d_{-2} \ldots$

The numbers to the right of the decimal point indicate negative powers of ten. The decimal part of this number therefore means $10^{-1} \cdot d_{-1} + 10^{-2} \cdot d_{-2} + \ldots$

Putting the decimal part together with the integer part to the left of the decimal point, we have:

$\ldots d_2 d_1 d_0 . d_{-1} d_{-2} \ldots &= \ldots + 10^2 \cdot d_2 + 10^1 \cdot d_1 + 10^0 \cdot d_0 + 10^{-1} \cdot d_{-1} + 10^{-2} \cdot d_{-2} + \ldots \\ &= \ldots + 100 d_2 + 10 d_1 + d_0 + \dfrac{d_{-1}}{10} + \dfrac{d_{-2}}{100} + \ldots$

The dots $\ldots$ here indicate that we can make the number as large as we like by adding digits to the left, and as precise as we like by adding digits to the right.

Binary Numbers¶

A binary number is based on powers of 2. As in the base 10 system, the value of a number is also based on its position in the sequence, and the powers used are powers of 2 rather than powers of 10. In the base 2 system, the only digits are 0 and 1. A number in this system is represented as a sequence of 1’s and 0’s:

$\ldots b_2 b_1 b_0 . b_{-1} b_{-2} \ldots$

The digits to the left of the point again represent the integer part of the number, and those to the right represent the fractional part.

$\ldots b_2 b_1 b_0 . b_{-1} b_{-2} \ldots &= \ldots + 2^2 \cdot b_2 + 2^1 \cdot b_1 + 2^0 \cdot b_0 + 2^{-1} \cdot b_{-1} + 2^{-2} \cdot b_{-2} + \ldots \\ &= \ldots + 4 b_2 + 2 b_1 + b_0 + \dfrac{b_{-1}}{2} + \dfrac{b_{-2}}{4} + \ldots$

An example of a binary integer is the number $(1011)_2$ . To convert this to base 10, we add up the powers of 2:

$(1011)_2 = 2^3 \cdot 1 + 2^2 \cdot 0 + 2^1 \cdot 1 + 2^0 \cdot 1 = 8 + 2 + 1 = (11)_{10}$

The fractional part to the right of the point is dealt with in the same way, but with negative powers of 2. For example:

$(.1101)_2 = \dfrac{1}{2} + \dfrac{1}{4} + \dfrac{1}{16} = \left(\dfrac{13}{16}\right)_{10} = (0.8125)_{10}$

Scientific Notation in Base Ten¶

Scientific notation allows us to represent very large and very small numbers without needing to write a decimal with a large number of zeros before or after the decimal point. Instead, we write a number as a decimal between 1 and 10, multiplied by a power of 10.

To convert a number to scientific notation, the decimal point is shifted to be immediately to the right of the leading digit. The number of spaces we need to move the decimal point tell us how large the power of 10 has to be.

If the decimal point is moved to the left, we have a positive power of 10.
If the decimal point is moved to the right, we have a negative power of 10.

Some examples are:

1234	=	$1.234 \times 10^3$	decimal point moved 3 places to the left
0.0000412	=	$4.12 \times10^{-5}$	decimal point moved 5 places to the right
456.789	=	$4.56789 \times 10^2$	decimal point moved 2 places to the left

Scientific Notation in the Binary (Base 2) System¶

Scientific notation works equally well in the binary (base 2) system. Just as in the base 10 system, we move the point to be immediately to the right of the leading digit, but in this case we multiply by a power of 2. Some examples are:

1011	=	$1.011 \times 2^3$	point moved 3 places to the left
0.00000110101	=	$1.10101 \times 2^{-6}$	point moved 6 places to the right
11001.11	=	$1.100111 \times 2^4$	point moved 4 places to the left

Floating Point Representation¶

The IEEE 754 standard defines a way to store a binary number on a computer. The standard is built on the following foundation:

Every number is stored using the same, fixed number of binary digits, or bits (zeros and ones).
The number of bits used depends on the computer and operating system used, but is usually either 32 or 64.
Numbers are stored in binary scientific notation, as defined in the previous section.
The power of 2 is referred to as the exponent, and the digits after the point as the mantissa.
A fixed number of bits (E) are set aside to store the exponent, and a fixed number (M) to store the mantissa.
A single bit - the sign bit - is used to indicate if the number stored is positive or negative. A sign bit of 0 represents a positive number, and a sign bit of 1 represents a negative number.

The Bias¶

The number stored in the bits set aside for the exponent represents a binary integer. As such, the minimum exponent value consists of all zeros, corresponding to the binary integer zero. This is not particularly useful, as it means that the smallest number we can store is $2^0$ i.e. 1.0.

We would like however to be able to store both very large numbers (with a large positive exponent) and very small numbers (with a large negative exponent). To overcome this problem, when a number in binary scientific notation is stored, a fixed integer (called the bias) is added to the actual exponent before it is stored.

For example, if we wanted to store the binary number $2^{-100}$ , we would add the bias to -100 before it is stored in the exponent. The exponent stored would therefore be -100 + bias, rather than -100. When the stored exponent is converted back to the actual exponent, the bias therefore needs to be subtracted to get back to where we started.

Putting this together, we have a formula for converting the computer representation of a number back into an actual number in binary scientific notation:

$(-1)^s \times (1.\textbf{mantissa}) \times 2^{\textbf{exponent} -\textbf{bias}}$

The letter ‘s’ here denotes the value of the sign bit. Some examples are:

Actual Number		Stored Number
Binary number	Scientific Notation	Sign	Exponent	Mantissa
1011	$1.011 \times 2^3$	0	3 + bias	011 $\cdots$ 000
0.00000110101	$1.10101 \times 2^{-6}$	0	-6 + bias	10101 $\cdots$ 000
11001.11	$1.100111 \times 2^4$	0	4 + bias	100111 $\cdots$ 000

Denormalized Numbers¶

A last problem to be solved is how to represent the number zero in this system. The assumption that a ‘1’ is always present before the mantissa bits doesn’t work, since the number zero doesn’t start with a ‘1’.

To solve this problem, a stored exponent of all zeros is treated differently from all other exponents. For the case when there is a stored exponent of all zeros:

The assumption that the mantissa is preceded by a ‘1’ is dropped, and we assume that the mantissa is preceded by a ‘0’ instead.
The actual exponent is taken to be 1 - bias.

Such a number is referred to as denormalized. This gives rise to a separate formula for converting the computer representation of a denormalized number back into an actual number in binary scientific notation:

$(-1)^s \times (0.\textbf{mantissa}) \times 2^{1 -\textbf{bias}}$

Report Description¶

The first part of the report involves conducting experiments in Python to determine how many bits are used for the exponent (E) and mantissa (M), and what the value of the bias (B) is. We want to find:

The “machine epsilon” (the smallest number that, when added to 1, gives a result greater than 1).
The largest floating point number that can be represented in Python.
The smallest floating point number that can be represented in Python.
The number of bits M reserved for the mantissa.
The number of bits E reserved for the exponent.
The bias B used for the exponent.
Is denormalization used?

Run experiments to find the machine epsilon, the smallest and the largest floating-point numbers. Use the results of these experiments to infer the three parameters and if denormalization is used.

Exercise 1. Write a function find_epsilon that finds the machine epsilon by determining the smallest floating point number larger than 1 that can be stored. This should also return or print the base 2 exponent for this number.

Exercise 2. Write a function find_largest that finds the largest floating point number that can be stored. This should also return or print the base 2 exponent for this number.

Exercise 3. Write a function find_smallest that finds the smallest floating point number that can be stored. This should also return or print the base 2 exponent for this number.

Hint

The technique for finding each of these numbers is essentially the same: choose a power of 2, and increase (or decrease) it until the largest (or smallest) number that can be represented in Python is found. The algorithms to do this will follow that described in class for finding the machine epsilon.

Note

Separate marks will be given for finding each of these numbers, for inferring the values of M, E, and B from them, and for determining if denormalization is used. The main thing is to justify the approach taken and reasoning used.

The second part of the report involves the Taylor series approximation covered in class.

Exercise 4. Plot the function $f(x) = log(1 + x)/x$ close to zero. Describe the main features of this graph and how they relate to the loss of precision when numbers of very different magnitudes are added or subtracted.

Exercise 5. Extend the Taylor series approximation for this function that was begun in class by one more term. Plot the graph from Exercise 4 again along with the Taylor series approximation on the same graph.

Exercise 6. Find out how big $|x|$ can be using this approximation while maintaining a relative error of less than 1 part in 10¹⁶.