Floating-point type
Created by Masashi Satoh | 31/01/2026
- Shaping the spine of the ICT curriculum in Waldorf education
- The History of Computers(Currently being produced)
- Details on Constructing an Adder Circuit Using Relays
- Internet
- Learning Data Models
- Learning Programming and Application Usage Experience(Currently being produced)
- Human Dignity and Freedom in an ICT-Driven Society(Currently being produced)
Introduction
This article reinforces the concepts covered in “Learning Data Models.”
When covering floating-point types, we will adopt the following approach:
We will treat floating-point types as extensions of the integer types studied thus far. We will not consider the mantissa and exponent parts as a single entity, but rather handle them as separate integer types.
A positive exponent indicates the mantissa is right-shifted by that number of digits, while a negative exponent indicates it is left-shifted by that number of digits. We will represent negative numbers by treating the mantissa as a two’s complement. Since the sign bit is integrated into the mantissa in this model, we will deliberately not address it.
Following this approach means we will learn about a model different from the IEEE 754 standard for floating-point types. This is to allow students to enter the concept of floating-point types simply by dealing with a model that is as simple as possible.
After covering the basics, you can briefly touch on IEEE 754, which is widely used in actual computers, as needed. You can give a rough explanation of how it incorporates mechanisms for saving bit resources and considering efficiency, such as sorting.
How do we perceive the world?
The true delight of learning about floating-point numbers lies in how that learning connects to philosophical epistemology. Let’s consider several examples to reflect on how we perceive the world. First, let’s limit our consideration to vision.

When we gaze at a mountain from afar, its entire form, blue and hazy, fills our field of vision. The next day, heading toward that mountain for a picnic, as we approach, the mountain looms ever larger before us, filling our field of vision. What we see then is the mountain’s vegetation and the exposed geology of its slopes. Getting even closer, the branches of individual trees, the green moss and ferns covering the ground, fallen leaves, and nuts should come into view.
Each of these elements contributes to the mountain we recognized from afar. Visiting the mountain for our picnic allowed us to enrich our understanding of it.
Let’s consider this a bit more deeply.
When viewing the mountain from afar, we could take in its entire panorama, but we couldn’t discern the individual elements that make up the mountain. As we approached the mountain, the panorama receded from view, and the constituent elements came to dominate our field of vision.
In other words, we always encounter the world through the frame of our own “field of vision,” and the scope of our perception at any given moment is limited by that frame. To expand this limited scope of perception, we move closer to or farther from phenomena, thereby gaining a broader understanding.
In this way, we wander through the world, striving to make the image of the world we hold within us more complete.
Lenses are devices that support this mode of perception using optical mechanisms.
Through mechanisms combining lenses, we can artificially enlarge or reduce the size of our field of view without moving from this spot. These are devices like telescopes, binoculars, and microscopes. Specifically, zoom lenses, which combine convex and concave lenses, allow us to move closer or farther away while keeping the subject in focus.
Range Manipulation in Thought
In fact, our thoughts engage with the world of ideas in a similar manner. Let’s consider the mathematical elements here. For instance, when we contemplate a specific numerical value, the number of digits we can handle within our thoughts is limited.
We can focus our thoughts on 3.14, but unless we possess some extraordinary talent, we’d be overwhelmed by 3.141592653589793238462643383279502884197. Seeing the number 122,950,000 doesn’t immediately register, but 120 million is manageable. The same applies to 0.0000000001F (farad); we’d much rather think in terms of 100pF (picofarad).
In this way, we routinely perform operations that zoom in on the groups of numbers we want to manipulate, bringing them onto the stage of our thinking. The floating-point type is a data model that applies this mechanism, enabling us to handle a wide range of values, from extremely small numbers to extremely large ones.
The floating-point type is composed of two integers. One is called the exponent, and the other is called the mantissa.
exponentmantissa
The mantissa corresponds to the numbers placed on the stage of thought, representing the numerical values that are the subject of operations. As seen in the study of integer types, integers are represented by a finite sequence of bits, so the number of digits that can be contained depends on that size. This corresponds to the limitations on the size of the stage of thought or the frame of vision considered earlier.
The exponent holds an offset value indicating the magnitude of the mantissa’s numerical value within its absolute value. Using a zoom lens as an analogy, the exponent corresponds to the zoom function. Adjusting the exponent’s value serves to position the subject within the viewfinder (the mantissa) at an appropriate size. Conversely, it can also be seen as a value indicating the actual size of an image appearing in a small viewfinder.
Consider an example: given the known diameter of the Earth, let’s think about the calculation to find its circumference.
Floating-point type
How Floating-Point Numbers Work: Using Multiplication as an Example
The diameter of the Earth is taken as the average of the equatorial diameter, which is 12,756 km. To find the circumference from this value, simply multiply by pi.
12,756 x 3.1415
In this way, multiplying or dividing numbers that differ greatly in magnitude and contain decimal points is a frequent occurrence in everyday life. Performing such operations using integer types is impractical. This led to the development of the floating-point type.
Let’s examine the mechanism for performing the above calculation using floating-point format. First, let’s normalize these two numbers while keeping them in decimal form. Normalization for floating-point types refers to the operation of fitting the number into the mantissa’s digit count (number of bits) so that the significant digits of the target number are maximized.
Here, let’s consider them as 4-digit decimals.
We shift the digits so that the most significant digit of each number is placed in the most significant position. This is the normalization process for floating-point types.
12,756 → 1275.6 (Decimal places are truncated)
3.1415 → 3141.5 (Decimal places are truncated)
By aligning the bits and setting them in memory in this manner, we can reliably ensure the maximum number of significant digits within the prepared memory size. At that time, we record the shifted number as information for the exponent field. Otherwise, we cannot perform the adjustment back to the actual digits after the calculation.
12,756 → 1275:101
3.1415 → 3141:10-3
To revert to the original value, multiply the reciprocal of the exponent by the mantissa. Let’s confirm that this is indeed the case.
Now, since 101 and 10-3 are not integers, we cannot store them directly in the exponent part. You might ask the students how to handle this. Right—since the base value is always constant, we only need to store the number on the right, the exponent. That’s why it’s called the exponent part.
11275
-33141
Now that I understand the concept, let’s try this in binary. Here, we’ll assume both the exponent and mantissa are 8-bit integers. Compared to decimal, each digit’s weight is one-fifth, so with 8 bits, we can only represent pi to about 3.1. But let’s start small. We’ll try it at this level.
First, convert these two decimal numbers to binary notation. The common method involves repeatedly dividing by 2. The resulting remainders are used as binary values, and the digits are shifted upward.
12,756 → 11 0001 1101 0100
When converting pi to binary, note that the conversion method differs for the integer part and the fractional part after the decimal point. This is because the weighting for each digit in the fractional part is the reciprocal of a power of 2.
The integer part 3 is 2 + 1, so it is 11 in binary. For the fractional part, conversely, we convert to binary by multiplying by 2. We set the value that carries over to the integer part as the binary value and then proceed downward through the digits in order. During this conversion process, an infinite decimal may occur. In that case, the calculation of the fractional part would never end, so we round off at an appropriate number of digits.
3.1415 → 11.0010 0100 001
For the mantissa of this floating-point model, we adjust it to accommodate two’s complement format. When the number is positive, the maximum digit count is 7 bits, so we set the 8th bit (most significant bit) to 0. For negative numbers, we set it as-is.
For the Earth’s diameter, shift the decimal point 7 places to the right, truncate the decimal part, and right-pad to 8 bits. Set the most significant digit to 0. The shifted exponent is 27. Conversely, for pi, shift the decimal point 5 places to the left. The exponent is 2-5.
11 0001 1101 0100(12,75610) → 0110 0011:27 (Number of digits in the underlined = 7)
11(310).001 0010 0001(0.141510) → 0110 0100:2-5 (Number of digits in the underlined = 5)
Shifting digits is performed using the shift operation, which we briefly touched upon when learning about integers. For students who have observed mechanical shift operations, discovering that this corresponds to a mathematical operation will be a delightful experience.
0000 01100110 0011
1111 10110110 0100
This allows us to represent the approximate values of 12,756 and 3.1415 in floating-point style. To see how approximate these values are, let’s convert them back to decimal form.
Original number: 12,756 → 7-bit representation: 12,672
Original number: 3.1415 → 7-bit representation: 3.125
Well now, it turns out the accuracy loss is greater than I expected. But that’s about what I’d expect. This is a good learning experience too.
Now, it’s finally time to calculate the circumference of the Earth. The calculation itself is simple: multiply the mantissa parts of the floating-point numbers together. It’s integer multiplication. Since multiplying two 8-bit numbers results in a maximum of 16 bits, I’ll prepare 16 bits of memory in advance.
0110 0011
× 0110 0100
=0010 0110 1010 1100
To determine the actual magnitude of this result, we use two exponents. For multiplication, simply add the exponents together.
7+(-5)=2
This is the exponent for this result. To convert this result into an 8-bit floating-point type with an 8-bit mantissa, we must perform normalization. Since we shifted it 7 digits to the right, we add 7 to the exponent, making it 9.
0000 10010100 1101
Converting this to decimal yields the following number.
39,424(Km)
Approximately 40,000 km. We’ve finally managed floating-point calculations. If any students express dissatisfaction with the precision, you could have them try the same task using double the number of bits.
In this way, by combining the mantissa and exponent parts, we can represent values ranging from very small numbers containing decimal points to enormously large numbers using only a small number of bits of memory. Isn’t that wonderful?
At the same time, let’s be sure to keep firmly in mind that this representation method models the way humans think.
Floating-point addition
Now, let’s look at floating-point addition. Unlike multiplication, addition cannot directly add the mantissas unless the exponents are the same. Conversely, this means that if you first match the exponents and then add the mantissas, you can perform the addition.
In the example of the Earth’s circumference, since we were seeking an approximate value, an estimated result was sufficient. However, in many addition cases, the number of significant digits becomes problematic.
The maximum value representable by an 8-bit integer, including sign representation, is 127. This is less than three decimal digits. No matter how wide a range the exponent part can specify, this limited number of significant digits makes it impractical for real-world use.
0000 00000111 1111 12710
In actual implementations, even single-precision floating-point types guarantee 23 bits of significant digits in the mantissa (IEEE 754). Even this is insufficient for 7 significant decimal digits.
0000 00000111 1111 1111 1111 1111 1111 8,388,60710
For example, let’s represent the population of Kanagawa Prefecture, 9,219,618, using this number of digits. Converting it directly to binary gives 1000 1100 1010 1110 0010 0010. Since this exceeds the number of digits in the mantissa, we adjust the exponent to fit it.
1111 11110100 0110 0101 0111 0001 0001 9,219,61810
If this population increases by 3,000 people, we will calculate how many people there will be.
0000 10110101 1101 1100 0000 0000 0000 3,00010
The issue here is which number’s digits to align when adding. If you try to align the larger number with the exponent of the smaller number, the higher-order digits will be lost, rendering the value meaningless. Therefore, the exponent of the smaller number is aligned with the larger number.
1111 11110100 0110 0101 0111 0001 0001 9,219,61810
+1111 11110000 0000 0000 0101 1101 1100 3,00010
=1111 11110100 0111 0101 1100 1110 1101 9,222,61810
The calculation is complete.
Now, suppose the previous population increases by one person. Prepare the floating-point value 1 to be added.
0001 01100100 0000 0000 0000 0000 0000 110
Align with the exponent of the larger number.
1111 11110000 0000 0000 0000 0000 00001 110
Oh no, the 1 got pushed out of the mantissa, leaving it as zero!
1111 11110100 0110 0101 0111 0001 0001 9,219,61810
+1111 11110000 0000 0000 0000 0000 0000 010
No matter how many times you add 1, the value will never increase. While floating-point types can handle a wide range of values, unexpected results occur if you don’t pay attention to the number of significant digits.
This is the difficulty that wasn’t present with integer types. With integers, the number of significant digits is clear, but with floating-point types, it tends to be ambiguous.
Whenever possible, use integer types. When you must use floating-point types, you can avoid this problem by increasing the significant digits in the mantissa. The IEEE 754 double-precision floating-point type allocates a large 52 bits for the mantissa. This provides precision of over 15 decimal digits.
0000 00000 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 4,503,599,627,370,49510
Not all numbers can be represented using floating-point types.
The introduction of the exponent part allows the floating-point type to handle a wide range of values. However, even from what we’ve seen so far, it should be clear that issues like the number of significant digits and the error arising from truncating recurring decimals during the conversion from decimal to binary cannot be avoided. The floating-point type is not a dream mechanism capable of representing any number.
A frequently cited example is the calculation 0.1 + 0.2. Due to the occurrence of a repeating decimal during the conversion from decimal to binary, an error already exists within the truncated range of digits.
1111 11010110 1100 0.09960937510
+1111 11100110 0110 0.1992187510
=1111 11110100 1100 0.29687510
As you can see, the answer does not become 0.3. This is not a matter of significant digits; no matter how many digits are added to the mantissa, the cycle continues infinitely, making it impossible to eliminate this error.
Thus, it is crucial to treat numbers represented by floating-point types as approximations or approximate values. Particular caution is needed in programs that strictly compare calculation results for conditional branching.
Moreover, this principle applies to all information processed by computers. Computers present us with a papier-mâché model, crafted from mere surface-level slices of world events on finite memory, as if it were the absolute truth. We must ask ourselves why people swallow this whole.
We have seen that errors frequently occur when converting between the decimal notation humans commonly use and the binary notation used internally by computers. To avoid this inconvenience, the Binary-Coded Decimal (BCD) format was devised.
This format allows handling decimal numbers without converting them to binary. It does this by representing values 0 through 9 using 4-bit binary numbers, with a carry occurring to the next higher 4-bit position when the value exceeds 10. Consistently performing calculations in this format prevents the aforementioned errors, making it widely used in applications like handling money.
However, this format has the drawback of making operations more complex and time-consuming. Furthermore, it shares the limitation of standard binary notation in that certain numbers cannot be represented.
In closing
We have now examined the floating-point type.
In this way, by modeling world phenomena as sets of parameters that compose them, we can handle more complex phenomena. Simple examples include representing the three primary colors of light as sets of red, green, and blue luminance values, or examples like the string type covered in the next section.
Objects in programming languages and records in relational databases are also expansions of this approach.
Using floating-point types as a starting point, it would be beneficial to explore with students how various real-world phenomena can be represented using bit sequences in memory. From there, the exploration of finding algorithms to manipulate these data models and derive desired results will naturally emerge.
- Shaping the spine of the ICT curriculum in Waldorf education
- The History of Computers(Currently being produced)
- Details on Constructing an Adder Circuit Using Relays
- Seesaw Logic Elements
- Clock and Memory
- The Origin of the Relay and the Telegraph Apparatus
- About the sequencer
- About the Battery Checker(Currently being produced)
- Internet
- Learning Data Models
- Integer type
- Floating-point type
- Character and String Types
- Pointer type
- Arrays
- Learning Programming and Application Usage Experience(Currently being produced)
- Human Dignity and Freedom in an ICT-Driven Society(Currently being produced)


コメント