Floating-point Format

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Floating-point Format as PDF for free.

More details

  • Words: 1,638
  • Pages: 25
CS220 April 11, 2007

Floating-Point Format • Scientific Notation – Coefficient/mantissa, exponent – Decimal • Example: 2.429843 x 105, 7.3434 x 10-3

– Binary • Example: 1.0111 x 22 => 101.11 (in binary) 22x1+21x0+20x1+2-1x1+2-2x1

IEEE 754 Floating-Point Format • Components – Sign (1 negative, 0 positive) – Significand/Coefficient/Mantissa/Fraction • Normalized or Demormalized

– Exponent (positive unsigned, biased)

Exponent Bias • the value of exponent is offset from the actual value • two's complement makes comparison harder • adjusting its value to put it within an unsigned range suitable for comparison, biased by 2e-1-1 (Here e is the size of exponent part) • For a single-precision, an exponent in the range -126 to +127 is biased by adding 127 to get a value in the range 1 to 254. 0 reserved for denormalized num or zero 255 reserved for infinity or NaN

Comparison Unsigned One’s complement Two’s complement Biased 11111111

255

-0

-128

10000000

128

-127

-1

1

01111111

127

127

0

00000000

0

127

+0

0

128

-127

Precision • 32 bits, Single-Precision (1,8,23) – (1.18x10-38 to 3.40x1038)

• 64 bits, Double-Precision (1,11,52) – (2.23x10-308 to 1.79x10308)

• 80 bits, Double-Extended-Precision (1,15,64) – Intel format, not IEEE standard – (3.37x10-4932 to 1.18x104932)

Single Precision • Exponent is Biased by 28-1-1=127 – Represents -126 to 127

– In the example shown above, the sign is zero, the exponent is -3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25x2-3, which is +0.15625.

Single Precision Number Ranges •

The smallest non-zero positive and largest non-zero negative numbers (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are –



±2−149 ≈ ±1.4012985 x 10−45

The smallest non-zero positive and largest non-zero negative normalized numbers (represented with the binary value 1 in the Exp field and 0 in the Fraction field) are –



±2−126 ≈ ±1.175494351 x10−38

The largest finite positive and smallest finite negative numbers (represented by the value with 254 in the Exp field and all 1s in the Fraction field) are –

±(2128 - 2104) ≈ ±3.4028235 x 1038

Example •

Encode the decimal number -118.625 using the IEEE 754 system – – –



First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1". Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101. Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 x 26. This is a normalized floating point number. The mantissa is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000. The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101.

Double Precision • Exponent is Biased by 211-1-1=1023 – Represents -1022 to 1023

Double Precision Number Ranges •

The smallest non-zero positive and largest non-zero negative numbers (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are –



±2−1074 ≈ ±5×10−324

The smallest non-zero positive and largest non-zero negative normalized numbers (represented by the value with the binary value 1 in the Exp and 0 in the Fraction field) are –



±2−1022 ≈ ±2.2250738585072020×10−308

The largest finite positive and smallest finite negative numbers (represented by the value with 2046 in the Exp field and all 1s in the Fraction field) are –

±(21024 − 2971) ≈ ±1.7976931348623157×10308

Special Cases • zero is not directly representable in the straight format, due to the assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special value denoted with an exponent field of zero and a fraction field of zero. • If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)s x 0.f x 2-126, where s is the sign bit and f is the fraction. For double precision, denormalized numbers are of the form (-1)s x 0.f x 2-1022. From this you can interpret zero as a special type of denormalized number.

Special Cases ‘cont • The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. • The value NaN (Not a Number) is used to represent a value that does not represent a real number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction.

Special Cases Summary Type Exponent Mantissa Zeroes 0 0 (Positive/Negative Zero depends on sign) Denormalized 0 non zero any Normalized 1 to 2e-2 0 Infinities 2e-1 (Positive/Negative Infinity depends on sign) non zero NaNs 2e-1 (here e is size of exponent)

Special Operations • Overflow: exponent too large, producing an infinity. • Underflow: exponent too small, producing a denorm or zero. • Zerodivide: nonzero number is divided by zero, producing an infinity of the appropriate sign. • Operand Error: such as such as division of zero by zero, or taking the square root of -1, producing a NaN

Special Operations Operation n ÷ ±Infinity ±Infinity × ±Infinity ±nonzero ÷ 0 Infinity + Infinity ±0 ÷ ±0 Infinity - Infinity ±Infinity ÷ ±Infinity ±Infinity × 0

Result 0 ±Infinity ±Infinity Infinity NaN NaN NaN NaN

FPU • Coprocessor: supplement the functions of the primary processor. • Coprocessor examples: floating point arithmetic, graphics, signal processing, string processing, or encryption. • FPU registers: eight 80-bit data registers, three 16-bit registers (control, status, and tag)

FPU Register Stack • Circular, top is defined in control register

%st(n)

Load and Store • finit: initialize the FPU, sets control and status registers to default values • flds/fldl/fldt: loads floating point number in memory onto the FPU register stack – S: single precision – L: double precision – T: intel double-extended-precision

• fsts/fstl/fstt: retrieves the top value on the FPU register stack and stores the value in a memory location • Example: finit flds value1 fsts -4(%ebp)

Preset Values • • • • • • •

FLD1 FLDL2T FLDL2E FLDPI FLDLG2 FLDLN2 FLDZ

Push 1.0 Push Log210 Push Log2e Push Pi Push Log102 Push Ln2 (Loge2) Push 0.0

R7: Empty 0x00000000000000000000 R6: Empty 0x00000000000000000000 R5: Empty 0x00000000000000000000 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 =>R0: Empty 0x00000000000000000000

st0 st1 st2 st3 st4 st5 st6 st7

0 0 0 0 0 0 0 0

(raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000)

fld1 =>R7: Valid R6: Empty R5: Empty R4: Empty R3: Empty R2: Empty R1: Empty R0: Empty

st0 st1 st2 st3 st4 st5 st6 st7

1 0 0 0 0 0 0 0

(raw 0x3fff8000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000) (raw 0x00000000000000000000)

st0 st1 st2 st3 st4 st5 st6 st7

3.1415926535897932385128089594061862 1 (raw 0x3fff8000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000)

st0 st1 st2 st3 st4 st5 st6 st7

0 (raw 0x00000000000000000000) 3.1415926535897932385128089594061862 1 (raw 0x3fff8000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000) 0 (raw 0x00000000000000000000)

0x3fff8000000000000000 +1 0x00000000000000000000 0x00000000000000000000 0x00000000000000000000 0x00000000000000000000 0x00000000000000000000 0x00000000000000000000 0x00000000000000000000

fldpi R7: Valid 0x3fff8000000000000000 +1 =>R6: Valid 0x4000c90fdaa22168c235 +3.141592653589793239 R5: Empty 0x00000000000000000000 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 R0: Empty 0x00000000000000000000 fldz R7: Valid 0x3fff8000000000000000 +1 R6: Valid 0x4000c90fdaa22168c235 +3.141592653589793239 =>R5: Zero 0x00000000000000000000 +0 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 R0: Empty 0x00000000000000000000

(raw 0x4000c90fdaa22168c235)

(raw 0x4000c90fdaa22168c235)

Status Register • Indicates the operating condition of the FPU Status Bit 0 1 2 3 4 5 6 7 8 9 10 11-13 14 15

Description Invalid operation exception flag Denormalized operand exception flag Zero divide exception flag Overflow exception flag Underflow exception flag Precision exception flag Stack fault Error summary status Condition code bit 0 (C0) Condition code bit 1 (C1) Condition code bit 2 (C2) Top of stack pointer Condition code bit 3 (C3) FPU busy flag

fstsw oldvalue fldsw newvalue

Control Register • controls the FPU functions, such as calculation precision, and rounding method Status Bit 0 1 2 3 4 5 6-7 8-9 10-11 12 13-15

Description Invalid operation exception mask Denormal operand exception mask Zero divide exception mask Overflow exception mask Underflow exception mask Precision exception mask Reserved Precision control Rounding control Infinity control Reserved fstcw oldvalue fldcw newvalue

Control Register • Precision Control – – – –

00 -- single-precision (24-bit significand) 01 -- not used 10 -- double-precision (53-bit significand) 11 -- double-extended-precision (64-bit significand)

• Rounding Control – – – –

00 -- round to nearest 01 -- round down (toward negative infinity) 10 -- round up (toward positive infinity) 11 -- round toward zero

Tag Register • Identify the values within the eight 80-bit FPU data registers. (2 bits per register) – – – –

A valid double-extended-precision value (code 00) A zero value (code 01) A special floating-point value (code 10) Nothing (empty) (code 11)

fsttw oldvalue fldtw newvalue

Related Documents

Format
October 2019 65
Format
July 2020 39
Format
May 2020 46
Format
November 2019 63
Format
November 2019 67
Format
June 2020 40