Title: Number Representation Fixed and Floating Point
1Number RepresentationFixed and Floating Point
- No Method Capable of Representing ALL Real
Numbers Using Finite Register Lengths - Must Use Approximations to Represent Values
- Concentrate on Two Forms
- Fixed Point
- Floating Point
- Others are
- Rational Number Systems uses ratios of integers
- Logarithmic Number Systems uses signs and
logarithms of values
2Fixed Versus Floating Point
- Fixed Point Values Represent Values where Any Two
Differ by 1 unit in the last place (ulp) - Equal Spacing Between Numbers
- Floating Point Values Use Two Multi-Bit Words
- Mantissa
- Exponent
- Both Forms Must be Capable of Representing Signed
Quantities - Fixed Point Values CAN be Used to Represent
Fractional Quantities
3Floating Point Characteristics
- Total Number of Representations Total Bit
Strings - For n-bit Register we have 2n
- Range of Value is Larger than Fixed Point
- Precision of Value is Smaller
- Distance Between Two Consecutive Values Increases
4Floating Point
s
e
m
s Sign Bit (signed magnitude) e Exponent (in
2s Complement Form) m Mantissa (significand or
fraction) mMAX1 - ulp 0,1)
hidden bit
float BIAS 127 (32 bits-23 for m and 8 for
e) double BIAS1023 (64 bits-52 for m and 11
for e) Sign of Exponent is Complement of its
MSb Thus, adding/subtracting bias is just
complementation of MSb
5Floating Point Example
double 00000000 bfe80000 Big Endian MSW has
Higher Address
s
m
e
1 011 1111 1110 1000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000
s 1 e 1022 m 0.5 Value (-1)1?1.5
?2(1022-1023) Value -(1.5)(0.5) -0.75
6Floating Point Normalization
- Redundant /representations are Possible!
- Hidden Bit Helps
- Out of All Possible Representations, Choose One
With Fewest Leading Zeros in Significand - This is Normalization
- After Performing Arithmetic, Renormalization May
- Need to be Accomplished
7Floating Point Special Numbers
Value v when exponent e and fraction f are
special values (IEEE standard) Note NaN Not a
Number
8IEEE/ANSI 754/854 Standard
9Denormalized Numbers
- Allows for Gradual Degradation for Underflow
10Denormals
11Operations Internal Precision
12Floating Point Addition/Subtraction
13Floating Point Multiplication/Division
14Conversions and Roundings
15Exceptions
16Rounding Schemes
Signed Magnitude
Twos Complement
17Round to Nearest (Signed Magnitude)
18Rounding Comments
19Round to Nearest Even/Odd
Round to Nearest Even
Round to Nearest Odd (R)
20Jamming/von Neumann Rounding
21ROM Rounding
22Rounding
23Rounding Examples
Round Towards
Downward Directed Rounding
24Floating Point Operations
25Adders/Subtractors
26Operand Packing/Unpacking
27Other Key Parts of FP Add/Sub Unit
28Pre-Shifting
29Four-stage Combinational Shifter
Pre-shifts Operand by 0 to 15 Bits
30Leading Zeros/Ones Counting vs. Prediction
31Leading Zeros Prediction
32Guard Digits
- What is the smallest number of extra digits
needed for rounding? post-normalization? - Multiplication Double Length Result
- Add/Sub w/ differing exp. Can have Double
Length Result - FP Unit Provides One Length Result
33Significand Ranges
- Assume Significand M?(0,1-ulp
- Then Normalized M ranges as
- For postnormalization need at most one shift left
to get
34Significand Ranges (cont)
- Need at most one shift right to get
- Conclusion
- 1 Extra Digit Needed for Postnormalization
- 1 Extra Digit Needed for Round-to-Nearest
- 2 Extra Digits Needed
- G - guard
- R - round
35Sticky Bit in std754
- Round-to-Nearest-Even Requires 1 Extra Bit
- The sticky bit, S
- Turns out to be Logical-OR of Other Additional
Bits
36Floating Point Multiplier
37Floating Point Divider