Design Of Double Precision Ieee-754 Floating-point Units

  • Uploaded by: Michael Kennedy
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Design Of Double Precision Ieee-754 Floating-point Units as PDF for free.

More details

  • Words: 21,331
  • Pages: 73
GRIFFITH UNIVERSITY

Design of Double Precision IEEE-754 Floating-Point Units Extending the Open Source LTH FPU for the Gaisler Leon2 SPARC8 Microprocessor

DISSERTATION

SUBMITTED BY MICHAEL KENNEDY B.I.T.

SUPERVISOR PROFESSOR PETER LISNER

SCHOOL OF MICROELECTRONICS SUBMITTED IN FULFILLMENT OF THE REQUIREMENTS OF THE DEGREE OF MASTER’S OF ENGINEERING IN VLSI

MARCH 2007

©2007 Michael Kennedy All Rights Reserved

This work has not previously been submitted for a degree or diploma in any university. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made in the thesis itself.

Michael Kennedy

Table of Contents LIST OF FIGURES

7

ABBREVIATIONS AND SYMBOLS

9

ACKNOWLEDGEMENTS

10

ABSTRACT

11

1 INTRODUCTION

12

2 LITERATURE REVIEW

14

2.1 INTRODUCTION

14

2.2 FLOATING POINT NUMBERS

14

2.3 THE LTH FPU

14

2.4 ARITHMETIC OPERATIONS

15

2.5 DESIGN

18

2.6 VERIFICATION

18

2.7 PERFORMANCE

19

2.8 COMPARISON

19

2.9 CONCLUSION

20

3 DESIGN

21

3.1 INTRODUCTION

21

3.2 LTH FLOATING POINT UNIT

21

3.3 MULTIPLICATION

23

3.4 SRT DIVISION

27

3.5 SRT SQUARE ROOT

32

3.6 CONCLUSION

36

4 IMPLEMENTATION

37

4.1 INTRODUCTION

37

4.2 HDL MODELING

37

4.3 SYNTHESIS

41

4.4 CONCLUSION

42

5 RESULTS

43

5.1 INTRODUCTION

43

5.2 TIMING AND AREA ANALYSIS

43

5.3 IC LAYOUT

49

5.4 PERFORMANCE

51

5.5 CONCLUSION

52

6 CONCLUSION

53

6.1 INTRODUCTION

53

6.2 ACHIEVEMENTS

53

6.3 FUTURE DIRECTIONS

54

6.4 CONCLUSION

55

BIBLIOGRAPHY

56

APPENDIX A: ARITHMETIC BY EXAMPLE

60

INTRODUCTION

60

RADIX-2 UNSIGNED MULTIPLICATION

60

RADIX-4 MULTIPLICATION USING MODIFIED BOOTH ENCODING

61

SRT RADIX-2 DIVISION

62

SRT RADIX-2 SQUARE ROOT

64

RADIX-2 ON-THE-FLY CONVERSION

66

APPENDIX B: COUNTER STUDY

67

INTRODUCTION

67

TIMING ANALYSIS

67

AREA ANALYSIS

68

CONCLUSION

68

APPENDIX C: HDL VERIFICATION

69

INTRODUCTION

69

MULTIPLICATION

69

DIVISION

69

SQUARE ROOT

70

APPENDIX D: CODE LISTING

72

INTRODUCTION

72

FILE LOCATIONS AND NAMING CONVENTIONS

72

List of Figures

Figure 3.1 LTH Overview .................................................................................................................................................................. 21 Figure 3.2 Extending the LTH ......................................................................................................................................................... 22 Figure 3.3 Floating Point Multiplication Operation Properties ....................................................................................... 23 Figure 3.4 Modified Booth-2 Encoding ....................................................................................................................................... 23 Figure 3.5 Partial Product Sign Extension................................................................................................................................. 24 Figure 3.6 Partial Product Counter Over Sizing ...................................................................................................................... 24 Figure 3.7 CSA Counter Configuration ........................................................................................................................................ 25 Figure 3.8 106-Bit CSEA using 16-bit CLA Blocks.................................................................................................................. 26 Figure 3.9 Floating Point Division Operation Properties ................................................................................................... 27 Figure 3.10 SRT Radix-2 Division Equations ........................................................................................................................... 27 Figure 3.11 SRT Radix-2 Division Overview ............................................................................................................................ 28 Figure 3.12 SRT Radix-2 Division Circuit .................................................................................................................................. 28 Figure 3.13 SRT Radix-2 Quotient Selection ............................................................................................................................ 29 Figure 3.14 Radix-2 On-The-Fly Conversion Equations ..................................................................................................... 30 Figure 3.15 Radix-2 On-The-Fly Conversion for Q ................................................................................................................ 30 Figure 3.16 Radix-2 On-The-Fly Conversion for QM ............................................................................................................ 30 Figure 3.17 Stick Bit Formula for a Redundant Residual ................................................................................................... 31 Figure 3.18 SRT Radix-4 Division using Layer Radix-2 Stages ........................................................................................ 31 Figure 3.19 Floating Point Square Root Operation Properties ........................................................................................ 32 Figure 3.20 SRT Radix-2 Square Root Equations ................................................................................................................... 33 Figure 3.21 SRT Radix-2 Square Root Overview .................................................................................................................... 33 Figure 3.22 SRT Radix-2 Square Root Circuit .......................................................................................................................... 34 Figure 3.23 SRT Radix-2 Square Root Functional Divisor Equations ........................................................................... 35 Figure 3.24 SRT Radix-2 Division and Square Root Equations........................................................................................ 35 Figure 3.25 SRT Radix-2 Combined Division and Square Root Circuit ........................................................................ 36 Figure 4.1 Multiplier Pipeline ......................................................................................................................................................... 37 Figure 4.2 SRT Radix-4 Division and Square Root Pipeline .............................................................................................. 39 Figure 4.3 Leonardo Spectrum Settings ..................................................................................................................................... 41 Figure 4.4 Preparing a Netlist for IC Layout ............................................................................................................................. 42 Figure 5.1 Original LTH Timing Analysis ................................................................................................................................... 43 Figure 5.2 Original LTH Area Analysis ........................................................................................................................................ 43 Figure 5.3 Multiplier Timing Analysis......................................................................................................................................... 44 Figure 5.4 Multiplier Area Analysis.............................................................................................................................................. 45 Figure 5.5 Combined Division and Square Root Timing Analysis .................................................................................. 46 Figure 5.6 Combined Division and Square Root Area Analysis ....................................................................................... 47 Figure 5.7 Exponent Component Timing Analysis ................................................................................................................ 48

Figure 5.8 Exponent Component Area Analysis ..................................................................................................................... 48 Figure 5.9 Multiplier IC Layout ...................................................................................................................................................... 49 Figure 5.10 Combined Division and Square Root IC Layout ............................................................................................. 50 Figure 5.11 Pre-Layout Clock Frequency .................................................................................................................................. 51 Figure 5.12 Processor Latency and Frequency Comparison ............................................................................................ 51 Figure A.1 Unsigned Radix-2 Multiplication Example ......................................................................................................... 60 Figure A.2 Booth-2 Multiplicaiton Example ............................................................................................................................. 61 Figure A.3 Booth-2 Multiplication Sign Extension Dot Diagram ..................................................................................... 61 Figure A.4 SRT Radix-2 Division Example................................................................................................................................. 63 Figure A.5 SRT Radix-2 Square Root Example ........................................................................................................................ 65 Figure A.6 On-The-Fly Conversion Example ............................................................................................................................ 66 Figure B.1 Std. Cell v. Synthesized Counter Timing Analysis ............................................................................................ 67 Figure B.2 Std. Cell v. Synthesized Counter Area Analysis ................................................................................................. 68 Figure C.1 FPU Multiplication Test Bench ................................................................................................................................. 69 Figure C.2 FPU Division Test Bench ............................................................................................................................................. 70 Figure C.3 FPU SQRT Test Bench................................................................................................................................................... 70 Figure C.4 FPU Square Root Result Verification ..................................................................................................................... 71 Figure D.1 File Location Map .......................................................................................................................................................... 72 Figure D.2 File Naming Conventions ........................................................................................................................................... 72 Figure D.3 HDL Synthesis and Simulation Highlights .......................................................................................................... 73

Abbreviations and Symbols ADK

Academic Design Kit

CLA

Carry Look-Ahead Adder

CPA

Carry Propagate Adder

CSA

Carry Save Adder or Full Adder

CSEA

Carry Select Adder

D

Divisor

DIV

Division (may refer to combined division and square root unit)

FA

Full Adder

F(i)

Square root functional divisor (F(i) and –F(i), which are for positive and negative products)

FPU

Floating Point Unit

G

Guard Bit

I

Iteration

J

Bit Position

LTH

Open Source FPU distributed by Gaisler Research (www.gaisler.com)

MUX

Multiplexer

OTFC

On-The-Fly-Conversion, used for quotient generation

PLA

Programmable Logic Array

PPi

Partial Product

PR(s,c)

Partial Remainder in carry save form

Q

Quotient

QM

Negative quotient factor (Qi – 2-(i))

qi+1

QSLC result for current iteration

QSLC

Quotient Selection Logic Circuit

r

Radix

R

Round bit

SQRT

Square Root

Syn

Synthesis or synthesized

T

Sticky bit

ULP

Unit in Least Significant Position

Acknowledgements I would like to thank my supervisor, Professor Peter Lisner for his patience and support My parents, Jennifer and Michael Kennedy, for making this all possible Professor Nicholas Bellamy, for editorial advice David Clarke, for research assistance Mal Gillespie, for software support

Abstract This thesis explores the design and implementation of a synchronous pipelined Floating Point Unit FPU for the AMI 0.5 micron process. Specifically it focuses on designing a multiplier, divider and square root unit for the LTH. The LTH is an open source FPU provided by Gaisler Research for their Leon2 SPARC8 compliant processor. It is a partial implementation of an arithmetic unit, for working with IEEE-754 Floating-Point single and double precision numbers. The new functions are modeled using the Verilog-97 hardware design language HDL, and implemented using the Mentor Graphics ADK design flow. The goal is to produce a design that implements the essential operations of the floating point standard, and is capable of running at and above 100MHz.

1

Introduction

This thesis presents an overview of designing a pipelined synchronous IEEE-754 double precision floating point unit for the AMI 0.5 micron process, using the Mentor Graphics ADK 2.5 design flow. The first section provides the relevant background information for working with floating point numbers and the base implementation. The second section illustrates the choice of arithmetic design to provide the missing functionality. The final two sections provide details of the implementation and the results achieved. Essentially floating point numbers are used to represent very small to very large numbers with varying levels of precision. The normal set of mathematical operations such as addition, division, multiplication and subtraction are defined, but are slightly more complex than those for standard integer numbers. Floating point operations performed in software, where only an integer adder is available may take several hundred or thousands of cycles. The speedup of these operations is quite important, because floating point numbers are used in a wide range of applications including CAD, games, graphics, multimedia, and scientific. Hence, the main objective to designing a FPU, is to decrease the number of cycles to perform a floating point operation from that taken in software, down to a smaller number of cycles performed in hardware. The first section, literature review, provides the background for this thesis. It discusses the IEEE-754 floating point number standard and the existing implementation provided with the Gaisler Leon2 SPARC8 compatible processor. This first section also provides some details of how various implementations were verified and their performance. The second section, design, provides an overview of the arithmetic algorithms and architecture employed to extend the LTH FPU. It is meant to illustrate the circuit design, from a high level perspective of how the arithmetic algorithms are to be implemented in logic blocks, that fit together to provide the required functionality. In the third section, implementation, the design process is discussed with reference to the Mentor Graphics ADK 2.5 design flow. It covers the basics from modeling the FPU with the Verilog HDL using standard cells to layout with IC station. While not all elements that were used in the FPU are fully detailed, a general outline for their construction is provided with a selection of examples from each stage of the development cycle. Also covered in this section are issues discovered with the tool flow discovered during the implementation process. The fourth section, results, is a discussion of the developed FPU implementation. It is aimed at providing some comments in relation to testing and verifying the design, and providing some performance measurements. It covers verification of both the HDL design and the completed IC layouts for both logic and electrical rule verification.

12

The final section, conclusion, outlines the success of the developed implementation. It details known limitations and future improvements with the design. Performance measurements and comparisons with industry implementations are also discussed. This section in brief provides a summary of the design, results and where future modifications should be aimed to provide increased performance.

13

2

Literature Review

2.1 Introduction This section provides an overview of the background literature required for understanding the floating point standard, operations and the design of floating point units. In particular it focuses on multiplication, division, and square root operations as they relate to extending the LTH into a more complete implementation for processing IEEE-754 floating point numbers for the Leon2 Microprocessor.

2.2 Floating Point Numbers Floating point numbers [8] [11] are essentially used to describe very small to very large numbers with a varying level of precision. They are comprised of three fields, a sign, a fraction and an exponent field. Floating point numbers are also defined for several levels of precision, which include single, double, quad, and several extended forms. The IEEE-754 standard defines several floating point number formats and the size of the fields that comprise them. Also making them more complex, is that several specific bit patterns represent special numbers, which include zero, infinity, and Not-a-Number “NaN”. Additionally floating point numbers may be in a normalised or de-normalized state. Further discussed in the IEEE-754 standard are several rounding schemes, which include round to zero, round to infinity, round to negative infinity, and round to nearest [8] [ 9] [11] [14]. In particular for the design of this FPU, only double precision normalized floating point numbers are of interest. This is because de-normalized numbers are not required to be implemented in an FPU for it to be considered compatible with the IEEE-754 standard, and the LTH converts all floating point numbers to double precision for internal use. Also, all special bit patterns result in special conditions, such as divide by zero, which is instantly an exception, and requires no computation. This essentially means the floating point numbers used will have a 1-bit sign field, 53-bit fraction field, and an 11-bit exponent field. The format of the fraction and exponent fields is also specified, with the MSB of the fraction always being one and the exponent never being all one’s or all zero’s.

2.3 The LTH FPU 2.3.1 The Leon2 The Leon2 [1] is a pipelined SPARC8 [4] software configurable synthesizable processor designed by Gaisler Research for the European Space Agency and distributed under the GPL open source license [5]. Internally it does not include a floating point unit, but does define an interface for using one as a coprocessor [2]. It provides two methods of interfacing with a floating point unit, which include a direct and an indirect approach. The main difference is that when a floating point operation is issued using the direct approach, the CPU is stalled until the FPU completes that operation. The advantage to this is that an operations queue is not needed, reducing the complexity of any implementation. While the Leon2 does not explicitly contain any FPU, Gaisler does distribute it with several options [3] [6], which include a partial implementation called the LTH [3] that is made available under the same GPL license.

14

2.3.2 The LTH The LTH [3] is a partial implementation of a floating point unit. In its original form it supports addition, subtraction, and comparison operations on both single and double precision floating point numbers. While the IEEE-754 standard [11] does not require all floating point operations to be implemented in hardware, it does require several rounding schemes, and most implementations normally include multiplier, divider and square root units [7] [14]. Further, as the LTH is aimed for the SPARC8 ISA it should be able to handle single and double precision floating point operations. However while quad precision instructions are listed in the SPARC8 standard, they are not often implemented in many industry designs [29] [30]. Incorporating the new instructions into the LTH also requires limited changes in the Leon2, for identifying the available floating point operations implemented in hardware.

2.4 Arithmetic Operations 2.4.1 Floating Point Arithmetic Floating point units usually consist of two main data paths for computing the fraction and exponent fields of a result [8]. However, the main idea of performing a floating point operation can be defined as decoding, normalization, exception checking, fraction and exponent operations and rounding, where operations on the fractions of the operands take the most computation [8] [11]. It should be noted that the floating point standard is aimed at producing rounded results as if they were produced for an infinite precision [11]. To accomplish this all operations are generally computed to several extra bits of precision (guard, round, and sticky bits). Some designs can produce this extra information for some operations in parallel to their normal execution [16] [25]. 2.4.2 Multiplication Multiplication of two floating point normalized numbers is performed by multiplying the fractional components, adding the exponents, and an exclusive or operation of the sign fields of both of the operands [9] [11] [14]. The most complicated part is performing the integer-like multiplication on the fraction fields (Figure A.2). This can be done by utilizing a full tree [15], Wallace [20] or Dadda tree [12], staged [17], or serial array multiplier architecture [8] to sum the partial products. Essentially the multiplication is done in two steps, partial product generation and partial product addition. For double precision operands (53-bit fraction fields), a total of 53 53-bit partial products are generated [16]. To speed up this process, the two obvious solutions are to generate fewer partial products and to sum them faster [8]. In it simplest form, producing the radix-2 unsigned partial products is accomplished by a logical AND operation [8] on the operands. To produce fewer partial products, the normal method used is some form of modified Booth encoding [15] [36]. This allows for computations at a higher radix, with partial products produced in various forms. The choice of radix [14] determines how many partial products are to be generated, and the style of encoding determines the form of partial product. Modified Booth-2 radix-4 encoding for double precision operands produces 27, and Booth-3 radix-8 produces 18. With an increased radix also comes increased complexity and normally increased delay [8] [9]. The form of the

15

produced partial products can be as simple as 2’complement, or a more redundant number system [18] that allows for a faster summation of the products. After the partial products have been produced, they must all be added together. The fastest method is to use a carry free method of addition that sums the products in parallel using a CSA tree [8] [15]. The CSA tree can be implemented in many forms, the main variations including wiring regularity [8] [14] [20], order of addition [8], and counter or compressor sizing [15] [16]. In the case of Booth-2 radix-4 encoding with a full binary CSA tree, the benefits of producing fewer partial products do not include a major speed increase, because only a decrease in the critical path of the one full adder is realized. Alternately, this does provide significant area saving. Further, while it may be possible to generate fewer partial products with higher levels of Booth encoding, the time and area penalties increases with higher radix encodings and can outweigh any improvements offered by a smaller CSA tree [8]. Also care must be taken with the layout of the counter tree [16] [20], as the natural shape is irregular and area intensive [19]. The CSA tree from Samsung [15] attempts to eliminate much of the wiring irregularity, by defining the counter tree in slices ranging from 28:2 to 4:2 counters built from seven levels of full adder cells. Alternately, the Parallel Structured Full Array Multiplier [17] built using a Wallace tree from four levels of 4:2 counters offers increased regularity. Assuming that the 4:2 counters are internally two full adder cells, the Samsung design consists of one less level in the CSA tree, while offering a less regular wiring structure then the Wallace tree design. The last stage in the multiplication is a carry propagate addition using a wide 106-bit adder for double precision numbers. Adders, while performing a simple operation, are still an area of interest and many implementations exist for providing a final carry propagate addition CPA [8]. Amongst these are CarryLook-Ahead CLA [14], Carry-Select CSEA [11], Ling [14], and hybrid adder architectures comprised of elements from several different design approaches [17]. Additionally, there are adders that operate on different number systems with different levels of redundancy. Further, some partial products may be produced in a highly redundant form making a full CPA irrelevant [18] and instead utilizing a slightly faster but less area intensive conversion circuit to convert the redundant result into a standard binary notation. However, many of the designs specified become less attractive when they are implemented using standard cells, such as the 4ns 32-bit CMOS Ling Adder [14]. This presents its own problems, as the selected adder must be feasible using standard cells and be able to perform the full addition in under 10ns. This makes the XOR based design offered by Samsung [15] more attractive. Optimizing a multiplier is usually done by constructing the CSA tree into a regular layout instead of the irregular trapezoid that results from the shifted partial products [20]. The goal here is to decrease the total area used by the multiplier and to make the wiring delay between the adders more uniform. Some other work has been performed using more redundant number systems, which allow for higher levels of modified Booth encoding with out incurring the time penalties that normally result from forming the hard multiples (such as the 3x) [18]. Generally however, the most common implements revolve around a radix-4 Booth-2 encoding, some form of CSA tree, and a wide hybrid adder (at least 106-bits).

16

2.4.3 Division Division of two floating normalized floating point numbers is performed by a subtraction on the exponents, an integer style division on the fractions (Figure A.4), and an exclusive OR operation of the sign fields of both of the operands [11] [14]. Similar to multiplication, the operation on the fraction fields is the most time consuming part. There are two main ways of performing the integer division of the fraction fields; a subtractive recursive approach [11] and a multiplicative approximation [8]. Both present their own unique benefits [21] [22] in terms of area and performance, but also complexity. The subtractive approach provides for a simpler implementation but produces a constant number of bits per iteration. Alternately, the multiplicative approach produces an increasing number of bits per iteration, and as a result requires fewer iterations, with each taking several cycles that need a modified multiplier. The subtractive approach is a more popular architecture because it can be implemented independently, and does need complicated modifications to the multiplier [28]. The subtractive approach, which is performed by a repeated addition, subtraction or shifting operation produces a constant numbers of result bits per iteration (1-bit per iteration for radix-2, 2-bits per iteration for radix-4). This does imply that for a 53-bit result, at least 53 iterations are needed. Two ways to increase the performance are to increase the number of result bits generated during an iteration, or to perform more iteration’s per cycle [8]. This can be done either through a higher radix more complex iteration, or by layering several simple lower radix iterations together [23] [26]. Another way to reduce the delay is to work with the partial remainder in a redundant form, such as carry-save [25], which removes the need for a CPA in each iteration. The critical path then becomes the quotient selection logic (QSL), which determines if a subtraction, addition or shift operation takes places during an iteration. The quotient selection logic can be implemented as purely combinational logic or as a look up table (LUT), or referred to in SRT as a Quotient Selection Table QST. In the case of the second approach several tools are available to automatically generate the QST’s [8] [12] [13]. With a higher radix comes a more complex QSL. Layering and overlapping of lower radix iterations allows for a computation at a higher radix without the complexity [23]. A design by SUN Microsystems [25] demonstrates the layering of low radix iterations, and also provides a more optimized SRT table than some other designs [24]. This optimization has the potential to make unnecessary a final restoring iteration for the remainder normally required by many SRT implementations. When used in conjunction with high radix cycles, this single cycle saving provides a relatively large performance increase without any extra delay to the critical path. 2.4.4 Square Root The square root operation is very similar to the division operation, where both multiplicative approximation and subtractive recursive approaches are available [22] (Figure A.4). The reasons for choosing an appropriate architecture are very similar to division [21], where either the multiplier can be modified or an extra subtractive unit can be implemented. In the case of the latter choice, the difference between the subtractive division and square root unit’s is essentially very small, and an SRT divider can be modified to encompass square root operations [25] with only a few changes.

17

Essentially the biggest difference between the SRT division and square root operations, is that in the SRT square root operation the divisor is adjusted during every iteration [11] [13], where as in SRT division it remains unchanged. Another difference between the two operations is in the handling of exponents and base exceptions. For the most part, the SRT division algorithm can be extended to implement the square root operation [37], allowing for the advantages of the more complex division unit designs to be extended for the square operations in a combined division square root unit.

2.5 Design The design of the original LTH [3] is a single pipeline that has several pipeline stages operating on double precision floating point numbers. Data are inserted into the pipeline in several stages including decode, operands and rounding. The missing functionality is normally implemented in parallel independent pipelines because of the marked difference in the requirements of their respective algorithms. They still normally share some common stages, which may include decoding, rounding, and normalization. This implies that the new functionality can be incorporated by adding the new functional blocks in parallel to the existing addition/subtraction unit of the LTH. The new blocks will also be required to operate on a single pair of double precision numbers. However, other industry designs offer a wider variety of word sizes for computation [19] [30] [31]. The original design uses a pipelined architecture, but is not truly pipelined. While it is divided into stages, it cannot accept new operations every cycle. This is an intended feature of the LTH because of its interface with the Leon2 [2], which means with the inclusion of the new functional units, at most only one stage in a single pipeline will be in operation at any time. While it may be possible to extend the modified LTH to a more modern out-of-order superscalar design [7] [32], neither the SPARC8 standard [4] nor the co-processor interface of the Leon2 requires such complexities, and therefore the need for executing floating point instructions in parallel and out-of-order is beyond the scope of this project.

2.6 Verification A floating point unit is a set of complex algorithms implemented in an integrated architecture [33] [34]. Demonstrating a satisfactory level of compliance with the IEEE-754 standard and implementation reliability can be grouped into two separate concepts, design and implementation testing [36]. The first is a logic proof of the well defined algorithm translated into Verilog HDL. These are tested by running a limited number of test vectors against the developed HDL modules. The second, implementation testing, is usually accomplished by generating test vectors to cover a specific set of test cases and possibly a percent of total input coverage against a physic model. This testing can be accomplished through a mechanized form on a taped out implementation. The importance of testing both the algorithm and implementation can be seen by the problems encountered with the Intel Pentium division QRT table [35], where the algorithms were sound but the implementation contained incomplete data sets.

18

2.7 Performance Performance of a floating point unit is commonly expressed as a matter of its latency and throughput. However the real world performance of an FPU may be more application specific, and is normally measured using benchmarks that comprise tests from common applications [29]. While this does allow for a comparison of different architectures based on execution cycles alone, it does not permit a meaningful overall comparison of designs since area, power, and technology are not considered. FUPA, a Floating-point Unit Performance Analysis metric discussed by Flynn and Obermann [14], is a more complete metric for comparisons of different implementations as it incorporates latency, area, scaling, power, minimum feature size, and application profiling.

2.8 Comparison During this literature review several different approaches and designs were researched to determine the optimal architecture for extending the LTH. The extended overall design is well defined by the original LTH, and the marked differences between the functional requirements of the different operational units. The largest issues are in concern to the additional units, which can be implemented using a variety of different methods, but are limited to those designs that use standard cells [12] [13] and can achieve a 10ns cycle time. The design of a multiplier is generally accepted to be a three stage process, modified Booth encoding, CSA tree, and final CPA. The approach offered by Cho et al [15] offers both a complete modular description for Booth-2 encoders and a CSA tree constructed from full adders. The other main advantage is that this design avoids some of the layout issues with counter trees [8] as the blocks are clearly defined. Other implementations utilizing a more redundant number system [18], while interesting, offer perhaps a faster solution but are vastly more complex and at times not complete. Flynn [14] also suggests for both area and latency, a modified Booth-2 encoder with a full binary adder tree is an optimal solution. The final stage of multiplication, the 106-bit CPA, presents the only real choice. While many designs exist, only fast adders that can be implemented using standard cells are relevant. The hybrid CSEA using 16-bit CLAs offered by Mori [17], provides a suitable delay, a modular and well documented design optimal for layout, especially in comparison to a fully synthesized random logic wide adder. There is a larger difference in the design of division and square root units. Essentially the subtractive SRT approaches are favored because they do not require any modification to the multiplier and can be implemented independently. The next choice then becomes that of radix, where more bits per iteration incurs more complexity, or more iterations per cycle increases the area. The implementation offered by SUN Microsystems and others utilizing several lower radix iterations to form a higher radix iteration, is clearly the best approach [25]. While the SUN design is not noticeably different from other layered overlapped radix-2 designs, it does provide significantly improved QSLC, utilizing a fifth bit. Essentially the methods used for testing the final product are not available for the extended LTH. This is because the design approach of testing on an FPGA is not possible given the direct instantiation of ADK standard cells in the HDL. Secondly, since there is no planned fabrication, the process of mechanical

19

testing of a physical processor is not possible, and if it were the approximate two month testing period would be restrictive [36]. Testing is therefore limited to HDL testing and IC layout verification. Finally in terms of performance, the least favored methods of comparison, latency and throughput, will be the only metrics available for making a comparison between the extended LTH and other Floating Point Units. This is because without a fabricated design it is impossible to verify the operational frequency or power requirements. This makes running any application specific benchmarks [29] and generating any FUPA metric [11] impossible. The other reason that running benchmarks is not possible is that while the extended LTH may conform to the original LTH-Leon2 interface [2], the Leon2 needs some modification to operate correctly with the new design, which is beyond the scope of this work.

2.9 Conclusion Extending the LTH to perform division, multiplication and square root operations for single and double precision floating point numbers can be accomplished by adding parallel pipelines to the addition/subtraction pipeline. The best approach for designing a multiplier relies on the Samsung design for the modified Booth encoding and CSA tree. However there are many alternate approaches to the final wide fast adder, which is confined to those that are implementable using standard cells. The optimal architecture of a joint division and square root unit is a layered subtractive radix-2 SRT style approach, benefiting from quotient selection logic outlined by Sun. This design is appropriate because of its independence from the multiplier, its ability to be layered to perform higher radix computations and meet timing requirements. In terms of implementation, the design decisions are fairly well dictated. To achieve a cycle time comparable to that of industry designs using a similar process technology, optimizing for delay at the expense of area is necessary, as a result of being confined to a standard cell library. The concept of using standard cells for implementation also means a host of other designs are not feasible, including custom transistor 4:2 counters and pseudo CMOS Ling adders. Further, since the design is intended to be converted into an IC layout, a modular design is also beneficial. This can be seen in the selection of a CPA for the wide adder, where a 106-bit CSEA made of 16-bit CLA blocks offers a more regular layout than the random logic offered by a completely synthesized approach.

20

3

Design

3.1 Introduction This section covers the basic details of the computation of the fractional component of double precision floating point numbers. It illustrates the underlying elements for the multiplication, division and square root designs selected. It also provides some details of how this new functionality can be incorporated into the existing LTH architecture.

3.2 LTH Floating Point Unit 3.2.1 Introduction The LTH is an open source FPU distributed with the Leon2 [3]. It is a partial implementation of a floating point unit that uses the double precision format for all calculations. This section presents the original datapath organization and the required changes for integrating missing functionality. 3.2.2 Original Design The original LTH [3] is divided into five stages (Figure 3.1), “Decode”, “Prenorm”, “Add/Sub”, “Postnorm”, and “Roundnorm”. The Decode stage unpacks the operands and instructions and checks for special conditions. Prenorm, the second stage, formats the operands for addition or subtraction, by placing them in the correct order and performing any necessary shift operations. The third stage Add/Sub, performs the addition or subtraction operation. The Postnorm stage normalizes the result for the final stage, which rounds and packages the result ready for output. Input

Busy Signal

Decode Stage 1

Prenorm Stage 2

Add/Sub Stage 3

Postnorm Stage 4

Roundnorm Stage 5 Output

Figure 3.1 LTH Overview

This presents several problems, mostly related to the time taken to execute different operations. In the case when an invalid operation is sent to the LTH, an exception should be generated by the floating point unit indicating a problem with the instruction or operand [11]. In this design, when the error is known at

21

the end of the first stage, it takes a further four stages for the exception to be written to the output. The same is true for the comparison operation which requires less then the five cycles. While this may not appear to be a problem, since the Leon2 is stalled while any FPU operation is underway, for an invalid operation, the CPU may have to wait four cycles for the FPU to do nothing. 3.2.3

Integrating Additional Functions Input

Decode Stage 1

Prenorm

Multiplier

DIV/SQRT Stage 2

Add/Sub

Normalization

Normalization Stage 3

Postnorm Stage 4

Roundnorm Stage 5 Output

Figure 3.2 Extending the LTH

Integrating the additional features, such as division, square root, and multiplication, requires extending the LTH’s decoder, modifying its rounding stage to accept inputs from multiple datapaths, and redesigning its busy signal unit. Further, an extra path from the decoding stage to the last stage should be added, which for a division operation with invalid operands would avoid over 57 stalled CPU cycles. The key here is that essentially the decoder recognizes all valid operations and correct operands for the new features, and can dispatch the operation down the correct datapath. At the end of each of datapath, where a result is available, the already defined rounding module can be utilized to round and package it for output. This leaves only the design of a multiplier, divider, square root unit, and associated normalization units required to construct a full FPU (Figure 3.2).

22

3.3 Multiplication 3.3.1

Introduction Multiplication Operation Properties

Property

Value

Operation Operand A Range (Dividend)

1≤A<2

Operand B Range (Divisor)

1≤B<2

Result R Range

1 ≤ R < 4 (May Require left shift and exponent increment)

Result Sign

Sign A XOR Sign B

Result Exponent

Exponent A + Exponent B - Bias

Figure 3.3 Floating Point Multiplication Operation Properties

Multiplication includes four logically separate stages, partial product generation, partial product addition, a final carry propagate addition, and result normalization. To accomplish these tasks Booth encoding, a full CSA tree, a fast CSEA CPA and a left shifter were designed. This section outlines the fundamentals behind the algorithms used to implement each of the four stages. 3.3.2 Partial Product Generation The first part of designing a multiplier circuit is to produce the partial products that have to be summed. Modified Booth-2 radix-4 encoding was selected, as it produces half the number of partial products as an un-encoded method, and does not require the calculation of any hard multiples as would be needed by a higher level of encoding. Booth-2 Products

Xj+1

XJ

Xj-1

Booth Multiple

Dir.

Shift

Add

0

0

0

0Y

0

0

0

0

0

1

1Y

0

-

1

0

1

0

1Y

0

-

1

0

1

1

2Y

0

1

0

1

0

0

-2Y

1

1

0

1

0

1

-1Y

1

-

1

1

1

0

-1Y

1

-

1

1

1

1

0Y

1

0

0

Figure 3.4 Modified Booth-2 Encoding

Instead of a simple AND operation on a bit J from operand A with all bits of operand B to produce a single partial product (Figure A.1), Booth encoding uses three bits from operand A to generate a multiple of operand B (Figure A.2). Essentially this requires constructing an encoder for the truth table presented above (Figure 3.4), and a set of connected MUX to select the correct multiples of operand B. The design from Samsung [15] offers a full description of both the Booth encoders and the MUX required for

23

implementation. Their encoders produce three select signals, “add”, “dir”, and “shift” to generate the require Booth multiple. 3.3.3 Multi-Operand Addition The modified Booth encoding produces 27 54-bit partial products in a 1’s complement form. To convert these to the required 2’s complement, the sign bit can also be extracted, and added to the shifted products LSB. For the multiplication of radix-2 unsigned numbers, the shifted results can be simply added together. However, when signed numbers are concerned, a special sign extension method (Figure 3.5) is generally used to keep the operands short [14], which significantly reduces area. The alternate would be to sign extend each partial product to the maximum width of the operation, which is obviously undesirable.

S

S

S

















1

S

























1

S





















1

S



S•



































• S

S

S

S

Figure 3.5 Partial Product Sign Extension

Adding these products together using a carry save adder CSA tree can be done through at least seven levels [8] of 3:2 counters (also known as full adders). The design presented by Samsung, offers an optimized form of carry propagation (Figure 3.6), where each counter has bit zero of its partial product input port optimized for late arrival (Figure 3.7). This allows for counters to receive an input to their partial product input port from their neighbouring counter’s carry out with no added delay.

0

S

S

S





















0

S

0

C

1

S





















0

S

0

C

1

S





















1

S





















0

S





















0

S

Figure 3.6 Partial Product Counter Over Sizing

24

4:2 Counter

4:2 Counter

5:2 Counter GND

FA

GND

FA

FA

FA

FA

FA

FA

CARRY

CARRY SUM

CARRY SUM

SUM

Late Arrival

Figure 3.7 CSA Counter Configuration

While the Samsung design is fairly complete, it does present some concerns. It proposes a maximum counter size of 28:2, which suggest that in a column there will be 28 bits to sum. As the last product generated is never negative, because the MSB of the other operand is always zero (operand B [52:0], MSB for Booth encoding is bit 53). As a result, the final partial product will always be a positive multiple, which makes its sign also zero. This limits the maximum height of a column to 27, and the required counter size to 27:2. In terms of implementation, this requires a range of counters to be constructed, from 27:2 to 4:2, and have port zero optimized for late arrival to handle carry propagation. The other result of carry propagation, is that some counters need to be oversized to handle the extra carry signals generated by previous counters (Figure 3.6). 3.3.4 Fast Carry Propagation Addition For the final stage, the wide adder, two candidates exist for performing the CPA. The fastest adder, but also most area intensive is a Carry Look Ahead adder CLA [14], which is only suitable for at most 16 bits. However, a hybrid approach using a Carry Select Adder CSEA comprised of 16-bit CLA blocks provides most of the speed advantages of the CLA, without the vastly increased area requirements that would be expected from such a wide adder [17]. The second possibility for performing the wide CPA is a fully synthesized adder [8]. Using a 16-bit CLA with the output signal for bit 16 as a block carry output, it is possible to construct a larger adder. This effectively makes each 16-bit CLA with no carry-out signal into a 15-bit adder cell with both carry-in and carry-out signals defined. A 15-bit CSEA block can be made by using two of the CLA blocks to compute the addition of the same input bits but with different carry in signals. Then a 106-bit CPA can be constructed by chaining several 15-bit CLA CSEA blocks together, and for the most significant block where no carry out signal is required, the 15-bit CSEA block can be used as a full 16-bit Adder cell.

25

Thus, a 106-bit CSEA can be built from 13 16-bit CLA’s, with the delay of slightly more then one 16-bit CLA and six 16:8 multiplexes (Figure 3.8).

16-bit

GND

15-bit

GND

16-Bit CLA

VDD

16-Bit CLA

VDD

16-Bit CLA

GND

16-Bit CLA

GND

MUX

Carry out

MUX

15-bit

16-Bit CLA

GND

Carry out

Carry out Result[15] Result[14:0]

1-bit

Result[15] Result[15] Result[14:0] Result[14:0]

15-bit

15-bit

15-bit

Result

Figure 3.8 106-Bit CSEA using 16-bit CLA Blocks

3.3.5 Normalization and Sticky Bit Generation The result of a double precision operand multiplication is 106-bits long in the range of 1 ≤ R < 4. This implies that if the leading bit is a one, the result is above the required range, and to normalize it a left shift is needed. If a left shift is required, the exponent must also be incremented. At this stage, when there is more than the required 53 bits of precision available, before the extra 50 or 51 bits are discarded, a sticky bit “T” must also be generated. This sticky bit indicates if the truncated bits are all zero and is required for the next stage, rounding.

26

3.4 SRT Division 3.4.1

Introduction SRT Division Operation Properties

Property

Value

Operation Operand X Range (Dividend)

1 ≤ X < 2 and X ≤ D (Right shift X, 0.5 ≤ X < 1)

Operand D Range (Divisor)

1≤D<2

Result Q Range

0.5 ≤ Q < 2 (With right shift of X, 0.25 ≤ Q < 1)

Result Sign

Sign X XOR Sign D

Result Exponent

( Exponent X – Exponent D ) + Bias

Figure 3.9 Floating Point Division Operation Properties

SRT is a non-restoring residual recurrence approach to division that estimates a constant number of quotient bits per iteration [8] [11]. The estimation is performed by a quotient selection function QSLC, which defines the type of operation that is to occur during an iteration. Further, as each quotient bit is produced, it must be incorporated into the existing quotient. To accommodate for the possible negative values, a scheme called “on-the-fly” conversion is presented. While predominately the designs presented are for a radix-2 SRT division unit, extending the base architecture to perform calculation at a higher radix is also possible. 3.4.2

Radix-2

Where I = 0, 1, 2, 3,..., m Q0 = 0, PR0 = X/2 Figure 3.10 SRT Radix-2 Division Equations

For a radix-2 or a bit-at-a-time implementation (Figure A.4), the iteration formula can be summarized in the above figure. In its simplest form, it involves examining the most significant bits of a partial remainder and performing an operation based on the result. Depending on the implementation of the QSLC, the time to determine the correct operation can be quite long. By calculating all three possible results in parallel to determine which operation to perform (Figure 3.11), it is possible to reduce the critical path significantly [21].

27

Registers Critical Path QSLC

-1 Product

0 Product

+1 Product

Q Q-1

MUX

Logic

Figure 3.11 SRT Radix-2 Division Overview

Generating the three different products for the new partial remainder when a carry-save form residual is used (Figure 3.13), can be accomplished by using one level of Full Adders. When “qi+1” is zero, a left shift of the partial remainder is sufficient. When “qi+1” is negative one or positive one, the new partial remainder is the result of a left shifted remainder and addition or subtraction of the divisor respectively. For the subtraction of the divisor, a 2’s complement form can be used, which involves only inverting the divisor for the positive product and adding a one to the carry-in for the LSB.

Registers

2PRsj

2PRcj

Dj

2PRsj

2PRcj

Dj

QSLC CSA

-PScj+1

CSA

+PScj+1 +PScj - Product

0 Product

-PScj

+ Product

MUX

PRsj (i+1)

PRcj (i+1)

Parallel Product Cellj

Figure 3.12 SRT Radix-2 Division Circuit

To obtain the correct 53-bit result at least 56 iterations are required. These iterations are used to produce 53 bits for the result, two guard bits G, and a round bit R. The first guard bit is used for the case in which 0.25 ≤ Q < 0.5 and the second guard bit used to correct for any over estimation in the last produced bit of the quotient. The round bit R, is a required bit for all floating point operations, it

28

effectively calculates the result to an extra level of precision which can be used in subsequent rounding stages [8]. 3.4.3

Quotient Selection Quotient Selection

CPA Result

QSL Product

000.0 and 5th MSB of PRs and PRc = 0

0

0XX.X

+1

111.1

0

1XX.X

-1

Figure 3.13 SRT Radix-2 Quotient Selection

Quotient selection is the process of estimating the next bit in the final quotient. It does this by sampling several of the most significant bits of the partial remainder and estimating the next “qi+1” quotient bit (Figure 3.13). Each estimate is either correct or an over estimation of the next quotient bit, which may require correction at a later iteration, which is why the negative quotient bit is required. Further, since only several of the most significant bits are sampled, the partial remainder is not required to be fully calculated, and a redundant form is suitable [11] [24], such as carry-save. However, traditional QSL does not function as expected for an exact result, such as “25/5”. An improved version proposed by SUN Microsystems [25] corrects this behaviour, by modifying the case where the first four most significants bits of the partial remainder sum to zero. In terms of implementation, this requires a fast adder to sum the four most significant bits of the residual and some form of logic to determine which case fits. Alternatively, by examining each case, a ROM containing slightly more then 256 entries can be constructed [35]. 3.4.4 Partial Quotient Generation Every iteration of a radix-2 operation generates a new quotient bit “qi+1”. Since this bit can be positive or negative, it would at first appear a CPA might be required for negative results. By using the “On-The-Fly” conversion method [11], two registers Q and QM are maintained, which remove the need for a full CPA of the quotient (Figure A.6). Q is the current quotient at iteration I, and QM is its negative factor (Qi -2-i). Calculating Q and QM involves selecting one of the two registers and concatenating a value to its end in position “2-i” (Figure 3.14). In terms of implementation, this requires less than two 2:1 MUX and the distribution of the new quotient bit. However, since the parallel product generation also includes a 3:1 MUX for selecting the correct product, the critical path of a single iteration is only increased by less then a 4:1 MUX.

29

Figure 3.14 Radix-2 On-The-Fly Conversion Equations Figure 3.14 Radix-2 On-The-Fly Conversion Equations

The radix-2 form of these equations is quite simple in comparison to the generic formula. This leads to a basic design for a single bit of Q and QM logic, which is depicted in the next two figures. The only complication then becomes knowing what “2-(i+1)” is during an iteration. By having a single hot state machine called “I”, where a single high signal is propagated from left to right in a register one bit per iteration, it becomes trivial to select the bit position that will be changed.

Qj

QSL

QMj

Ij

Q if QSL ≥ 0 e.g. QSL = 01 or

QSLMSB

00 if QSL < 0 e.g. QSL = 11 QM

MUX QSLLSB

If Bit is in position 2-(i+1) Bit = QSLLSB

MUX

Else Bit = Bit read from selected register

Qj

Figure 3.15 Radix-2 On-The-Fly Conversion for Q

Qj

QMj

Ij

Q if QSL > 0 e.g. QSL = 01 QSL

NXOR2

MUX

QM if QSL <= 0 e.g. QSL = 00 or 11

QSLLSB

If Bit is in position 2-(i+1) MUX

QMj

Figure 3.16 Radix-2 On-The-Fly Conversion for QM

30

Bit = ~QSLLSB Else Bit = Bit read from selected register

3.4.5 Normalization and Sticky Bit Generation Like multiplication, the division operation may produce a result not in the required range, in this case 0.5 ≤ Q < 2. By detecting if the leading quotient bit is a zero (accounting for the initial required dividend right shift), the range of the quotient can be seen to either be in the range 0.5 ≤ Q < 1 or 1 ≤ Q < 2. In the case of the first, the quotient must be left shifted and the exponent decremented. However, unlike multiplication, the sticky bit T is more complicated to calculate. For division and square root, it is produced from a residual in carry save form [25] (Figure 3.17). Since it is not based on the quotient, it is normally performed in the same stage as the recurrence iteration.

Where M = Residual Length Figure 3.17 Stick Bit Formula for a Redundant Residual

3.4.6

Radix-4 Registers

Critical Path

QSLC

QSLC

QSLC

+1 Product

0 Product

-1 Product

QSLC MUX

+1 Product

0 Product

-1 Product

MUX MUX

Figure 3.18 SRT Radix-4 Division using Layer Radix-2 Stages

The largest delay of an SRT division algorithm is the quotient selection logic circuit QSLC and the fan-out of its output. While it is possible to compute at a higher radix, the QSLC becomes more complicated, and its delay and area increase [25]. Overlapping two lower radix stages allows for the second QSLC to begin before the first has completed, essentially leaving only one radix-2 QSLC in the critical path (Figure 3.18). This is done by calculating the second quotient products for the next stage for each of the three possible products from the first stage before the correct product from the first stage has been selected. Once the

31

first stage QSLC has completed, and the correct product has been selected, the next parallel product generation can commence, and by its completion, the second stage QSLC should have also completed. In terms of the circuits described for the parallel product generation for radix-2, at radix-4 they remain unchanged. While no extra logic may be required, the five most significant bits of each parallel product need to be available for output to the second level QSLC before the correct product has been selected. 3.4.7 Higher Radix It is possible to compute at radix-8 using three overlapped radix-2 stages. Essentially this would have a critical path of a one QSLC, three 3:1 MUX and two parallel product generation stages. While this may be possible, as the SUN design shows [25], using standard cells and an AMI0.5 process, the increased delay and area compared to a radix-4 design may be restrictive. It is also important to remember that the above diagram for radix-4 division does not indicate the extra delay of the first stage new input selection, sticky bit generation, nor the partial quotient calculation. These are important to remember, as while it may be possible to combine three stages in under 8ns, the extra delay may be problematic. Further, any higher then radix-8 and the additional parallel product generation delay becomes a limiting factor [23].

3.5 SRT Square Root 3.5.1

Introduction SRT Square Root Operation Properties

Property

Value

Operation Operand X Range

1≤X<2

Operand D Range

N/A

Result Q Range

1≤Q<2

Result Sign

Sign X (Should be positive from decoding)

Result Exponent Figure 3.19 Floating Point Square Root Operation Properties

The SRT approach can be adapted to handle the square root operation [11]. However, unlike division, where the integer SRT algorithm could handle the floating point numbers, the square root iteration has some minor differences [37]. These differences can be seen by comparing the formula presented for both division and square root, and noting the value of PRs0 for both. For the most part, the division and square root algorithms can share an underlying design, and most approaches to speeding up SRT division are applicable to the square root operation. This section covers a base SRT square root implementation and the elements that make it different to division, such as the creation of a functional divisor.

32

3.5.2

Radix-2

Where I = 0, 1, 2, 3,..., m Q0 = 1, QM0 = 0, PR0 = (X-1) for an even exponent Q0 =0, QM0 = 0, PR0 = (X/2) for an odd exponent Figure 3.20 SRT Radix-2 Square Root Equations

The formula for SRT square root (Figure 3.20), unlike division has four base factor PRc, PRs, 2Q, and 2(i+1),

which can be reduced to three PRc, PRs, and F(I) (Figure 3.23). This makes the approaches and

concepts of reducing the delay of SRT square root architectures (Figure 3.21) similar to that for division (Figure 3.11). The only added complexity is the computation of F(i), which must occur before an iteration can start.

Registers

F[i] –F[i] Generation Critical Path QSLC

-1 Product

0 Product

+1 Product

MUX

Q Q-1 Logic

Figure 3.21 SRT Radix-2 Square Root Overview

While for the positive product, F(i) could be produced as -(F(i)), as is shown later, it is simpler to keep it in the form F(i) (Figure 3.24). However, unlike division, where there is a single divisor D, the square root algorithm uses two, a separate one for both the positive and negative products.

33

2PRsj

2PRcj

-F[i]j

+F[i]j

QSLC

CSA -PScj+1

CSA

+PSj+1 +PScj - Product

0 Product

-PScj

+ Product

MUX

PRsj (i+1)

PRcj (i+1)

Parallel Product Cellj Figure 3.22 SRT Radix-2 Square Root Circuit

For floating point numbers in the range 1 ≤ X < 2, the square root operation produces a result guaranteed to be in the range 1 ≤ Q < 2, and the first bit of which is produced at the initial stage. This means only 54 iterations are required, 52 for the remaining 52 bits, one for the round bit R, and a guard bit G to correct the last quotient bit produced. 3.5.3 Square Root Functional Divisor The original square root formula (Figure 3.20) can be simplified to involve only three base factors instead of four (Figure 3.23), with the new factor being a functional divisor F(i). Two different divisors, F(i) and – F(i), are required for the positive and negative products. Both versions of F(i) can be generated by using the two different versions of the current quotient, Qi and QMi. From these elements, the four base factors of a square root operation, PRsi, PRci, 2Qi and 2-(i+1), can be treated as three factors, PRsi, PRci, and F(i) or F(i). The advantage of using three factors instead of four, is that the underlying division architecture can be used and a single level of full adders is sufficient (Figure 3.12). In terms of implementation, this only requires a left shift of Q or QM and two OR operations with the iterations marker.

34

Figure 3.23 SRT Radix-2 Square Root Functional Divisor Equations

3.5.4

Integration with Division

Figure 3.24 SRT Radix-2 Division and Square Root Equations

The division and square root operations present a great deal of similarities (Figure 3.24). Integrating the two functions into a single unit requires the design of a mechanism for selecting the correct divisor (Figure 3.25). In division, the divisor is a constant single factor, in a square root operation the divisor is the result of multiple changing factors (Figure 3.23). By constructing the square root functional divisors prior to their input in to the division iteration, the four base factors are limited to three and the circuit only requires an extra MUX to choose the correct divisor for both positive and negative products. The integrated square root and division architecture then has a critical path of the original division unit with the addition of a MUX and F(I) generation (which at most is a three input OR gate).

35

2PRsj

2PRcj Dj

-

+F[i]j

Operation

F(i)J QSLC MUX

MUX

CSA -PScj+1

CSA

+PSj+1 +PScj - Product

0 Product

-PScj

+ Product

MUX

PRsj (i+1)

PRcj (i+1)

Parallel Product Cellj Figure 3.25 SRT Radix-2 Combined Division and Square Root Circuit

3.6 Conclusion Extending the LTH to provide most of the capabilities of a modern FPU requires the development of a double precision divider, multiplier, and square root unit. To meet the clock frequency of 100MHz using standard cells on a 0.5µ process, a focus needs to be placed on reducing delay. This can be done by using a radix-4 multiplier that consists of modified Booth-2 partial product generation, a full CSA tree, and a hybrid 106-bit CSEA. To accommodate for the division and square root operations, a combined SRT radix-4 division and square root unit is proposed. While the design should meet the target frequency, it may come with a large area cost.

36

4

Implementation

4.1 Introduction This section presents an overview of the HDL modeling, testing, and synthesis stages, used to develop the multiplier and combined division and square root unit designs into an IC layout ready form. Due to the complexity and large volumes of code and scripts produced, only a high level overview of the important aspects of each design is presented. For a full code listing please see “Code Listing”.

4.2 HDL Modeling 4.2.1 Introduction This section discusses the modeling of the designs presented in the last chapter using the Verilog-97 HDL. It elaborates on the choice of coding styles, ranging from direct ADK standard cell instantiation to synthesized alternatives. Further, the processes of macro and micro cell optimization are explored, and the reasons for performing these levels of modeling are investigated. 4.2.2

Multiplication Input Booth Partial Product Generation

CSA Tree

106-Bit CSEA

Normalization

Stage 1

Stage 2

Stage 3

Stage 4 Output

Figure 4.1 Multiplier Pipeline

4.2.2.1

Partial Product Generation

The partial product generation step for multiplication is by all accounts one of the fastest stages [8] [11] [14]. A design from literature was directly implemented [15], which consisted of two cells. The first cell was a Booth encoder that read three bits from one of the operands, to determine the correct Booth multiple. The second cell, Booth MUX, read in a single bit from the second operand, and based on the outputs from the Booth encoder performed a shift or invert operation with its neighboring cells. As the design was already defined, the only addition to the schematics presented by Samsung was an inverter tree. The inverter tree was used to buffer the output of a Booth encoder and provide the signals to all associated Booth MUX for that partial product. Therefore, each partial product contained a Booth MUX, three inverter trees (one for each bit of the Booth encoder’s output), and 54 Booth MUX (one for each

37

possible bit of the partial product). These partial products were then grouped into a single module called “Booth_Partial_Products” that produced all the required partial products. 4.2.2.2

CSA Tree

The CSA tree was the most complex part of this thesis. It involved designing a range of counters from 27:2 to 4:2 using full adders, connecting them in such a way that there was only a critical path of seven adders, and retrieving columns of bits from the rows of partial products produced from the last stage for each counter’s input. The first step was to build the required counters. The Samsung design[15] specifies carry inputs and outputs for each counter on all seven levels. For each carry signal on a level, there must also be an associated counter. By using this as a guide, all 23 counters could be constructed, with the only optimization being that port zero of the initial partial product column input for a particular counter, would be used as late as possible. It was then observed, upon synthesizing the HDL, that the produced netlist did not contain any full adder cells (ADK “Fadd1”). Since the design implied the cells were to be used, a style of coding was adopted that directly instantiated the ADK standard cells. As a result, the Verilog code is no longer portable outside of an ADK environment. Once all counters were created, the CSA tree needed to be constructed. This entailed placing 106 counters ranging from 27:2 to half adder cells, in such a way that the carry signals propagated correctly, and the partial product rows were correctly aligned into columns for counter input. At first by hand, this seemed an impossible task due to the overwhelmingly vast amount of ports and nets used. Opting for a more automated approach, a PERL script was used. This resulted in an approximately 1800 lines long computer generated module that required only minor changes. 4.2.2.3

106-bit Wide Adder

In some texts, a fully synthesized design [8] is suggested to provide sufficient results for the final multiplication CPA. This proposal resulted in a module with a single line of HDL in its body. While initial timing and area reports were promising, the goal of 100MHz could not be met with this design. Using a hybrid CSEA design [17] made from 16-bit CLA blocks [12] a better result was obtained. However, the delay achieved was very close to the acceptable cut-off limit of 8ns. Since this design was explicitly described in a structural Verilog, a version using ADK standard cells directly instantiated was also created. This proved to be a better candidate. 4.2.2.4

Normalization

The normalization section for the multiplier was relatively trivial. It consists of a MUX and several OR gates. The MUX is used to left shift the required bits if needed, and the OR gates are used to detect if any of the discarded bits are ones. 4.2.3

38

Division and Square Root

Input Input Selection Div4_Iteration First Level QSLC

Square Root Functional Divisor Generation

OTFC Q and QM Calculation

Parallel Product Generation

Second Level QSLC

Square Root Functional Divisor Generation

OTFC Q and QM Calculation

Parallel Product Generation

Sticky Bit Calculation

Radix-4 Iteration

Normalization Stage

Normalization

Output

Figure 4.2 SRT Radix-4 Division and Square Root Pipeline

4.2.3.1

QSLC

The quotient selection logic is one of the most critical sections of the SRT design. SUN [25] and Parhami [8] suggested either using a 4-bit fast adder with some combinational logic, or implementing it as a ROM. Both avenues were explored, but there was some hesitation with the second [35]. The first approach called for a fast 4-bit adder. The fastest adder suitable for this word size was taken to be the CLA [12], and due to it being only four bits, the increased area used by this style of adder is assumed not to be a problem. The next approach was to be a ROM version. When testing the original CLA in ModelSIM, a test bench was created that contained a simple behavioral QSLC. By modifying this, to output the resulting quotient bit for all variations of the most significant four bits, and where the result was all zero adding an extra internal condition, a Verilog module was created. This module has only a single CASE statement with 256 entries, which Leonardo Spectrum recognized as a ROM. 4.2.3.2

Parallel Product Generation

The parallel product generation essentially performs all three variations of the formula described in the last chapter for division and square root at the same time (Figure 3.24). Transferring the designs

39

presented in the integrated division and square root section directly into an initial design was the first target. This was done by defining a basic cell for a single bit that produced a single bit result for all three variations of “qi+1” and had a final MUX to select the correct one. This resulted in a cell with a large number of ports requiring to be connected to at least 58 other cells. Moving directly to an automated approach, a PERL script was created that would instantiate all the required bit cells, and provide two inverter trees to buffer the input bits from the QSLC. In the first version, ADK cells were directly used. However the parallel product cell’s internal logic presented itself as slightly more random then that of the CSA tree. By creating a second version of the basic fraction bit cell, which used Verilog primitives and synthesizable full adder cells, as in the last section, an alternative was produced. Initial timing reports were promising. However, it is important to remember that the timing report tool assumed all inputs will be available at time zero. This is not the case as the new quotient bits will arrive late as they have to be first generated and propagated. The solution was to synthesize the full adders and provide the produced netlist as an input to a parallel product bit cell synthesis run with the hierarchy set to “PRESERVE”. The final step was to create a netlist for all bits using the synthesized bit cell. This could be described as macro optimization of micro optimized cells. By noticing that the synthesis tool did not modify the types of gates used when ADK cells were directly called in a module, a twofold approach was developed. The first step was to synthesize the bit cells with the behavioral full adders. The second step was run the synthesis tool again, providing it with the full 59 bit parallel product generation module and synthesized netlist as inputs. The result is that synthesized logic is confined to a single micro cell, and the induced added delay due to random logic wiring irregularity is prevented from propagating to the macro cell. 4.2.3.3

On-The-Fly Conversion

On-the-fly conversion is used to update the quotient at the end of each iteration. Its complexity derives from the fact it must handle negative quotient bits. During the last chapter, a high level design was shown for the updating of both Q and QM (Figure 3.14). This led to a basic module for a single bit consisting of several MUX levels, which was expanded and replaced by two levels of NAND gates. These cells were optimized in the same way as the parallel product generator, in that they were synthesized at the cell level, linked together using a PERL script, and synthesized again together using the macro cell and the synthesized micro cell netlists. 4.2.3.4

Radix-2 Iteration

This was one the easiest modules to design. Its purpose was to perform a single radix-2 iteration of a division or square root operation on the fractional component of the operands. Essentially it only used four sub modules. These include a QSLC, parallel fraction product generator, on-the-fly conversion, and a square root functional divisor module. The inputs and outputs were all clearly defined by the internals. While not used in the final design, it was an invaluable tool for testing, since with a higher radix, the outputs are the result of multiple iterations, which may make it unclear where a problem has occurred. 4.2.3.5

40

Radix-4 Iteration

This module serves the same purpose as the radix-2 version. The difference is that it implements two layered radix-2 iterations with overlapped QSLC. It uses slightly more then double the gates, as it requires a total of four QSLC blocks instead of one. 4.2.3.6

Normalization

The normalization stage for division is much simpler than that of multiplication. It takes in a 57-bit quotient and returns a 54-bit result, 53 bits for the quotient and an extra round bit R. For a square root operation, the required result is the first 54 bits, where for division it is 54 bits but from the second or third MSB. Essentially this can be done by using three levels of MUX.

4.3 Synthesis 4.3.1

Configuration Mentor Graphics Leonardo Spectrum Synthesis Level 3 Configuration

Property

Settings

Temperature

27

Voltage

5

Max Fan-out Load

5

Register 2 Register

10ns (Data arrival time

Input 2 Register

10ns

Register 2 Output

10ns

Optimization

Timing with Standard Effort (maximum possible)

8.0ns)

Figure 4.3 Leonardo Spectrum Settings

Leonardo Spectrum was used to generate netlists for behavioral HDL modules and to flatten structural hierarchies in preparation for layout. Further, it was also utilized to provide timing and area reports, and to verify fan-out of no more than five for all key modules. While for the most part, the HDL was designed in a structural form, the buffering and stage connections were done using a high level form of Verilog-97. The most important elements of the above configuration are the timing settings. While any data arrival times produced by Leonardo will only be a sum of the delay through each component, they do not include wiring delays. As a result, any pre-layout timing will be increased post-layout. To counter for this, for a cycle time of 10ns, approximately ~2ns must be set aside. For any design unit, a synthesized register to register data arrival time of 8ns is assumed to indicate that it can meet a cycle time of 10ns post layout. 4.3.2 Automation and Reporting The Mentor Graphics synthesis tool, Leonardo Spectrum, when run in a GUI mode was found to be highly unresponsive. In comparison, the same tool when used under a terminal environment was more reliable. This led to a series of scripts for generating the netlists and reporting timing and area reports. 4.3.3 IC Preparation When creating an ADK netlist by flattening or synthesizing a design, the output from Leonardo may be incompatible with the IC layout tool. This occurs in two areas, the first being signals assignment, and the

41

second power supply component problems. It is unclear why both of these issues occur, but the workaround is to manually alter the produced netlist (Figure 4.4). Preparing a Netlist for IC Layout Synthesis Result

IC Layout Compatible Verilog

Fake_gnd gFgXX(.Y(wGnd));

// Replace fake ground Supply0 wGND;

Fake_vcc gFvccXX(.Y(wVdd));

// Replace fake power Supply1 wVcc;

Assign j = k;

// Replace assign statements with fastest ADK buffer Buf02 gJKXX( .Y(j) , .A(k) );

Figure 4.4 Preparing a Netlist for IC Layout

After synthesis and some fine tuning of the produced netlist, it was time to proceed to creating a layout. This is done in two stages; the first is to import the netlist into the schematic editor to generate a technology viewport. The second stage uses this viewport to generate the layout. By synthesizing each pipeline stage, and feeding the produced netlist into the synthesis run of the macro cell, where essentially all pipeline stages are instantiated with the addition of registers, a generic floor plan can be created. For example, synthesizing the multiplier, the netlists for the Booth partial products, CSA tree, and CPA are required. For this to produce the correct result, the synthesis run should have its hierarchy set to preserve. In this way both a semi automated or fully automated IC layout is possible.

4.4 Conclusion The implementation of division, multiplication, and square root units is a complicated task. By designing the units in a hierarchical manner, where each micro or leaf cell is modeled using both ADK standard cells and Verilog primitives, a variety of options becomes available for the final implementations. To handle the large number of interconnects required, especially for the CSA tree, an automated approach using PERL was adopted. Also, scripting was further embraced to design the QSLC, where a Verilog test bench for the CLA QSLC was modified to generate a full ROM QSLC.

42

5

Results

5.1 Introduction This section presents the results of the HDL design in terms of area, timing, and IC layouts, then discusses the reasons for the achieved area and timing for both the multiplier and combined division and square root units, and elaborates on why integrating them into the original LTH was not possible. Finally a comparison between the project performance of the designed architecture and industry implementations is presented.

5.2 Timing and Area Analysis 5.2.1

LTH LTH Pre-Layout Data Arrival Times (ns)

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

8.65

4.64

3.13

Pre- normalization

23.72

12.59

9.94

Add/Sub

26.66

17.68

11.66

Post- normalization

30.37

15.98

11.23

Round-normalization

23.01

10.76

9.87

LTH (all stages and buffers)

36.87

24.73

16.78

Decode

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Figure 5.1 Original LTH Timing Analysis

LTH Area by Gate Count

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

631

615

582

Pre-Normalization

1947

1771

1720

Add/Sub

5407

3975

4213

Post- normalization

2582

2384

2654

Round- normalization

1147

1049

1018

11547

10200

9859

Decode

LTH (all stages and buffering)

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Figure 5.2 Original LTH Area Analysis

The above is a timing and area analysis of the Original LTH FPU HDL separated into pipeline stages. By the authors own admission [3], it is not optimized. The HDL is coded in a very high level of behavioral VHDL, and as a result, the timing is much higher then desired. Using the maximum possible computational effort of the synthesis tool, the original LTH fails to meet the required cycle time of 10ns, which becomes worse considering this is pre-layout, and does not include any allowance for wiring delay incurred during layout. While it was possible to generate the above tables with apparently borderline results, the time taken often exceeded a week for synthesizing a single stage for a single variation of the

43

AMI 0.5 micron process. The situation is worsened by most stages requiring additional logic, which would add extra delay, and more than likely increased synthesis time. This then begs the question, whether the task is one of integrating, or completely redesigning the LTH. Some possible solutions to fixing the large delay would involve splitting each of the pipeline stages that exceed the 8.5ns for data arrival. For example, the three internal logic operations of the “Add/Sub” stage could be split into three stages, namely a pre-shift, an addition using the multipliers CSEA, and a result conversion. Since several stages require a similar form of pipeline stage splitting, this method would move the number of stages for addition or subtraction operations closer to 10. This latency does not compare well against other FPU implementations [14], which normally perform these operations in fewer than five cycles. The problem is magnified by the fact the Leon2 stalls until a single FPU operation has completed, and extra stages in the pipeline have significant effects on overall system performance. 5.2.2

Multiplier Multiplier Pipeline Pre-Layout Data Arrival Times (ns)

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

6.26

3.59

2.31

CSA Tree (ADK Std. Cells)

21.06

11.68

7.34

*Synthesized CSA (Area)

13.74

7.88

5.03

*Synthesized CSA (Delay)

15.64

7.70

5.01

106-Bit CSEA (ADK Std. Cells)

18.75

10.17

8.02

*Synthesized CSEA (Cells Preserved)

28.22

12.60

10.04

*Synthesized 106-Bit Adder

48.67

20.77

11.84

4.56

2.66

1.73

24.04

13.52

8.91

Partial Product Generation Booth-2 Product Generation Carry Save Adder Tree

106-Bit Final CPA

Normalization (Fraction Adjustment) Normalization Multiplier (Includes: Buffers and Exponent) Radix-4 Multiplier

Modules do not include buffering unless specified. *Unused Cells or listed for Information Purposes only. “Area” and “Delay” refer to synthesis optimization targets that may deviate from normal Leonardo Spectrum settings. Figure 5.3 Multiplier Timing Analysis

Multiplier Pipeline Area by Gate Count

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

22946

22946

22946

CSA Tree (ADK Std. Cells)

7719

6415

6415

*Synthesized CSA (Area)

8202

7998

7997

Partial Product Generation Booth-2 Product Generation Carry Save Adder Tree

44

*Synthesized CSA (Delay)

15303

13875

14397

106-Bit CSEA (ADK Std. Cells)

3128

2963

2931

*Synthesized CSEA (Cells Preserved)

3074

3260

3126

*Synthesized 106-Bit Adder

2445

1588

1930

188

185

185

48908

47131

47132

106-Bit Final CPA

Normalization (Fraction Adjustment) Normalization Multiplier (Includes: Buffers and Exponent) Radix-4 Multiplier

Modules do not include buffering unless specified. *Unused Cells or listed for Information Purposes only. “Area” and “Delay” refer to synthesis optimization targets that may deviate from normal Leonardo Spectrum settings. Figure 5.4 Multiplier Area Analysis

Generating the partial products is one of the fastest operations in the multiplication process as is shown by the timing analysis for “Booth-2 Product Generation” (Figure 5.3). It is also one of the largest in terms of gate count. This can be explained by the large number of Booth MUX required (1458), with one needed for each bit of each partial product (27 products each 54-bits long). Further, connecting a Booth encoder to each Booth MUX in a Booth product required buffering to solve the fan-out problem. The solution was to use an inverter tree, with three of these required for each partial product, and a total of 81 for all the partial products. At first it may appear that this stage could include more logic levels. However, the extra slack is beneficial for placing the input registers for the next stage in a more optimal position for the CSA tree. Adding the generated partial products using a full binary tree of Full Adder cells achieved a balanced result for both area and timing. This stage whether implemented using Full Adders or 4:2 counters, requires a significant number of Adder cells, with long carry propagation paths. As it can be seen from the timing analysis, the CSEA is actually slower then the CSA, meaning that a satisfactory delay was achieved for the Adder tree using ADK Fadd1 cells. While it is clear that a 4:2 counter approach (Figure B.1) would offer a significantly shorter delay, it would also come at a vastly increased gate count (Figure B.2), and no overall increase in performance. Further, the synthesized options, while apparently faster, may have a significantly worse critical path post-layout then the one using Fadd1 cells. Thus while a 4:2 counter and fully synthesized CSA trees offer increased pre-layout performance, the gain is either irrelevant or unlikely to be realized post-layout and comes with a cost of significantly more gates. For the final stage it was initially hoped that a synthesized 106-bit CPA may be able to provide satisfactory performance in terms of both gate count and delay. As the above timing and area analysis shows, the custom ADK standard cell based CSEA is around 33% faster then the fully synthesized 106-bit Adder option on a fast process, but at a cost of 50% more gates. The synthesized version of the CSEA provides no advantages for either delay or area. To meet the target of a 10ns or approximately 8ns prelayout cycle time, the only viable option is to use the ADK std. cell version of the CSEA on a fast process.

45

5.2.3

Combined Division and Square Root Combined Divide Square Root Pre-Layout Data Arrival Times (ns)

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

*OTFC (ADK Std. Cells)

4.84

2.79

1.79

OTFC (Syn.)

4.15

2.49

1.61

*QSLC (CLA)

8.26

4.61

2.91

QSLC (ROM)

3.26

1.83

1.15

0.70

0.42

0.28

Quotient Generation (Q and QM)

Quotient Selection Logic QSLC

Square Root Function Divisor (F(i) and –F(i)) Square Root Functional Divisor

Parallel Product Generation (Includes: Adders, qi+1 Inverter Trees, and Selection MUX) *Bit Slice (ADK Std. Cells)

4.97

2.81

1.77

*Bit Slice (Syn.)

3.77

2.01

1.34

Bit Slice (Micro Syn.)

4.30

2.43

1.54

Sticky Bit Generation

4.19

2.35

1.55

Iterations (Includes: QSLC, Functional Divisor Calculation, Parallel Product Generation, and OTFC) *Radix-2 Iteration (ADK Std. Cells)

10.68

6.06

3.92

*Radix-2 Iteration (Syn.)

9.30

6.19

4.02

*Radix-2 Iteration (Micro Syn.)

9.64

5.89

3.83

Radix-4 Input Selection (Syn.)

4.39

3.09

2.09

*Radix-4 Iteration (ADK Std. Cells)

17.51

9.95

6.57

*Radix-4 Iteration (Syn. Product)

14.01

9.90

6.41

Radix-4 Iteration (Micro Syn.)

16.03

9.39

6.06

Combined Division and Square Root Unit (Including Buffers and Exponent) SRT Divider Radix-4

22.08

12.92

8.93

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Radix-4 iteration also includes sticky bit generation. Figure 5.5 Combined Division and Square Root Timing Analysis

Combined Divide Square Root Area by Gate Count

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

*OTFC (ADK Std. Cells)

856

856

856

OTFC (Syn.)

779

652

652

*QSLC (CLA)

46

46

46

QSLC (ROM)

33

41

42

251

251

250

Quotient Generation (Q and QM)

Quotient Selection Logic QSLC

Square Root Function Divisor (F(i) and –F(i)) Square Root Functional Divisor

46

Parallel Product Generation (Includes: Adders, qi+1 Inverter Trees, and Selection MUX) *Bit Slice (ADK Std. Cells)

34

34

34

*Bit Slice (Syn.)

38

33

37

Bit Slice (Micro Syn.)

29

29

29

Sticky Bit Generation

423

420

417

Iterations (Includes: QSLC, Functional Divisor Calculation, Parallel Product Generation, and OTFC) *Radix-2 Iteration (ADK Std. Cells)

3156

3155

3155

*Radix-2 Iteration (Syn.)

3125

3124

3124

Radix-2 Iteration (Micro Syn.)

2872

2875

2875

Radix-4 Input Selection (Syn.)

1057

955

955

*Radix-4 Iteration (ADK Std. Cells)

7046

6987

6898

*Radix-4 Iteration (Syn.)

7075

6969

6969

Radix-4 Iteration (Micro Syn.)

6714

6530

6532

Combined Division and Square Root Unit (Including Buffers and Exponent) SRT Divider Radix-4

12418

12524

12419

Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Radix-4 iteration also includes sticky bit generation. Figure 5.6 Combined Division and Square Root Area Analysis

The two key areas of the combined division and square root unit are the QSLC and the layering of iterations. The fast CLA based QSLC was larger (Figure 5.6) and slower (Figure 5.5) than the synthesized ROM QSLC. Also from the timing analysis, it is possible to see the benefits from overlapping QSLC, where a two iterations are performed with a data arrival time significantly less the twice what would be expected by layering two non-overlapped radix-2 iterations. However the second iteration was helped by micro synthesis, where effectively the parallel product used bit slices containing several smaller synthesized cells (“micro syn.”), instead of a completely synthesized bit cell (“syn.”). Surprisingly, each of these bit cells used two synthesized full adders (Figure B.2), which would expectantly make the cell larger than the ADK version, but instead they were both faster and smaller. 5.2.4

Exponent Important Exponent Component Pre-Layout Data Arrival Times (ns)

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

11.17

6.52

4.12

9.67

5.73

3.63

*Dec 12-Bit (ADK Std. Cells)

6.55

3.26

2.06

Dec 12-Bit (Syn.)

4.79

2.74

1.99

*Inc 12-Bit (ADK Std. Cells)

5.14

2.96

1.92

Inc 12-Bit (Syn.)

3.73

2.25

1.45

Addition or Subtraction of Exponents *CLA 12-Bit (ADK Std. Cells) CLA 12-Bit (Syn.) Decrementing Exponent

Incrementing Exponent

47

*Unused Cells or Provided for Information Purposes only. Figure 5.7 Exponent Component Timing Analysis

Important Exponent Component Area by Gate Count

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

*CLA 12-Bit (ADK Std. Cells)

157

150

50

CLA 12-Bit (Syn.)

104

96

95

*Dec 12-Bit (ADK Std. Cells)

46

42

42

Dec 12-Bit (Syn.)

49

49

47

*Inc 12-Bit (ADK Std. Cells)

47

46

45

Inc 12-Bit (Syn.)

46

45

45

Addition or Subtraction of Exponents

Decrementing Exponent

Incrementing Exponent

*Unused Cells or Provided for Information Purposes only. Figure 5.8 Exponent Component Area Analysis

The exponent datapaths for both the multiplier and combined division and square root units have to this point not been discussed in any detail. While their design is trivial in comparison to the fraction for multiplication, division, and square root operations, it is important to understand the associated timing (Figure 5.7) and area (Figure 5.8), that they add to an FPU. Essentially, there are three different operations performed on operands exponents, these include addition, increment, and decrement. While the exponent field of IEEE-754 operands is only 11 bits, the extra 12th bit of precision is required for overflow and underflow detection.

48

5.3 IC Layout 5.3.1

Multiplier

Figure 5.9 Multiplier IC Layout

When transferring the multipliers netlist into an IC layout (Figure 5.9), more standard cell limitations were discovered. For example, a four port OR gate was missing in the Design Architect installation. It was not possible to avoid using this cell in a similar way as the “AND04”, where the synthesized CPA was used instead of the faster ADK CSEA. The result was to use two NOR and a NAND gate. Similar occurrences happened for other ADK standard cells. At this stage it was realized that any IC layout would not accurately reflect the merits of the HDL model, and with limited time, a fully automated place and route were generated. That being said, placing the Booth MUX closer to their respective connections in the CSA tree, may have been very difficult with a full SDL design using a manual layout process.

49

5.3.2

Combined Division and Square Root

Figure 5.10 Combined Division and Square Root IC Layout

Unlike the multiplier, the layout for the division and square root unit is more regular, which should make a non-automated IC layout much simpler. This can be seen by the regularity of the integrated division and square root cell (Figure 3.25) in comparison to that of the CSA tree counters [15]. While the netlist produced from the HDL for the division unit contained no missing cells, time constraints limited any IC layout to that of a fully automated place and route (Figure 5.10). The internal wiring and cell placement is noticeably more uniform then the multiplier.

50

5.4 Performance 5.4.1

Pre-Layout Pre-Layout Clock Frequency for FPU Components on an AMI 0.5µ Fast Process

Component

Pre-Layout Clock Frequency (MHz)

LTH FPU(Original)

58.8

Multiplier

109.3

Multiplier (Using Synthesized CPA for IC Layout)

102.7

Combined Division and Square Root Unit

108.9

Figure 5.11 Pre-Layout Clock Frequency

The maximum possible clock frequency of a FPU containing the new components is 102.7MHz (Figure 5.11). Post-layout this figure is likely to decrease as the wiring delays are incorporated. It does however illustrate the point, that the newly designed components have achieved a pre-layout frequency almost double that of the original LTH. This figure is more interesting, when it is considered that the speed of the multiplier was sacrificed for software compatibility issues, as a slower synthesized CPA was used for IC layouts. Alternately, assuming there were no standard cell compatibility issues, the maximum clock rate would be 108.9MHz. Allowing roughly 10% added wiring delay, results in a projected post-layout clock frequency of 98.01MHz or 100MHz for the optimal design. 5.4.2

Industry Comparison 1994-95 0.5µ Processor Latency and Clock Frequency Comparison Operation Latency

Processor

Multiplication

Division

Square Root

Clock Rate (MHz)

Intel P6

5

38

38

150

MIPS R10000

2

19

19

200

Sun UltraSPARC

3

22

22

167

DEC21164

4

23

23

200

AMDK5

7

39

39

75

PA8000

3

31

31

200

6(4+2LTHL)

30(28+2LTHL)

31(29+2LTHL)

100

LTH Extended

All operation latencies are assumed to include FPU overhead (e.g. decoding, rounding, and operand packaging). LTHL = LTH FPU operation overhead latency (this includes decoding and rounding stages). Statistics provided by Flynn and Obermann [14]. Figure 5.12 Processor Latency and Frequency Comparison

In a comparison of 0.5µ microprocessors released during 1994-95 (Figure 5.12), the new components for the LTH performed quite well. While all statistics for the new design are projected, as the LTH was not updated, and the clock frequency is based on pre-layout data arrival times, the extended LTH is not the slowest architecture, even if the post-layout data arrival times are decreased by 25%. However, the mentioned clock frequencies of industry implementations may not reflect the operation clock of the respective FPU, but rather the operation frequency of its CPU.

51

Operation latency and clock frequencies alone,, are not however adequate metrics for an architecture design. The area, feature size, and power consumed are also important [14]. This provides a restriction on measuring the benefits of the new modules. As the existing LTH was not updated due to its own limitations, the FPU area could not be approximated, and a clock frequency could only be related to the critical path of the new units (Figure 5.11).

5.5 Conclusion For the multiplier and combined division square root units, both a structural form of Verilog with direct ADK instantiation, and a synthesizable version were implemented. This led to a variety of options for creating the final layouts. The ADK based models provided a satisfactory delay with a low gate count, while the synthesized modules provided, in general, a vastly decreased pre-layout delay and increased gate count, When selecting the optimal solution for the multiplier, the decisions were determined by the delay of the final CPA. While for the other stages, various solutions were available, the ones with the lowest gate counts were selected. This is because no matter how fast the other stages may be, the cycle time is only altered by the slowest part in the pipeline, which is the CPA, and there is no need to add extra gates for no improved performance. In the division and square root unit, the use of synthesized components was increased. The original CLA based QSLC provided an expected delay. However, the synthesized ROM version was significantly faster and with fewer gates. For the parallel product generation, which included buffering of the QSLC output, a synthesized cell was also appropriate. By confining the results of the synthesis tool to a small area, the random logic was not expected to produce a large increase to post-layout data arrival times. While a variety of options were investigated for both the multiplier and division units, the LTH proved elusive. Working with the original LTH was problematic as a result of unsatisfactory synthesis times, where the components produced could not meet the required register to register delays. Nearly all stages required some optimization, which would extend their critical path, providing worse data arrivals times. This leaves two solutions, the first being to completely redesign the LTH, and the second is to use a different open source FPU, such as that provided by SUN Microsystems [7]. For the implemented units, an IC layout was constructed using a fully automated process. The results were for illustration only, as many sacrifices had to be made on the original design due to software compatibility issues. Thus, any post-layout timings would not accurately reflect the true merits of the architecture, and performance evaluation must be based on pre-layout data arrival times.

52

6

Conclusion

6.1 Introduction This section provides an overview of what was achieved, what remains to be done, and the future direction of extending the FPU. It discusses each of the functional units implemented in terms of design, area and timing, and layout. The directions for which the original LTH could be extended and the new units improved to further increase performance are also elaborated. Finally, there is some discussion of the overall achievements of this project, and directions for future implementations.

6.2 Achievements 6.2.1 Design A double precision floating point multiplier was designed with four pipelined stages. The first was a Booth-2 radix-4 partial product generator, that produced 27 54-bit partial products. The second stage was a full CSA tree, where each column of the partial products produced was an input into a counter. The third stage was a CPA, which was constructed from 15-bit CLA blocks to form a 106-bit CSEA. The fourth and final stage was normalization, where the result produced was prepared for rounding. The division approach used was a radix-2 SRT subtractive algorithm using a modified QSL from SUN Microsystems. It consisted of two pipeline stages, a recursively called division stage and a normalization stage. The first stage used two layered radix-2 iterations with overlapped parallel quotient selection, which permitted the calculation of a double precision result in 27 cycles. The square root operation was also incorporated into the division unit. Like the CSA from the multiplier, the design was realized through a bit slice implementation, However, unlike the multiplier, the delay of the SRT division and square root unit was never a concern, as the number of SRT iterations could be scaled. Both the multiplication and division units were verified against a set of over 1000 random test vectors. The square root unit was also subjected to a similar number of vectors, however its results had to be check by hand. 6.2.2 Implementation Over a 100 modules were created to model the architecture of the multiplier and division unit for both direct ADK and synthesized implementations. This led to a low area ADK full adder based CSA tree, a quick ADK CSEA, and a fast synthesize QSLC ROM. Without this depth and variety of HDL modeling, it is doubtful that a sub 10ns pre-layout cycle time would have been possible under any 0.5µ process. Transferring the HDL to an IC layout was more complicated than expected. This resulted from software compatibility issues, where the synthesized netlists would consist of unknown or unavailable cells. The solution was to model the missing cells using available ones, creating a slower and larger design. This meant no produced IC layout could come close in terms of clock frequency or area to the pre-layout calculations. Considering this, and time limitations, a fully auto-place and route IC layout was produced, if for no other reason than to demonstrate size and complexity.

53

6.2.3 Timing and Area Using the Mentor Graphics ADK2.5 standard cell design flow for an aggressive AMI 0.5 micron static CMOS process, a target pre-layout speed of 100MHz was possible. However, using a more typical AMI 0.5 micron process, a maximum data arrival time of 13.52ns was obtained for the multiplier, in contrast to a more aggressive fast process, where the cycle time was reduced to 8.91ns. Thus for a 10ns cycle time or an FPU clocked at 100MHz, an aggressive process is required. This does not include the timing of the original LTH. Due to the high level of HDL, extensive synthesis was required to generate a netlist. This provided several problems. In particular, some of the original code was not synthesizable and some modules provided a critical path of at least 50ns for a typical AMI 0.5 micron process. This caused significant problems with integrating the data paths, as only a complete redesign of the original LTH code would permit it to achieve a cycle time of 10ns, similar to that of the new pipeline stages. Regarding the new datapaths, the pipeline stage with the largest area was the modified Booth-2 radix-4 partial product generator. While using a fast AMI process, it was estimated to have an area of at least 22K gates, but surprisingly it had one the fastest pre-layout critical paths of 3.5ns. Overall the multiplier required 47K gates, and the combined division square root unit 12K gates, and since multiplication is a more used operation [29], the ratio of required area and utilization is appropriate.

6.3 Future Directions 6.3.1 Design There are several possibilities for improving this design that may lead to achieving a sub 10ns cycle time under a typical AMI 0.5µ process. The first is modifying the multiplier to perform more computations in the Booth encoding stage, perhaps extending the Booth encoding to radix-8. This would lead to an increased delay during this stage, along with increased area. A higher radix multiplication would reduce the delay and area of the CSA tree, as a result of requiring fewer counters. The CSA tree could be modified to have its first few levels of counters placed in the Booth encoding stage. The result would be a slightly longer Booth encoding Stage, a slightly shorter CSA stage, and a significant saving of area as less partial products would need to be buffered. This would come at the expense of increased design time, as many counters would need to be created that had pipeline registers directly inserted after the first few levels of counters. The third possibility is to perform the final CPA over several stages, where each CLA block could produce an output for the CSEA, and the MUX selection could occur in the next stage. This reduces the critical path of the first stage of the CPA to one 16-bit CLA and the second stage to six multiplexers. The reason this was not done initially was that it adds an extra pipeline stage, and more gates, due to the extra buffering required. For the division and square root unit, the goal of implementing the Sun design [25], which used three layered overlapped stages seemed out of reach for standard cells on the AMI 0.5µ process. Without

54

considering the input selection required for the SRT iterations, three levels of parallel product generation and overlapped QSLC comes close to 8.5ns. This limits the number of iteration per cycle for this design to two. Future implements may wish to investigate alternate architectures, such as a radix-512 [37] or a multiplicative approach [19]. However, for an SRT approach, there is probably not much more performance to be gained, without significant increases in area [23]. 6.3.2 Implementation Clearly the different versions of the ADK offered by the synthesis tool, and the missing cells of the schematic entry and layout software, are a cause for concern. With these issues resolved, and a significant amount of time, the HDL implementation could be realized in a hand placed IC layout. 6.3.3 Architecture Integration The LTH FPU offered most of the required functionality for integrating the new units. It did not however, provide a satisfactory level of performance. While the missing (multiplier, divider and square root) units were fully realized in HDL, integrating them with the LTH would result in halving their projected clock frequencies (Figure 5.11). For the designed units to be used as part of a larger FPU capable of a 100MHz clock frequency, a more optimal solution than the current LTH needs to be found, which may include a complete redesign or the use of another open source FPU [7].

6.4 Conclusion The LTH FPU provides for Gaisler Researches Leon2, a partial implementation of a floating point unit, that includes instruction and operand decoding, special number detection, addition and subtraction, and rounding. However it does not have a multiplier, divider, or a square root unit. These components have been designed using Verilog HDL with the ADK 2.5 design flow, and verified to operate at 100MHz prelayout, using an aggressive AMI 0.5 micron process. While the components have been built and tested they have not been integrated into the original LTH, nor realized in an IC layout. While the designs of the two new units are complete, the implementation could benefit from a hand placed IC layout, specifically the CSA tree. This would drastically reduce area consumed and improve the post-layout critical path of the multiplier compared to the auto-place and route layout produced. Since the start of development, an open source CPU including a fully pipelined FPU with many improvements over the Leon2 and original LTH, has become available from SUN Microsystems, with the release of the Open SPARC T1. The FPU provided with the T1 offers a better choice for future innovations, as it is not confined by the limitations placed on it by the Leon2 and the original LTH. Further, the original LTH would need to be completely redesigned, to achieve a similar performance level to that of the new units, or to be a comparable option to the T1’s FPU for future projects.

55

Bibliography 1

Gaisler Research. “Leon 2 Processor” [Online]. Available http://www.gaisler.com/bin/leon2-1.0.32-xst.tar.gz

2

Gaisler Research. “Leon2 Processor Users Manual” [Online]. Available http://www.gaisler.com/doc/leon2-1.0.30-xst.pdf

3

M. Kasprzyk. “Floating Point Unit Design IC Project 2001” [Online]. Available http://www.gaisler.com/bin/leon2-1.0.32-xst.tar.gz

4

SPARC International. “The SPARC Architecture Manual V8” [Online]. Available http://www.sparc.com/standards/V8.pdf.

5

GNU. “General Public License GPL” [Online]. Available http://www.gnu.org/copyleft/gpl.html

6

Gaisler Research. “GRFPU High-Performance Floating-Point Unit”. Available http://www.gaisler.com/cms4_5_3/index.php?option=com_content&task=view&id=138& Itemid=54

7

SUN Microsystems. “OpenSPARC-T1” [Online]. Available http://opensparct1.sunsource.net/index.html

8

B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, 2000.

9

I. Koren, Computer Arithmetic Algorithms 2nd Ed., Prentice-Hall, 2002.

10

I. Koren. “Computer Arithmetic Algorithms Simulator” [Online]. Available: http://www.ecs.umass.edu/ece/koren/arith/simulator/

11

M. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, 2004.

12

J. E. Stine, Digital Computer Arithmetic Datapath Design using Verilog HDL, Kluwer Academic Publishers, 2004.

13

J. Deschamps, G. J. Antoine Bioul, and G. D. Sutter, Synthesis of Arithmetic Circuits, John Wiley & Sons, 2006.

56

14

M. J. Flynn and S. F. Oberman, Advanced Computer Arithmetic Design, John Wiley & Sons, 2001.

15

K. Cho, J. Hong, and G Choi, “54x54-bit Radix-4 Multiplier based on Modified Booth Algorithm,” in Proc. 13th ACM Symp. VLSI, pp 233-236, Apr. 2003.

16

R. K. Yu, G. B. Zyner, “167MHz Radix-4 Floating-Point Multiplier,” in Proc. 12th IEEE Symp. On Computer Arithmetic, pp 149-154, Jul. 1995.

17

J. Mori, M. Nagamatsu, M. Hirano, ”A 10-ns 54*54-bit Parallel Structured Full Array Multiplier with 0.5 micron CMOS technology,” IEEE Journal of Solid-State Circuits, vol. 26, no. 4, pp 600-606, Apr. 1991.

18

N. Besli, R. G. DeshMukh, “A 54*54-bit Multiplier with a new Redundant Booth’s Encoding,” IEEE Conf. Electrical and Computer Engineering, vol. 2, pp 597-602, 12-15 May 2002.

19

E. M. Schwarz, R. M. Averill, L. J. Sigal, “A Radix-8 S/390 Multiplier,” in Proc. 13th IEEE Symp. Computer Arithmetic, pp 2-9, Jul. 1999.

20

N. Itoh, Y. Naemura, H. Makino, Y. Nakase, “A Compact 54*54-bit With Improved WallaceTree Structure,” in Dig. Technical Papers of Symp. VLSI Circuits, pp 15-16, Jun. 1999.

21

M. J. Flynn and S. F. Oberman, “Division Algorithms and Implementations,” IEEE Trans. Computers, vol. 46, no. 8, Aug. 1997.

22

P. Soderquist and M. Leeser, “Area and Performance Tradeoffs in Floating-Point Divide and Square-Root Implementations,” ACM Computing Surveys, vol. 28, no. 3, Sep. 1996.

23

D. L. Harris, S. F. Oberman, and M. A. Horowitz, “SRT Division Architectures and Implementations,” in Proc 13th IEEE Symp. Computer Arithmetic, pp 18-25, Jul. 1997.

24

P. Kornerup, “Quotient Digit Selection for SRT Division and Square Root,” IEEE Trans. Computers, vol. 54, no. 3, Mar. 2005.

25

J. A. Prabhu and G. B. Zyner, “167MHz Radix-8 Divide and Square Root Using Overlapped Radix-2 Stages,” in Proc. 12th IEEE Symp. Computer Arithmetic, pp. 155-162, Jul. 1995.

26

J. Chiang, H. Chung, and M. Tsai, “Carry-Free Radix-2 Subtractive Division Algorithm and

57

Implementation of the Divider,” Tamkang Journal of Science and Engineering, vol. 3, no. 4, pp 249-255, 2000. 27

M. Kuhlmann and K. K. Parhi, “Fast Low-Power Shared Division and Square Root Architecture,” in Proc. IEEE Int. Conf. on Computer Design, (ICCD), pp. 128-135, Oct. 1998.

28

G. Gerwig, H. Wetter, E. M. Schwarz, J Haess, “Higher Performance Floating-Point Unit with 116 wide Divider,” in Proc. 12th IEEE Symp. On Computer Arithmetic, pp 87-94, June 2003.

29

SPEC. “SPEC CFP2006 Benchmark Description” [Online]. Available: http:www.spec.org/cpu2006/CFP2006.

30

M. Darley, B. Kronlage, D. Bural, B. Churchill, D. Pulling, P. Wang, ”The TMS390602A Floating-Point Coprocessor for SPARC Systems,” IEEE Micro, vol. 10, no. 3, pp. 36-47, May 1990.

31

B. J. Benchschneider, W. J. Bowhill, E. M. Cooper et al, ”A Pipeline 50-MHz CMOS 64-bit Floating-Point Arithmetic Processor,” IEEE Journal of Solid-State Circuits, vol. 24, no. 5, Oct. 1989.

32

L. Min, B. Yong-Qiang, S. Xu-Bang, G. De-Yuan, “The implementation of an out-of-order execution floating point unit,” Proc. IEEE Int. Conf. Solid State and Integrated Circuits, pp. 1384-1387, vol. 2, Oct. 2004.

33

M. Aagaard and C. H. Seger, “The Formal Verification of a Pipelined Double Precision IEEE Floating Point Multiplier,” Proc. IEEE/ACM Int. Conf. Computer Aided Design, pp. 7-10, 1995.

34

M. Leeser and J. O’Leary, “The Verification of a Subtractive Radix-2 Square Root Algorithm and Implementation,” IEEE Int. Conf. on Computer Design, (ICCD), pp. 526-531, Oct. 1995.

35

S. Siewert, “Big Iron Lessons, Part 1: FPU Architecture Now and Then” [Online], Available: http://www-128.ibm.com/developerworks/library/pabigiron1/?ca=dnt-65.

36

58

N. Weste, D. Harris, CMOS VLSI Design, 3rd Ed. Int. Ed, Addison Wesley, 2004.

37

A. Nannarelli, “Low Power Division and Square Root”, PhD. thesis, Univ. of California, Irvine, 1999.

59

Appendix A:

Arithmetic by Example

Introduction The following sections provide a demonstration of some of the arithmetic methods used to compute the multiplication, division, and square root operations in this design. They illustrate the broad concepts of each function, and provide an insight into the correct operation of the most important sections of each unit.

Radix-2 Unsigned Multiplication Unsigned multiplication is one of the simplest of operations, where a set of partial products are produced and summed (Figure A.1). Each partial product is the result of an AND operation on a bit from operand B and the entire operand A. The partial product is then shifted to the left a number of times, depending on the bits position in operand B. A×B=R 2110 × 2610 = 54610 101012 × 110102 = 10001000102

A[4:0]

1

0

1

0

1

2110

1

1

0

1

0

2610

0

0

0

0

0

A and B0

1

0

1

0

1

0

0

0

0

0

1

0

1

0

1

1

0

1

0

1

0

0

0

1

0

B[4:0]

×

PP1 PP2 PP3 PP4 PP5 Result

1

Figure A.1 Unsigned Radix-2 Multiplication Example

60

A and B1 A and B2 A and B3 A and B4

0

0

1

0

PP Sum = 54610

Radix-4 Multiplication using Modified Booth Encoding The following example is of an unsigned multiplication of two numbers that uses a Modified Boot-2 algorithm. After the partial products have been generated, they are sign extended and then summed. This is required as the Booth encoding can generate negative numbers, and using a special form of sign encoding, requires only flipping the bits and shifting for all negative partial products. A×B=R 2110 × 2610 = 54610 101012 × 110102 = 10001000102

A[4:0] B[4:0]

1

0

1

0

1

2110

×

1

1

0

1

0

2610

PP11

PP10

~S1

S1

S1

PP15

PP14

PP13

PP12

1

~S2

PP25

PP24

PP23

PP22

PP21

PP20

PP15

PP14

PP13

PP12

PP11

PP10

0

1

1

0

1

0

1

1

0

PP1 PP2

1

0

1

0

1

0

PP3

1

0

1

0

1

0

Result

1

0

0

0

1

0

S1

S2 0

1

(B1,B0,0) = -2A

1

(B3,B2,B1)= -1A

1 0

0

(0,B4,B3) = +2A 1

0

PP Sum = 54610

Figure A.2 Booth-2 Multiplicaiton Example

The sign extension can extended to a larger number of partial products by repeating the form of encoding on the second line for the rest of the other products (Figure A.3). In the following example of a generic 8bit multiplication, it may appear as if an error has occurred because the last product is shorter then the rest. This is due to the modified Booth encoding, which will only receive “00X”, and can only product a zero or positive one product. Further, the sign extension of the forth row was not continued to include a one, because an 8-bit by 8-bit multiplication can only produce a 16-bit number, which results in the extra sign extension on row four and the carry out from the most significant column being ignored. •















×





















~S

S

S















1

~S























1

~S















~S



















































S

S

S

S •















Figure A.3 Booth-2 Multiplication Sign Extension Dot Diagram

61

SRT Radix-2 Division X/D=Q 2510 / 510 = 510 110012 / 001012 = 001012 In Floating-Point Terms 1.10012 / 1.01002 = 1.01002 Let PRs0 = 000.110012 and Result Exponent = Result Exponent + 1 Let PRc0 = 0 All QSLC calculations are based on the SUN Design [25]

Iteration 0 PRs0

0

0

0

.

1

1

0

0

1

PRc0

0

0

0

.

0

0

0

0

0

QSLC 4-bit Sum

0

0

0

.

1

2PRs0

0

0

1

.

1

0

0

1

0

2PRc0

0

0

0

.

0

0

0

0

0

-D

1

1

0

.

1

1

0

0

0

PRs1

1

1

1

.

0

1

0

1

0

0

0

1

.

0

0

0

0

0

PRs1

1

1

1

.

0

1

0

1

0

PRc1

0

0

1

.

0

0

0

0

0

QSLC 4-bit Sum

0

0

0

.

0

1

2PRs1

1

1

0

.

1

0

1

0

0

2PRc1

0

1

0

.

0

0

0

0

0

1

1

0

.

1

1

0

0

0

0

1

0

.

0

1

1

0

0

1

0

1

.

0

0

0

0

0

PRs2

0

1

0

.

0

1

1

0

0

PRc2

1

0

1

.

0

0

0

0

0

QSLC 4-bit Sum

1

1

1

.

0

2PRs2

1

0

0

.

1

1

0

0

0

2PRc2

0

1

0

.

0

0

0

0

0

0

0

1

.

0

1

0

0

0

PRc1

0

Qi = 1 , Q1 = 1.0

PR1 = 2PR0 - D

Iteration 1

-D

+

PRs2 PRc2

1

Q2 = 1 as MSB5 = 1, Q2 = 1.1

PR2 = 2PR1 - D

Iteration 2

+D

62

+

Q3 = -1 , Q3 = 1.01

PR3 = 2PR2 + D

PRs3

1

1

1

.

1

0

0

0

0

0

0

0

.

1

0

0

0

0

PRs3

1

1

1

.

1

0

0

0

0

PRc3

0

0

0

.

1

0

0

0

0

QSLC 4-bit Sum

0

0

0

.

0

0

2PRs3

1

1

1

.

0

0

0

0

0

2PRc3

0

0

1

.

0

0

0

0

0

+D

0

0

0

.

0

0

0

0

0

PRs4

1

1

0

.

0

0

0

0

0

0

1

0

.

0

0

0

0

0

PRs4

1

1

0

.

0

0

0

0

0

PRc4

0

1

0

.

0

0

0

0

0

QSLC 4-bit Sum

0

0

0

.

0

0

2PRs4

1

0

0

.

0

0

0

0

0

2PRc4

1

0

0

.

0

0

0

0

0

+D

0

0

0

.

0

0

0

0

0

PRs5

0

0

0

.

0

0

0

0

0

0

0

0

.

0

0

0

0

0

PRc3

0

Iteration 3

PRc4

0

Q4= 0 as MSB5= 0 , Q4= 1.010

PR4 = 2PR3

Iteration 4

PRc5

1

Q5 = 0 , Q5 = 1.0100

PR5 = 2PR4

Figure A.4 SRT Radix-2 Division Example

No more iteration’s are required, as in the next iteration, all of the bits for the QSLC input will be zero. Thus, further iterations will only generate extra zero quotient bits. This method has produced a result of “1.0100…..0”, which matches what was expected. In further iterations, the partial remainder will be completely zero, thus the sticky bit generated would also be zero.

63

SRT Radix-2 Square Root The following is a demonstration of a radix-2 SRT square root operation. =Q 2510 1/2 = 510 SQRT 110012 = 001012 In Floating-Point Terms SQRT 1.10012 = 1.01002 Let PRs0 = 000.100102 Let PRc0 = 0 All QSLC calculations are based on the SUN Design [25]

Iteration 0 X

0

0

1

.

1

0

0

1

0

0

0

0

PRsi

0

0

0

.

1

0

0

1

0

0

0

0

PRCi

0

0

0

.

0

0

0

0

0

0

0

0

QSLC

0

0

0

.

1

Qi

0

0

1

.

0

0

0

0

0

0

0

0

2-(i+1)

0

0

0

.

1

0

0

0

0

0

0

0

2PRsi

0

0

1

.

0

0

1

0

0

0

0

0

2PRci

0

0

0

.

0

0

0

0

0

0

0

0

F[i]

1

0

1

.

0

1

1

1

1

1

1

1

PRsi+1

1

0

0

.

0

1

0

1

1

1

1

1

PRci+1

0

1

0

.

0

1

0

0

0

0

0

1

PRsi

1

0

0

.

0

1

0

1

1

1

1

1

PRCi

0

1

0

.

0

1

0

0

0

0

0

1

QSLC

1

1

0

.

0

Qi

0

0

1

.

1

0

0

0

0

0

0

0

2-(i+1)

0

0

0

.

0

1

0

0

0

0

0

0

2PRsi

0

0

0

.

1

0

1

1

1

1

1

0

2PRci

1

0

0

.

1

0

0

0

0

0

1

0

F(i)

0

1

0

.

1

1

0

0

0

0

0

0

PRsi+1

1

1

0

.

1

1

1

1

1

1

0

0

PRci+1

0

0

1

.

0

0

0

0

0

1

0

0

qi+1=1

PRi+1=2PRi-(2Q+2-(i+))

+ Carry in

Iteration 1

Iteration 2

64

qi+1=-1

PRi+1=2PRi+(2Q-2-(i+))

+

PRsi

1

1

0

.

1

1

1

1

1

1

0

0

PRCi

0

0

1

.

0

0

0

0

0

1

0

0

QSLC

1

1

1

.

1

Qi

0

0

1

.

0

1

0

0

0

0

0

0

2-(i+1)

0

0

0

.

0

0

1

0

0

0

0

0

2PRsi

1

0

1

.

1

1

1

1

1

0

0

0

2PRci

0

1

0

.

0

0

0

0

1

0

0

0

0

0

0

0

.

0

0

0

0

0

0

0

0

PRsi+1

1

0

1

.

1

1

1

1

1

0

0

0

PRci+1

0

1

0

.

0

0

0

0

1

0

0

0

PRsi

1

0

1

.

1

1

1

1

1

0

0

0

PRCi

0

1

0

.

0

0

0

0

1

0

0

0

QSLC

1

1

1

.

1

Qi

0

0

1

.

0

1

0

0

0

0

0

0

2-(i+1)

0

0

0

.

0

0

0

1

0

0

0

0

2PRsi

0

1

1

.

1

1

1

1

0

0

0

0

2PRci

1

0

0

.

0

0

0

1

0

0

0

0

0

0

0

0

.

0

0

0

0

0

0

0

0

PRsi+1

0

1

1

.

1

1

1

1

0

0

0

0

PRci+1

1

0

0

.

0

0

0

1

0

0

0

0

qi+1=0

PRi+1=2PRi

Iteration 3

qi+1=0

PRi+1=2PRi

Figure A.5 SRT Radix-2 Square Root Example

Q3 = 1.0100, which means after several iterations the desired result has been obtained. No further iterations were completed because all further “qi+1” will be zero. The main difference between an integer and the demonstrated floating point version of SRT square root, is that PRsi and Qi are initialized on the first iteration to (X-1) and 1 respectively.

65

Radix-2 On-The-Fly Conversion The following is an example of how the quotient is formed for division and square root operations. Specifically it uses the SRT radix-2 division example from above. The key here is to notice that only the bit position specified in the third column is new for both Q and QM. All other bits are simply the result of a select statement from the old values. This is illustrated in iteration three, where the QSLC output is negative one, and Q selects the previous value from QM. The square root operation is more complicated then division, as its divisor “F(I)” is variable. There are two separate versions of F(i), one for both positive and negative products for each iteration. They can, however, simply be formed by using the final three columns. Iteration

qi

2-i

Qi

QMi

0

N/A

1.00000

0.XXXXX

0.XXXXX

1

1

0.10000

0.1XXXX

0.0XXXX

2

1

0.01000

0.11XXX

0.10XXX

3

-1

0.00100

0.101XX

0.100XX

4

0

0.00010

0.1010X

0.1001X

5

0

0.00001

0.10100

0.10011

Figure A.6 On-The-Fly Conversion Example

66

Appendix B:

Counter Study

Introduction A counter is a functional unit that is able to produce a value equal to the sum of its input bits, where each input bit is worth the same. For example, a standard binary counter given the input vector “10011010” may produce an output of “0100”, which means four bits are high. While many counters may produce the same output given identical inputs, there is a great variation of implementation. This section aims to delve into the various styles of counters, and discover what advantages they offer in terms of both performance and area, for the multiplication and division units.

Timing Analysis Counter Pre-Layout Data Arrival Times (ns)

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

3 to 2 Counters (Full Adders) ADK Fadd1 Std. Cell

2.17

1.23

0.78

Synthesized

1.25

0.71

0.45

Two ADK Fadd1 Std. Cells

5.26

2.95

1.86

Synthesized (Adders Preserved)

2.96

1.69

1.10

Custom ADK Design [11]

4.95

2.84

1.80

Synthesized (Adders Dissolved)

2.97

1.70

1.04

Samsung ADK Fadd1 Design [15]

21.06

11.68

7.34

Synthesized (Adders Preserved)

11.55

6.63

4.34

Synthesized (Adders Dissolved)

11.31

6.64

4.29

4 to 2 Counters

27 to 2 Counters

Figure B.1 Std. Cell v. Synthesized Counter Timing Analysis

Several different types of counters were studied (Figure B.1) to determine the best architecture for a multiplier design. In particular this relates to the second stage of multiplication, fast multi-operand addition. The two common methods for adding the partial products into a carry save form rely on either seven levels of 3:2 counters or four levels of 4:2 counters. For both different types of counter, the synthesized versions offer a greatly reduced delay. While the 3:2 counter may seem to have a data arrival time of less then 50% of the larger counter, it is important to remember that these figures do not account for carry propagation. As a result, the synthesized adders may suffer, since the synthesis tool assumes all inputs are available at time zero, where the ADK std. cell versions may have a port optimized for late arrival. Also of particular interest to the multiplier, is a comparison of the delays of 4:2 and 27:2 counters. The Samsung design [15] relies on seven levels of full adder cells to produce a 27:2 counter. An alternate design [17], utilizes four levels of 4:2 counters, which would approximately produce a delay of at 4.16ns

67

(four levels of synthesized 4:2 counters with the cells dissolved). This second approach offers a substantial decrease in delay in comparison to the required 7.34ns of the 27:2 ADK std. cell counter. However, it is important to remember that the decreased delay may come at the cost of a significant area increase. Also, providing a very fast CSA for the multiplier may not yield any performance increase, as there may be other slower pipeline stages.

Area Analysis Counter Area by Gate Count

Module

AMI 0.5µ Slow

AMI 0.5µ Typical

AMI 0.5µ Fast

ADK Fadd1 Std. Cell

5

5

5

Synthesized

7

7

8

Two ADK Fadd1 ADK Std. Cells

10

9

9

Synthesized (Adders Preserved)

14

14

15

Custom ADK Design [11]

16

16

16

Synthesized (Adders Dissolved)

14

14

18

Samsung ADK Fadd1 Design [15]

133

114

114

Synthesized (Adders Preserved)

170

170

190

Synthesized (Adders Dissolved)

199

197

246

3 to 2 Counters (Full Adders)

4 to 2 Counters

27 to 2 Counters

Figure B.2 Std. Cell v. Synthesized Counter Area Analysis

In contrast to the timing analysis (Figure B.1), the gate counts for the synthesized counters (Figure B.2) are vastly increased. The 4:2 counter with all of its internal full adder cells dissolved, uses 3.6 times the number of gates as the ADK std. cell full adder. While an extreme example, it does illustrate the trend of the synthesized adders requiring a larger area. It is important to remember, to achieve a satisfactory delay for the AMI 0.5 micron process, all adders were optimized for delay. As a result of synthesizing with optimization for delay, the required area increases.

Conclusion The case for a full ADK standard cell instantiated HDL counter design is that of a low gate count, and probably similar pre-layout and post-layout timing results. In comparison, the synthesized options produce a vastly faster pre-layout result but with an increased gate count. The selection of which counter is appropriate should then be based on how many counters are used and on the importance of the data arrival times offered by each. For the multiplier, which uses a vast number of counters, area is the overriding factor. In contrast, the combined division square root unit uses significantly few counters, and delay is more important.

68

Appendix C:

HDL Verification

Introduction This section presents the outputs from the HDL simulator for the three main operations. For multiplication and division, it was possible to model their complete functionality in a Verilog test bench. In the case of square root, no simple arithmetic HDL constructs were available, and the produced results were verified by hand. For all three operations, a separate test bench was constructed, and 1000 random normalized operands were generated. While the majority of HDL written, has an accompanying test bench, the units tested here utilize all of the other sub components, and as a result provide a means of testing all designed modules.

Multiplication # # # # # # # # # # # # # # # # # # # # # #

Griffith University Floating-Point Multiplier Test Bench. Simulation File: [./hdl/testbench/tb_fpu_multply_vec.v] Simulated and circuit responses being logged Log File: [./hdl/testbench/output/tb_fpu_multply_vec.out] Initializing test vectors... Performing 1000 verification tests using random normalized vectors. Test bench complete Details: Tests completed: 1000 Calculation errors: 0 Fraction error: 0 Exponent error: 0 Sign error: 0 Processing Details:. Time required: 6004000 Cycles required: 6004

Figure C.1 FPU Multiplication Test Bench

When testing the functionality of the designed multiplier, it was possible to model the calculations for all three fields (exponent, fraction, and sign) in a test bench (Figure C.1). Essentially, one operation was run at a time (similar conditions to LTH FPU), and when that operation completed, the produced results was checked against a pre-calculated one. At this stage, the inputs were then reset to random values, and remain constant until the next test had completed.

Division # # # # # # #

Griffith University Floating-Point Combined DIVSRT Test Bench. Performs only division tests. Simulation File: [./hdl/testbench/tb_fpu_divsqrt_vec.v] Simulated and circuit responses being logged Log File: [./hdl/testbench/output/tb_fpu_divsqrt_vec.out]

69

# # # # # # # # # # # # # # # # #

Initializing test vectors... Performing 1000 verification tests using random normalized vectors. Test bench complete Details: Tests completed: 1000 Calculation errors (after corrections): 0 Fraction errors before correction: 119 ULP Corrections required: 119 Exponent error: 0 Sign error: 0 Processing Details:. Time required: 31004000 Cycles required: 31004

Figure C.2 FPU Division Test Bench

Similar to multiplication, it was possible to model all aspects of a division operation in a Verilog test bench (Figure C.2). However, in SRT calculations, the quotient can be over estimated, and corrected at a latter stage by a negative QSLC result. However, this adjustment is not guaranteed to happen directly after an over estimation, and as a result, the quotient can be over estimated by an ULP. Accounting for this, the division test bench compared the original result produced by the circuit first, with a precalculated one. When the original result did not match, the pre-calculated result was then compared to the original result minus an ULP. From 1000 normalized operands, allowing for quotient correction, no problems were observed.

Square Root # # # # # # # # # # # # # # # #

Griffith University Floating-Point Combined DIVSRT Test Bench. Performs only square root tests. Simulation File: [./hdl/testbench/tb_fpu_sqrt_simple_vec.v] Simulated and circuit responses being logged Log File: [./hdl/testbench/output/tb_fpu_sqrt_simple_vec.out] Initializing test vectors... Performing 1000 verification tests using random normalized vectors. Test bench complete Processing Details:. Time required: 30004000 Cycles required: 30004

Figure C.3 FPU SQRT Test Bench

Unlike the division and multiplication operations, the square root function could not be modeled in Verilog. By modifying the division test bench, to simply generate operands and record results (Figure C.3), it was possible to create a list of operations with results that could be verified at a later stage (Figure C.4).

70

When verifying the recorded results, the logged quotients generated by VSIM were compared to those produced using Microsoft Windows Power Toy Calculator. Given the square root operation also suffers from a similar error as division, where the quotient may be over estimated by an ULP, the results were assumed the same, if the simulated quotient or the simulated quotient minus an ULP matched the result produced by the calculator.

Further, as the division and square root functions share many common

blocks, by verifying the division operation with a large set of test vectors, the square root function is also verified. Square Root Test: 0 (First test with odd exponent) Operand A

1.1001010111101000000100010010000101010011010100100100

Calculator

1.11000111111000001001100001000001001110010101111111001

Circuit

1.11000111111000001001100001000001001110010101111111010

Correction

1.11000111111000001001100001000001001110010101111111001

Odd Exponent Q = Q - ULP

Square Root Test: 3 (First test with even exponent) Operand A

1.0000011010011111001001000111111011001101101110001111

Calculator

1.00000011010010100010100100000110011101000101100001000

Circuit

1.00000011010010100010100100000110011101000101100001000

Even Exponent

Square Root Test: 612 (Selected at random) Operand A

1.0101011111101110010100010111010110011011001100101110

Calculator

1.00101000101110011101101100001010000100101011000100100

Circuit

1.00101000101110011101101100001010000100101011000100100

Even Exponent

Square Root Test: 726 (Selected at random) Operand A

1.0000010100001000010011100000010101001101000011000000

Calculator

1.00000010100000010000010001110110111110001001011110010

Circuit

1.00000010100000010000010001110110111110001001011110010

Even Exponent

Square Root Test: 996 (Last test with odd exponents) Operand A

1.0101101110001010110000000001000011010101001100000010

Calculator

1.10100101110101001100010011001101010000110011101001011

Circuit

1.10100101110101001100010011001101010000110011101001011

Odd Exponent

Test cases are from the log file produced and referenced in the square root test bench (Figure C.3) Figure C.4 FPU Square Root Result Verification

71

Appendix D:

Code Listing

Introduction A CD/DVD accompanies this dissertation. This section outlines all the contents of that disc. In particular this includes HDL, area and timing reports, and produced netlists. Also included is some of the early work performed on the LTH FPU, which includes the separation into pipeline stages and synthesis reports.

File Locations and Naming Conventions File Locations

Location

Description

./hdl

Base directory for created modules

./hdl/scripts

Contains all PERL and Verilog scripts used to generate other modules (e.g. CSA tree and QSLC ROM)

./hdl/testbench

Contains test benches for created modules

./hdl/testbench/output

Stores the log files produces by a limited number of the main test benches.

./lth

Has a similar directory structure to that of “./hdl”. It also contains a “synth” directory, similar to that described below.

./synth

Base directory for synthesis results

./synth/area

Area reports (e.g. gate count)

./synth/delay

Timing Reports

./synth/netlist

Produced netlist, either from synthesized or flattened modules.

./synth/script

Contains all synthesis settings and Linux Bash shell scripts to run them in terminal mode.

Figure D.1 File Location Map

File Naming Conventions

File Pattern

Description

*.ami05[fast|slow|typ].*

Result or output of synthesis, where the name of the module that produced this file precedes the process variation.

*.area.rpt

Area report

*.delay.rpt

Pre-layout delay report

*.out

Log file produced from VSIM simulation

*_syn.*

A module designed to be or produced from a design utilizing Verilog primitives. There should also be a similar name file with out “-syn” that is based on ADK cells.

Tb_*.v Figure D.2 File Naming Conventions

72

Verilog test bench

Highlights

File

Description

./hdl/fpu_divsqrt.v

Radix-4 combined division and square root unit

./hdl/fpu_multply.v

Multiplication unit

./synth/delay/fpu_divsqrt.ami05fast.rpt

Delay report for the division square root unit on an AMI 0.5 fast process.

./synth/delay/fpu_mulyply.ami05fast.rpt

Delay report for the multiplication unit on an AMI 0.5 fast process.

./hdl/scripts/csatree.pl

Generates connectivity for the CSA tree, instantiates 27:2 to 4:2 counters.

./hdl/testbench/tb_fpu_multiply_vec.v

FPU Multiplication test bench

./hdl/testbench/tb_fpu_divsqrt_vec.v

FPU Division test bench

Figure D.3 HDL Synthesis and Simulation Highlights

73

Related Documents

Precision
May 2020 10
Float Ieee754
June 2020 5
Units
December 2019 45
Units
May 2020 27

More Documents from ""