Sanjay_high Performance Dsp Architectures

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Sanjay_high Performance Dsp Architectures as PDF for free.

More details

  • Words: 14,159
  • Pages: 38
High Performance DSP Architectures

CHAPTER 1 EVOLUTION OF DSP PROCESSORS INTRODUCTION Digital Signal Processing is carried out by mathematical operations. Digital Signal Processors are microprocessors specifically designed to handle Digital Signal Processing tasks. These devices have seen tremendous growth in the last decade, finding use in everything from cellular telephones to advanced scientific instruments. In fact, hardware engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use "DSP" to mean Digital Signal Processing. DSP has become a key component in many consumer, communications, medical, and industrial products. These products use a variety of hardware approaches to implement DSP, ranging from the use of off-the-shelf microprocessors to field-programmable gate arrays (FPGAs) to custom integrated circuits (ICs). Programmable “DSP processors,” a class of microprocessors optimized for DSP, are a popular solution for several reasons. In comparison to fixed-function solutions, they have the advantage of potentially being reprogrammed in the field, allowing product upgrades or fixes. They are often more costeffective than custom hardware, particularly for low-volume applications, where the development cost of ICs may be prohibitive. DSP processors often have an advantage in terms of speed, cost, and energy efficiency.

DSP ALGORITHMS MOULD DSP ARCHITECTURES From the outset, DSP processor architectures have been moulded by DSP algorithms. For nearly every feature found in a DSP processor, there are associated DSP algorithms whose computation is in some way eased by inclusion of this feature. Therefore, perhaps the best way to understand the evolution of DSP architectures is to examine typical DSP algorithms and identify how their computational requirements have influenced the architectures of DSP processors. FAST MULTIPLIERS The FIR filter is mathematically expressed as a vector of input data, along with a vector of filter coefficients. For each “tap” of the filter, a data sample is multiplied by a filter coefficient, with the result added to a running sum for all of the taps . Hence, the main component of the FIR filter algorithm is a dot product: multiply and add, multiply and add. These operations are not unique to the FIR filter algorithm; in fact, multiplication is one of the most common operations performed in signal processing convolution, IIR filtering, and Fourier transforms also all involve heavy use of multiply-accumulate operations. Originally, microprocessors implemented multiplications by a series of shift and add operations, each of which consumed one or more clock cycles. As might be expected, faster multiplication hardware yields faster performance in many DSP algorithms, and for this reason all modern DSP processors include at least one dedicated single- cycle multiplier or combined multiply-accumulate (MAC) unit .

.

Department Of Electronics & Communication Engineering, GEC Thrissur.

1

High Performance DSP Architectures

MULTIPLE EXECUTION UNITS DSP applications typically have very high computational requirements in comparison to other types of computing tasks, since they often must execute DSP algorithms in real time on lengthy segments of signals sampled at 10-100 KHz or higher. Hence, DSP processors often include several independent execution units that are capable of operating in parallel for example, in addition to the MAC unit, they typically contain an arithmetic- logic unit (ALU) and a shifter. EFFICIENT MEMORY ACCESSES Executing a MAC in every clock cycle requires more than just a single-cycle MAC unit. It also requires the ability to fetch the MAC instruction, a data sample, and a filter coefficient from memory in a single cycle. To address the need for increased memory bandwidth, early DSP processors developed different memory architectures that could support multiple memory accesses per cycle. Often, instructions were stored in the memory bank, while data was stored in another. With this arrangement, the processor could fetch an instruction and a data operand in parallel in every cycle.

Since many DSP algorithms consume two data operands per instruction, a further optimization commonly used is to include a small bank of RAM near the processor core that is used as an instruction cache. When a small group of instructions is executed repeatedly, the cache is loaded with those instructions, freeing the instruction bus to be used for data fetches instead of instruction fetches thus enabling the processor to execute a MAC in a single cycle. High memory bandwidth requirements are often further supported via dedicated hardware for calculating memory addresses. These address generation units operate in parallel with the DSP processor’s main execution units, enabling it to access data at new locations in memory without pausing to calculate the new address. Memory accesses in DSP algorithms tend to exhibit very predictable patterns; for example, for each sample in an FIR filter, the filter coefficients are accessed sequentially from start to finish for each sample, then accesses start over from the beginning of the coefficient vector when processing the next input sample. DSP processor address generation units take advantage of this predictability by supporting specialized addressing modes that enable the processor to efficiently access data in the patterns commonly found in DSP algorithms. The most common of these modes is register-indirect addressing with post-increment, which is used to automatically increment the address pointer for algorithms where repetitive computations are performed on a series of data stored sequentially in memory. Many DSP processors also support “circular addressing,” which allows the processor to access a block of data sequentially and then automatically wrap around to the beginning address exactly the pattern used to access coefficients in FIR filtering. Circular addressing is also very helpful in implementing first-in, first-out buffers, commonly used for I/O and for FIR filter delay lines. Department Of Electronics & Communication Engineering, GEC Thrissur.

2

High Performance DSP Architectures

DATA FORMAT DSP applications typically must pay careful attention to numeric fidelity. Since numeric fidelity is far more easily maintained using a floating point format, it may seem surprising that most DSP processors use a fixed-point format. DSP processors tend to use the shortest data word that will provide adequate accuracy in their target applications. Most fixed-point DSP processors use 16-bit data words, because that data word width is sufficient for many DSP applications. A few fixed-point DSP processors use 20, 24, or even 32 bits to enable better accuracy in applications that are difficult to implement well with 16-bit data, such as high-fidelity audio processing. To ensure adequate signal quality while using fixed-point data, DSP processors typically include specialized hardware to help programmers maintain numeric fidelity throughout a series of computations. For example, most DSP processors include one or more “accumulator” registers to hold the results of summing several multiplication products. Accumulator registers are typically wider than other registers; they often provide extra bits, called “guard bits,” to extend the range of values that can be represented and thus avoid overflow. In addition, DSP processors usually include good support for saturation arithmetic, rounding, and shifting, all of which are useful for maintaining numeric fidelity. ZERO-OVERHEAD LOOPING DSP algorithms typically spend the vast majority of their processing time in relatively small sections of software that are executed repeatedly; i.e., in loops. Hence, most DSP processors provide special support for efficient looping. Often, a special loop or repeat instruction is provided which allows the programmer to implement a for-next loop without expending any clock cycles for updating and testing the loop counter or branching back to the top of the loop. This feature is often referred to as “Zero-overhead looping.” STREAMLINED I/O Finally, to allow low-cost, high-performance input and output, most DSP processors incorporate one or more specialized serial or parallel I/O interfaces, and streamlined I/O handling mechanisms, such as low-overhead interrupts and direct memory access (DMA), to allow data transfers to proceed with little or no intervention from the processor's computational units. SPECIALIZED INSTRUCTION SETS DSP processor instruction sets have traditionally been designed with two goals in mind. The first is to make maximum use of the processor's underlying hardware, thus increasing efficiency. The second goal is to minimize the amount of memory space required to store DSP programs, since DSP applications are often quite cost-sensitive and the cost of memory contributes substantially to overall chip and/or system cost. To accomplish the first goal, conventional DSP processor instruction sets generally allow the programmer to specify several parallel operations in a single instruction, typically including one or two data fetches from memory in parallel with the main arithmetic operation. With the second goal in mind, instructions are kept short by restricting which registers can be used with which operations, and restricting which operations can be combined in an instruction.

Department Of Electronics & Communication Engineering, GEC Thrissur.

3

High Performance DSP Architectures

CHAPTER 2 TRADITIONAL SOLUTIONS FOR REAL TIME PROCESSING DSP architectures designs have traditionally focused on providing and meeting real-time constraints. Advanced signal processing algorithms, such as those in base station receivers, present difficulties to the designer due to the implementation of complex algorithms, higher data rates and desire for more channels per hardware module. A key constraint from the manufacturing point of view is attaining a high channel density. Traditionally, real-time architecture designs employ a mix of DSP’s, Co-processor’s, FPGA’s, ASIC’s and Application Specific Standard Parts (ASSP’s) for meeting real-time requirements in high performance applications. The chip rate processing is handled by the ASSP, ASIC or FPGA while the DSP’s handle the symbol rate processing and use coprocessors for decoding. The DSP can also implement parts of the MAC layers and control protocols or can be assisted by a RISC processor.

However, dynamic variations in the system workload such as variations in the number of users in wireless base-stations, will require a dynamic re-partitioning of the algorithms which may not be possible to implement in traditional FPGA’s and ASIC’s in real-time. LIMITATIONS OF SINGLE PROCESSOR DSP ARCHITECTURES Single processors DSP’s can only have limited arithmetic units and cannot directly extend their architectures to 100’s of arithmetic units. This is because, as the number of arithmetic units increases in an architecture, the size of the register files and the port interconnections start dominating the architecture. PROGRAMMABLE MULTIPROCESSOR DSP ARCHITECTURES Multiprocessor architectures can be classified into Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD) architectures. Data-parallel DSP’s exploit data parallelism, instruction level parallelism and sub word parallelism. Alternate levels of parallelism such as thread level parallelism exist and can be considered after this architecture space has been fully studied and explored.

Department Of Electronics & Communication Engineering, GEC Thrissur.

4

High Performance DSP Architectures

MULTI-CHIP MIMD PROCESSORS Each processor in a loosely coupled system has a set of I/O devices and a large local memory. Processors communicate by exchanging messages using some form of messagetransfer system. Loosely coupled systems are efficient when interaction between tasks are minimal. The tradeoffs of this processor design have been the increase in programming complexity and the need for high I/O bandwidth and inter-processor support. Such MIMD solutions are also difficult to scale with processors. E.g.: TI ‘C4XX.

Register file explosion in traditional DSP’s with centralized register files. The disadvantages of the multi-chip MIMD model and architectures are the following: 1. Load-balancing algorithms for such MIMD architectures is not straight-forward similar to heterogeneous systems. This makes it difficult to partition algorithms on this architecture model especially when the workload changes dynamically. 2. The loosely coupled model is not scalable with the number of processors due to interconnection and I/O bandwidth issues. 3. I/O impacts the real-time performance and power consumption of the architecture. 4. Design of a compiler for a MIMD model on a loosely coupled architecture is difficult and the burden is left to the programmer to decide on the algorithm partitioning on the multiprocessor.

SINGLE-CHIP MIMD PROCESSORS Single-chip MIMD processors can be classified into 3 categories: single-threaded chip multiprocessors (CMP’s), multi-threaded multiprocessors (MT’s) and clustered VLIW architectures . A CMP integrates two or more complete processors on a single chip. Therefore, every unit of a processor is duplicated and used independently of its copies.

Department Of Electronics & Communication Engineering, GEC Thrissur.

5

High Performance DSP Architectures

In contrast, a multi-threaded processor interleaves the execution of instructions of various threads of control in the same pipeline. Therefore, multiple program counters are available in the fetch unit and multiple contexts are stored in multiple registers on the chip. Multithreading increases instruction level parallelism in the arithmetic units by providing access to more than a single independent instruction stream. Programmer is assigned the duty to schedule the threads of his application.. Clustered VLIW architectures are another example of VLIW architectures that solve the register explosion problem by employing clusters of functional units and register files. Clustering improves cycle time in two ways: by reducing the distance the signals have to travel within a cycle and by reducing the load on the bus. Clustering is beneficial for applications which have limited inter-cluster communication. However, compiling for clustered VLIW architectures can be difficult in order to schedule across various clusters and minimize inter-cluster operations and their latency. Although single chip MIMD architectures eliminate the I/O bottleneck between multiple processors, the load balancing and architecture scaling issues still remain. The availability of data parallelism in signal processing applications is not utilized efficiently in MIMD architectures. SIMD ARRAY PROCESSORS SIMD processing refers to processing of identical processors in the architecture that execute the same instruction but work on different sets of data in parallel. An SIMD array processor is referred to processor designs targeted towards implementation of arrays or matrices. There are various types of interconnection methodologies used for array processors such as linear array (vector), ring, star, tree, mesh, systolic arrays and hypercubes. For example, Illiac-IV , Burroughs Scientific Processor (BSP). Although vector processors have been the most popular version of array processors, mesh based processors are still being used in scientific computing. SIMD VECTOR PROCESSORS Data parallelism allow vector processors to approach the performance and power efficiency of custom designs, while simultaneously providing the flexibility of a programmable processor. Vector machines were the first attempt at building super-computers, starting from the Cray-1 machine These processors executed vector instructions such as vector adds and multiplications out of a vector register file. The number of memory banks is equal to the number of processors such that all processors can access memory in parallel.

Department Of Electronics & Communication Engineering, GEC Thrissur.

6

High Performance DSP Architectures

DATA-PARALLEL DSPS Data-parallel DSP’s as architectures that exploit ILP. Stream processors are state-of-the-art programmable architectures aimed at media processing applications. Stream processors enhance data-parallel DSP’s by providing a bandwidth hierarchy for data flow in signal processing applications that enable support for hundreds of arithmetic units in the dataparallel DSP. PIPELINING MULTIPLE PROCESSORS An alternate method to attain high data rates is to provide multiple processors that are pipelined. Such processors would be able to take advantage of the streaming flow of data through the system. The disadvantages of such a design are that the architecture would need to be carefully designed to match the system throughput and is not flexible enough to adapt to changes in system workload. Also, such a pipelined system would be difficult to program and suffer from I/O bottlenecks unless implemented as a SoC. However, this is the only way to provide desired system performance if the amount of parallelism exploitation does not meet the system requirements.

Department Of Electronics & Communication Engineering, GEC Thrissur.

7

High Performance DSP Architectures

CHAPTER 3 CURRENT DSP LANDSCAPE COVENTIONAL DSP PROCESSORS The performance and price range among DSP processors is very wide. In the low-cost, lowperformance range are the industry workhorses, which are based on conventional DSP architecture. They issue and execute one instruction per clock cycle, and use the complex, multi-operation type of instructions described earlier. These processors typically include a single multiplier or MAC unit and an ALU, but few additional execution units, if any. Included in this group are Analog Devices ADSP-21xx family, Texas Instrument’s TMS320C2xx family, and Motorola's DSP560xx family. These processors generally operate at around 20-50 MHz, and provide good DSP performance while maintaining very modest power consumption and memory usage. Midrange DSP processors achieve higher performance than the low-cost DSP’s described above through a combination of increased clock speeds and somewhat more sophisticated architectures.

ENHANCED CONVENTIONAL DSP PROCESSORS DSP processor architects improved performance by extending conventional DSP architectures by adding parallel execution units, typically a second multiplier and adder. These hardware enhancements are combined with an extended instruction set that takes advantage of the additional hardware by allowing more operations to be encoded in a single instruction and executed in parallel. We refer to this type of processor as an “enhancedconventional DSP processor,” because it is based on the conventional DSP processor architectural style rather than being an entirely new approach. With this increased parallelism, enhanced-conventional DSP processors can execute significantly more work per clock cycle—for example, two MAC’s per cycle instead of one. Enhanced-conventional DSP processors typically have wider data buses to allow them to retrieve more data words per clock cycle in order to keep the additional execution units fed. They may also use wider instruction words to accommodate specification of additional parallel operations within a single instruction. MULTI-ISSUE ARCHITECTURES With the goals of achieving high performance and creating an architecture that lends itself to the use of compilers, some newer DSP processors use a “multi-issue” approach.

Department Of Electronics & Communication Engineering, GEC Thrissur.

8

High Performance DSP Architectures

In contrast to conventional and enhanced-conventional processors, multi-issue processors use very simple instructions that typically encode a single operation. These processors achieve a high level of parallelism by issuing and executing instructions in parallel groups rather than one at a time. Using simple instructions simplifies instruction decoding and execution, allowing multi-issue processors to execute at higher clock rates than conventional or enhanced conventional DSP processors.eg:.TMS320C62xx, The two classes of architectures that execute multiple instructions in parallel are referred to as VLIW and Superscalar. These architectures are quite similar, differing mainly in how instructions are grouped for parallel execution. VLIW and superscalar architectures provide many execution units each of which executes its own instruction. VLIW DSP processors typically issue a maximum of between four and eight instructions per clock cycle, which are fetched and issued as part of one long superinstruction hence the name “Very Long Instruction Word.” Superscalar processors typically issue and execute fewer instructions per cycle, usually between two and four. In a VLIW architecture, the assembly language programmer specifies which instructions will be executed in parallel. Hence, instructions are grouped at the time the program is assembled, and the grouping does not change during program execution. Superscalar processors, in contrast, contain specialized hardware that determines which instructions will be executed in parallel based on data dependencies and resource contention, shifting the burden of scheduling parallel instructions from the programmer to the processor. The processor may group the same set of instructions differently at different times in the program's execution; for example, it may group instructions one way the first time it executes a loop, then group them differently for subsequent iterations. The difference in the way these two types of architectures schedule instructions for parallel execution is important in the context of using them in real-time DSP applications. Because superscalar processors dynamically schedule parallel operations, it may be difficult for the programmer to predict exactly how long a given segment of software will take to execute. The execution time may vary based on the particular data accessed, whether the processor is executing a loop for the first time or the third, or whether it has just finished processing an interrupt, for example. Dynamic features also complicate software optimization. As a rule, DSP processors have traditionally avoided dynamic features for just these reasons; this may be why there is currently only one example of a commercially available superscalar DSP processor. In VLIW architectures, a wide instruction word may be required in order to specify information about which functional unit will execute the instruction. Wider instructions allow the use of larger, more uniform register sets, which in turn enables higher performance. There are disadvantages, however, to using wide, simple instructions. Since each VLIW instruction is simpler than a conventional DSP processor instruction, VLIW processors tend to require many more instructions to perform a given task. Combined with the fact that the instruction words are typically wider than those found on conventional DSP processors, this characteristic results in relatively high program memory usage. High program memory usage, in turn, may result in higher chip or system cost because of the need for additional ROM or RAM. VLIW processors typically use either wide buses or a large number of buses to access data memory and keep the multiple execution units fed with data. The architectures of VLIW DSP processors are in some ways more like those of general-purpose processors than like those of the highly specialized conventional DSP architectures.

Department Of Electronics & Communication Engineering, GEC Thrissur.

9

High Performance DSP Architectures

VLIW and superscalar processors often suffer from high energy consumption relative to conventional DSP processors, however in general, multi-issue processors are designed with an emphasis on increased speed rather than energy efficiency. These processors often have more execution units active in parallel than conventional DSP processors, and they require wide on-chip buses and memory banks to accommodate multiple parallel instructions and to keep the multiple execution units supplied with data, all of which contribute to increased energy consumption. Because they often have high memory usage and energy consumption, VLIW and superscalar processors have mainly targeted applications which have very demanding computational requirements but are not very sensitive to cost or energy efficiency. For example, a VLIW processor might be used in a cellular base station, but not in a portable cellular phone. On DSP processors with SIMD capabilities, the underlying hardware that supports SIMD operations varies widely. Analog Devices, for example, modified their basic conventional floating-point DSP architecture, the ADSP- 2106x, by adding a second set of execution units that exactly duplicate the original set. The augmented architecture can issue a single instruction and execute it in parallel in both sets of execution units using different data effectively doubling performance in some algorithms. In contrast, instead of having multiple sets of the same execution units, some DSP processors can logically split their execution units into multiple sub-units that process narrower operands. These processors treat operands in long registers as multiple short operands. Perhaps the most extensive SIMD capabilities we have seen in a DSP processor to date are found in Analog Devices' TigerSHARC processor. TigerSHARC is a VLIW architecture, and combines the two types of SIMD: one instruction can control execution of the processor's two sets of execution units, and this instruction can specify a splitexecution-unit operation that will be executed in each set. Using this hierarchical SIMD capability, TigerSHARC can execute eight 16-bit multiplications per cycle SIMD is only effective in algorithms that can process data in parallel; for algorithms that are inherently serial, SIMD is generally not of use.

Department Of Electronics & Communication Engineering, GEC Thrissur.

10

High Performance DSP Architectures

CHAPTER 4 DIVERGING ARCHITECTURES Up until recently, DSP processor designs were improved primarily by incremental enhancements; new DSP’s tended to maintain a close resemblance to their predecessors. In the last couple of years, however, DSP architectures have become much more interesting, with a number of vendors announcing new architectures that are completely different from preceding generations. HIGH-PERFORMANCE DSPS Processor designers who want higher DSP performance than can be squeezed out of traditional architectures have come up with a variety of performance-boosting strategies. The main idea is that if you want to improve performance beyond the increase afforded by faster clock speeds, you need to increase the amount of useful work that gets done every clock cycle. This is accomplished by increasing the number of operations that are performed in parallel, which can be implemented in two main ways: by increasing the number of operations performed by each instruction, or by increasing the number of instructions that are executed in every instruction cycle. INCREASING THE WORK PERFORMED BY EACH INSTRUCTION Traditionally, DSP processors have used complex, compound instructions that allow the programmer to encode multiple operations in a single instruction. In addition, DSP processors traditionally issue and execute only one instruction per instruction cycle. This single-issue, complex-instruction approach allows DSP processors to achieve very strong DSP performance without requiring a large amount of program memory. One method of increasing the amount of work performed by each instruction while maintaining the basics of the traditional DSP architecture and instruction set described above is to augment the data path with extra execution units We refer to processors that follow this approach as “Enhanced Conventional DSPs''; their basic architecture is similar to previous generations of DSPs, but has been enhanced by adding execution units. Lucent Technologies DSP16000 architecture is based on that of the earlier DSP1600, but Lucent added a second multiplier, an adder , and a bit manipulation unit. To support more parallel operations and keep the processor from starving for data, Lucent also increased the data bus widths to 32 bits. The net result is a processor that is able to sustain a throughput of two multiply-accumulates per instruction cycle. EXECUTING MULTIPLE INSTRUCTIONS / CYCLE A few designers have opted for a more RISC-like instruction set coupled with an architecture that supports execution of multiple instructions in every instruction cycle .E.g. TMS320C62xx family. In TI's version, the processor fetches a 256-bit instruction ``packet,'' parses the packet into eight 32-bit instructions, and routes them to its eight independent execution units. VLIW processors typically suffer from high program memory requirements and high power consumption. Like VLIW processors, superscalar processors issue and execute multiple instructions in parallel. Unlike VLIW processors, in which the programmer explicitly specifies which instructions will be executed in parallel, superscalar processors Department Of Electronics & Communication Engineering, GEC Thrissur.

11

High Performance DSP Architectures

use dynamic instruction scheduling to determine ``on the fly'' which instructions will be executed concurrently based on the processor's available resources, on data dependencies, and on a variety of other factors. Superscalar architectures have long been used in highperformance general-purpose processors such as the Pentium and PowerPC.

CIRCULAR BUFFERING In off-line processing, the entire input signal resides in the computer at the same time. The key point is that all of the information is simultaneously available to the processing program. This is common in scientific research and engineering, but not in consumer products. Off-line processing is the realm of personal computers and mainframes. In real-time processing, the output signal is produced at the same time that the input signal is being acquired. To calculate the output FIR sample, we must have access to a certain number of the most recent samples from the input.. When a new sample is acquired, it replaces the oldest sample in the array, and the pointer is moved one address ahead. Circular buffers are efficient because only one value needs to be changed when a new sample is acquired. Four parameters are needed to manage a circular buffer. First, there must be a pointer that indicates the start of the circular buffer in memory. Second, there must be a pointer indicating the end of the array , or a variable that holds its length . Third, the step size of the memory addressing must be specified. These three values define the size and configuration of the circular buffer, and will not change during the program operation. The fourth value, the pointer to the most recent sample, must be modified as each new sample is acquired. In other words, there must be program logic that controls how this fourth value is updated based on the value of the first three values. DSP/MICROCONTROLLER HYBRIDS Many applications require a mixture of control-oriented software and DSP software. A prime example is the digital cellular phone, which must implement both supervisory tasks and voice-processing tasks. In general, microcontrollers provide good performance in controller tasks and poor performance in DSP tasks; dedicated DSP processors have the opposite characteristics. Hence, until recently, combination controller/signal processing applications were typically implemented using two separate processors: a microcontroller and a DSP.

Department Of Electronics & Communication Engineering, GEC Thrissur.

12

High Performance DSP Architectures

In the past couple of years, however, a number of microcontroller vendors have begun to offer DSP-enhanced versions of their microcontrollers as an alternative to the dualprocessor solution. Using a single processor to implement both types of software is attractive, because it can potentially: • • • •

simplify the design task save board space reduce total power consumption reduce overall system cost

Microcontroller vendors like Hitachi offers a DSP-enhanced version of their SH-2 microcontroller. This version is called the SH-DSP, and adds a complete 16-bit fixed-point DSP data path to the SH-2. In contrast, ARM took a different approach and developed a DSP co-processor, ``Piccolo,'' that is meant to be used as an add-on to their ARM7 microcontroller and each has its own instruction set and processes its own instruction stream. It is therefore possible for the two processors to operate in parallel with the caveat that Piccolo relies on the ARM7 to perform all data transfers. RECONFIGURABLE ARCHITECTURES Reconfigurable architectures are defined as programmable architectures that change the hardware or the interconnections dynamically so as to provide flexibility with simultaneous benefits in execution time due to the reconfiguration as opposed to turning off units to conserve power. There have been various approaches to provide and use this reconfigurability in programmable architectures. The first approach is the ’FPGA+’ approach, which adds a number of high-level configurable functional blocks to a general purpose device to optimize it for a specific purpose such as wireless . The second approach is to develop a reconfigurable system around a programmable ASSP. The third approach is based on a parallel array of processors on a single die, connected by a reconfigurable fabric. These kind of architectures are just in their initial form of evolution.

Department Of Electronics & Communication Engineering, GEC Thrissur.

13

High Performance DSP Architectures

CHAPTER 5 NOVEL DSP ARCHITECTURES "POST-HARVARD" TECHNOLOGY After remaining unchanged for more than a decade, DSP architectures have started to evolve. They are even trying to encompass control operations. Conventional DSP architecture typically uses Harvard-style architecture, with separate data and instruction buses. Their main processing elements are a multiplier, an arithmetic logic unit (ALU), and an accumulation register, allowing creation of a multiply-accumulate (MAC) unit that accepts two operands. Depending on the processor, the operands may be 16-, 24-, 32-, or 48-bit words in either fixed-point or floating-point format. Whatever the word width, these conventional DSPs offer fixed-width instructions, executing one instruction per clock cycle.

Figure : Â The conventional DSP architecture uses separate data and memory buses and features fixed-width instructions, executing one instruction per clock cycle.

The instructions themselves can be fairly complex. A single instruction may embody two data moves, a MAC operation, and two pointer updates. These complex instructions help the conventional DSP offer a high degree of code density when performing repeated mathematical operations on arrays of numbers. As control devices, however, they leave something to be desired. The fixed-width instructions are inefficient when tasked with performing simple counter increments as part of a control loop, for instance. Even if the counter is only going as high as 10, the processor needs to use the full word width for the values. Conventional DSPs are also weak at bit-level data manipulation beyond bit shifting. Still, because of their number-crunching proficiency, conventional DSPs soon gained popularity in communications and media applications. The communications devices, including modem and telephony processors, needed the computational power for echo canceling, voice coding, and filtering. Media applications, including digital audio, video, and imaging, needed computational power for compression and filtering along with program flexibility to track evolving standards. DSPs also found a home in disk-drive and other servo-motor-control applications. ENHANCED DSPS EMERGE As semiconductor process technology evolved, conventional DSPs began to acquire a number of on-chip peripherals such as local memory, I/O ports, timers, and DMA controllers. Their basic architecture, however, didn't change for more than a decade. Eventually, though, the relative weakness in bit-level manipulation began to catch up with conventional DSPs, as did the incessant demand for greater performance.

Department Of Electronics & Communication Engineering, GEC Thrissur.

14

High Performance DSP Architectures

One common feature of these enhanced DSPs is the presence of a second MAC, which allows for some parallelism in computation. In many cases, this parallelism extends to other elements in the DSP, allowing the device to perform single-instruction, multiple-data (SIMD) operations. Often this is accomplished with data packing, which allows registers, data paths, and the like to handle two half-word operands each clock cycle. Along with data packing, many enhanced DSPs allow the instructions themselves to use fractional word widths, which allows multiple instructions to launch simultaneously. The enhanced DSPs also tend to incorporate features that speed execution of algorithms in a specific application space as well as add special-purpose peripherals and memory. The exact nature of the specialization varies with the application an enhanced DSP targets, which makes direct comparisons difficult. Many include hardware accelerators for frequently-used operations as well as provide specialized addressing modes and augmented instruction sets that target the application space. The augmented instruction sets may include both special DSP instructions and RISC-like instructions for improved control operation. Consider, for instance, the Blackfin DSP family from Analog Devices. This family targets voice, video, and data communications signal processing along with control operations. The core includes dual 16-bit MACs, dual 40-bit arithmetic logic units (ALUs), a 40-bit barrel shifter, and quad 8-bit ALUs for video operations. Because the architecture allows data packing, the 40-bit ALUs can handle two 40-bit numbers or four 16-bit numbers. In addition, a control unit handles sequencing of instructions so that a mix of 16-bit control and 32-bit DSP instructions can pack for simultaneous execution. Data can be in 8-, 16-, or 32-bit format.

Figure: Analog Devices Blackfin DSP architecture handles multi-width data words and can simultaneously execute 16-bit control and 32-bit DSP instructions. The core also includes two data address generators (DSGs) to simplify both DSP and control operations. DSP addressing operations include circular buffering, for matrix operations, and bit-reversal, for unscrambling FFT results. Control operations include autoincrement, auto-decrement, and base-plus-immediate-offset addressing modes not found in conventional DSPs.

Department Of Electronics & Communication Engineering, GEC Thrissur.

15

High Performance DSP Architectures

INSTRUCTION SETS TARGET APPLICATIONS The instruction set of the Blackfin core includes both general DSP instructions and RISClike control instructions. In addition, the core has complex instructions geared toward the needs of the intended applications. For Huffman coding, used in communications algorithms, there is a "Field Deposit/Extract" command. For the Discrete Cosine Transform, used in imaging and video, an IEEE 1180 rounding operation is available. Video compression algorithms can take advantage of the "Sum Absolute Difference" instruction. These specialty instructions are one way that the Blackfin family targets applications. The other way is the peripheral mix each family member offers. The ADSP-21532, for example, aims at low-cost consumer multimedia applications by including peripherals supporting surround-sound and video-specific operating modes. The ADSP-21535 goes after high-performance communications applications with USB and PCI interfaces as well as substantial amounts of on-chip SRAM. The range and variety of variations within the Blackfin family as well as the nature of its specialized instructions mirror the diversity of enhanced conventional DSPs, available from companies such as Cirrus Logic, Motorola, and Texas Instruments. But for all the enhancements, these DSPs follow basically the same programming model as the conventional device. Other DSP architectures have emerged that follow a different programming model. In search of the highest performance levels, these architectures allow the DSP to launch multiple instructions at the same time for parallel execution. While these approaches result in greater code execution speed, they also make software more difficult to optimize. They require careful instruction ordering to avoid needing simultaneous access to the same data. They also need to avoid attempting simultaneous execution of instructions where one instruction depends on the results of the other for its operands. Not all DSP application software has a structure suitable for multiple-launch execution, but when it does, these DSPs offer the highest performance. PARALLELISM ARISES Two different forms of multiple-launch DSPs have arisen: very long instruction word (VLIW) and superscalar architectures. Both have multiple execution units configured to operate in parallel and use RISC-like instruction sets. The instructions of a VLIW architecture are explicitly parallel, being composed of several sub-instructions that control different resources. The superscalar architectures, on the other hand, load instructions in bulk, then use hardware run-time scheduling to identify instructions that can run in parallel and map them to the execution units. Of the multi-launch architectures, VLIW designs are the most common. Devices from Adelante Technologies, Equator Technologies, Siroyan, and Texas Instruments fall into this category, although they vary considerably with the type and number of parallel execution units they offer. The TI TMS320C64xx processors, for instance, have eight execution units that can handle both 8- and 16-bit SIMD operations. The Siroyan OneDSP, on the other hand, is scalable from two to 32 clusters, each with several execution units. The Adelante Saturn DSP core as shown in the following figure demonstrates the essence of the VLIW approach. It uses multiple data buses in a dual-Harvard configuration to Department Of Electronics & Communication Engineering, GEC Thrissur.

16

High Performance DSP Architectures

deliver data and 96-bit wide instructions to an array of execution units simultaneously. These units include two multipliers (MPY), four 16-bit ALUs that can combine to form two 20-bit ALUs, a barrel shifter with saturation logic (SST/BRS), program (PCU) and loop (LCU) controllers, address controllers (ACU), and an ability for design teams to add application-specific execution units (AXU) to speed processing.

Figure: Adelante's Saturn DSP core handles VLIW instructions that can comprise several sub-instructions that control different resources. The core also handles application-specific execution units (AXUs) to accelerate processing. The Saturn core uses a unique approach to get around one of the problems the wide word widths of VLIW architectures cause. Accessing external memory is a challenge for these DSPs, because of their need to work with buses that can be as wide as 128 bits. The Saturn core uses 16-bit program memory, which it maps into the 96-bit instruction word it uses internally. Adelante developed this mapping after analyzing millions of lines of code for common applications. However, the core also allows developers to create their own application-specific instructions that map into the VLIW. SUPERSCALAR DSPS While the 16-bit external instruction width of the Saturn processor is unusual for VLIW architectures, it is typical for superscalar architectures. These devices pull in several instructions at a time and dynamically map them to the execution units. Internally the effect is much the same as a VLIW architecture in that execution units are operating in parallel. But from the software development viewpoint the approach reduces programming complexity. With hardware handling the sequencing and arranging of instructions, the developer is free to work with the more manageable short instructions. The structure of a sample superscalar DSP, the LSI Logic ZSP600. Because it is a core its memory interface isn't constrained, making it look like a VLIW architecture. But the presence of the instruction-sequencing unit (ISU) and the pipeline control unit betray its superscalar nature. The ZSP600 fetches eight instructions at a time, and can execute as many as six, using its four MAC and two ALU execution units simultaneously. Data packing allows the units to perform 16- or 32-bit operations. The architecture also allows for the addition of coprocessors to speed specific DSP functions.

Department Of Electronics & Communication Engineering, GEC Thrissur.

17

High Performance DSP Architectures

Figure: Superscalar DSPs, such as LSI Logic's ZSP600, use several instructions simultaneously and dynamically map these instructions to the execution units. This ability to add coprocessors is becoming a common feature of high-performance DSP cores. In many cases the core's creators have also created coprocessors for functions such as DES (data encryption standard) and Viterbi coding. If a pre-designed coprocessor isn't available, however, creating your own can be a major design challenge. A recently-introduced DSP architecture, the PulseDSP from Systolix, might make the task easier. Similar to an FPGA, the PulseDSP offers a massively parallel, repetitive structure. It is designed as a systolic array, which means that all data transfers occur synchronously on a clock edge. Each processing element in the array has selectable I/O paths, local data memory, and an ALU. Both the I/O and the ALU are programmable, and the array has a programming bus running through it. The combination makes the array reprogrammable, either statically or dynamically. The array structure is intended to handle low-complexity but high-speed processing tasks using 16- to 64-bit arithmetic, which makes it suitable as a coprocessor.

Figure: Systolix's PulseDSP is a systolic array that can run as a coprocessor or as a standalone unit for applications such as filters and FFTs. The array is programmable, with each processing element having its own selectable I/O paths, local data memory, and an ALU. Department Of Electronics & Communication Engineering, GEC Thrissur.

18

High Performance DSP Architectures

The array can also be used as a stand-alone processor for some types of algorithms, such as filters and FFTs. One of the commercial implementations of the array, in fact, is to provide filtering in an Analog Devices data acquisition part, the AD7725. The device combines the PulseDSP with a sigma-delta A/D converter to provide post-processing of the acquired data. The DSP array implements various filter algorithms. Innovations such as the PulseDSP as well as the proliferation within the other DSP architectures are a strong indicator of how important these once-arcane processors have become. In many applications, especially communications, they share the spotlight with the RISC processor. The DSP handles the data and the RISC handles the protocols. There are problems with the two-processor approach, of course, including increased cost and software development complexity. One reason many DSPs are adding RISC-like instructions to their set is to be able to edge out the other processor in such applications. The same thing is happening with some RISC processors. Extensible cores, such as the Tensilica Xtensa and the ARC International ARCtangent, are offering DSP enhancements so that communications applications need only one processor. These enhancements follow the architecture of the conventional DSP, but merge the DSP functions into the instruction set of the RISC core. The ARCtangent,, demonstrates how the two get blended. The DSP instruction decode and processing elements both connect with the rest of the core, allowing them to use the core's resources as well as their own. The extensions have full access to registers and operate in the same instruction stream as the RISC core. ARC's DSP offerings include MACs in varying widths, saturation arithmetic, and X-Y memory for DSP data. The extensions also support DSP addressing modes such as bit-reversal.

Figure :  The ARCtangent core from ARC International blends DSP functionality into a RISC processor. Both DSP instruction-decode and processing elements connect with the rest of the core, allowing these elements to use the core�s resources as well as their own. These extended RISC processors, enhanced conventional DSPs, and high-performance architectures have all proliferated in the last few years, a sure sign of the importance DSPs have acquired. Furthermore, that proliferation is likely to continue. With process technology allowing integration of multiple peripherals with DSP cores and instruction sets extending to match application needs, DSPs are headed the way of the microcontroller. From obscure, specialized parts, they are evolving to become a fundamental building block for virtually any system.

Department Of Electronics & Communication Engineering, GEC Thrissur.

19

High Performance DSP Architectures

CHAPTER 6 ARCHITECTURE OF LATEST DSP PROCESSORS TEXAS INSTRUMENTS TMS320C67xx FAMILY OVERVIEW The TMS320C67xx family is the Highest Performance Floating-Point version DSPs. It is based on a advanced VelociTI very-long-instruction-word (VLIW) architecture making this DSP an excellent choice for multichannel and multifunction applications, which allows it to execute up to eight RISC-like instructions per clock cycle. It has added support for floating-point arithmetic and 64-bit data. It has a performance of up to 1 giga floating-point operations per second (GFLOPS) at a clock rate of 167 MHz.It uses an 1.8-volt core supply , and executes up to 334 million MACs per second at 167 MHz. The TMS320C67xx's two data paths extend hardware support for 64-bit data and IEEE-754 32-bit single-precision and 64-bit double-precision floating-point arithmetic. Each data path includes a set of four execution units, a general-purpose register file, and paths for moving data between memory and registers. The four execution units in each data path comprise two ALUs, a multiplier, and an adder/subtractor which is used for address generation. The ALUs support both integer and floating point operations, and the multipliers can perform both 16x16-bit and 32x32-bit integer multiplies and 32-bit and 64-bit floating point multiplies. The two register files each contain sixteen 32-bit general-purpose registers. These registers can be used for storing addresses or data. To support 64-bit floating point arithmetic, pairs of adjacent registers can be used to hold 64-bit data. The C6701 DSP possesses the operational flexibility of high-speed controllers and the numerical capability of array processors. This processor has 32 general-purpose registers of 32-bit word length and eight highly independent functional units. The eight functional units provide four floating-/fixed-point ALUs, two fixed-point ALUs, and two floating-/fixedpoint multipliers. Program memory consists of a 64K-byte block that is user-configurable as cache or memory-mapped program space. Data memory consists of two 32K-byte blocks of RAM. The peripheral set includes two multichannel buffered serial ports (McBSPs), two general-purpose timers, a host-port interface (HPI), and a glueless external memory interface (EMIF) capable of interfacing to SDRAM or SBSRAM and asynchronous peripherals. The large bank of On-chip memory system of the TMS320C67xx implements a modified Harvard architecture, providing separate address spaces for program and data memory. Program memory has a 32-bit address bus and a 256-bit data bus. Each of the two data paths is connected to data memory by a 32-bit address bus and two 32-bit data buses. Since there are two 32-bit data buses for each data path, the TMS320C67xx can load two 64-bit words per instruction cycle. TMS320C6701 has 64 Kbytes of 32-bit on-chip program RAM and 64 Kbytes of 16-bit on-chip data RAM. The TMS320C6701 has one external memory interface, which provides a 23-bit address bus and a 32-bit data bus. These buses are multiplexed between program and data memory accesses. Addressing modes supported include register-direct, register-indirect, indexed register-indirect, and modulo addressing. Immediate data is also supported.

Department Of Electronics & Communication Engineering, GEC Thrissur.

20

High Performance DSP Architectures

The TMS320C67xx does not support hardware looping, and hence all loops must be implemented in software. However, the parallel architecture of the processor allows the implementation of software loops with virtually no overhead. The peripherals on the TMS320C6701 include a host port, four-channel DMA controller, two TDM-capable buffered serial ports and two 32-bit timers

CPU ARCHITECTURE

Department Of Electronics & Communication Engineering, GEC Thrissur.

21

High Performance DSP Architectures

CPU DESCRIPTION Fetch packets are always 256 bits wide; however, the execute packets can vary in size. The variable-length execute packets are a key memory-saving feature, distinguishing the C67x CPU from other VLIW architectures. The CPU features two sets of functional units. Each set contains four units and a register file. One set contains functional units .L1, .S1, .M1, and .D1; the other set contains units .D2, .M2, .S2, and .L2. The two register files contain 16 32-bit registers each for the total of 32 general-purpose registers. The two sets of functional units, along with two register files, compose sides A and B of the CPU The four functional units on each side of the CPU can freely share the 16 registers belonging to that side. Additionally, each side features a single data bus connected to all registers on the other side, by which the two sets of functional units can access data from the register files on opposite sides. In addition to the C62x DSP fixed-point instructions, the six out of eight functional units (.L1, .M1, .D1, .D2, .M2, and .L2) also execute floating-point instructions. The remaining two functional units (.S1 and .S2) also execute the new LDDW instruction which loads 64 bits per CPU side for a total of 128 bits per cycle. Another key feature of the C67x CPU is the load/store architecture, where all instructions operate on registers. Two sets of data-addressing units (.D1 and .D2) are responsible for all data transfers between the register files and the memory. The data address driven by the .D units allows data addresses generated from one register file to be used to load or store data to or from the other register file. The C67x CPU supports a variety of indirect-addressing modes using either linear- or circular-addressing modes with 5- or 15-bit offsets. All instructions are conditional, and most can access any one of the 32 registers. Some registers, however, are singled out to support specific addressing or to hold the condition for conditional instructions. The two .M functional units are dedicated multipliers. The two .S and .L functional units perform a general set of arithmetic, logical, and branch functions with results available every clock cycle. The processing flow begins when a 256bit-wide instruction fetch packet is fetched from a program memory. The 32-bit instructions destined for the individual functional units are “linked” together by “1” bits in the least significant bit (LSB) position of the instructions. The instructions that are “chained” together for simultaneous execution compose an execute packet. A “0” in the LSB of an instruction breaks the chain, effectively placing the instructions that follow it in the next execute packet. If an execute packet crosses the fetch-packet boundary (256 bits wide), the assembler places it in the next fetch packet, while the remainder of the current fetch packet is padded with NOP instructions. The number of execute packets within a fetch packet can vary from one to eight. Execute packets are dispatched to their respective functional units at the rate of one per clock cycle and the next 256-bit fetch packet is not fetched until all the execute packets from the current fetch packet have been dispatched. After decoding, the instructions simultaneously drive all active functional units for a maximum execution rate of eight instructions every clock cycle. While most results are stored in 32-bit registers, they can be subsequently moved to memory as bytes or halfwords as well. All load and store instructions are byte-, half-word, or word-addressable.

Department Of Electronics & Communication Engineering, GEC Thrissur.

22

High Performance DSP Architectures

ANALOG DEVICES ADSP-21XX FAMILY OVERVIEW The ADSP-21xx is the first single chip DSP processor family from Analog Devices. The family consists of a large number of processors based on a common 16-bit fixed-point architecture core with a 24-bit instruction word. Each processor combines the core DSP architecture computation units, data address generators, and program sequencer—with differentiating features such as on-chip program and data memory RAM, a programmable timer, and one or two serial ports. The fastest members of the family operate at 75 MIPS at 2.5 volts, 52 MIPS at 3.3 volts, and 40 MIPS at 5.0 volts. Analog Devices has recently announced the ADSP-219x series, which offers projected speeds of up to 300 MIPS, as well as architectural enhancements. ADSP-21xx processors are targeted at modem, audio, PC multimedia, and digital cellular applications. Fabricated in a high speed, submicron, double-layer metal CMOS process, the highestperformance ADSP-21xx processors operate at 25 MHz with a 40 ns instruction cycle time. Every instruction can execute in a single cycle. Fabrication in CMOS results in low power dissipation. The ADSP-2100 Family’s flexible architecture and comprehensive instruction set support a high degree of parallelism. The ADSP-21xx data path consists of three separate arithmetic execution units: an arithmetic/logic unit (ALU), a multiplier/accumulator (MAC), and a barrel shifter. Each unit is capable of single-cycle execution, but only one of these units can be active during a single instruction cycle. The ALU operates on 16-bit data. In addition to the usual ALU operations, the ALU provides increment/decrement, absolute value, and add-with-carry functions. ALU results are saturated upon overflow if the appropriate configuration bit is set by the programmer. The MAC unit includes a 16x16->32-bit multiplier, four input registers, a feedback register, a 40-bit adder, and a single 40-bit result register/accumulator providing eight guard bits. Besides signed operands, the multiplier can operate on unsigned/unsigned or on signed/unsigned operands, thus supporting multi-precision arithmetic. The barrel shifter shifts 16-bit inputs from an input register or from the ALU/MAC/barrel shifter result registers into a 32-bit result register. Logical and arithmetic shifts are supported left or right up to 32 bits. The barrel shifter also supports block floating-point arithmetic with block exponent detect (which determines a maximum exponent of a block of data), single-word exponent detect, normalize, and exponent adjust instructions. ADSP-21xx processors use a modified Harvard architecture with separate memory spaces and on-chip bus sets for program and data. All processors in the ADSP-21xx family include on-chip program RAM or ROM and on-chip data RAM. On-chip program memory can be used for both instructions and data, and it can be accessed via a 14-bit address bus and a 24-bit data bus. On-chip program memory is dualported to allow the processor to fetch both a data operand and the next instruction in a single instruction cycle. The on-chip data memory can be accessed via a 14-bit address bus and a 16-bit data bus. One access to the on-chip data memory can be performed in a single instruction cycle. Three memory accesses (one instruction and two data operands) can be performed in one instruction cycle.

Department Of Electronics & Communication Engineering, GEC Thrissur.

23

High Performance DSP Architectures

Both of the on-chip memory spaces can be extended off-chip. All ADSP-21xx processors have one external memory interface, providing a 14-bit address bus and a 24-bit data bus. This external interface is multiplexed between program and data memory accesses. The ADSP-21xx supports register-direct, memory-direct and register-indirect addressing modes. Immediate data is also supported. The ADSP-21xx provides zero-overhead program looping through its DO instruction. Any length sequence of instructions can be contained in a hardware loop, and up to 16,384 repetitions are supported.

ARCHITECTURE OVERVIEW The processors contain three independent computational units: the ALU, the multiplier/accumulator (MAC), and the shifter. The ALU performs a standard set of arithmetic and logic operations; division primitives are also supported. The MAC performs single-cycle multiply, multiply/add, and multiply/subtract operations. The shifter performs logical and arithmetic shifts, normalization, renormalizations, and derive exponent operations. The shifter can be used to efficiently implement numeric format control including multiword floating-point representations. The internal result (R) bus directly connects the computational units so that the output of any unit may be used as the input of any unit on the next cycle. A powerful program sequencer and two dedicated data address generators ensure efficient use of these computational units. The sequencer supports conditional jumps, subroutine calls, and returns in a single cycle. With internal loop counters and loop stacks, the ADSP-21xx executes looped code with zero overhead no explicit jump instructions are required to maintain the loop. Two data address generators (DAGs) provide addresses for simultaneous dual operand fetches (from data memory and program memory). Each DAG maintains and updates four address pointers. Whenever the pointer is used to access data (indirect addressing), it is post-modified by the value of one of four modify registers. A length value may be associated with each pointer to implement automatic modulo addressing for circular buffers. The circular buffering feature is also used by the serial ports for automatic data transfers to On chip memory. Efficient data transfer is achieved with the use of five internal buses namely : Program Memory Address (PMA) Bus , Program Memory Data (PMD) Bus, Data Memory Address (DMA) Bus, Data Memory Data (DMD) Bus and the Result (R) Bus.

Department Of Electronics & Communication Engineering, GEC Thrissur.

24

High Performance DSP Architectures

The two address buses (PMA, DMA) share a single external address bus, allowing memory to be expanded off-chip, and the two data buses (PMD, DMD) share a single external data bus. The BMS, DMS, and PMS signals indicate which memory space is using the external buses. Program memory can store both instructions and data, permitting the ADSP-21xx to fetch two operands in a single cycle, one from program memory and one from data memory. The processor can fetch an operand from on-chip program memory and the next instruction in the same cycle. The memory interface supports slow memories and memorymapped peripherals with programmable wait state generation. External devices can gain control of the processor’s buses with the use of the bus request/grant signals.

One bus grant execution mode (GO Mode) allows the ADSP-21xx to continue running from internal memory. A second execution mode requires the processor to halt while buses are granted. Each ADSP-21xx processor can respond to several different interrupts. There can be up to three external interrupts, configured as edge- or level-sensitive. Internal interrupts can be generated by the timer, serial ports, and, on the ADSP-2111, the host interface port. There is also a master RESET signal. Booting circuitry provides for loading on-chip program memory automatically from byte-wide external memory. After reset, three wait states are automatically generated. This allows, for example, a 60 ns ADSP-2101 to use a 200 ns EPROM as external boot memory. Multiple programs can be selected and loaded from the EPROM with no additional hardware. The data receive and transmit pins on SPORT1 (Serial Port 1) can be alternatively configured as a general-purpose input flag and output flag. You can use these pins for event signalling to and from an external device. A programmable interval timer can generate periodic interrupts. A 16-bit count register (TCOUNT) is decremented every n cycles, where n–1 is a scaling value stored in an 8-bit register (TSCALE). When the value of the count register reaches zero, an interrupt is generated and the count register is reloaded from a 16-bit period register (TPERIOD).

Department Of Electronics & Communication Engineering, GEC Thrissur.

25

High Performance DSP Architectures

BLACKFIN PROCESSOR Blackfin Processors are a new breed of embedded media processor. Based on the Micro Signal Architecture (MSA) jointly developed with Intel Corporation, Blackfin Processors combine a 32-bit RISC-like instruction set and dual 16-bit multiply accumulate (MAC) signal processing functionality with the ease-of-use attributes found in general-purpose microcontrollers. This combination of processing attributes enables Blackfin Processors to perform equally well in both signal processing and control processing applications-in many cases deleting the requirement for separate heterogeneous processors. This processor family also offers industry leading power consumption performance to as low as 0.15mW/MMAC at 0.8V. This combination of high performance and low power is essential in meeting the needs of today's and future signal processing applications including broadband wireless, audio/video capable Internet appliances, and mobile communications. HIGH PERFORMANCE SIGNAL PROCESSING The core architecture employs fully interlocked instruction pipeline, multiple parallel computational blocks, efficient DMA capability, and instruction set enhancements designed to accelerate video processing FULLY INTERLOCKED INSTRUCTION PIPELINE All Blackfin Processors utilize a multi-stage fully interlocked pipeline that guarantees code is executed as you would expect and that all data hazards are hidden from the programmer. This type of pipeline guarantees result accuracy by stalling when necessary to achieve proper results. HIGHLY PARALLEL COMPUTATIONAL BLOCKS The basis of the Blackfin Processor architecture is the Data Arithmetic Unit that includes two 16-bit Multiplier Accumulators (MACs), two 40-bit Arithmetic Logic Units (ALUs), four 8-bit video ALUs, and a single 40-bit barrel shifter. Each MAC can perform a 16-bit by 16-bit multiply on four independent data operands every cycle. The 40-bit ALUs can accumulate either two 40-bit numbers or four 16-bit numbers. With this architecture, 8-, 16- and 32-bit data word sizes can be processed natively for maximum efficiency. Two Data Address Generators (DAGs) are complex load/store units designed to generate addresses to support sophisticated DSP filtering operations. For DSP addressing, bitreversed addressing and circular buffering is supported. The DAGs also include two loop counters for nested zero overhead looping and hardware support for on-the-fly saturation and clipping. HIGH BANDWIDTH DMA CAPABILITY All Blackfin Processors have multiple, independent DMA controllers that support automated data transfers with minimal overhead from the processor core. DMA transfers can occur between the internal memories and any of the many DMA-capable peripherals. VIDEO INSTRUCTIONS In addition to native support for 8-bit data, the word size common to many pixel processing algorithms, the Blackfin Processor architecture includes instructions specifically defined to Department Of Electronics & Communication Engineering, GEC Thrissur.

26

High Performance DSP Architectures

enhance performance in video processing applications. Video compression algorithms are incorporated for the enhanced instructions. Efficient Control Processing is similar to that of RISC control processors. These features include a hierarchical memory architecture, superior code density, and a variety of microcontroller-style peripherals including a watch-dog timer, real-time clock, and an integrated SDRAM controller. The L1 memory is connected directly to the processor core, runs at full system clock speed, and offers maximum system performance for time critical algorithm segments. The L2 memory is a larger, bulk memory storage block that offers slightly reduced performance, but still faster than off-chip memory. The L1 memory structure has been implemented to provide the performance needed for signal processing while offering the programming ease found in general purpose microcontrollers. By supporting both SRAM and cache programming models, system designers can allocate critical DSP data sets that require high bandwidth and low latency into SRAM, while maintaining the simple programming model of the data cache for operating system (OS) and microcontroller code.

The Memory Management Unit provides for a memory protection format that can support a full OS Kernel. The OS Kernel runs in Supervisor mode and partitions blocks of memory and other system resources for the actual application software to run in User mode. This is a unique and powerful feature not present on traditional DSPs. SUPERIOR CODE DENSITY The Blackfin Processor architecture supports multi-length instruction encoding. Very frequently used control-type instructions are encoded as compact 16-bit words, with more mathematically intensive DSP instructions encoded as 32-bit values. DYNAMIC POWER MANAGEMENT They employ multiple power saving techniques. Blackfin Processors are based on a gated clock core design that selectively powers down functional units on an instruction-byinstruction basis. They also support multiple power-down modes for periods where little or no CPU activity is required. Lastly, and probably most importantly, Blackfin Processors support a dynamic power management scheme whereby the operating frequency AND voltage can be tailored to meet the performance requirements of the algorithm currently being executed.

BLACKFIN PROCESSOR CORE BASICS The Blackfin Processor core is a load-store architecture consisting of a Data Arithmetic Unit, an Address Arithmetic Unit, and a sequencer unit. Blackfin Processors combine a high performance, dual MAC DSP architecture with the programming ease of a RISC MCU into a single, instruction set architecture. Department Of Electronics & Communication Engineering, GEC Thrissur.

27

High Performance DSP Architectures

GENERAL PURPOSE REGISTER FILES The Blackfin Processor core includes an 8-entry by 32-bit data register file for general use by the computational units. Supported data types include 8-, 16-, or 32-bit signed or unsigned integer and 16- or 32-bit signed fractional. In every clock cycle, this multiported register file supports two 32-bit reads AND two 32-bit writes. It can also be accessed as a 16-entry by 16-bit data register file. The address register file provides a general purpose addressing mechanism in addition to supporting circular buffering and stack maintenance. This register file consists of 8 entries and includes a frame pointer and a stack pointer. The frame pointer is useful for subroutine parameter passing, while the stack pointer is useful for storing the return address from subroutine calls. DATA ARITHMETIC UNIT It contains: •

Two 16-bit MACs



Two 40-bit ALUs



Four 8-bit video ALUs



Single barrel shifter

All computational resources can process 8-, 16-, or 32-bit operands from the data register file-R0 through R7. Each register can be accessed as a 32-bit register or a 16-bit register high or low half. In a single clock cycle, the dual data paths can read AND write up to two 32-bit values. However, since the high and low halves of the R0 through R7 registers are individually addressable (Rx, Rx.H, or Rx.L), each computational block can choose from either two 32bit input values or four 16-bit input values with no restrictions on input data. The results of the computation can be written back into the register file as either a 32-bit entity or as the high or low 16-bit half of the register. Additionally, the method of accumulation can vary between data paths..

Department Of Electronics & Communication Engineering, GEC Thrissur.

28

High Performance DSP Architectures

Both accumulators are 40 bits in length, providing 8 bits of extended precision. Similar to the general purpose registers, both accumulators can be accessed in 16-, 32-, or 40-bit increments. The Blackfin architecture also supports a combined add/subtract instruction that can generate two 16-, 32-, or 40-bit results or four 16-bit results. In the case where four 16-bit results are desired, the high and low half results can be interchanged. This is a very powerful capability and significantly improves, for instance, the FFT benchmark results. ADDRESS ARITHMETIC UNIT Two data address generators (DAGs) provide addresses for simultaneous dual operand fetches from memory. The DAGs share a register file that contains four sets of 32-bit index (I), length(L), base(B), and modify(M) registers. There are also eight additional 32-bit address registers—P0 through P5, frame pointer, and stack pointer that can be used as pointers for general indexing of variables and stack locations. The four sets of I, L, B, and M registers are useful for implementing circular buffering. Used together, each set of index, length, and base registers can implement a unique circular buffer in internal or external memory. The Blackfin architecture also supports a variety of addressing modes, including indirect, auto increment and decrement, indexed, and bit reversed. Last, all address registers are 32 bits in length, supporting the full 4 Gbyte address range of the Blackfin Processor architecture. PROGRAM SEQUENCER UNIT The program sequencer controls the flow of instruction execution and supports conditional jumps and subroutine calls, as well as nested zero-overhead looping. A multistage fully interlocked pipeline guarantees code is executed as expected and that all data hazards are hidden from the programmer. This type of pipeline guarantees result accuracy by stalling when necessary to achieve proper results. The Blackfin architecture supports 16- and 32-bit instruction lengths in addition to limited multi-issue 64-bit instruction packets. This ensures maximum code density by encoding the most frequently used control instructions as compact 16-bit words and the more challenging math operations as 32-bit double words.

Department Of Electronics & Communication Engineering, GEC Thrissur.

29

High Performance DSP Architectures

LSI LOGIC ZSP600-QUAD MAC SUPERSCALAR CORE OVERVIEW The ZSP600 is a quad MAC superscalar DSP core that addresses the high performance data throughput and signal processing requirements of emerging communications platforms. The ZSP600 supports up to Six IPC DSP performance at a peak 300MHz data rate. It includes quad MAC and quad ALU computational resources, a high-performance load/store memory architecture, and dedicated co-processor interfaces, combined with state-of-the-art power reduction techniques. These attributes make the ZSP600 core an ideal solution for a variety of embedded DSP algorithms, including those required for wireless infrastructure, mobile (3G), IAD/home gateway, central office, and access/network applications.ZSP600 instruction parallelism is supported by usertransparent instruction grouping and pipeline control to deliver superscalar DSP performance, while programming using a RISC-instruction set.. The ZSP600 is a fully synthesizable, single-phase, clocked architecture, with all core I/Os registered for ease-of-process migration and design flexibility. The ZSP600 provides extensive computational resources, including four 16-bit multipliers/MACs, dual 40-bit ALUs, and dual 16-bit ALUs, all capable of supporting 16-and 32-bit operations. The ZSP600 can perform four independent 16x16 MUL/MAC operations into four 16-bit or two 40-bit results, two 32x32-bit MUL/MACs into a 32-bit result, or two Viterbi (addcompare-select) results per cycle. The ZSP600 is based upon a high-bandwidth memory architecture with separate 8 instruction per cycle prefetch and dual 64-bit data interfaces , over a 24-bit address space. The instruction memory architecture allows multiinstruction/cycle pre-fetch to an integrated instruction cache. The data memory architecture incorporates dual independent 64-bit load/store units, with dedicated address generation, allowing up to eight 16-bit word or four 32-bit word load/store operations per cycle. The ZSP600 integrates a bi-directional co-processor interface to support hardware acceleration. The memory subsystem (MSS) is decoupled from the DSP operations to provide increased flexibility in support of different memory schemes. It also includes Instruction Set enhancements to RISC architecture for improved broadband and wireless application support..

Department Of Electronics & Communication Engineering, GEC Thrissur.

30

High Performance DSP Architectures

A WORD ON SUPERSCALAR DSP A superscalar architecture simply implies that the architecture is responsible for resolving the operand and resource hazards and that it has the resources to achieve an instruction throughput that is greater that one instruction per clock. Logic dedicated to pipeline control is kept to a minimum by enforcing in-order execution and by isolating the control to a single stage at the head of the pipeline. This stage issues sequential groups of instructions that have no data dependencies or other resource conflicts. Once a group of instructions has been issued, they advance through the pipeline in lock step. A VLIW machine does not employ instruction scheduling or pipeline protection. Instructions in a VLIW pipeline are statically issued , and it is the programmer’s responsibility to prevent data hazards and resource conflicts. Superscalar architectures also facilitate software compatibility not only between implementations of the same architecture, but also from one generation of architecture to the next thus increasing lifetime. ARCHITECTURE OVERVIEW The G2 architecture is scalable in terms of arithmetic resources, data bandwidth, and pipeline capacity. This scalable nature allows the architecture to support multiple implementations that target different application spaces. All address and data I/O communication across the core boundary are registered. This feature is highly desirable from a SOC system designer’s point of view for a number of reasons, one being the removal of timing budget ambiguities between system logic and the core. Prefetch unit (PFU) is at the head of the instruction pipeline. The ZSP600 can prefetch eight 16-bit words per cycle. It is responsible for maximizing the probability that the instruction cache has the data required by the instruction sequencing unit (ISU) for any given fetch cycle. The prefetch unit performs limited decoding to identify code discontinuities and to apply static branch prediction when necessary. The ISU is responsible for instruction fetch and decode, instruction grouping, and instruction issue. Instruction grouping refers to the pipeline stage in which operand dependencies are resolved. The ISU issues groups of in-order instructions that will not cause any operand conflicts. This is the only unit (and only stage in the execution pipeline) that enforces pipeline protection. Isolating the pipeline protection logic in this manner simplifies pipeline control logic significantly. The ZSP600 ISU can issue up to six instructions per cycle, one to each of the six primary datapaths: two address generation units (AGUs), two arithmetic logic units (ALUs), and two multiply/accumulate/arithmetic units (MAUs) that are capable of performing up to four MAC operations per cycle. The pipeline control unit (PCU) stages control associated with each of the primary data paths and the bypass logic. The PCU is also responsible for managing interrupt control, the co-processor interface, the debug interface, and the on-core timers. The Bypass unit (BYP) handles all the data forwarding between execution units.)

Department Of Electronics & Communication Engineering, GEC Thrissur.

31

High Performance DSP Architectures

PIPELINE The pipeline of the G2 architecture is an eight-stage pipeline. The existing architecture uses a data prefetch mechanism, called data linking, to efficiently sustain required data bandwidth for its dual- MAC. All pipeline protection and resource allocation is performed during the grouping stage. Instruction groups are issued by the grouping stage and advance in lock step down the remainder of the pipeline. Data address generation is performed in the AG stage. This stage is also responsible for enforcing the boundaries of the circular buffers. A load or store that straddles a boundary of the circular buffer is split by the AGU into two sequential accesses. Stages M0 and M1 are allocated for data memory loads. They are optimized for systems using synchronous RAM. M0 is allocated for address decode and M1 for data access and return. Load and store requests are registered and issued to the memory subsystem in M0. The memory interface is stallable. If the MSS determines that is can not return requested data during M1, it stalls the core until the data is ready. ARITHMETIC RESOURCES By adding two AGUs, along with dedicated address registers , the arithmetic throughput of G2 demonstrates an immediate improvement. The two AGUs allow the core to issue any combination of two loads or stores per cycle. The data size of the load/store is implementation specific. Each data port in the ZSP600 is 64-bits wide, allowing a total of 128-bits (8 words) of data to be loaded per cycle. The AGUs have dedicated hardware to support four circular buffers and reverse-carry addressing. The circular buffer support has been enhanced in functionality to support load/store operations with positive and negative offsets and signed indexes. Circular buffer logic also applies to address arithmetic and also has no alignment restrictions.

Department Of Electronics & Communication Engineering, GEC Thrissur.

32

High Performance DSP Architectures

REGISTER RESOURCES With the 32-bit address registers, the architecture allows implementations of the core to remain flexible in defining the physical linear address space. The actual address register remains a 32-bit register to ensure pointer sizes remain the same from one implementation to the next. This also allows the address registers to be used as temporary registers for the GPRs. Dedicated address registers simplify the instruction decoder and issue logic as it can now identify address related operations and assign the datapath resources appropriately. The primary operand resource of the AGUs is the address register file, allowing the general-purpose register file to be physically optimized for data moving to and from the ALUs and MAUs. The current generation defines two 32-bit registers and another 16-bit register whose low and high bytes correspond to the upper byte of each accumulator respectively thus resulting in a 40-bit accumulator. A guard byte is now available for each of the eight extended 32-bit registers of the GPRs. Accumulators are also recognized in the programming model by providing associated instruction set support for 40-bit arithmetic and data loads and stores. INSTRUCTION SET ENHANCEMENTS A powerful enhancement to the new architecture is the ability to conditionally execute instructions. The programming model for G2 allows programmers to define packets of instructions that are predicated on a specified condition. The programmer then defines a bracketed set of up to eight instructions that will be predicated in the execution pipeline based on the specified condition. A packet of instructions can be issued over multiple cycles, using the same operand and resource rules enforced by the grouping stage, but they are considered atomic in the sense that a packet of instructions cannot be interrupted. Interrupts can occur between successive packets of instructions.. Due to the inclusion of stack based operations in combination with the quad-word data support, all general purpose registers, address registers, and index registers can be pushed or popped in eleven clock cycles. The enhanced instruction set also includes new bit field insert and extract operations, instructions to support 40-bit arithmetic, multiply/accumulator instructions that accept both signed and unsigned operands, and division assist instruction that returns a 16- bit quotient and remainder in 16 cycles. POWER REDUCTION The core implements a multi-tiered power saving scheme. At the highest level, the cores power consumption can be controlled via instructions to idle the core when desired. This feature, which is common among DSPs, allows the core to effectively shut down if it is not being used. An interrupt is used to wake the core when needed. The second level of power savings comes from an internal unit that dynamically controls clocks of other units on a clock-by-clock basis. PERFORMANCE The pipeline of the G2 architecture has been designed to achieve 300MHz operation in a 0.13nm technology. Performance modeling suggests an average improvement of roughly three times that of the existing architecture.

Department Of Electronics & Communication Engineering, GEC Thrissur.

33

High Performance DSP Architectures

SHARC DSP FAMILY ARCHITECTURE OVERVIEW Von Neumann architecture contains a single memory and a single bus for transferring data into and out of the central processing unit (CPU). Multiplying two numbers requires at least three clock cycles, one to transfer each of the three numbers over the bus from the memory to the CPU. Harvard architecture insisted on separate memories for data and program instructions, with separate buses for each. Since the buses operate independently, program instructions and data can be fetched at the same time, improving the speed over the single bus design. SHARC® DSPs, a contraction of the longer term, Super Harvard ARChitecture. The idea is to build upon the Harvard architecture by adding features to improve the throughput. SHARC DSPs are optimized in: an instruction cache, and an I/O controller. Instruction cache improves the performance of the Harvard architecture. A handicap of the basic Harvard design is that the data memory bus is busier than the program memory bus. When two numbers are multiplied, two binary values must be passed over the data memory bus, while only one binary value is passed over the program memory bus.DSP algorithms generally spend most of their execution time in loops This means that the same set of program instructions will continually pass from program memory to the CPU. The Super Harvard architecture takes advantage of this situation by including an instruction cache in the CPU. This is a small memory that contains about 32 of the most recent program instructions. The first time through a loop, the program instructions must be passed over the program memory bus. On additional executions of the loop, the program instructions can be pulled from the instruction cache. This means that all of the memory to CPU information transfers can be accomplished in a single cycle: the sample from the input signal comes over the data memory bus, the coefficient comes over the program memory bus, and the program instruction comes from the instruction cache. In the jargon of the field, this efficient transfer of data is called a “high memory access bandwidth”. The SHARC DSPs provides both serial and parallel communications ports. These are extremely high speed connections., while six parallel ports each provide a 40 Mbytes/second data transfer. When all six parallel ports are used together, the data transfer rate is an incredible 240 Mbytes/second.This type of high speed characteristic of DSP’s.

Department Of Electronics & Communication Engineering, GEC Thrissur.

34

High Performance DSP Architectures

At the top of the diagram are two blocks labeled Data Address Generator (DAG), one for each of the two memories. These control the addresses sent to the program and data memories, specifying where the information is to be read from or written to. SHARC DSPs, each of the two DAGs can control eight circular buffers. This means that each DAG holds 32 variables, plus the required logic. In addition, an abundance of circular buffers greatly simplifies DSP code generation- both for the human programmer as well as highlevel language compilers, such as C. The data register section of the CPU is used in the same way as in traditional microprocessors. In the ADSP-210XX SHARC DSPs, there are 16 general purpose registers of 40 bits each. These can hold intermediate calculations, prepare data for the math processor, serve as a buffer for data transfer, hold flags for program control, and so on. If needed, these registers can also be used to control loops and counters; however, the SHARC DSPs have extra hardware registers to carry out many of these functions. The math processing is broken into three sections, a multiplier, an arithmetic logic unit (ALU), and a barrel shifter. The multiplier takes the values from two registers, multiplies them, and places the result into another register. The ALU performs addition, subtraction, absolute value, logical operations (AND, OR, XOR, NOT), conversion between fixed and floating point formats, and similar functions. Elementary binary operations are carried out by the barrel shifter, such as shifting, rotating, extracting and depositing segments, and so on. A powerful feature of the SHARC family is that the multiplier and the ALU can be accessed in parallel. In a single clock cycle, data from registers 0-7 can be passed to the multiplier, data from registers 8-15 can be passed to the ALU, and the two results returned to any of the 16 registers. Another feature is the use of shadow registers for all the CPU's key registers. These are duplicate registers that can be switched with their counterparts in a single clock cycle. They are used for fast context switching, the ability to handle interrupts quickly. When an interrupt occurs in traditional microprocessors, all the internal data must be saved before the interrupt can be handled. This usually involves pushing all of the occupied registers onto the stack, one at a time. In comparison, an interrupt in the SHARC family is handled by moving the internal data into the shadow registers in a single clock cycle. When the interrupt routine is completed, the registers are just as quickly restored.

Department Of Electronics & Communication Engineering, GEC Thrissur.

35

High Performance DSP Architectures

Because of its highly parallel nature, the SHARC DSP can simultaneously carry out all of these tasks. Specifically, within a single clock cycle, it can perform a multiply , an addition , two data moves, update two circular buffer pointers, and control the loop. There will be extra clock cycles associated with beginning and ending the loop; however, these tasks are also handled very efficiently. If the loop is executed more than a few times, this overhead will be negligible. To give you a better understanding of this issue. The important idea is that the fixed point programmer must understand dozens of ways to carry out the very basic task of multiplication. In contrast, the floating point programmer can spend his time concentrating on the algorithm. SHARC family can represent numbers in 32-bit fixed point, a mode that is common in digital audio applications. This makes the 232 quantization levels spaced uniformly over a relatively small range, say, between -1 and 1. In comparison, floating point notation places the 232 quantization levels logarithmically over a huge range, typically ±3.4×1038. This gives 32-bit fixed point better precision, that is, the quantization error on any one sample will be lower. However, 32-bit floating point has a higher dynamic range, meaning there is a greater difference between the largest number and the smallest number that can be represented. To handle these high-power tasks, several DSPs can be combined into a single system. This is called multiprocessing or parallel processing. The SHARC DSPs were designed with this type of multiprocessing in mind, and include special features to make it as easy as possible. For instance, no external hardware logic is required to connect the external busses of multiple SHARC DSPs together; all of the bus arbitration logic is already contained within each device. As an alternative, the link ports (4 bit, parallel) can be used to connect multiple processors in various configurations.

Department Of Electronics & Communication Engineering, GEC Thrissur.

36

High Performance DSP Architectures

CONCLUSION There are many different architectures evolving in the past few years. Most of them have already marked their presence in their own respective areas. The recent developments in DSP architectures are mainly marked by the introduction of VLIW, Superscalar units as well as Super Harvard Architecture. The embedded field is also refined by the introduction of DSP Hybrid devices. The current trend is focused into reducing the power consumption with increased performance over respective materials. From the detailed analysis of the different architecture of currently available DSP’s, we reach to a conclusion that there is no common platform for evaluating these device performances. Also each of them are well competitive enough for their products.

Department Of Electronics & Communication Engineering, GEC Thrissur.

37

High Performance DSP Architectures

REFERENCES Publications: “Design Methodologies for VLSI DSP Architectures and Applications”. Kluwer. “DSP Processor Fundamentals: Architectures and features” IEEE Press “High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit”, Computer Architecture News. “Reconfigurable Processors for High-Performance & Embedded Digital Signal Processing”. “High-Performance Microarchitectures with Hardware-Programmable Functional Units” , Proc. 27th Annual IEEE Intl. Symposium on Microarchitecture. “Evaluation of a Low-Power Reconfigurable DSP Architecture” “Reconfigurable Computing: the Solution to Low Power Programmable DSP” “DSP Buyers Guide 2004 Edition.”

Websites: www.dspguide.com Official websites of :• Texas Instruments • Analog Devices • Lucent Technology • LSI Logic • Motorola Devices

Department Of Electronics & Communication Engineering, GEC Thrissur.

38

Related Documents

Dsp
June 2020 21
Dsp
June 2020 14
Dsp
October 2019 36
Dsp
April 2020 22