Design & Implementation of an Area Efficient FIR Filter Using FPGA & VHDL Presented By: K.Praveen V.Sathish kumar A.Yasar Arafath R.Venkateshwaran
Outline
Introduction FIR filter implementation MAG algorithm RSG Multiplier block synthesis algorithm Generation & Presentation of results Overview Advantages Applications Conclusion
Introduction High-speed digital signal processor (DSP) systems are increasingly being implemented on field programmable gate array (FPGA) hardware platforms. The techniques in use are applicable to implementation on application-specific integrated circuit (ASIC) and Structured ASIC technologies, analysis is performed using field programmable gate array (FPGA) hardware. The FPGA platform provides high performance and flexibility with the option to reconfigure and is the technology focused on for the remainder of this paper. Finite impulse response (FIR) digital filters are common DSP functions and are widely used in FPGA implementations. If very high sampling rates are required, full-parallel hardware must be used where every clock edge feeds a new input sample and produces a new output sample.
continued……. Such filters can be implemented on FPGAs using combinations of the general purpose logic fabric, on-board RAM and embedded arithmetic hardware. Full-parallel filters cannot share hardware over multiple clock cycles and so tend to occupy large amounts of resource. Hence, efficient implementation of such filters is important to minimize hardware requirement. When implementing a DSP system on a platform containing dedicated arithmetic blocks, it is normal practice to utilize such blocks as far as possible in preference to any general purpose logic fabric.
FIR Filter
A finite impulse response (FIR) filter is a filter structure that can be used to implement almost any sort of frequency response digitally. An FIR filter is usually implemented by using a series of delays, multipliers, and adders to create the filter's output. The general formula of FIR filter is
Y(n) =∑ x(k) h(n-k) where k = 0,1,2…..L-1
FIR Filter Add/Shift Implementation Replacing Constant Multiplication by Multiplier Block
X [n] X[n]
Multiplier Block
+
+
+
+
y0 Delay Block
+
+
+
y2
y1
x Z-1
+
Z-1
+
yL-1
hL-1 Z-1
Z-1
+
y[n]
x
FIR Filter Add/Shift Implementation Registered Adder at no Additional Cost
Shift/Add Operation
RSG Algorithm
Reduced Slice Graph Algorithm algorithm performing synthesis for low FPGA area. This can be achieved by synthesising low-logic depth multiplier blocks. Considering above multiplier block diagram for RSG synthesis, that multiplies the input by 3, 13, 21,37 in 2 clock cycles. input is fed to the ‘3’ adder untouched and after being left shifted once (multiplied by 2). Hence, the output of the ‘3’ adder is 2x + x = 3x required. This product can be used as a graph output be fed to the filter summation chain.
Contd..
It generate +ve, odd integers since negative filter coefficient weightings can be restored at the summation chain by subtracting the +ve equivalent generated by the multiplier block. Once a multiplier has been synthesised, the multiplier & any additional values used to generate it are placed in a set called ‘Graph Set’. RSG stores data that describes the synthesised graph topology & the mathematical association between the vertices and edges (shifts)
Contd…
RSG requires the least FPGA area with the slice and flip-flop cost. RSG takes the approach , starts with the highest cost values and simply inserts the ‘best graph’ required for each, ensuring no duplicate adders are created and that adder outputs are shared as far as possible. As the number of slices reduced compared to MAG algorithm, RSG synthesis is followed to reduce filter complexity in verl large scale integration (VLSI) design.
RSG Multiplier Block Synthesis Algorithm
RSG coding: library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_SIGNED.ALL; entity test is Port ( clk,reset : in std_logic; xin : in std_logic_vector(3 downto 0); yout3 : out std_logic_vector(5 downto 0); yout13: out std_logic_vector(7 downto 0); yout21: out std_logic_vector(8 downto 0); yout37: out std_logic_vector(9 downto 0) ); end test; architecture bhv of test is signal add_in,s1 :std_logic_vector(5 downto 0):=(others=>'0');
Contd… signal add_3,add_21,mul3:std_logic_vector(8 downto 0):=(others=>'0'); signal s:std_logic_vector(8 downto 0):=(others=>'0'); signal sh_5,add_5_0,add_37 ,xin_5 :std_logic_vector(9 downto 0):=(others=>'0'); signal sh_2,add_5:std_logic_vector(6 downto 0):=(others=>'0'); signal sh_22, xin_0 :std_logic_vector(7 downto 0):=(others=>'0'); begin process(clk,reset) begin if reset='1' then s1<=(others=>'0'); s<=(others=>'0'); yout3<=(others=>'0'); add_5<=(others=>'0'); add_5_0<=(others=>'0');
Contd… elsif clk'event and clk='1' then s1<= '0' & xin(3 downto 0) & '0' + ("00" & xin ); yout3<=s1; s<="000" & s1; add_5<=('0' & xin(3 downto 0) & "00") + ("000" & xin);--7 add_5_0<="000" & add_5;--10 end if; end process; process(clk,reset) begin if reset='1' then yout21<=(others=>'0'); xin_5<=(others=>'0'); yout37<=(others=>'0'); sh_22<=(others=>'0'); yout13<=(others=>'0');
Contd.. xin_0<=(others=>'0'); yout13<=(others=>'0'); elsif clk'event and clk='1' then yout21<=( s(5 downto 0) & "000") - s; xin_5<="000000" & xin; yout37<= xin_5(4 downto 0) & "00000" + add_5_0;--10 sh_22<= s1(5 downto 0) & "00";--8 xin_0<="0000" & xin;--8 yout13<=sh_22 + xin_0;--8 end if; end process; end bhv;
MAG algorithm
The minimised adder graph (MAG) algorithm was defined byDempster and Macleod. It generates minimum adder graphs consisting of shifts,adds&subtracts for implementing constant integer multiplication of individual values up to12-bits. MAG is required for recent FPGA architectures to further reduce the area requirement of conventional constant coefficient multipliers. MAG finds numerous graphs for each value that implement the required multiplication with the number of adders/vertices.
Contd.. This
algorithm attempts to synthesise all required multiplications by initially placing coefficients of adder cost 1 into the multiplier block the ‘3’ adder is synthesised first and all other coefficients are built from it. This approach leads to high logic depth blocks and hence pushes up FPGA area requirement. calculating the number of single-bit full adders required to implement each graph and selecting the graph requiring the least. Minimise VLSI area consumed when implementing the multiplication. MAG also generates a table containing the adder cost of each.
Multiplier block with 4 adders, 4 pipeline stages costing 44 slices
MAG Coding: library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_SIGNED.ALL; entity mag is Port ( clk,reset : in std_logic; xin : in std_logic_vector(3 downto 0); yout3 : out std_logic_vector(5 downto 0); yout21: out std_logic_vector(8 downto 0); yout13: out std_logic_vector(7 downto 0); yout37: out std_logic_vector(9 downto 0) ); end mag; architecture bhv of mag is signal add_in,s1 :std_logic_vector(5 downto 0):=(others=>'0'); signal add_3,add_21,mul3:std_logic_vector(8 downto 0):=(others=>'0');
Contd.. signal s:std_logic_vector(8 downto 0):=(others=>'0'); signal sh_5,add_5_0,add_37 ,xin_5 :std_logic_vector(9 downto 0):=(others=>'0'); signal sh_2,add_5:std_logic_vector(6 downto 0):=(others=>'0'); signal sh_22, xin_0 :std_logic_vector(7 downto 0):=(others=>'0'); begin process(clk,reset)-----3 begin if reset='1' then s1<=(others=>'0'); yout3<=(others=>'0'); elsif clk'event and clk='1' then s1<= '0' & xin(3 downto 0) & '0' + ("00" & xin ); yout3<=s1; end if; end process;
Contd.. process(clk,reset)----13 begin if reset='1' then sh_22<=(others=>'0'); yout13<=(others=>'0'); elsif clk'event and clk='1' then sh_22<= s1(5 downto 0) & "00";--8 xin_0<="0000" & xin;--8 yout13<=sh_22 + xin_0;--8 end if; end process; process(clk,reset)----21 begin if reset='1' then s<=(others=>'0'); yout21 <=(others=>'0'); elsif clk'event and clk='1' then
Contd.. s<="000" & s1; yout21<=( s(5 downto 0) & "000") - s; end if; end process; process(clk,reset)---37 begin if reset='1' then add_5<=(others=>'0'); add_5_0<=(others=>'0'); yout37<=(others=>'0'); elsif clk'event and clk='1' then add_5<=('0' & xin(3 downto 0) & "00") + ("000" & xin);--7 add_5_0<="000" & add_5;--10 xin_5<="000000" & xin; yout37<= xin_5(4 downto 0) & "00000" + add_5_0;--10 end if; end process; end bhv;
Generation & presentation of Results
VHDL filters were generated to compare RSG, RAG and C1, coefficient bitwidth varied from 1 to 20 and filter length fixed at 51taps. Fig 4a demonstrates that RSG requires less FPGA area in general, although fig 4b shows that RSG uses more adders in all cases,fig 4c confirms that FPGA area is correlated with flip-flop usage,fig 4d RSG is lowest as expected and shift-register usage. Fig 4e LUT data is correlated with adder usage with RSG using fractionally more LUTs in general.
Contd.. Device Utilization of RSG synthesis multiplier block --------------------------------------------------------------------Resource Used Available Utilization ---------------------------------------------------------------------IOs 39 86 45.35% Function Generators 27 384 7.03% CLB Slices 25 192 13.02% Dffs or Latches 50 672 7.44%
Overview….
Minimising multiplier block adder cost has been demonstrated not to minimise FPGA hardware for full-parallel pipelined FIR filters. Reducing flip-flop count through minimising multiplier logic depth has instead been shown to yield the lowest area solutions. The results presented establish a clear area advantage of RSG over prior algorithms for typical filter parameters with comparablemaximum clock rates. In addition, the industrial relevance of the transposed FIR with multiplier block architecture & the RSG algorithm has been established through comparison with filters implemented using the DA technique.
Contd..
RSG is shown to consume more LUTs which is the primary cause of reducing RSG area advantage with increasing bitwidth. Maximumclock rates were generally well into the fullparallelrange although for bit-widths 18 and 20, RSG filter performance drops as filter length increases. RSG filters provide lowest area implementations capable of being clocked at full-parallel rates.
Advantages
There is a constant requirement for efficient use of FPGA resources where occupying less hardware for a given system can yield significant cost-related benefits: 1.Reduced power consumption. 2. Area for additional application functionality. 3. Potential to use a smaller, cheaper FPGA. (i)No input sample shift registers are required since each sample is fed to each tap simultaneously. (ii) the pipelined addition chain maps efficiently. (iii) filter latency is reduced. (iv) identical tap coefficient magnitudes can share multiplication hardware because taps receive the input sample simultaneously.
Applications:
Extensive use of FPGAs in computationally intensive applications such as DSP More available logic resources in current FPGAs Broad applications of FIR filters in multimedia and Audio communication. Need to efficient design methods to save area/power. Parameterizing FIR filters in VHDL by using required multiplier block synthesis algorithms. Minimise VLSI area consumed when implementing the synthesis of multiplier blocks.
Conclusion
Presented a multiplierless technique, based on the add and shift method.
And Common sub expression elimination for low area, low power and high speed implementations of FIR filters.
FIR filters can perform at sample rates greatly exceeding those of a state-of-the-art programmable DSP.
The FPGA solution offers complete flexibility in the design and, by reducing chip count, improves the overall reliability of the system.