2.2 CORDIC AlgorithmThe CORDIC algorithm in the circular coordinatesystem is as follows [19].)(2)()1( iyixixii−−=+σ(3))(2)()1( ixiyiyji−+=+σ(4))()()1( iiziziασ−=+ (5)ii−−= 2tan)(1α (6)where))(( izsigni=σwith 0)( →iz in the rotationmode, and))(())(( iysignixsigni⋅−=σwith0)( →iy in the vectoring mode. The scale factor:)(ik is equal toii2221−+σ. After n micro-rotations, the product of the scale factors is given by∏∏−=−−=+==10210121)(niiniikK (7)Notice that CORDIC in the circular coordinatesystem with rotation mode can be written by⎥⎦⎤⎢⎣⎡⎥⎦⎤⎢⎣⎡−=⎥⎦⎤⎢⎣⎡000000cossinsincosyxzzzzKyxcnn(8)where⎥⎦⎤⎢⎣⎡00yxand⎥⎦⎤⎢⎣⎡nnyxare the input vector and theoutput vector, respectively,0z is the rotation angle,and Kcis the scale factor. In [1], the circular rotationcomputation of CORDIC was used for complexmultiplication withθje−, which is given by⎥⎦⎤⎢⎣⎡⎥⎦⎤⎢⎣⎡−=⎥⎦⎤⎢⎣⎡]Im[]Re[cossinsincos]Im[]Re[''XXXXθθθθ

(9)

3 Reusable IP 128-point CORDIC-Based Split-Radix FFT CoreFigure 1 shows the proposed 128-point CORDIC-based split-radix FFT processor, which can be usedas a reusable IP core for various FFT with multiplesof 128 points. Notice that the modified split-radix2/8 FFT butterfly processor and the ROM-freetwiddle factor generator are used. In addition, aninternal (128×32-bit) SRAM is used to store theinput and output data for hardware efficiency,through the use of the in-place computationalgorithm [1].

3.1 CORDIC-Based Split-Radix 2/8 FFTProcessorFor the butterfly computation of the proposedCORDIC-based split-radix 2/8 FFT processor,sixteen complex additions, two constantmultiplications (CM), and four CORDIC operationsare needed, as shown in Figure 2. The CORDICalgorithm has been widely used in various DSPapplications because of the hardware simplicity.According to equation (9), the twiddle factormultiplication of FFT can be considered a 2-Dvector rotation in the circular coordinate system.Thus, CORDIC in the circular coordinate systemwith rotation mode is adopted to compute complexmultiplications of FFT.The pipelined CORDIC arithmetic unit can beobtained by decomposing the CORDIC algorithminto a sequence of operational stages. In [20], wederived the error analysis of fixed-point CORDICarithmetic, based on which, the number of theCORDIC stages can be determined effectively. Forexample, the number of the CORDIC stages is 12 ifthe overall relative error of 16-bit CORDICarithmetic is required to be less than310−. In which,the pre-calculated scaling factor64676.1≈cKandthe Booth binary recoded format leads to 1.101001.The main concern for the design of the CORDICarithmetic unit is throughput rather than latency.Table 1 shows a comparison between theconventional complex multiplier using 4 real Boothmultipliers and the proposed CORDIC arithmeticunit in terms of gate counts. In addition, the powerconsumption can be reduced significantly by usingthe proposed CORDIC arithmetic unit; it has beenreduced by 30% according to the report ofPrimePower® distributed by Synopsys.As the twiddle factors:18W and38W are equal to)1(22j− and)1(22j+−, respectively, aWSEAS TRANSACTIONS on CIRCUITS and SYSTEMSTze-Yun Sung, Hsi-Chin Hsin, Lu-Ting KoISSN: 1109-2734466Issue 6, Volume 8, June 2009complex number, say )( bja +, times18W or38Wcan be written by))()((22))1(22()( bajbajbja +−++=−×+(10)))()((22))1(22()( bajbajbja ++−−=+−×+(11)where22can be represented as0101010.1 usingthe Booth binary recoded form (BBRF). Thus, theCM unit can be implemented by using simple addersand shifters only. Figure 3 shows the pipelined CMarchitecture, which uses three subtractions/additionsand therefore improves on the computation speedsignificantly.Based on the above-mentioned CORDICarithmetic unit and CM unit, the computationalcircuit and hardware architecture of the CORDIC-based split-radix 2/8 FFT butterfly computation areshown in Figure 4, respectively. As one can see, thepipelined CORDIC arithmetic unit aims atincreasing the throughput of complexmultiplications.

3.2 ROM-Free Twiddle Factor GeneratorIn the conventional FFT processor, a large ROMspace is needed to store all the twiddle factors. Toreduce the chip area, a twiddle factor generator isthus proposed. Figure 5 shows the ROM-freetwiddle factor generator using simple adders andshifters for 128-point FFT. In which, the 16-bitaccumulator is to generate the valueπn2for eachindexn;123log2−=−Nn, the 16-bit shifter is todivideπn2by N, and the 16-bit shifter/adder is toproduce the twiddle factors:nN1θ,nN3θ,nN5θandnN7θ.By using the twiddle factor generator, the chip areaand power consumption can be reduced significantlyat the cost of an additional logic circuit. Table 2shows the gate counts of the full-ROM storing allthe twiddle factors, the CORDIC twiddle factorgenerator [1] and the ROM-free twiddle factorgenerator.

4 Hardware Implementations of FFTProcessors by Using IP 128-Point FFTCoreFigure 6 depicts 128/256/512/1024/2048/4096/8192-point FFT processors; and moreover, two memorybanks (4096/2048/1024/512/256/0×32-bit and8192/4096/2048/1024/512/256/128×32-bit) areallocated for increased efficiency by using the in-place computation algorithm [1]. Hardwarearchitectures of 128/256/512/1024/2048/4096/8192-point FFT processors is shown in Figure 7.The platform for architecture development andverification has been designed and implemented inorder to evaluate the development cost. In which,the 8051 microcontroller reads data from PC viaDMA channel and writes the result back to PC byUSB 2.0 bus; the Xilinx XC2V6000 FPGA chip [21]implements FFT processors. In addition, thereusable IP CORDIC-based FFT core has beenimplemented in Matlab®for functional simulations.The hardware code written in Verilog®isrunning on a workstation with the modelSim®

5 Performance Analysis of theProposed FFT Architecture andProgrammable FFT ProcessorThe proposed FFT processors used to compute128/256/512/1024/ 2048/4096/8192-point FFT arecomposed mainly of the 128-point CORDIC-basedsplit-radix 2/8 FFT core; the computationcomplexity using a single 128-point FFT core is)6/(NOfor N-point FFT. By comparison with theCORDIC-based radix-2, radix-4, radix-8 and split-radix 2/4 FFT architectures, the proposed FFTarchitecture is superior, as shown in Table 4. Theplot and log-log plot of the CORDIC computationsversus the number of FFT points are shown inFigures 9 and 10, respectively. As one can see, theproposed FFT architecture is able to improve thepower consumption and computation speedsignificantly.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMSTze-Yun Sung, Hsi-Chin Hsin, Lu-Ting KoISSN: 1109-2734467Issue 6, Volume 8, June 20096 ConclusionThis paper presents low-power and high-speed FFTprocessors based on CORDIC and split-radixtechniques for OFDM systems. The architecturesare mainly based on a reusable IP 128-pointCORDIC-based split-radix FFT core. The pipelinedCORDIC arithmetic unit is used to compute thecomplex multiplications involved in FFT, andmoreover the required twiddle factors are obtainedby using the proposed ROM-free twiddle factorgenerator rather than storing them in a large ROMspace.CORDIC-based 128/256/512/1024/2048/4096/8192-point FFT processors have been implementedby 0.18mμCMOS, which take 395sμ, 176.8sμ,77.9sμ, 33.6sμ, 14sμ, 5.5sμand 1.88sμtocompute 8192-point, 4096-point, 2048-point, 1024-point, 512-point, 256-point and 128-point FFT,respectively.The CORDIC-based FFT processors aredesigned by using the portable and reusableVerilog®. The 128-point FFT core is a reusable IP,which can be implemented in various processes andcombined with an efficient use of hardwareresources for the trade-offs of performance, area,and power consumption.