第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
第9页 / 共10页
第10页 / 共10页
.PowerPC和DSP对比一、 主要性能参数对比TigerSHARC TigerSHARC PowerPCPowerPCParameterADSP-TS101SADSP-TS201SMPC7455PPC476FP(IBM 45nm SoI)Core Clock250 MHz500 MHz1,000 MHz1,600 MHzPeak Floating-pt Performance1,500 MFLOPS 3000 MFLOPS 8,000 MFLOPS 3,000 MFLOPSMemory Bus Size/Speed64-bit/100 MHz 64-bit/100 MHz 64-bit/133 MHz 128-bit/800 MHz External Link Ports4250 MB/Sec 4250 MB/Sec None User DefineI/O Bandwidth (inc. memory)1,800 MB/Sec 1,800 MB/Sec 1,064 MB/sec 64,00 MB/sec Bandwidth-to-Processing Ratio1.20 Bytes/FLOP 1.20 Bytes/FLOP 0.13 Bytes/FLOP 2.1 Bytes/FLOP 1024-pt cFFT Benchmark39 sec 19 sec 13 sec (est.) 83.2sec(双精度)Approx Cycles for 1024-pt cFFT9,750 cycles 9,750 cycles 13,000 cycles Predicted 1024-pt cFFTs/chip25,641 per Sec 12,821 per Sec 64,941* per Sec ASDP tigersharp主要参数Part#Clock Speed (MHz)MMACS (Max)On Chip MemoryExternal Memory SupportedOperating Temp RangePackageUS Price 1000-4999ADSP-TS201S600MHz480024MbitAsync, SDRAM-25 x 25 BGA$252.25ADSP-TS202S500MHz400012MbitAsync, SDRAM-25 x 25 BGA$209.51ADSP-TS203S500MHz40004MbitAsync, SDRAM-25 x 25 BGA$184.49ADSP-TS101S300MHz24006MbitAsync, SDRAM-40 to +8519 x 19 BGA, 27 x 27 BGA$193.88C6701C6201C6203MPC7410*PPC476Clock (MHz)1672003005001600Instruction Cycle (ns)653.332Instructions Per Cycle1 - 81 - 81 - 81 - 314Million Instructions/Sec.133316002400500Million Fixed-Point Ops/Sec.1333160024008000Million Floating-Point Ops/Sec.100020003000General-Purpose Algorithm Benchmarks on TIs C66x DSP Core at 1.25 GHz1Benchmark Speed Clock Cycle 32-bit algorithm 1k point FFT (Radix 4) 5.47 s 6840 64k point FFT (Radix 4) 0.58 ms 696588 FIR filter (per real tap) 0.2 ns 0.25 8x88x8matrix multiply (complex floating point) 1.06 s 1327 16-bit algorithm 256 point complex FFT (Radix 4) 0.6 s 752 主要DSP的浮点性能对比:Speed Scores for floating-point packaged processors BDTImark2000(BDTI认证结果)(BDTI主要是针对DSP的benchmark,没有MPC7410和Powerpc的数据)一些算法,像FFT,可以充分利用7410的矢量数学运算。1024点,浮点复数FFT可以在27us内完成,相比之下,C6701需要108us。其他算法,像无线应用中的turbo解码器,VLIW结构处理的更有效率。很明显,具有AltiVec核的PowerPC G4(74xx)具有较高的核时钟速率与性能。P O W e r P C 的核时钟速率几乎是目前T i g e r s H A R C的33倍(不久更快版本的TigerSHARC将发布)。AltiVec核每个周期执行单条指令,每128位向量包含4个独立的32位数据单元,这就是众所周知的sIM-D(单指令多数据)结构。当执行一次乘加(MAC)矢量运算时,达到峰值处理能力,每周期可完成8次浮点操作。对于1 GHz的MPC7455,峰值处理能力可达8000M 次s浮点运算。AltiVec每周期能执行8次整数或定点操作,峰值整数运算能力为8000MOPS(百万次操作s)。相反,TigerSHARC有两个独立的32位处理器核,或称MIMD(多指令多数据)结构。每个计算单元每周期能执行一次乘法以及和差分运算,对于300 MHz ADSPTSl0lS每周期完成6次浮点运算或1800MFLOPS峰值运算能力。当执行16位整数运算时,TigerSHARC 可以利用它的超标量体系结构, 分离两个独立3 2位计算单元成2个单独的16位S1MD单元。这样每个操作在两个数据单元, 每个周期总共12次操作。另外,TigerSHARC有另外两个专门的1 6位整数引擎, 每个周期可以增加超过1 2次的操作,这样每个周期共计2 4次整数运算,7200MOPS。1.二、 IBM 476FPE在FFT方面的性能评估FFT算法采用FFTW3.3.3的算法(http:/www.fftw.org),FFTW3.3.3算法是优化比较好的算法,性能得到肯定。测试程序采用benchFFT3.1(http:/www.fftw.org).对比的三个芯片是IBM PPC476FPE,PowerPC7447A,Intel 四核Pentium 3.06GHz。以512和1024 transform-size为参考。配置情况说明:1. PPC476FPE,ubuntu9.0.4,GCC-4.3.3,2. Apple iBook G4. 1.06 GHz PowerPC 7447A, linux 2.6.15, gcc-4.0.2, g+-4.0.2, g77-4.0.2. Has Altivec (4-way single precision SIMD).Compilers and flags (unless overridden):C: gcc -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=7450C+: g+ -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=7450Fortran: gfortran -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=74503. Four-processor 3.06 GHz Intel Pentium 4, 512 KB L2. Linux 2.4.25, gcc-3.3.3, g+-3.3.3, g77-3.3.3, AMD Core Math Library (ACML) 3.0.0, Intel Math Kernel Library Version 8.0.1, Intel Integrated Performance Primitives v5.0. Has SSE (4-way single precision SIMD), SSE2 (2-way double precision SIMD). The benchmark uses one processor only.Mflops计算方法To report FFT performance, we plot the mflops of each FFT, which is a scaled version of the speed, defined by:mflops = 5 N log2(N) / (time for one FFT in microseconds) for complex transforms, andmflops = 2.5 N log2(N) / (time for one FFT in microseconds) for real transforms,where N is number of data points (the product of the FFT dimensions). This is not an actual flop count; it is simply a convenient scaling, based on the fact that the radix-2 Cooley-Tukey algorithm asymptotically requires 5 N log2(N) floating-point operations. It allows us to compare the performance for many different sizes on the same graph, get a sense of the cache effects, and provide a rough measure of efficiency relative to the clock speed.变换类型的说明transform-typeis a four-character string consisting of precision (double/single =d/s), type (complex/real =c/r), in-place/out-of-place (=i/o), and forward/backward (=f/b). For example,transform-type=dcifdenotes a double-precision in-
收藏 下载该资源
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号