Title: TMS320C54x DSP processor
1TMS320C54x DSPprocessor
Shahab adin Rahmanian
2Outline
- Introduction
- Architecture
- Applications
- features
- Instruction Set and addressing
- FIR Filtering
- Accelerating Polynomial Evaluation
- Numerical Issues
- Write code in C
- Conclusion
3Introduction
2
4TMS320C54x
- a fixed-point digital signal processor (DSP) in
the TMS320 family. - Low power DSP 0.54 mW/MIP
- Acceleration for FIR and LMS filtering, code book
search, polynomial evaluation, Viterbi decoding
,Fast Fourier transform
4
5Some Typical Applications
- General-Purpose
- Adaptive filtering
- Digital filtering
- Fast Fourier transforms
- Control
- Disk drive control
- Laser printer control
- Robotics control
- Military
- Missile guidance
- Radar processing
- Secure communication
- Telecommunications
- 1200- to 19200-bps modems
- Adaptive equalizers
- Cellular telephones
- Echo cancellation
- Video conferencing
6Software Applications
- Circular Buffers
- Single-Instruction Repeat (RPT) Loops
- Extended-Precision Arithmetic
- Addition and Subtraction
- Multiplication
- Division
- Square Root
- Floating-Point Arithmetic
- Application-Oriented Operations
- Symmetric FIR Filters
- Adaptive Filtering
- Viterbi Algorithm for Channel Decoding
- Fast Fourier Transforms
7Some key features
- CPU
- Advanced multi bus architecture with three
separate 16-bit data buses and one program bus - 40-bit arithmetic logic unit (ALU), including a
40-bit barrel shifter and two independent
40-bit accumulators - 17-bit 17-bit parallel multiplier coupled to a
40-bit dedicated adder for non-pipelined
single-cycle multiply/accumulate (MAC) operation - Memory
- 192K words 16-bit maximum addressable memory
space (64K words program, 64K words data, and 64K
words I/O) - 28K words 16-bit single-access on-chip ROM
with 8K words configurable as program or data
memory (C541 only)
8Some key features
- On-chip peripherals
- On-chip phase-locked loop (PLL) clock generator
with internal oscillator or external clock source - Two full-duplexed serial ports to support 8- and
16-bit transfers (C541only) - Time-division multiplexed (TDM) serial port
(C542/C543 only) - One 16-bit timer
- Speed 25/20-ns execution time for a single-cycle
fixed-point instruction (40 MIPS/50 MIPS) with
5-V power supply
9C54x Addressing Modes
- Immediate
- Operand is part of the instruction
- Absolute
- Address of operand is part of the instruction
- Register
- Operand is specified in a register
ADD 0FFh
LD (LABEL), A
READA DATA(data read from address in
accumulator A)
10C54x Addressing Modes
- Direct
- Address of operand is part of the instruction
(added to implied memory page) - Indirect
- Address of operand is stored in a register
- Offset addressing
- Register offset (ar1ar0)
- Autoincrement/decrement
- Bit reversed addressing
- Circular addressing
ADD 010h,A
ADD AR1
ADD AR1(10)
ADD AR10
ADD AR1
ADD AR1B
ADD AR10B
11C54X Instructions Set by Category
LogicalANDBITBITFCMPLCMPMORROLRORSFTASFT
CSFTLXOR
ArithmeticADDMACMASMPYNEGSUBZERO
ProgramControlBBCCALLCCIDLEINTRNOPRCRET
RPTRPTBRPTZTRAPXC
ApplicationSpecificABSABDSTDELAYEXPFIRSLMS
MAXMINNORMPOLYRNDSATSQDSTSQURSQURASQURS
DataManagementLDMARMV(D,K,M,P)ST
NotesCMPL complement MAR modify address
reg.CMPM compare memory MAS multiply and subtract
12Block FIR Filtering
- yn h0 xn h1 xn-1 ... hN-1
xn-(n-1) - h stored as linear array of N elements (in prog.
mem.) - x stored as circular array of N elements (in data
mem.)
Addresses a4 h, a5 N samples of x, a6 input
buffer, a7 output buffer Modulo addressing
prevents need to reinitialize regs each sample
Moving filter coefficients from program to data
memory is not shownfirtask ld firDP,dp
initialize data page pointer stm frameSize-1,brc
compute 256 outputs rptbd firloop-1 stm N,bk
FIR circular buffer size ld ar6,a load
input value to accumulator b stl a,ar4
replace oldest sample with newest rptz a,(N-1)
zero accumulator a, do N taps mac ar40,ar5
0,a one tap, accumulate in a sth a,ar7
store ynfirloop ret
13Accelerating Symmetric FIR Filtering
- Coefficients in linear phase filters are either
symmetric or anti-symmetric - Symmetric coefficients using 2 mults 3 adds
- yn h0 xn h1 xn-1 h1 xn-2 h0
xn-3 yn h0 (xn xn-3) h1 (xn-1
xn-2) - Accelerated by FIRS (FIR Symmetric) instruction
x in twocircularbuffers
h inprogrammemory
14Accelerating Symmetric FIR Filtering
- Addresses a6 input buffer, a7 output
buffer a4 array with xn-4, xn-3, xn-2,
xn-1 for N 8 a5 array with xn-5,
xn-6, xn-7, xn-8 for N 8 Modulo
addressing prevents need to reinitialize regs
each samplefirtask ld firDP,dp initialize
data page pointer stm frameSize-1,brc compute
256 outputs rptbd firloop-1 stm N/2,bk FIR
circular buffer size ld ar6,b load input
value to accumulator b mvdd ar4,a50 move
old xn-N/2 to new xn-N/2-1 stl b,ar4
replace oldest sample with newest add a40,a5
0,a a xn xn-N/2-1 rptz b,(N/2-1)
zero accumulator b, do N/2-1 taps firs ar40,a
r50,coeffs b a hi, do next
a mar a4(2) to load the next newest
sample mar ar5 position for xn-N/2
sample sth b,ar7firloop ret
15Architecture - FIRS
16Accelerating Polynomial Evaluation
- Function approximation and spline interpolation
- Fast polynomial evaluation (N coefficients)
- y(x) c0 c1 x c2 x2 c3 x3 Expanded form
- y(x) c0 x (c1 x (c2 x (c3))) Horners
form - POLY reduces 2 N cycles using MACADD to N cycles
ar2 contains address of array c3 c2 c1 c0
poly uses temporary register t for multiplicand
x first two times poly instruction executes
gives 1. a c(3) x 0 c(3) b c2
2. a c(2) x c(3) b c1 ld
ar2,16,b b c3 ltlt 16 ld ar3,t t x
(ar3 contains addr of x) rptz a,3 a 0,
repeat next inst. 4 times poly ar2 a b
xa b c(i-1) ltlt 16 sth a,ar4 store
result (ar4 is addr of y)
17Integer Multiplication
- Integer multiplication yields products larger
than the inputs, as can be seen in the example
below, using single digit decimal values as
inputs
- Does the user store the lower (1) or upper (8)
result? - Both must be kept, resulting in additional
resources (two cycles ,words of code, and RAM
locations) to complete the store. - Worse, how can the double-sized result be used
recursively as an input in later calculations,
given that the multiplier inputs an input in
later calculations, given that the multiplier
inputs are single-width?
18Fractional Multiplication
- Multiplication of fractions yields products that
never exceed the range of a fraction, as can be
seen in the example below, using single digit
decimal fractions as inputs
- Dont we still have a double sized result to
store? - In this case, we can store just the upper result
(.8) - This allows storage of result with fewer
resources - Results may be used recursively
- Has accuracy been lost by dropping the lower
accumulator value?
19Accuracy vs. Precision
- Often the programmer wants to retain the fullest
accuracy of a calculation, thus dropping the 16
LSBs of the result in the previous example seems
a bad choice. - Note though, the inputs how much accuracy do
they offer? - The product offers double precision but its
accuracy is based on the single-width inputs. - Thus, storing a single precision result is not
only an efficient solution, but represents the
limit of the accuracy of the result. - The accumulator is double-sized for two reasons
- To allow for integer operations, which would
possibly require the LSBs for the result. - So that sum-of-product operations will generate
accumulative noise at the 32nd vs. the 16th bit.
20Redundant Sign Bit
Multiplication of two signed numbers yields
product with two sign bits Extra sign bit
causes problems if stored to memory as
result Wastes space Creates off-size Q
Solution Fractional mode bit! When FRCT (mode
bit in ST1) is set, the multiplier output is
left-shifted by one For 16-bit C54x Q1 5Q1
5Q1 5
21Accumulation
- With fractions, we were able to guarantee that no
multiplicative overflow could occur, ie FFltF. - For addition, this rule does not apply, ie
FFgtF. - Therefore, we need additional measures to manage
the possibility of overflow for accumulation. Two
general methods apply - Guard Bits the C54x offers an 8-bit extension
above the high accumulator to allow valid
representation of the result of up to 256
summations. - Non-gain Systems offer additional criteria that
allow a simple solution for unlimited length
summations.
22Guard Bits and saturation
- Guard Bits the C54x offers an 8-bit extension
above the high accumulator to allow valid
representation of the result of up to 256
summations.
- Saturation (SAT)
- SAT instruction saturates value exceeding 32-bit
range in the selected accumulator
SAT A
SAT B
23Non-gain Systems
- Many systems can be modeled to have no DC gain
- Filters with low Q.
- Any systems scaled by its maximum gain value.
- Input values from A/D converters are
automatically fractions, if the limits of the A/D
are presumed to be /-1 - Coefficient values can similarly bonded by
making the largest value the scaling factor for
all other values. - For these systems, it is known that the final
value of the process is less than or equal to the
input values. - The accumulator therefore can be allowed to
temporarily - overflow, since the final result is known to be
bonded /-1. - Allows maximum usage of selected A/D and D/A
converters - D/A bits for gain are more expensive than using
analog components
24Division
- The C54x does not have a single cycle 16-bit
divide instruction - Divide is a rare function in DSP
- Division hardware is expensive
- The C54x does have a single cycle 1-bit divide
instruction conditional subtract or SUBC - Preceded by RPT 15, a 16-bit divide is performed
- Is much faster than without SUBC
- The SUBC process operates only on unsigned
operands, thus software must - Compare the signs of the input operands
- If they are alike, plan a positive quotient
- If they differ, plan to negate (NEG) the quotient
- Strip the signs of the inputs
- Perform the unsigned division
- Attach the proper sign based on the comparison
of the inputs
25Division Routine
- B numden (tells sign)
- Strip sign of numerator
- Strip sign of denominator
- 16 iterations
- 1-bit divide
- If result needs to be negative
- Invert sign
- Store negative result
26Rounding
- Result of multiplication can be rounded for MPY,
- and MAS operations. This is specified by
appending the instruction with an R suffix. - Example MAC with rounding is MACR. Rounding
consists of adding 215 to the result and then
clearing the low accumulator. - In a long sum-of-products, only the last MAC
operation should specify rounding
- Rounding can also be achieved with a load
operation
27Sign Extension (SXM)
28Write code in C
- Inline Assembly
- Allows direct access to assembly language from C
- Useful for operating on components not used by C,
ex
- Note first column after leading quote is label
field - Long operations should be written in ASM and
called from C - main C file retains portability
- yields more easily maintained structures
- eliminates risk of interfering with registers in
use by C
29Accessing MMRs from C
- Using pointers to access Memory-Mapped Registers
- Create a pointer and set its value to the
assigned memory address - Read and write to the register as any other
pointer - Accessing I/O Ports from C
- 1. create the port
- 2. access the port
volatile unsigned int SPC_REG (volatile
unsigned int ) 0x0022
SPC_REGOxC8
ioport unsigned port8000 x port8000 port8000
y
30Summary and Conclusion
- C54x is a conventional digital signal processor
- Separate data/program busses (3 reads 1
write/cycle) - Extended precision accumulators
- Single-cycle multiply-accumulate
- Saturation and wraparound arithmetic
- Bit-reversed and circular addressing modes
- C54x has instructions to accelerate algorithms
- Communications FIR LMS filtering, Viterbi
decoding - Speech coding vector distances for code book
search - Interpolation polynomial evaluation
31References
- 1 Texas instrument TMS320C54x DSP Design
Workshop May 1997 - 2 TMS320C54x Users guide
- 3 www.ti.com
- 4 SIGNAL AND IMAGE PROCESSING ON THE
TMS320C54x DSP by Prof. Brian L. Evans - 5 TMS320C54x Assembly Language Tools
-