Key Issues with TI TMS320DM642 Applications - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Key Issues with TI TMS320DM642 Applications

Description:

DM642 is powerful, however, complex Audio/Video/VGA processing brings big challenge. ... TI's own, Royalty-free, compact, but scalable from 3KB footprint, ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 34
Provided by: highmanqi
Category:

less

Transcript and Presenter's Notes

Title: Key Issues with TI TMS320DM642 Applications


1
Key Issues with TI TMS320DM642 Applications
http//www.embeddedcore.com
2
Agenda
  • vBackground
  • System Platform
  • Memory Hierarchy
  • Program Optimization
  • Advanced Topics
  • Conclusion

3
Background TI DM642
  • Advanced TI C64x DSP Core
  • Pipeline and VLIW
  • SIMD extension
  • Two sets of 32 registers
  • Up to 720MHz
  • Two-Level On-chip Memory
  • L2 cache or mapped memory
  • Various Peripherals Integrated
  • Video Port (Capture/Display)
  • PCI
  • I2C
  • McBSP/McASP
  • MAC

4
Background Experiences
  • Product Requirements
  • Video Capture/MPEG-4 Encoding (PCI Card or
    Standalone as Video Server)
  • Video MPEG-4 Decoding/Display (PCI Card for TV
    Wall)
  • VGA Signal Capture/Encoding (PCI Card)
  • TI DM642 can do it. We did it.

5
Background Video Encoding Card
6
Background 3-TV-output Video Decoding Card
7
Background VGA Capture and Encoder
8
Background Conclusion
  • DM642 is powerful, however, complex
    Audio/Video/VGA processing brings big challenge.
    the mission may fail
  • Performance not reached, the original is often
    tens/hundreds of times difference from the
    target
  • Development cycle too long
  • While going through some issues, we may
  • Not only fulfill the mission,
  • But also boost the development process
  • Here some key issues are shared.

9
Agenda
  • Background
  • vSystem Framework
  • Memory Hierarchy
  • Program Optimization
  • Advanced Topics
  • Conclusion

10
System Basis
  • The system platform issues are same as those of
    all the embedded systems, including
  • RTOS in DSP
  • Device Driver in DSP
  • Reference Framework
  • Algorithm Interface Standard

11
RTOS is essential
  • 2-D Thread Grids in an Typical project
  • Service axis
  • Main control
  • Video
  • Audio
  • Processing axis
  • Device ISR
  • Soft interrupt
  • Input task capture or PCI
  • Processing task encoding or decoding
  • Output task PCI or display
  • So thread scheduling is essential in complex
    applications, which is just the minimum kernel of
    various RTOS

12
DSP/BIOS
  • So uC/OS, Linux all work on DM642. Why not TI
    DSP/BIOS?
  • TIs own, Royalty-free, compact, but scalable
    from 3KB footprint,
  • Configurable, including
  • Thread sync., like post()/pend()
  • Memory management,
  • Generic devices,
  • Data structure operation like QUE
  • Other instrumentations

13
DSP/BIOS Device Driver
  • Device Driver, as on Linux, Wince
  • Uniform APIs
  • Uniform Framework, DDK
  • It includes the control of
  • On-chip devices like EDMA, Video Port, McASP,
    GPIO, I2C (CSL)
  • External devices like ADC, DAC via I2C (EDC/BSP)
  • EDMA associated
  • ISR

14
Reference Framework
  • It unifies the layout of all the applications
  • Thread Scheduling
  • Task, including accessing device driver
  • SCOM Inter-task Message Communication
  • Post()/pend()
  • Message QUE
  • Channel and algorithm
  • Channel, serial collection of cells
  • Cell, algorithm container
  • ICC, inter-cell communication
  • On-chip Memory Management Interface

15
Algorithm Interface Standard
  • It unifies the algorithm interfaces
  • Associated with reference framework

16
System Platform
  • It is actually once-invested always-useful
    infrastructure
  • Reusable in all the other projects
  • Uniform makes more reusable at each level. The
    uniform is actually as important as system
    themselves, which is often ignored.

17
Agenda
  • Background
  • System Framework
  • v Memory Hierarchy
  • Program Optimization
  • Advanced Topics
  • Conclusion

18
Memory Hierarchy
  • On-chip Memory
  • 16KB L1D and 16KB L1P cache
  • 256KB L2 cache/mapped memory
  • Off-chip Memory on EMIF (External Memory
    Interface)
  • SDRAM
  • Typical applications have to use all the levels
    of memory
  • Program Flow
  • Data Flow

19
Data and Program Flow
20
SDRAM
  • Big amount of data can be put there only
  • From the point of view of DSP Core, SDRAM is a
    slow device off the chip, larger size than that
    on-chip.
  • In the DSP processing model, CPU shall not access
    SDRAM directly
  • At least about 20 cycles per off-chip access, 3-8
    cycles for on-chip memory
  • Burst transfer to L2 memory using DMA or Cache
    Controller

21
L2 Memory
  • Part of L2 memory should be configured as L2
    Cache for the access to SDRAM
  • Data Consistency in L2 cache with SDRAM (CPU and
    DMA update) should be maintained by software
    itself, not automatically
  • Remains is as addressable memory
  • Which can specify some frequently accessed
    program or data
  • Transfer some stripes of data from/back SDRAM via
    QDMA

22
L1D and L1P Cache
  • 16KB L1D and 16KB L1P cache
  • Consistency automatically maintained
  • Tuning program segment and data block being
    processing fitting the size
  • Example 1
  • To parameterize the size of image stripe on L2 in
    the processing software, x MB
  • Different size impacts the processing
    performance, to find a best xn
  • To adjust this stripe size n MB
  • Example 2
  • To divide the program into proper segment for a
    group of data, reducing Cache Miss

23
Effective Use of Memory Hierarchy
  • On-chip Memory is always expensive resources, not
    enough for huge amount of data like video
  • Using whatever transparent cache or explicit
    DMA-based memory transfer, the developer should
    plan the data flow and their consistency.
  • This brings big performance boost.

24
Agenda
  • Background
  • System Framework
  • Memory Hierarchy
  • v Program Optimization
  • Advanced Topics
  • Conclusion

25
Program Optimization
  • If system and memory framework is tuned well,
    only program optimization makes more sense
  • Good C/C Coding always needed
  • Profiling guides optimization
  • Intrinsics in C
  • Equivalent to an instruction
  • Assembly
  • Inline
  • Function

26
Assembly Optimization
  • DM642 has
  • 6 ALUs and 2 Multipliers
  • Each is a pipeline
  • 32 x2 registers
  • SIMD extension

27
By An Example
  • This is a fun trick, showing also how the
    algorithm fits architecture.
  • DM642 has avgu4 instruction which can compute 4
    pairs average in parallel in 2 cycles,
    di(s1is2i1)/2 where i0,1,2,3, d and s are
    unsigned
  • For an image, how to compute fastest d
    (s1s21-r)/2 where r0 or 1 (rounding)

28
By An Example
  • d (s1s21-r)/2 where r0 or 1
  • d ((s1-r)s21)gtgt1, s1-r may underflow as no
    sature substract
  • d((s12-r)s21)gtgt1 -1
  • if(r1) s1is1i1 ? saddu4
  • di(s1is2i1)gtgt1 ? avgu4
  • if(r1) didi-1 ? sub4
  • This expression conversion can be used to safely
    process 4 pairs of pixels in parllel.

29
Real Code Scenario
  • ldndw .d1t1 hA_srchsrcastride,hA4567hA
    0123
  • hcntb .s1 hloop
  • hroundingbsaddu4 .s2
    hD5678,cst1b,hD5678 1
  • ldndw .d2t2 hB_srchsrcbstride,hD5678hD1
    234
  • avgu4 .m1 hA1234,hA0123,hA0123
  • avgu4 .m2x hB5678,hA4567,hB4567
  • !hroundinga mv .l1x hD1234,hC1234
  • hroundingasaddu4 .s1x
    hD1234,cst1a,hC1234 2
  • ldndw .d1t1 hA_srchsrcastride,hC4567hC
    0123
  • avgu4 .m1 hC1234,hC0123,hC0123
  • avgu4 .m2x hD5678,hC4567,hD4567
    3
  • hcntldndw .d2t2 hB_srchsrcbstride,hB56
    788hB1234
  • hroundingasub4 .l1 hA0123,cst1a,hA0123
  • hroundingbsub4 .l2 hB4567,cst1b,hB4567
    4

v
30
Agenda
  • Background
  • System Framework
  • Memory Hierarchy
  • Program Optimization
  • v Advanced Topics
  • Conclusion

31
Advanced Topics
  • One multi-core processor like OMAP integrates the
    following into one chip. Great!
  • DSP
  • Host Interface like PCI (and driver)
  • Host, ARM
  • And the related peripherals
  • This unifies DSP and Host app.

32
Agenda
  • Background
  • System Framework
  • Memory Hierarchy
  • Program Optimization
  • Advanced Topics
  • v Conclusion

v
33
Conclusion
  • These discussions are quite complete set of
    system optimization, obviously applying to the
    other embedded systems.
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com