Title: Software Estimation for Application Specific Multiprocessor SoCs
1Software Estimation for Application Specific
Multiprocessor SoCs
-
- Under the Supervision of
- Prof. M.Balakrishnan
2Presentation Outline
- Introduction and Motivation
- Objectives
- Implementation
- Results
- Conclusions Future work
- References
3Introduction and Motivation
- Application specific processors
- Multiprocessor SoCs
- SRIJAN flow
-
4Why Application Specific Multiprocessors
Compute Intensive Application
Control Part
General Purpose Multiprocessor
Application Specific Multiprocessor
No customization
Customization
Higher Performance
Avg. Performance
5Role of Processor Customization
- Allows effective utilization of resources
- Makes solution cheaper
6SRIJAN System Level Design Methodology
7Presentation Outline
- Introduction and motivation
- Objectives
- Implementation
- Results
- Conclusions Future work
- References
8Objectives
- Objectives
- Defining multiprocessor architecture description
- Developing a tool to generate a task graph and
annotate with
- Input
- Application IR
- Profiled data
- Architecture description.
- Output
- Annotated task graph
9Presentation Outline
- Introduction and motivation
- Objectives
- Implementation
- Results
- Conclusions Future work
- References
10Implementation
- Defined sections to describe multiprocessor
architecture - Task graph generation
- Modified MACHSUIF library for estimating
execution times
11Architecture Description
- Describing the architecture using HMDES and
extracting information using MQes. - There are three sections
- Memory section
- Processor section
- Bus section
12Architecture Description contd
- Memory section
- Memory type
- No of ports
- Memory size
- Bus name
- Processor section
- Register file information
- Cache information
- Instruction set information
- Pipeline information
13Architecture Description contd
- Bus section
- Protocol information
- Connectivity information
- Bit width information
- BCU information
- Main section
- Integrate all the above three sections
- Extracting details with MQes
14Task Graph
- Application model is pthreads
- Task is defined as a piece of sequential code
15Task Graph contd
- Problems encountered
- Thread creation in loops
- Thread creation in if-else statement
- Solutions
- Unrolling loops
- Pruning the less frequently executed part with
the help of profiling information
16Execution time Estimation
- Machine SUIF library
- Extract DDG at basic block level
- Supply the resource model to the scheduler
- Generating the estimates by using scheduler
17MACHSUIF Flow
Application in C
Lower level SUIF
SUIF virtual machine
Target instructions
Target machine Description Resource Model -gt
Target dependent
Control flow graph
Profiling
Register allocation
Scheduler Estimates
--gt
18Resource Model
Resources a, b, c Vectors a1 i1 b
i2 ac, c i3
19Collision matrices for instruction classes
20Generated Automata
F1
b
x 0 0 0 x 0
F0
a
0 0 0 0 0 0
a
F2
F0 and F4 are Cycle advancing states
b
0 0 x 0 0 0
a
c
c
F3
x 0 0 0 x x
F4
Modified Flow
0 0 0 0 x 0
b
F5
b
a
0 0 x 0 x 0
21Modified Flow
22Branch Delays
- Unconditional Branches
- delay uncond_delay cur_block_profile_info
- Conditional Branches
- taken_delay frequency of branch taken
taken_delay - not_taken_delay frequency of branch not taken
not_taken_delay - delay taken_delay not_taken_delay
- Delay information is extracted from the processor
pipeline - Branch frequency information is obtained from
gcov profiler
23Memory References
- Classifying loads and stores
- Loads and stores involving scalar variables
- Loads and stores involving array references
- Scalar References
- All the scalar variables are stored in
consecutive memory locations - There is only one cache miss corresponding to
every cache line containing a scalars
24Scalar References
- N , no of scalar variables involved in the memory
access - M, no of memory access to the N scalar variables
- K, no of processor cycles to fetch one line to
the cache - L, cache line size
- Processor cycles (KL) ceil(N/L) M
ceil(N/L)
25Array References
- Self-spatial reuse
- A reference access same cache line in different
iterations - Self-temporal reuse
- A reference access same data location in
different iterations - Group-spatial reuse
- Different references access same cache line in
different iterations - Group-temporal reuse
- Different references access same data location in
different iterations
26Array References contd
- Self-temporal reuse references are moved outside
the loop - Group the remaining references into equivalence
classes. - Each class exhibit self-spatial and group-spatial
reuse - Calculate effective accesses per iteration
27Example
for(i0iltMi) for(j0jltMj) aij
aij ai-1j ai1j aij-1
aij1 bi cji
- bi is self-temporal reuse
- aij1 is self-spatial reuse
- aij and aij1 group temporal reuse
- aij and aij-1 group spatial reuse
28Example contd
- aij, aij-1, aij1
- ai-1j
- ai1j
- 3 memory access in each iteration for A , 3/L
per j - 1 memory access in each iteration for C ie 1 per
j - For B, 1/L off-chip access per each iteration per
i
29Presentation Outline
- Introduction and motivation
- Objectives
- Implementation
- Results
- Conclusions Future work
- References
30Tsim vs Our Estimates
Percentage of Error 16, 12, 18
31Presentation Outline
- Introduction and motivation
- Objectives
- Implementation
- Results
- Conclusions Future work
- References
32Conclusions Contributions
- Facilitated system level architecture description
for SRIJAN - Task graph formulation
- Execution time estimates
- List scheduler
- Branch delays
- Memory
- Leon target library
- Architectural exploration
- Instruction latencies
- Number of FUs
- Memory organizations
- Register file organizations
33Future Work
- Task Graph Formulation
- Synchronization overheads
- Improving Leon Library
- Extracting latency information from HMDES
34Presentation Outline
- Introduction and motivation
- Objectives
- Implementation
- Results
- Conclusions Future work
- References
35References
- SRIJAN
- Trimaran mQs functions in md.h (in
trimaran/impact dir) - SUIF2 documentation
- MACHSUIF documentation
- Instruction scheduling library for SUIF by Gang
Chen and Cliff Young, Harvard University - Efficient instruction scheduling using finite
state automata by Vasanth Bala and Norman Rubin - Local memory exploration and optimization in
embedded systems by P R Panda, Nikil D.Dutt,
Alexandru Nicolau - M.J. Flynn, "Computer Architecture Pipelined
and Parallel Processor Design", Narosa Publishing
House, 1996.
36Acknowledgements
- Main ideas, motivation and support
- Prof. M. Balakrishnan and Prof. Anshul Kumar
- Helpful discussions, Debugging
- Basant Kr Diwedi
- Manoj Kr Jain, Anup Gangwar
- Other members of ESG
- MACHSUIF Libraries
- Glenn Holloway
37Thank You