Hybrid SoftwareHardware Dynamic Memory Allocator - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Hybrid SoftwareHardware Dynamic Memory Allocator

Description:

Flip the bits corresponding to the memory chunk in the bit-vector. Bit-vector ... [4]D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms. ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 32

Provided by: mustafaozg

Category:

more less

Transcript and Presenter's Notes

Title: Hybrid SoftwareHardware Dynamic Memory Allocator

1
Hybrid (Software-Hardware) Dynamic Memory
Allocator

Prepared by Mustafa Özgür Akduran
Istanbul, 2006
Bogaziçi Üniversitesi

2
Outline

Introduction
Related Research
Proposed Hybrid Allocator
Complexity and Performance Comparison
Conclusion
References

3
Introduction

Need for
Efficient Implementation of Memory Management
Functions
Memory Usage
Execution Performance

Dynamic Memory Allocation (DMA)
Garbage Collection
Modern Programming Languages
4
Introduction
Dynamic Memory Management
DMM

Current Systems
Execution time spent on Memory Management is 42.
Still important researches on
Good execution performance
Memory locality
How to get free chunks of memory ?
Software Allocator
Hardware Allocator

Pure Software
Low Cost Allocator
5
Introduction

Software Allocator
Different Search Techniques to organize available
chunks of free memory
Disadvantage
Search could be in the critical path of
allocators causing a major performance bottleneck.

Hardware Allocator
Parallel Search
Speed up Memory Allocation
Improve Performance
Hide execution latency of freeing objects
Coalescing of free chunks of memory
Disadvantage
Potential Hardware Complexity

6
Introduction
A New Hybrid Software-Hardware Allocator
Changs Hardware Allocator
PHK (Poul-Henning Kamp) Allocation Algorithm used
in Free-BSD System
Aim is to balance the hardware complexity with
performance by using both hardware and software
together.
7
Related Research

PHK (Poul-Henning Kamp) Allocator
Two most popular general purpose open source
allocator
1. Doug Lea used in LINUX System
2. PHK used in Free-BSD System
Difference between them is less than 3 for
memory allocation intensive benchmarks in SPEC
2000 CPU.
PHK Allocator chosen bacause of its suitability
for hardware/software co-design.
Free-BSD (Berkeley Software Distribution ) is an
advanced operating system for x86 compatible
(including Pentium and Athlon), architectures.
It is derived from BSD, the version of UNIX
developed at the University of California,
Berkeley. It is developed and maintained by a
large team of individuals.

8
Related Research

PHK (Poul-Henning Kamp) Allocator
Page based allocator
Each page can only contain objects of one size
For a large object sufficient number of pages
allocated
For small objects less than a half page, object
size is padded to the nearest power of 2
Allocator keeps a page directory for all
allocated pages and at the beginning of each
small object page, bitmap of allocation
information is created
While allocating small objects, PHK Allocator
performs a linear search on the bitmap to find
the first available chunk in that page

9
Related Research

Changs Hardware Allocator
Based on Buddy System invented by Knuth
The buddy memory allocation technique divides
memory into partitions to satisfy a memory
request as suitably as possible
This system makes use of splitting memory into
halves to try to give a best-fit
Compared to the memory allocation techniques
(such as paging) that modern OS such as MS
Windows and Linux use, the buddy memory
allocation is relatively easy to implement, and
does not have the hardware requirement of a
memory management unit
Changs algorithm is a first method based on a
binary OR-tree and a binary AND-tree.

10
Related Research
Or Gates

Changs Hardware Allocator
Each leaf node of the OR-tree represents base
size of the smallest unit of memory that can be
allocated
The leaves of OR-tree together represent the
entire memory
AND-tree has the same number of leaves as the
OR-tree
Input of the AND-tree is generated by a complex
interconnection network of the OR-tree

11
Related Research

Changs Hardware Allocator
Or-Tree
Determine if there is a large enough space for
allocation request
AND-Tree
Find the beginning address of that memory chunk
Flip the bits corresponding to the memory chunk
in the bit-vector

Bit-vector
12
Related Research
13
Related Research

The interconnection between the OR-tree and the
AND-tree is the most complex part of the Changs
allocator
The interconnection has the same critical path
delay as the OR/AND-tree
Final allocation result is produced by the output
of the AND-tree through a set of multiplexers
The Hardware complexity, in terms of number of
gates is
O(n logn)

the memory chunks
Critical path delay
14
Proposed Hybrid Allocator
Problems of hardware-software only allocators
1. Complexity of the hardware increases with the
size of the memory managed 2. Poor object locality

Pure hardware allocators based on buddy system

Poor execution performance
Software Allocators
15
Proposed Hybrid Allocator

Using small , fixed hardware to help manage the
memory
Software portion which is based on PHK algorithm
provides better object localities than buddy
system
Hardware portion improves execution performance
of the software portion

New Hybrid Allocator
16
Proposed Hybrid Allocator

Creating page indexes
For large sized objects (gthalf a page) does the
allocation without any assistance from hardware
Allocation for a small sized object, it will
locate the bitmap of a page with free memory and
issues a search request to the hardware

Software portion
Responsible for

Search the page index (or bitmap) in parallel to
find a free chunk
Mark the bitmap to indicate an allocation

Hardware portion
17
Proposed Hybrid Allocator

OR-tree responsible for determining if there is a
free chunk in a page (similar to Changs system)
AND-tree will locate the position of the first
free chunk in the page (similar to Changs
system)
Because an OR-tree and an AND-tree are dedicated
to one object size, complex interconnections
between OR and AND tree are not needed( unlike
Changs)

18
Proposed Hybrid Allocator

MUX uses opcode to select the address of the bit
needed to be flipped.
If the opcode is alloc the address from the
AND-tree will be chosen
If the opcode is free the address from the
request will be selected
D-latches are used as storage devices where the
bitmap will be loaded from the page in accordance
with the allocation size
DEMUX used to decode the address from the MUX

19
Proposed Hybrid Allocator

Bit-flippers use the decoded address and the
opcode to determine how to flip a desired bit

Block Diagram of Proposed Hardware Component (For
Page Size 4096 bytes and Object Size 16 bytes)
20
Proposed Hybrid Allocator

Overall design of the system with 4096-byte pages
For different object sizes, the hardware needed
to support the bit-map will be different
In our design, preselected object sizes are from
16-bytes to 2048-bytes and include hardware to
support pages for these objects
MUX is used to select the hardware unit that will
be responsible for supporting objects of a given
size
The larger the object size, the smaller the
amount of hardware needed to support the bit-maps
indicating the availability of chunks in that page

21
Proposed Hybrid Allocator

With 4096-byte pages, we have 8 different sized
objects ranging from 16-bytes to 2048-bytes.
For allocating
2048-byte objects we need a tree with two leaves
16-byte objects we need trees with 256 leaves
For a 16-byte object we need only 255 AND/OR
gates
For overall system 137153163127255502 AND
gates and 502 OR gates are needed
Very small amount compared to billions of
transistors available on modern processor chips

22
Complexity and Performance Comparison

Complexity Comparison
Existing hardware allocator designs implement the
buddy system
The amount of hardware that is used to implement
a buddy allocator is dependent on the size of
memory
That makes buddy system based allocators not
scalable.
Our design has much lower hardware complexity
than Changs allocator. (Buddy System)

23
Complexity and Performance Comparison
M Total dynamic memory size P Page size S
Smallest allocated object size
24
Complexity and Performance Comparison

Performance Analysis

Conventional CPU using SimpleScalar simulation
tool set V2.0
Hardware-assisted PHK allocator
25
Complexity and Performance Comparison
26
Complexity and Performance Comparison
27
Complexity and Performance Comparison

We show the reduced memory management execution
cycles normalized to the original execution
cycles spent on memory management functions by
software only allocator
Cfrac application shows the best performance
improvement
Ave.obj.size is 8 bytes which means that most
pages allocated contain 256 objects
Linear search in the software implementation for
that many objects will be very slow
The hardware speeds up the search, leading to
76.2 normalized performance improvement over the
software only allocation
Benchmark espresso with average object size of
250 bytes shows the least amount of improvement
using the hybrid allocator
Pages allocated for espresso contain fewer than
20 objects
Linear search of 20 objects is not significant,
and the hardware allocator only shows 48.0
nornalized performance improvement
Other benchmarks have average object sizes of 16
bytes to 48 bytes, so the performance gains are
not significant as cfrac, but better than
espresso
On average, the Hybrid allocator reduces the
memory management time by 58.9. The average
overall execution speedup of our design when
compared to a software only allocator
implementation is 12.7

28
Conclusion
Our Design

Significantly lower hardware complexity
Lower critical path delays
Our design has a fixed hardware complexity which
is dependent on the size of a memory page (not
the total user memory being managed)

Compared to Hardware only allocators

Overall execution performance is 12.7 better on
memory intensive benchmarks
Memory management efficiency improved by 58.9

Compared to Software only allocators
29
Conclusion

Future Work
Exploring variable sized pages such that the
number of allocated objects are the same in each
page
All the bitmaps will have the same number of bits
Thus, we need only one pair of AND-tree and
Or-tree in the design
That will further reduce the hardware complexity
This will also improve the memory management
efficiency of allocators for large objects

30
References

1 W.Li, S.P.Mohanty and K.Kavi, A Page-based
Hybrid (Software-Hardware) Dynamic Memory
Allocator IEEE Computer Architecture Letters
(accepted in July 2006 for future issue)
2 J.M. Chang and E.F.Gehringer, A
High-Performance Memory Allocator for
Object-oriented Systems, IEEE Transactions on
Computers, Mar. 1996, pp 357-366.
3 P.H.Kamp.Malloc(3)revisited,
http//phk.freebsd.dk/pubs/malloc.pdf
4D.E.Knuth, The Art of Computer Programming
Vol.I Fundamental Algorithms., Addison-Wesley,
1968.
5D.Burgerand, T.M.Austin, The Simple Scalar
Tool Set, V2.0, Tech Report CS-1342, University
of Wisconsin-Madison, Jun. 1997.