Title: Hybrid SoftwareHardware Dynamic Memory Allocator
1Hybrid (Software-Hardware) Dynamic Memory
Allocator
- Prepared by Mustafa Özgür Akduran
- Istanbul, 2006
- Bogaziçi Üniversitesi
2Outline
- Introduction
- Related Research
- Proposed Hybrid Allocator
- Complexity and Performance Comparison
- Conclusion
- References
3Introduction
- Need for
- Efficient Implementation of Memory Management
Functions - Memory Usage
- Execution Performance
Dynamic Memory Allocation (DMA)
Garbage Collection
Modern Programming Languages
4Introduction
Dynamic Memory Management
DMM
- Current Systems
- Execution time spent on Memory Management is 42.
- Still important researches on
- Good execution performance
- Memory locality
- How to get free chunks of memory ?
- Software Allocator
- Hardware Allocator
-
Pure Software
Low Cost Allocator
5Introduction
- Software Allocator
- Different Search Techniques to organize available
chunks of free memory - Disadvantage
- Search could be in the critical path of
allocators causing a major performance bottleneck.
- Hardware Allocator
- Parallel Search
- Speed up Memory Allocation
- Improve Performance
- Hide execution latency of freeing objects
- Coalescing of free chunks of memory
- Disadvantage
- Potential Hardware Complexity
6Introduction
A New Hybrid Software-Hardware Allocator
Changs Hardware Allocator
PHK (Poul-Henning Kamp) Allocation Algorithm used
in Free-BSD System
Aim is to balance the hardware complexity with
performance by using both hardware and software
together.
7Related Research
- PHK (Poul-Henning Kamp) Allocator
- Two most popular general purpose open source
allocator - 1. Doug Lea used in LINUX System
- 2. PHK used in Free-BSD System
- Difference between them is less than 3 for
memory allocation intensive benchmarks in SPEC
2000 CPU. - PHK Allocator chosen bacause of its suitability
for hardware/software co-design. - Free-BSD (Berkeley Software Distribution ) is an
advanced operating system for x86 compatible
(including Pentium and Athlon), architectures.
It is derived from BSD, the version of UNIX
developed at the University of California,
Berkeley. It is developed and maintained by a
large team of individuals.
8Related Research
- PHK (Poul-Henning Kamp) Allocator
- Page based allocator
- Each page can only contain objects of one size
- For a large object sufficient number of pages
allocated - For small objects less than a half page, object
size is padded to the nearest power of 2 - Allocator keeps a page directory for all
allocated pages and at the beginning of each
small object page, bitmap of allocation
information is created - While allocating small objects, PHK Allocator
performs a linear search on the bitmap to find
the first available chunk in that page
9Related Research
- Changs Hardware Allocator
- Based on Buddy System invented by Knuth
- The buddy memory allocation technique divides
memory into partitions to satisfy a memory
request as suitably as possible - This system makes use of splitting memory into
halves to try to give a best-fit - Compared to the memory allocation techniques
(such as paging) that modern OS such as MS
Windows and Linux use, the buddy memory
allocation is relatively easy to implement, and
does not have the hardware requirement of a
memory management unit - Changs algorithm is a first method based on a
binary OR-tree and a binary AND-tree.
10Related Research
Or Gates
- Changs Hardware Allocator
- Each leaf node of the OR-tree represents base
size of the smallest unit of memory that can be
allocated - The leaves of OR-tree together represent the
entire memory - AND-tree has the same number of leaves as the
OR-tree - Input of the AND-tree is generated by a complex
interconnection network of the OR-tree
11Related Research
- Changs Hardware Allocator
- Or-Tree
- Determine if there is a large enough space for
allocation request - AND-Tree
- Find the beginning address of that memory chunk
- Flip the bits corresponding to the memory chunk
in the bit-vector
Bit-vector
12Related Research
13Related Research
- The interconnection between the OR-tree and the
AND-tree is the most complex part of the Changs
allocator - The interconnection has the same critical path
delay as the OR/AND-tree - Final allocation result is produced by the output
of the AND-tree through a set of multiplexers - The Hardware complexity, in terms of number of
gates is - O(n logn)
the memory chunks
Critical path delay
14Proposed Hybrid Allocator
Problems of hardware-software only allocators
1. Complexity of the hardware increases with the
size of the memory managed 2. Poor object locality
- Pure hardware allocators based on buddy system
Poor execution performance
Software Allocators
15Proposed Hybrid Allocator
- Using small , fixed hardware to help manage the
memory - Software portion which is based on PHK algorithm
provides better object localities than buddy
system - Hardware portion improves execution performance
of the software portion
New Hybrid Allocator
16Proposed Hybrid Allocator
- Creating page indexes
- For large sized objects (gthalf a page) does the
allocation without any assistance from hardware - Allocation for a small sized object, it will
locate the bitmap of a page with free memory and
issues a search request to the hardware
Software portion
Responsible for
- Search the page index (or bitmap) in parallel to
find a free chunk - Mark the bitmap to indicate an allocation
Hardware portion
17Proposed Hybrid Allocator
- OR-tree responsible for determining if there is a
free chunk in a page (similar to Changs system) - AND-tree will locate the position of the first
free chunk in the page (similar to Changs
system) - Because an OR-tree and an AND-tree are dedicated
to one object size, complex interconnections
between OR and AND tree are not needed( unlike
Changs)
18Proposed Hybrid Allocator
- MUX uses opcode to select the address of the bit
needed to be flipped. - If the opcode is alloc the address from the
AND-tree will be chosen - If the opcode is free the address from the
request will be selected - D-latches are used as storage devices where the
bitmap will be loaded from the page in accordance
with the allocation size - DEMUX used to decode the address from the MUX
19Proposed Hybrid Allocator
- Bit-flippers use the decoded address and the
opcode to determine how to flip a desired bit
Block Diagram of Proposed Hardware Component (For
Page Size 4096 bytes and Object Size 16 bytes)
20Proposed Hybrid Allocator
- Overall design of the system with 4096-byte pages
- For different object sizes, the hardware needed
to support the bit-map will be different - In our design, preselected object sizes are from
16-bytes to 2048-bytes and include hardware to
support pages for these objects - MUX is used to select the hardware unit that will
be responsible for supporting objects of a given
size - The larger the object size, the smaller the
amount of hardware needed to support the bit-maps
indicating the availability of chunks in that page
21Proposed Hybrid Allocator
- With 4096-byte pages, we have 8 different sized
objects ranging from 16-bytes to 2048-bytes. - For allocating
- 2048-byte objects we need a tree with two leaves
- 16-byte objects we need trees with 256 leaves
- For a 16-byte object we need only 255 AND/OR
gates - For overall system 137153163127255502 AND
gates and 502 OR gates are needed - Very small amount compared to billions of
transistors available on modern processor chips
22Complexity and Performance Comparison
- Complexity Comparison
- Existing hardware allocator designs implement the
buddy system - The amount of hardware that is used to implement
a buddy allocator is dependent on the size of
memory - That makes buddy system based allocators not
scalable. - Our design has much lower hardware complexity
than Changs allocator. (Buddy System)
23Complexity and Performance Comparison
M Total dynamic memory size P Page size S
Smallest allocated object size
24Complexity and Performance Comparison
Conventional CPU using SimpleScalar simulation
tool set V2.0
Hardware-assisted PHK allocator
25Complexity and Performance Comparison
26Complexity and Performance Comparison
27Complexity and Performance Comparison
- We show the reduced memory management execution
cycles normalized to the original execution
cycles spent on memory management functions by
software only allocator - Cfrac application shows the best performance
improvement - Ave.obj.size is 8 bytes which means that most
pages allocated contain 256 objects - Linear search in the software implementation for
that many objects will be very slow - The hardware speeds up the search, leading to
76.2 normalized performance improvement over the
software only allocation - Benchmark espresso with average object size of
250 bytes shows the least amount of improvement
using the hybrid allocator - Pages allocated for espresso contain fewer than
20 objects - Linear search of 20 objects is not significant,
and the hardware allocator only shows 48.0
nornalized performance improvement - Other benchmarks have average object sizes of 16
bytes to 48 bytes, so the performance gains are
not significant as cfrac, but better than
espresso - On average, the Hybrid allocator reduces the
memory management time by 58.9. The average
overall execution speedup of our design when
compared to a software only allocator
implementation is 12.7
28Conclusion
Our Design
- Significantly lower hardware complexity
- Lower critical path delays
- Our design has a fixed hardware complexity which
is dependent on the size of a memory page (not
the total user memory being managed)
Compared to Hardware only allocators
- Overall execution performance is 12.7 better on
memory intensive benchmarks - Memory management efficiency improved by 58.9
-
Compared to Software only allocators
29Conclusion
- Future Work
- Exploring variable sized pages such that the
number of allocated objects are the same in each
page - All the bitmaps will have the same number of bits
- Thus, we need only one pair of AND-tree and
Or-tree in the design - That will further reduce the hardware complexity
- This will also improve the memory management
efficiency of allocators for large objects
30References
- 1 W.Li, S.P.Mohanty and K.Kavi, A Page-based
Hybrid (Software-Hardware) Dynamic Memory
Allocator IEEE Computer Architecture Letters
(accepted in July 2006 for future issue) - 2 J.M. Chang and E.F.Gehringer, A
High-Performance Memory Allocator for
Object-oriented Systems, IEEE Transactions on
Computers, Mar. 1996, pp 357-366. - 3 P.H.Kamp.Malloc(3)revisited,
http//phk.freebsd.dk/pubs/malloc.pdf - 4D.E.Knuth, The Art of Computer Programming
Vol.I Fundamental Algorithms., Addison-Wesley,
1968. - 5D.Burgerand, T.M.Austin, The Simple Scalar
Tool Set, V2.0, Tech Report CS-1342, University
of Wisconsin-Madison, Jun. 1997.
31Questions ?