Title: Accelerating Data Management Operations using New Hardware Paradigms
1Accelerating Data Management Operations using New
Hardware Paradigms
Major Area Examination
- Sudipto Das sudipto_at_cs
Committee Prof. Divyakant Agrawal (co-chair)
Prof. Amr El Abbadi (co-chair) Prof. Timothy
Sherwood
2Data Streams Model
- What are Data Streams and how are they processed?
Conventional Database
Data Stream Model
Courtesy Ahmed Metwally
- Data is viewed as a passing stream (possibly
infinite) - Only a single pass through the data tuples
- Answer to queries are computed as the stream is
viewed
3Applications of Data Streams
- Monitoring network web traffic
- Internet Advertising
- Stock market analysis
- Detecting DoS and DDoS attacks malicious
activities on the network - Distributed monitoring for load balancing
4Challenges Involved
- The volume of stream over the entire lifetime is
huge (possibly infinite) - Space complexity should be sub-linear to stream
size - Stream Summaries introduce approximation
- Goal Reduce error as well as reduce the space
requirements - Queries require timely answers
- Most queries are online and continuous in nature
- Response time must be small
- Goal Processing cost must be small
- Answer queries with high accuracy using minimal
space and small response time
5Why new hardware paradigms?
- Ever increasing data rates call for faster
processing - Speed of processing cores bounded by physical
barriers - The shift is towards multi-core architectures
Ram06 One core to multiple cores Intel
Digital Revolution - 64 128 cores projected in near future Held06
Intel Tera-scale Computing - Cisco recently announced 40 core QuantunFlowTM
Network Processor - Paradigm shift in algorithmic models to exploit
these architectures - Only concurrent programs can effectively exploit
the potential of the multi-core architectures - Specialized hardware like TCAM or GPU can be
utilized to improve efficiency of certain
operations
6(No Transcript)
7Presentation Outline
- Introduction Motivation
- Common Queries on Data Streams
- New Hardware Paradigms
- Work in Progress
- Conclusion Discussion
8Frequent Elements
- Find elements with frequency above a certain
threshold (also known as support) - Two classes of algorithms
- Sketch Based Techniques Char02
- Use set of hash functions to keep a sketch of
frequent elements - Expensive in terms of computation, error bounds
not stringent - Counter Based Techniques Datar02, Dem02, Man02,
Karp03, Met05, Pani07 - Monitor only a subset of the elements seen
- Maintain counters for these elements
- Heuristics to limit the space to
or - Handling deletions is not trivial Corm03
- Space is
time for reporting
9Top K
- Return the top k elements based on some scoring
function Met05, Das07 - Top-k in the sliding window
- Top-k in a window of data items
- Window as number of items or based on time
- Distributed Top-k Monitoring Bab03
- Answer queries when monitors are distributed
- Goal is to minimize communication and support
continuous queries - Use filters and constraints to limit communication
10(No Transcript)
11Presentation Outline
- Introduction Motivation
- Common Queries on Data Streams
- New Hardware Paradigms
- Work in Progress
- Conclusion Discussion
12New Hardware Paradigms
- Network Processing Units (NPU)
- Ternary Content Addressable Memory (TCAM)
- Chip Multiprocessors (CMP)
- Graphics Processing Units (GPU)
- Cell Broadband Engine
13Network Processing Unit (NPU)
- Provides extensive parallelism supporting up to
10 GB/s line rates - Examples Intel IXP Family, AMD NPs, Cisco
QuantumFlow - IXP 2855 NPU provides 16 Micro Engines each
operating up to 1.5 GHz - Each ME has 8 Hardware thread contexts
- MEs designed for simple Data Plane operations and
have a simple instruction set - XScale core supports a much diverse instruction
set and is used as a Control Plane Processor - Built-in hashing and cryptography unit
- Gold05 TLP of NPU used for accelerating
Database operations such as sequential scan and
Hash based join
14(No Transcript)
15Ternary Content Addressable Memory
- Provide constant time lookups (Searches based on
Content) - Ternary capability provides dont care bits
that allow range matches - IDT 75K62134 Chip has 256K - 36 bit words
- Programmable word size
- From 36 bits up to 576 bits
- Supports pipelining of requests
- 128 request contexts supported
- 1oo million searches per second
16Applications of TCAM
- TCAM Conscious Heavy Distinct Hitters Ban07b
- TCAM as hardware implementation of Hashtable
- Adapted the 1-level filtering Ven05 into the
TCAM setting - Acceleration due to low cost of look-ups
- Experiments on NLANR repository data confirms the
intuition - Accelerating Database operations Ban05, Ban06
- Suggest a CAM-Cache architecture to be used to
interface a TCAM with the CPU - Modify the Nested Loop Join to exploit O(1)
lookups - Efficient sorting range intersection Pani03
- TCAM Conscious Frequent Elements Ban07a
17Chip Multi-Processors (CMP)
- Architecture characterized by multiple cores on a
single die and shared cache between cores - Two broad categories
- Lean Camp
- Simple design of cores
- Rely on Thread Level Parallelism
- Sun ULTRASPARC T1 T2, Compaq Piranha
- Fat Camp
- Target maximum single thread performance
- Larger cores compared to LC
- Intel Core 2 Quad, IBM Power6
18Chip Multi-Processors (CMP)
- Adaptive Aggregation on CMPs Cie07a
- Use the shared cache in a CMP to provide a
Hash-based aggregation algorithm - Local hash table approach (best performance),
- Shared hash table (plagued by synchronization
issues) - Hybrid approach with adaptive harness (best of
both worlds) - Har07 provided good experimental observations
for a Database Server on a CMP - Under saturated workloads, lean camp hides memory
latencies better - L2 hit latency is a bottleneck for chips with
larger cache - Parallel Buffers on CMP Cie07b
19(No Transcript)
20Presentation Outline
- Introduction Motivation
- Common Queries on Data Streams
- New Hardware Paradigms
- Work in Progress
- Conclusion Discussion
21Integrated Frequent Elements/Top-K
- Idea developed in Met05 called Space Saving
- Authors develop a new counter based technique
- Count occurrences of elements and keep them
sorted - Number of counters is dependent on the precision
sought by the user - Occurrence of a monitored element results in
incrementing the counter - New element results in replacing minimum
- Intuition is that high frequency elements will
never be replaced
22Space-Saving By ExampleMet05, Met06
Courtesy Ahmed Metwally
- Elements should be sorted to identify min in
constant time
23Stream Summary Data StructureDem02, Met05,
Met06
f1 lt f2 lt f3
f2
f3
f1
If f2 gt f1 1, then a new bucket is added
between f1 f2
Element e appears in the stream
Assuming f2 f1 1
24TCAM Adaptation of Space SavingBan07a
- Need to maintain counters per element
- Stream elements looked up efficiently using TCAM
- Need to keep track of minimum element
- Element frequencies also stored in TCAM
- Minimum frequency can be looked up efficiently
- Fast and Efficient
- But elements are not sorted
- Cannot answer top-k queries
- Adaptation for continuous queries not easy
Figure taken from Ban07a SS Space Saving
Met05 LC Lossy Counting Mank02
25TCAM Adapted Stream Summary
SRAM
TCAM
- For each stream element, we have to look it up
- Use the TCAM to store the elements to look them
up in O(1) time
26Using the Parallelism of NPU
- NPU has 16 Micro Engines (MEs)
- Previous solution uses only a single ME
- Challenges in using multiple MEs
- Shared resources (TCAM) need synchronization
- Synchronization leads to performance degradation
- To avoid conflict, split the TCAM
- Merge the counters to produce the global result
- Splitting increases space overhead
27Experiments
- Experiments are performed on the Intel NPU
platform IXDP 2801 - Development using Teja NP Application Development
Environment and Intel Exchange Architecture - Involves programming in two semi-low level
languages Teja C TM and Micro C TM - Synthetic Data Zipfian Distribution with varying
Zipfian factors
28Experimental Platform
- IXDP 2801 Development Platform from Intel with
Integrated IDT TCAM chip
29Some Preliminary Results
30Some Preliminary Results
- Analysis of performance
- Pointer Manipulation Overhead We have to live
with it - Words in TCAM not aligned along boundaries
- Experiments revealed this is not the case
- Deletions are not O(1) per stream element
- This is indeed the culprit
- Present TCAM word width supports only singly
linked lists - Adding Support to handle Doubly Linked Lists
31Experiments for Parallelism
32Naïve Synchronization
Naïve Synchronization Ruins Parallelism
33Open Questions
- What would be an efficient synchronization
scheme? - Can we do away with synchronization?
- For a shared structure without synchronization,
we have Lost Updates - Can we provide a bound for error introduced?
- Challenges with merge
- How do we merge unsorted lists to generate a
sorted list? - How often do we merge?
- Do we loose information during merge?
34(No Transcript)
35Presentation Outline
- Introduction Motivation
- Common Queries on Data Streams
- New Hardware Paradigms
- Work in Progress
- Conclusion Discussion
36Conclusion
- Data Streams form an important class of
applications with specific needs - Increased data rates and on-line answering
constraints necessitate acceleration - New hardware paradigms (like TCAMs, multi-core
processors) can be exploited to accelerate these
operations - Recent Trends in Multi-Core Chip Design also
advocate development of algorithms to exploit
these new architectures - NPU (parallelism bundled with TCAMs) provides a
good framework
37Discussions
- Increasing popularity of TCAMs might lead it to
be considered as a commodity chip (just like
GPUs) - Multi-core architectures (16 to 64 Cores) bring
forward new frontiers to explore the parallelism - Adapt additional stream operators to best exploit
these advanced features - Vision Design a Data Management System
leveraging modern hardware paradigms to
efficiently and quickly answer a diverse set of
queries
38Acknowledgements
- My advisors and my committee members
- Computer Science Dept at UCSB
- Colleagues at DSL and at UCSB
39References (I)
- Ban07a Bandi et. al., Fast Data Stream
Algorithms using Associative Memory, SIGMOD 2007 - Cei07a Cieslewicz et. al., Adaptive Aggregation
on Chip Multiprocessor, VLDB 2007 - Ged07 Gedik et. al., Executing Stream Joins on
the Cell Processor, VLDB 2007 - Met05 Metwally et. al., Efficient Computation
of Frequent and Top-k Elements in Data Streams,
ICDT 2005 - Ban05 Bandi et. al., Hardware Acceleration of
Database Operations Using Content-Addressable
Memories, DaMoN 2005 - Gold05 Gold et. al., Accelerating Database
Operators Using a Network Processor, DaMoN 2005 - Datar02 Datar et. al., Maintaining Stream
Statistics over Sliding Windows, SODA 2002 - Das07 Das et. al., Ad-hoc Top-k Query Answering
for Data Streams, VLDB 2007 - Venk05 Venkataraman et. al., New Streaming
Algorithms for Fast Detection of Superspreaders,
NDSS 2005 - Shri04 Shrivastava et. al., Medians and Beyond
New Aggregation Techniques for Sensor Networks,
Sensys 2004 - Corm03 Cormode et. al., Whats Hot and Whats
Not Tracking Most Frequent Items Dynamically,
PODS 2003
40References (II)
- Pani07 Panigrahy et. al., Finding Frequent
Elements in Non-Bursty Streams, ESA 2007 - Mot03 Motwani et. al., Query Processing,
Approximation, and Resource Management in a Data
Stream Management System, CIDR 2003 - Corm04 Cormode et. al. Diamond in the Rough
Finding Hierarchical Heavy Hitters in
Multi-Dimensional Streams, SIGMOD 2004 - Corm08 Cormode et. al., Finding Hierarchical
Heavy Hitters in Streaming Data, ACM TKDD 2008 - Li07 Hong Li et. al., Stochastic Simulation of
Biochemical Systems on the Graphics Processing
Unit, Bioinformatics Journal, 2007 - Dem02 Demaine et. al., Frequency Estimation of
Internet Packet Streams with Limited Space, ESA
2002 - Cie07b Cieslewicz et. al., Parallel Buffers for
Chip Multiprocessors, DaMoN 2007 - Man02 Manku et. al., Approximate Frequency
Counts over Data Streams, VLDB 2002 - Ban07b Bandi et. al., Fast Algorithms for Heavy
Distinct Hitters using Associative Memories,
ICDCS 2007 - Har07 Hardavellas et. al., Database Servers on
Chip Multiprocessors Limitations and
Opportunities, CIDR 2007 - Corm06 Cormode et. al., Space and Time
Efficient Deterministic Algorithms for Biased
Quantiles over Data Streams, PODS 2006
41References (III)
- Col07 Colohan et. al., CMP Support for Large
and Dependent Speculative Threads, IEEE Parallel
Distributed Systems, 2007. - Yu04 Yu et. al., Efficient Multi-Match Packet
Classification with TCAM, High Perf.
Interconnects 2004. - Held06 Held et. al., From a Few Cores to Many
A Tera-scale Computing Research Overview, Intel
White Paper, 2006. - Ram06 Ramanathan, Intel Multicore Processors
Leading the Next Digital Revolution, Intel White
Paper, 2006. - Ban04 Bandi et. al., Hardware Acceleration in
Commercial Databases A Case Study of Spatial
Operations, VLDB 2004. - Karp03 Karp et. al., A Simple Algorithm for
Finding Frequent Elements in Stream and Bags,
TODS 2003. - Char02 Charikar et. al., Finding Frequent Items
in Data Streams, ICALP 2002. - Gilb02 Gilbert et. al., Fast, Small-Space
Algorithms for Approximate Histogram Maintenance,
STOC 02. - Fang98 Fang et. al., Computing Iceberg Queries
Efficiently, VLDB 98. - Ross07 Ross, Efficient Hash probes on Modern
Processors, ICDE 2007. - Gre01 Greenwald et. al., Space-Efficient Online
Computation of Quantile Summaries, SIGMOD 2001. - Ban06 Bandi et. al., Fast Computation of
Database Operations Using CAM, DEXA 2006.
42References (IV)
- Gov05 Govindaraju et. al, Fast and Approximate
Stream Mining of Quantiles and Frequencies using
Graphics Processors, SIGMOD 2005 - He08 He et. al, Relational Joins on Graphics
Processors, To Appear, SIGMOD 2008 - Met06 Metwally et. al., An Integrated Efficient
Solution for Computing Frequent and Top-k
Elements in Data Streams, ACM TODS, Sept 2006. - Corm06b Cormode et. al., What's Different
Distributed, Continuous Monitoring of
Duplicate-Resilient Aggregates on Data Streams,
ICDE 2006 - Bab03 Babcock et. al., Distributed Top-k
Monitoring, SIGMOD 2003 - Shar02 Sharma et. al., Sorting and Searching
using TCAMs, IEEE HotI 2002 - Pani03 Panigrahy et. al., Sorting and Searching
using TCAMs, IEEE Micro 2003 - Akh07 Akhbarizadeh et. al., A TCAM Based
Parallel Architecture for High Speed Packet
Forwarding, IEEE Trans on Computers, 2007 - Est06 Estan et. al., Bitmap Algorithms for
Counting Active Flows on High Speed Links, IEEE
Trans. On Networking, Oct 2006. - Kum07 Kumar et. al., On Finding Frequent
Elements in a Data Stream, Approx and Random 2007 - Mou06 Mouratidis et. al., Continuous Monitoring
of Top-k Queries over Sliding Windows, SIGMOD 2006
43ThanQ
Questions
44Back up Slides
45TCAM Adapted Stream Summary Key Observations
- Elements are sorted by the frequency
- No overhead for minimum maintenance
- Frequency counting within some error bound
- Can answer both Frequent Elements and Top-k
queries - Can support continuous queries
- Presence of TCAM accelerates look-up
- Sorting needs pointer manipulations which adds
overhead
46Hot Items
- Hot Items refer to ones that appeared in
significant fraction of the stream - We are looking at frequencies above N/(k1),
where N is length of stream seen, k is a
parameter - Frequency Counting algorithms such as Dem02,
Man02 can be used to answer these queries - But they cannot handle deletions in the stream
- Corm03 provides algorithm that efficiently
handles deletion - Input space divided into subsets
- Transactions results incrementing decrementing
appropriate subsets - Tests to see if a set has a hot item
- Test results combined to report the hot item set
- Space is
time for reporting
47Quantiles
- Very important for summarization of streams
- Can provide a wide range of statistics about the
stream - Median
- Percentiles
- Distribution of the elements in the stream
- Deterministic Algorithm with space bound of
suggested in Gre01 - Shri04 suggested an algorithm with space
complexity of where U is the
alphabet - Corm06 suggested an algorithm for answering
biased Quantile queries using space
48Heavy Distinct Hitters
- Naïve approach consumes a lot of space
- Bitmap based techniques Est06
- Flow IDs hashed into bitmap, flow count is a
count of hits - To reduce space, sampling is used.
- Multi-resolution sampling reduce dependence on
stream length. - Simple, fast, but prone to error
- Hash Based Technique Ven05 Space
- Use of different levels of filtering on the
stream - Multi-level filtering complex, but space
efficient - Streams are sampled to find hosts with multiple
connections - Probabilistic bounds on accuracy based on
sampling rates - Requires tuning of a huge number of parameters
49Graphics Processing Units (GPU)
- Characterized by high parallelism and high memory
bandwidth - 200X more processing than Intel Core 2 Duo E6700
2.6 GHz Gov05, Li07 - Memory Bandwidth 100GB/s (10-15 GB/s for Intel
CPUs) - Applications
- Stream mining for Quantiles and Frequency
estimation Gov05 - Develop a GPU aware sorting algorithm which forms
the core computing the summaries - Use Lossy Counting Mank02 for Frequency
estimation and Gre01 for Quantiles - Ban04 used it for accelerating spatial database
operations
50Cell Broadband Engine
- Intended for Game Consoles and Multimedia rich
consumer devices, product of STI - Typically contains 1 Power Processing Element
(PPE) and 8 Synergistic Processing Element (SPE) - PPE
- Dual threaded 64-bit RISC processor, runs system
software - SPE
- 128 bit RISC processor specialized for data-rich,
compute intensive SIMD applications - Provides Instruction Level Parallelism in form of
dual pipeline - Band Joins on stream windows parallelized on Cell
Processors Ged07
51TCAM Architecture
Courtesy Banit Agrawal
52Space Saving
- An element ei, with frequency fi gt min must exist
in the Stream-Summary - Assuming no specific distribution, or
user-supplied support, to find all frequent
elements with error , the Space-Saving
algorithm uses a number of counters bounded by - Any element ei, with frequency is
guaranteed to be in the Stream-Summary - Zipfian Data
- Zipfian data with parameter , has frequency
distribution as where
53Comparison of Frequency Counting Techniques
- Sticky Sampling Man02
- Lossy Counting Man02
- GroupTest Corm03
- CountSketch Char02
- Misra-Gries
- Frequent Karp03
- SpaceSaving Met05
- Pani07
54Parallel Computing
- Amdahls Law
- Multiple-core Computing Multiple execution units
- Symmetric Multiprocessing
- Memory architecture where two or more identical
processors are connected to single shared memory - In multi-core architectures, SMP refers to the
shared cache - Advantage of CMP over SMP is the presence of
on-chip cache coherence hardware
55References (V)
- Her04 Hershberger et. al., Adaptive Spatial
Partitioning for Multidimensional Streams, ISAAC
2004 - Her05 Hershberger et. al., Space Complexity of
Hierarchical Heavy Hitters in Multi-Dimensional
Data Streams, PODS 2005 - Pani02 Panigrahy et. al., Reducing TCAM Power
Consumption and Increasing Throughput, IEEE HotI
2002 - Lak05 Lakshminarayanan et. al., Algorithms for
Advanced Packet Classification with Ternary CAMs,
SIGCOMM 2005