Accelerating Data Management Operations using New Hardware Paradigms - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Accelerating Data Management Operations using New Hardware Paradigms

Description:

Data is viewed as a passing stream (possibly infinite) ... Sun ULTRASPARC T1 & T2, Compaq Piranha. Fat Camp. Target maximum single thread performance ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 56
Provided by: sudip1
Category:

less

Transcript and Presenter's Notes

Title: Accelerating Data Management Operations using New Hardware Paradigms


1
Accelerating Data Management Operations using New
Hardware Paradigms
Major Area Examination
  • Sudipto Das sudipto_at_cs

Committee Prof. Divyakant Agrawal (co-chair)
Prof. Amr El Abbadi (co-chair) Prof. Timothy
Sherwood
2
Data Streams Model
  • What are Data Streams and how are they processed?

Conventional Database
Data Stream Model
Courtesy Ahmed Metwally
  • Data is viewed as a passing stream (possibly
    infinite)
  • Only a single pass through the data tuples
  • Answer to queries are computed as the stream is
    viewed

3
Applications of Data Streams
  • Monitoring network web traffic
  • Internet Advertising
  • Stock market analysis
  • Detecting DoS and DDoS attacks malicious
    activities on the network
  • Distributed monitoring for load balancing

4
Challenges Involved
  • The volume of stream over the entire lifetime is
    huge (possibly infinite)
  • Space complexity should be sub-linear to stream
    size
  • Stream Summaries introduce approximation
  • Goal Reduce error as well as reduce the space
    requirements
  • Queries require timely answers
  • Most queries are online and continuous in nature
  • Response time must be small
  • Goal Processing cost must be small
  • Answer queries with high accuracy using minimal
    space and small response time

5
Why new hardware paradigms?
  • Ever increasing data rates call for faster
    processing
  • Speed of processing cores bounded by physical
    barriers
  • The shift is towards multi-core architectures
    Ram06 One core to multiple cores Intel
    Digital Revolution
  • 64 128 cores projected in near future Held06
    Intel Tera-scale Computing
  • Cisco recently announced 40 core QuantunFlowTM
    Network Processor
  • Paradigm shift in algorithmic models to exploit
    these architectures
  • Only concurrent programs can effectively exploit
    the potential of the multi-core architectures
  • Specialized hardware like TCAM or GPU can be
    utilized to improve efficiency of certain
    operations

6
(No Transcript)
7
Presentation Outline
  • Introduction Motivation
  • Common Queries on Data Streams
  • New Hardware Paradigms
  • Work in Progress
  • Conclusion Discussion

8
Frequent Elements
  • Find elements with frequency above a certain
    threshold (also known as support)
  • Two classes of algorithms
  • Sketch Based Techniques Char02
  • Use set of hash functions to keep a sketch of
    frequent elements
  • Expensive in terms of computation, error bounds
    not stringent
  • Counter Based Techniques Datar02, Dem02, Man02,
    Karp03, Met05, Pani07
  • Monitor only a subset of the elements seen
  • Maintain counters for these elements
  • Heuristics to limit the space to
    or
  • Handling deletions is not trivial Corm03
  • Space is
    time for reporting

9
Top K
  • Return the top k elements based on some scoring
    function Met05, Das07
  • Top-k in the sliding window
  • Top-k in a window of data items
  • Window as number of items or based on time
  • Distributed Top-k Monitoring Bab03
  • Answer queries when monitors are distributed
  • Goal is to minimize communication and support
    continuous queries
  • Use filters and constraints to limit communication

10
(No Transcript)
11
Presentation Outline
  • Introduction Motivation
  • Common Queries on Data Streams
  • New Hardware Paradigms
  • Work in Progress
  • Conclusion Discussion

12
New Hardware Paradigms
  • Network Processing Units (NPU)
  • Ternary Content Addressable Memory (TCAM)
  • Chip Multiprocessors (CMP)
  • Graphics Processing Units (GPU)
  • Cell Broadband Engine

13
Network Processing Unit (NPU)
  • Provides extensive parallelism supporting up to
    10 GB/s line rates
  • Examples Intel IXP Family, AMD NPs, Cisco
    QuantumFlow
  • IXP 2855 NPU provides 16 Micro Engines each
    operating up to 1.5 GHz
  • Each ME has 8 Hardware thread contexts
  • MEs designed for simple Data Plane operations and
    have a simple instruction set
  • XScale core supports a much diverse instruction
    set and is used as a Control Plane Processor
  • Built-in hashing and cryptography unit
  • Gold05 TLP of NPU used for accelerating
    Database operations such as sequential scan and
    Hash based join

14
(No Transcript)
15
Ternary Content Addressable Memory
  • Provide constant time lookups (Searches based on
    Content)
  • Ternary capability provides dont care bits
    that allow range matches
  • IDT 75K62134 Chip has 256K - 36 bit words
  • Programmable word size
  • From 36 bits up to 576 bits
  • Supports pipelining of requests
  • 128 request contexts supported
  • 1oo million searches per second

16
Applications of TCAM
  • TCAM Conscious Heavy Distinct Hitters Ban07b
  • TCAM as hardware implementation of Hashtable
  • Adapted the 1-level filtering Ven05 into the
    TCAM setting
  • Acceleration due to low cost of look-ups
  • Experiments on NLANR repository data confirms the
    intuition
  • Accelerating Database operations Ban05, Ban06
  • Suggest a CAM-Cache architecture to be used to
    interface a TCAM with the CPU
  • Modify the Nested Loop Join to exploit O(1)
    lookups
  • Efficient sorting range intersection Pani03
  • TCAM Conscious Frequent Elements Ban07a

17
Chip Multi-Processors (CMP)
  • Architecture characterized by multiple cores on a
    single die and shared cache between cores
  • Two broad categories
  • Lean Camp
  • Simple design of cores
  • Rely on Thread Level Parallelism
  • Sun ULTRASPARC T1 T2, Compaq Piranha
  • Fat Camp
  • Target maximum single thread performance
  • Larger cores compared to LC
  • Intel Core 2 Quad, IBM Power6

18
Chip Multi-Processors (CMP)
  • Adaptive Aggregation on CMPs Cie07a
  • Use the shared cache in a CMP to provide a
    Hash-based aggregation algorithm
  • Local hash table approach (best performance),
  • Shared hash table (plagued by synchronization
    issues)
  • Hybrid approach with adaptive harness (best of
    both worlds)
  • Har07 provided good experimental observations
    for a Database Server on a CMP
  • Under saturated workloads, lean camp hides memory
    latencies better
  • L2 hit latency is a bottleneck for chips with
    larger cache
  • Parallel Buffers on CMP Cie07b

19
(No Transcript)
20
Presentation Outline
  • Introduction Motivation
  • Common Queries on Data Streams
  • New Hardware Paradigms
  • Work in Progress
  • Conclusion Discussion

21
Integrated Frequent Elements/Top-K
  • Idea developed in Met05 called Space Saving
  • Authors develop a new counter based technique
  • Count occurrences of elements and keep them
    sorted
  • Number of counters is dependent on the precision
    sought by the user
  • Occurrence of a monitored element results in
    incrementing the counter
  • New element results in replacing minimum
  • Intuition is that high frequency elements will
    never be replaced

22
Space-Saving By ExampleMet05, Met06
Courtesy Ahmed Metwally
  • Elements should be sorted to identify min in
    constant time

23
Stream Summary Data StructureDem02, Met05,
Met06
f1 lt f2 lt f3
f2
f3
f1
If f2 gt f1 1, then a new bucket is added
between f1 f2
Element e appears in the stream
Assuming f2 f1 1
24
TCAM Adaptation of Space SavingBan07a
  • Need to maintain counters per element
  • Stream elements looked up efficiently using TCAM
  • Need to keep track of minimum element
  • Element frequencies also stored in TCAM
  • Minimum frequency can be looked up efficiently
  • Fast and Efficient
  • But elements are not sorted
  • Cannot answer top-k queries
  • Adaptation for continuous queries not easy

Figure taken from Ban07a SS Space Saving
Met05 LC Lossy Counting Mank02
25
TCAM Adapted Stream Summary
SRAM
TCAM
  • For each stream element, we have to look it up
  • Use the TCAM to store the elements to look them
    up in O(1) time

26
Using the Parallelism of NPU
  • NPU has 16 Micro Engines (MEs)
  • Previous solution uses only a single ME
  • Challenges in using multiple MEs
  • Shared resources (TCAM) need synchronization
  • Synchronization leads to performance degradation
  • To avoid conflict, split the TCAM
  • Merge the counters to produce the global result
  • Splitting increases space overhead

27
Experiments
  • Experiments are performed on the Intel NPU
    platform IXDP 2801
  • Development using Teja NP Application Development
    Environment and Intel Exchange Architecture
  • Involves programming in two semi-low level
    languages Teja C TM and Micro C TM
  • Synthetic Data Zipfian Distribution with varying
    Zipfian factors

28
Experimental Platform
  • IXDP 2801 Development Platform from Intel with
    Integrated IDT TCAM chip

29
Some Preliminary Results
30
Some Preliminary Results
  • Analysis of performance
  • Pointer Manipulation Overhead We have to live
    with it
  • Words in TCAM not aligned along boundaries
  • Experiments revealed this is not the case
  • Deletions are not O(1) per stream element
  • This is indeed the culprit
  • Present TCAM word width supports only singly
    linked lists
  • Adding Support to handle Doubly Linked Lists

31
Experiments for Parallelism
32
Naïve Synchronization
Naïve Synchronization Ruins Parallelism
33
Open Questions
  • What would be an efficient synchronization
    scheme?
  • Can we do away with synchronization?
  • For a shared structure without synchronization,
    we have Lost Updates
  • Can we provide a bound for error introduced?
  • Challenges with merge
  • How do we merge unsorted lists to generate a
    sorted list?
  • How often do we merge?
  • Do we loose information during merge?

34
(No Transcript)
35
Presentation Outline
  • Introduction Motivation
  • Common Queries on Data Streams
  • New Hardware Paradigms
  • Work in Progress
  • Conclusion Discussion

36
Conclusion
  • Data Streams form an important class of
    applications with specific needs
  • Increased data rates and on-line answering
    constraints necessitate acceleration
  • New hardware paradigms (like TCAMs, multi-core
    processors) can be exploited to accelerate these
    operations
  • Recent Trends in Multi-Core Chip Design also
    advocate development of algorithms to exploit
    these new architectures
  • NPU (parallelism bundled with TCAMs) provides a
    good framework

37
Discussions
  • Increasing popularity of TCAMs might lead it to
    be considered as a commodity chip (just like
    GPUs)
  • Multi-core architectures (16 to 64 Cores) bring
    forward new frontiers to explore the parallelism
  • Adapt additional stream operators to best exploit
    these advanced features
  • Vision Design a Data Management System
    leveraging modern hardware paradigms to
    efficiently and quickly answer a diverse set of
    queries

38
Acknowledgements
  • My advisors and my committee members
  • Computer Science Dept at UCSB
  • Colleagues at DSL and at UCSB

39
References (I)
  • Ban07a Bandi et. al., Fast Data Stream
    Algorithms using Associative Memory, SIGMOD 2007
  • Cei07a Cieslewicz et. al., Adaptive Aggregation
    on Chip Multiprocessor, VLDB 2007
  • Ged07 Gedik et. al., Executing Stream Joins on
    the Cell Processor, VLDB 2007
  • Met05 Metwally et. al., Efficient Computation
    of Frequent and Top-k Elements in Data Streams,
    ICDT 2005
  • Ban05 Bandi et. al., Hardware Acceleration of
    Database Operations Using Content-Addressable
    Memories, DaMoN 2005
  • Gold05 Gold et. al., Accelerating Database
    Operators Using a Network Processor, DaMoN 2005
  • Datar02 Datar et. al., Maintaining Stream
    Statistics over Sliding Windows, SODA 2002
  • Das07 Das et. al., Ad-hoc Top-k Query Answering
    for Data Streams, VLDB 2007
  • Venk05 Venkataraman et. al., New Streaming
    Algorithms for Fast Detection of Superspreaders,
    NDSS 2005
  • Shri04 Shrivastava et. al., Medians and Beyond
    New Aggregation Techniques for Sensor Networks,
    Sensys 2004
  • Corm03 Cormode et. al., Whats Hot and Whats
    Not Tracking Most Frequent Items Dynamically,
    PODS 2003

40
References (II)
  • Pani07 Panigrahy et. al., Finding Frequent
    Elements in Non-Bursty Streams, ESA 2007
  • Mot03 Motwani et. al., Query Processing,
    Approximation, and Resource Management in a Data
    Stream Management System, CIDR 2003
  • Corm04 Cormode et. al. Diamond in the Rough
    Finding Hierarchical Heavy Hitters in
    Multi-Dimensional Streams, SIGMOD 2004
  • Corm08 Cormode et. al., Finding Hierarchical
    Heavy Hitters in Streaming Data, ACM TKDD 2008
  • Li07 Hong Li et. al., Stochastic Simulation of
    Biochemical Systems on the Graphics Processing
    Unit, Bioinformatics Journal, 2007
  • Dem02 Demaine et. al., Frequency Estimation of
    Internet Packet Streams with Limited Space, ESA
    2002
  • Cie07b Cieslewicz et. al., Parallel Buffers for
    Chip Multiprocessors, DaMoN 2007
  • Man02 Manku et. al., Approximate Frequency
    Counts over Data Streams, VLDB 2002
  • Ban07b Bandi et. al., Fast Algorithms for Heavy
    Distinct Hitters using Associative Memories,
    ICDCS 2007
  • Har07 Hardavellas et. al., Database Servers on
    Chip Multiprocessors Limitations and
    Opportunities, CIDR 2007
  • Corm06 Cormode et. al., Space and Time
    Efficient Deterministic Algorithms for Biased
    Quantiles over Data Streams, PODS 2006

41
References (III)
  • Col07 Colohan et. al., CMP Support for Large
    and Dependent Speculative Threads, IEEE Parallel
    Distributed Systems, 2007.
  • Yu04 Yu et. al., Efficient Multi-Match Packet
    Classification with TCAM, High Perf.
    Interconnects 2004.
  • Held06 Held et. al., From a Few Cores to Many
    A Tera-scale Computing Research Overview, Intel
    White Paper, 2006.
  • Ram06 Ramanathan, Intel Multicore Processors
    Leading the Next Digital Revolution, Intel White
    Paper, 2006.
  • Ban04 Bandi et. al., Hardware Acceleration in
    Commercial Databases A Case Study of Spatial
    Operations, VLDB 2004.
  • Karp03 Karp et. al., A Simple Algorithm for
    Finding Frequent Elements in Stream and Bags,
    TODS 2003.
  • Char02 Charikar et. al., Finding Frequent Items
    in Data Streams, ICALP 2002.
  • Gilb02 Gilbert et. al., Fast, Small-Space
    Algorithms for Approximate Histogram Maintenance,
    STOC 02.
  • Fang98 Fang et. al., Computing Iceberg Queries
    Efficiently, VLDB 98.
  • Ross07 Ross, Efficient Hash probes on Modern
    Processors, ICDE 2007.
  • Gre01 Greenwald et. al., Space-Efficient Online
    Computation of Quantile Summaries, SIGMOD 2001.
  • Ban06 Bandi et. al., Fast Computation of
    Database Operations Using CAM, DEXA 2006.

42
References (IV)
  • Gov05 Govindaraju et. al, Fast and Approximate
    Stream Mining of Quantiles and Frequencies using
    Graphics Processors, SIGMOD 2005
  • He08 He et. al, Relational Joins on Graphics
    Processors, To Appear, SIGMOD 2008
  • Met06 Metwally et. al., An Integrated Efficient
    Solution for Computing Frequent and Top-k
    Elements in Data Streams, ACM TODS, Sept 2006.
  • Corm06b Cormode et. al., What's Different
    Distributed, Continuous Monitoring of
    Duplicate-Resilient Aggregates on Data Streams,
    ICDE 2006
  • Bab03 Babcock et. al., Distributed Top-k
    Monitoring, SIGMOD 2003
  • Shar02 Sharma et. al., Sorting and Searching
    using TCAMs, IEEE HotI 2002
  • Pani03 Panigrahy et. al., Sorting and Searching
    using TCAMs, IEEE Micro 2003
  • Akh07 Akhbarizadeh et. al., A TCAM Based
    Parallel Architecture for High Speed Packet
    Forwarding, IEEE Trans on Computers, 2007
  • Est06 Estan et. al., Bitmap Algorithms for
    Counting Active Flows on High Speed Links, IEEE
    Trans. On Networking, Oct 2006.
  • Kum07 Kumar et. al., On Finding Frequent
    Elements in a Data Stream, Approx and Random 2007
  • Mou06 Mouratidis et. al., Continuous Monitoring
    of Top-k Queries over Sliding Windows, SIGMOD 2006

43
ThanQ
Questions
44
Back up Slides
45
TCAM Adapted Stream Summary Key Observations
  • Elements are sorted by the frequency
  • No overhead for minimum maintenance
  • Frequency counting within some error bound
  • Can answer both Frequent Elements and Top-k
    queries
  • Can support continuous queries
  • Presence of TCAM accelerates look-up
  • Sorting needs pointer manipulations which adds
    overhead

46
Hot Items
  • Hot Items refer to ones that appeared in
    significant fraction of the stream
  • We are looking at frequencies above N/(k1),
    where N is length of stream seen, k is a
    parameter
  • Frequency Counting algorithms such as Dem02,
    Man02 can be used to answer these queries
  • But they cannot handle deletions in the stream
  • Corm03 provides algorithm that efficiently
    handles deletion
  • Input space divided into subsets
  • Transactions results incrementing decrementing
    appropriate subsets
  • Tests to see if a set has a hot item
  • Test results combined to report the hot item set
  • Space is
    time for reporting

47
Quantiles
  • Very important for summarization of streams
  • Can provide a wide range of statistics about the
    stream
  • Median
  • Percentiles
  • Distribution of the elements in the stream
  • Deterministic Algorithm with space bound of
    suggested in Gre01
  • Shri04 suggested an algorithm with space
    complexity of where U is the
    alphabet
  • Corm06 suggested an algorithm for answering
    biased Quantile queries using space

48
Heavy Distinct Hitters
  • Naïve approach consumes a lot of space
  • Bitmap based techniques Est06
  • Flow IDs hashed into bitmap, flow count is a
    count of hits
  • To reduce space, sampling is used.
  • Multi-resolution sampling reduce dependence on
    stream length.
  • Simple, fast, but prone to error
  • Hash Based Technique Ven05 Space
  • Use of different levels of filtering on the
    stream
  • Multi-level filtering complex, but space
    efficient
  • Streams are sampled to find hosts with multiple
    connections
  • Probabilistic bounds on accuracy based on
    sampling rates
  • Requires tuning of a huge number of parameters

49
Graphics Processing Units (GPU)
  • Characterized by high parallelism and high memory
    bandwidth
  • 200X more processing than Intel Core 2 Duo E6700
    2.6 GHz Gov05, Li07
  • Memory Bandwidth 100GB/s (10-15 GB/s for Intel
    CPUs)
  • Applications
  • Stream mining for Quantiles and Frequency
    estimation Gov05
  • Develop a GPU aware sorting algorithm which forms
    the core computing the summaries
  • Use Lossy Counting Mank02 for Frequency
    estimation and Gre01 for Quantiles
  • Ban04 used it for accelerating spatial database
    operations

50
Cell Broadband Engine
  • Intended for Game Consoles and Multimedia rich
    consumer devices, product of STI
  • Typically contains 1 Power Processing Element
    (PPE) and 8 Synergistic Processing Element (SPE)
  • PPE
  • Dual threaded 64-bit RISC processor, runs system
    software
  • SPE
  • 128 bit RISC processor specialized for data-rich,
    compute intensive SIMD applications
  • Provides Instruction Level Parallelism in form of
    dual pipeline
  • Band Joins on stream windows parallelized on Cell
    Processors Ged07

51
TCAM Architecture
Courtesy Banit Agrawal
52
Space Saving
  • An element ei, with frequency fi gt min must exist
    in the Stream-Summary
  • Assuming no specific distribution, or
    user-supplied support, to find all frequent
    elements with error , the Space-Saving
    algorithm uses a number of counters bounded by
  • Any element ei, with frequency is
    guaranteed to be in the Stream-Summary
  • Zipfian Data
  • Zipfian data with parameter , has frequency
    distribution as where

53
Comparison of Frequency Counting Techniques
  • Sticky Sampling Man02
  • Lossy Counting Man02
  • GroupTest Corm03
  • CountSketch Char02
  • Misra-Gries
  • Frequent Karp03
  • SpaceSaving Met05
  • Pani07

54
Parallel Computing
  • Amdahls Law
  • Multiple-core Computing Multiple execution units
  • Symmetric Multiprocessing
  • Memory architecture where two or more identical
    processors are connected to single shared memory
  • In multi-core architectures, SMP refers to the
    shared cache
  • Advantage of CMP over SMP is the presence of
    on-chip cache coherence hardware

55
References (V)
  • Her04 Hershberger et. al., Adaptive Spatial
    Partitioning for Multidimensional Streams, ISAAC
    2004
  • Her05 Hershberger et. al., Space Complexity of
    Hierarchical Heavy Hitters in Multi-Dimensional
    Data Streams, PODS 2005
  • Pani02 Panigrahy et. al., Reducing TCAM Power
    Consumption and Increasing Throughput, IEEE HotI
    2002
  • Lak05 Lakshminarayanan et. al., Algorithms for
    Advanced Packet Classification with Ternary CAMs,
    SIGCOMM 2005
Write a Comment
User Comments (0)
About PowerShow.com