Heartbeat Mechanism and its Applications in Gigascope - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Heartbeat Mechanism and its Applications in Gigascope

Description:

Some stream attributes are labeled with temporal properties (e.g monotone increasing) ... pipelined operators that rely on temporal properties of the stream ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 26
Provided by: John1161
Category:

less

Transcript and Presenter's Notes

Title: Heartbeat Mechanism and its Applications in Gigascope


1
Heartbeat Mechanism and its Applications in
Gigascope
  • Vladislav Shkapenyuk (speaker),
  • Muthu S. Muthukrishnan
  • Rutgers University
  • Theodore Johnson
  • Oliver Spatscheck
  • ATT Labs Research

2
Unblocking streaming operators
  • Data stream management systems (DSMS) work with
    infinite stream of tuples
  • How to get answers out of join, aggregation,
    etc., before the end of time?
  • limit the scope of output tuples which input
    tuple can affect
  • Two views
  • define a window over the input streams for the
    blocking operators (STREAM, TelegraphCQ)
  • use a pipelined operator, make use of an existing
    sort order (Gigascope, Tribeca)
  • most queries make reference to timestamps

3
Unblocking streaming operators
  • Some stream attributes are labeled with temporal
    properties (e.g monotone increasing)
  • In aggregation query one grouping attribute must
    have a timestampness
  • SELECT tb, srcIP, count() FROM TCP
  • GROUP BY time/60 as tb, srcIP
  • tb is infered to be monotone increasing too
  • Similarly stream merge (union) and join also need
    to have a set of attributes that have temporal
    properties

4
What if a data streams stalls?
  • Consider a query that merges multiple streams
  • Presence of tuples carries temporal information,
    absence doesn't
  • memory overflow at merge
  • Similar issues with every operator with multiple
    input streams (e.g. joins)

5
Stream Punctuations
  • Unblock operators by embedding special marks in
    the stream
  • indicate the end of the subset of the data
  • Stalled stream can notify the parent about the
    end of the epoch
  • Lots of issues
  • How these punctuations can be generated and
    propagated?
  • How do we integrate such a mechanism into
    high-performance DSMS?

6
Gigascope Architecture
  • DSMS designed for monitoring high-rate data
    streams
  • pure stream database (no stored relations or
    continuous queries)
  • pipelined operators that rely on temporal
    properties of the stream
  • Two layer architecture for early data reduction
  • fast lightweight data reduction queries (LFTA)
  • high level queries for expensive processing
    (HFTA)

App
high
high
low
low
low
ring buffer
NIC
7
Pipelined Operators
  • Aggregation
  • SELECT tb, srcIP, count() FROM TCP
  • GROUP BY time/60 as tb, srcIP
  • Merge operator performs a union of two streams R
    and S in a way that preserves timestamps
  • MERGE R.tb S.tb
  • FROM Inpackets R, Outpackets S
  • A join query on streams R and S must contain a
    join predicate such as R.tbS.tb
  • SELECT R.sourceIP, R.tb, R.length_sum
    S.length_sum
  • OUTER_JOIN from Inpackets R, Outpackets S
  • where R.sourceIP S.destIP and R.tb S.tb

8
Gigascope heartbeats
  • Initially designed to collect statistics about
    operator load
  • Special messages propagated using regular tuple
    routing mechanism
  • performance monitoring
  • failure detection

9
Unblocking operators using heartbeats
  • Stream punctuation mechanism
  • injects special temporal update tuples into
    operators output stream
  • notifies the operator about the end of subset of
    a data (end of the time window on aggregations,
    stream merge and joins operate)
  • Heartbeats are the perfect vehicles for carrying
    the temporal update tuples
  • regular propagation through operator DAG
  • unblocks all operators on its way in timely
    manner

10
Temporal update tuples
  • Temporal update tuples generated by operator have
    a schema identical to regular tuple
  • only values of temporal attributes are
    initialized (the rest is ignored)
  • future tuples are guaranteed not to violate
    temporal properties of the stream
  • Operator output schema
  • (Timebucket, SrcIP, DestIP, PacketCount)
  • Timebucket is monotone increasing
  • Temporal tuple
  • (T, Unitlitialized, Unitlitialized,
    Unitlitialized)
  • guarantees that all future tuples will have value
    of Timebucket gt T

11
Heartbeat generation
  • Naïve solution
  • operators emit last produced tuple cast as a
    temporal tuple
  • too conservative to be useful heartbeats dont
    carry any additional information
  • Goal aggressively generate the values of
    temporal attributes
  • set attributes to maximum values we can safely
    guarantee

12
Heartbeat generation
  • Two approaches
  • infer the values of temporal update tuples based
    on tuples operator received so far
  • infer based on system time
  • Inference based on received tuples
  • works when operators observe some tuples but they
    might be filtered out by selection predicates
  • works on every level of query execution
  • Inference based on system clock
  • works even with completely stalled streams
  • only for time based temporal attributes
  • potentially dangerous

13
Inferring temporal attributes
  • Every operator maintains state required to
    correctly generate temporal update tuples
  • last seen values of all temporal attributes
    referenced in select clause
  • operator specific state
  • Attribute values for temporal tuples are computed
    using inference rules
  • SELECT tb, srcIP, count() FROM TCP
  • GROUP BY time/60 as tb, srcIP
  • If last seen value of time is X, infer that the
    value of tb for temporal update tuple should be
    X/60

14
Inferring temporal attributes
  • What if the stream is completely stalled?
  • cannot advance values of temporal attributes
  • Inference based on system time
  • works in the temporal attribute can be correlated
    with system clock (usually the case in network
    streams)
  • unsafe for high level operators (need to reason
    about propagation delays)
  • need to be careful about the clock skew
  • Gigascope uses skew information entered by admin
    to infer the values of temporal attributes

15
Selection merge operators
  • Selection operator (filtering)
  • save the last seen values of temporal attributes
    regardless of whether tuple passes selection
    predicate
  • Merge (stream union)
  • combines multiple streams while preserving
    ordering properties
  • Requires buffering of input streams
  • maintains minimum timestamp values observed by
    every input
  • S1_ max, S2_max, Sn_max
  • Uses MIN(S1_ max, S2_max, Sn_max) to generate
    temporal update tuple

16
Aggregation sampling operator
  • Maintains hash table of aggregates for current
    time window
  • when the time window advances the table content
    is flushed
  • uses traffic shaping (slow flush) to avoid
    flushing excessive amounts of data
  • Slow flush can lead to incorrect generation of
    temporal tuples
  • if there is some unflushed tuples in hash table,
    generate temporal tuples based on unflushed
    tuples
  • otherwise uses last seen values saved by operator

17
Join operators
  • Stream join between R and S relates timestamp
    from R to timestamp in S (e.g. R.ts S.ts)
  • critical for guaranteeing bounded memory
  • supports inner and,right,and full outer
    equi-joins
  • Maintains maximum values of timestamps observed
    on each stream (Rmax and Smax)
  • Rmax and Smax can be composite structures storing
    max values of all attributes that a part of
    timestamp
  • Infers the values of attributes of temporal
    update tuples based on MIN(Rmax, Smax)

18
Experimental Evaluation
  • Two main data feeds
  • DAG4.3GE Gigabit Ethernet interfaces
  • 100,000 packets/sec (about 400Mbit/sec)
  • One low-rate control data feed
  • 100Mbit interface
  • Good representative of backup interface
  • Dual 2.8 GHz P4 server w/ 4 GB of RAM, FreeBSD
    4.8

19
Merge Query
High-level Aggregation
  • SELECT tb, protocol, srcIP, destIP, srcPort,
    destPort, count()
  • FROM DataProtocol
  • GROUP BY time/10 as tb, protocol, srcIP, destIP,
    srcPort, destPort

Stream Merge
Stream Merge
Low-level Aggregation
Low-level Aggregation
Low-level Aggregation
control
main1
main2
20
Performance Evaluation
21
Outer Join Query
  • Query flow1
  • SELECT tb, protocol, srcIP, destIP, srcPort,
    destPort, count() as cnt
  • FROM main0_and_control.DataProtocol
  • GROUP BY time/10 as tb,protocol,srcIP,destIP,srcPo
    rt,destPort
  •  
  • Query flow2
  • SELECT tb, protocol, srcIP, destIP, srcPort,
    destPort, count() as cnt
  • FROM main1.DataProtocol
  • GROUP BY time/10 as tb, protocol, srcIP, destIP,
    srcPort, destPort
  •  
  • Query full_flow
  • SELECT flow1.tb, flow1.protocol, flow1.srcIP,
    flow1.destIP, flow1.srcPort, flow1.destPort,
    flow1.cnt, flow2.cnt
  • OUTER_JOIN FROM flow1, flow2
  • WHERE flow1.srcIPflow2.srcIP and
    flow1.destIPflow2.destIP and
  • flow1.srcPortflow2.srcPort and
    flow1.destPortflow2.destPort and
  • flow1.protocolflow2.protocol and
  • flow1.tb flow2.tb

22
Outer Join Query
Outer Join
High-level Aggregation
High-level Aggregation
Stream Merge
Low-level Aggregation
Low-level Aggregation
Low-level Aggregation
backup
main1
main2
23
Performance Evaluation
CPU load w/ heartbeats enabled 37.5
w/ heartbeats disabled 37.3
24
Other heartbeat applications
  • Fault tolerance
  • Heartbeats regularly propagate through query DAGs
  • Easy detection of failed nodes
  • System performance analysis
  • Every heartbeat message is timestamped by
    receiving node
  • Timestamp traces are perfect for analyzing
    queuing delays
  • Distributed query optimization
  • Every heartbeat message carries runtime
    statistics (operator selectivities, sampling
    rates, in/out rates, memory footprint, etc)
  • Collected statistics can be fed to distributed
    query optimizer

25
Conclusions
  • Punctuation carrying heartbeats
  • effective at unblocking streaming operators on
    all levels
  • significantly reduce query memory utilization
  • capable at working on multiple Gigabit line
    speeds
  • Variety of other uses
  • fault tolerance, performance analysis,
    distributed query optimization
  • Part of production version of Gigascope
Write a Comment
User Comments (0)
About PowerShow.com