Spark Streaming Large-scale near-real-time stream processing - PowerPoint PPT Presentation

About This Presentation
Title:

Spark Streaming Large-scale near-real-time stream processing

Description:

Large-scale near-real-time stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Real Applications: Mobile Millennium Project Traffic transit time estimation ... – PowerPoint PPT presentation

Number of Views:682
Avg rating:3.0/5.0
Slides: 41
Provided by: sparkproj
Learn more at: https://spark.apache.org
Category:

less

Transcript and Presenter's Notes

Title: Spark Streaming Large-scale near-real-time stream processing


1
Spark StreamingLarge-scale near-real-time
stream processing
  • Tathagata Das (TD)
  • UC Berkeley

2
What is Spark Streaming?
  • Framework for large scale stream processing
  • Scales to 100s of nodes
  • Can achieve second scale latencies
  • Integrates with Sparks batch and interactive
    processing
  • Provides a simple batch-like API for implementing
    complex algorithm
  • Can absorb live data streams from Kafka, Flume,
    ZeroMQ, etc.

3
Motivation
  • Many important applications must process large
    streams of live data and provide results in
    near-real-time
  • Social network trends
  • Website statistics
  • Intrustion detection systems
  • etc.
  • Require large clusters to handle workloads
  • Require latencies of few seconds

4
Need for a framework
  • for building such complex stream processing
    applications
  • But what are the requirements
  • from such a framework?

5
Requirements
  • Scalable to large clusters
  • Second-scale latencies
  • Simple programming model

6
Case study Conviva, Inc.
  • Real-time monitoring of online video metadata
  • HBO, ESPN, ABC, SyFy,
  • Two processing stacks
  • Custom-built distributed stream processing system
  • 1000s complex metrics on millions of video
    sessions
  • Requires many dozens of nodes for processing
  • Hadoop backend for offline analysis
  • Generating daily and monthly reports
  • Similar computation as the streaming system

7
Case study XYZ, Inc.
  • Any company who wants to process live streaming
    data has this problem
  • Twice the effort to implement any new function
  • Twice the number of bugs to solve
  • Twice the headache
  • Two processing stacks
  • Custom-built distributed stream processing system
  • 1000s complex metrics on millions of videos
    sessions
  • Requires many dozens of nodes for processing
  • Hadoop backend for offline analysis
  • Generating daily and monthly reports
  • Similar computation as the streaming system

8
Requirements
  • Scalable to large clusters
  • Second-scale latencies
  • Simple programming model
  • Integrated with batch interactive processing

9
Stateful Stream Processing
  • Traditional streaming systems have a event-driven
    record-at-a-time processing model
  • Each node has mutable state
  • For each record, update state send new records
  • State is lost if node dies!
  • Making stateful stream processing be
    fault-tolerant is challenging

10
Existing Streaming Systems
  • Storm
  • Replays record if not processed by a node
  • Processes each record at least once
  • May update mutable state twice!
  • Mutable state can be lost due to failure!
  • Trident Use transactions to update state
  • Processes each record exactly once
  • Per state transaction updates slow

11
Requirements
  • Scalable to large clusters
  • Second-scale latencies
  • Simple programming model
  • Integrated with batch interactive processing
  • Efficient fault-tolerance in stateful computations

12
Spark Streaming
13
Discretized Stream Processing
  • Run a streaming computation as a series of very
    small, deterministic batch jobs

live data stream
Spark Streaming
  • Chop up the live stream into batches of X seconds
  • Spark treats each batch of data as RDDs and
    processes them using RDD operations
  • Finally, the processed results of the RDD
    operations are returned in batches

batches of X seconds
Spark
14
Discretized Stream Processing
  • Run a streaming computation as a series of very
    small, deterministic batch jobs

live data stream
Spark Streaming
  • Batch sizes as low as ½ second, latency 1
    second
  • Potential for combining batch processing and
    streaming processing in the same system

batches of X seconds
Spark
15
Example 1 Get hashtags from Twitter
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)

DStream a sequence of RDD representing a stream
of data
Twitter Streaming API
tweets DStream
stored in memory as an RDD (immutable,
distributed)
16
Example 1 Get hashtags from Twitter
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)
  • val hashTags tweets.flatMap (status gt
    getTags(status))

transformation modify data in one Dstream to
create another DStream
new DStream
hashTags Dstream cat, dog,
new RDDs created for every batch
17
Example 1 Get hashtags from Twitter
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)
  • val hashTags tweets.flatMap (status gt
    getTags(status))
  • hashTags.saveAsHadoopFiles("hdfs//...")

output operation to push data to external storage
batch _at_ t1
batch _at_ t
batch _at_ t2
tweets DStream
flatMap
flatMap
flatMap
hashTags DStream
every batch saved to HDFS
18
Java Example
  • Scala
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)
  • val hashTags tweets.flatMap (status gt
    getTags(status))
  • hashTags.saveAsHadoopFiles("hdfs//...")
  • Java
  • JavaDStreamltStatusgt tweets ssc.twitterStream(ltTw
    itter usernamegt, ltTwitter passwordgt)
  • JavaDstreamltStringgt hashTags tweets.flatMap(new
    Functionlt...gt )
  • hashTags.saveAsHadoopFiles("hdfs//...")

Function object to define the transformation
19
Fault-tolerance
  • RDDs are remember the sequence of operations that
    created it from the original fault-tolerant input
    data
  • Batches of input data are replicated in memory of
    multiple worker nodes, therefore fault-tolerant
  • Data lost due to worker failure, can be
    recomputed from input data

tweets RDD
input data replicated in memory
flatMap
hashTags RDD
lost partitions recomputed on other workers
20
Key concepts
  • DStream sequence of RDDs representing a stream
    of data
  • Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor,
    TCP sockets
  • Transformations modify data from on DStream to
    another
  • Standard RDD operations map, countByValue,
    reduce, join,
  • Stateful operations window, countByValueAndWindo
    w,
  • Output Operations send data to external entity
  • saveAsHadoopFiles saves to HDFS
  • foreach do anything with each batch of results

21
Example 2 Count the hashtags
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)
  • val hashTags tweets.flatMap (status gt
    getTags(status))
  • val tagCounts hashTags.countByValue()

batch _at_ t1
batch _at_ t
batch _at_ t2
tweets
hashTags
tagCounts (cat, 10), (dog, 25), ...
22
Example 3 Count the hashtags over last 10 mins
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)
  • val hashTags tweets.flatMap (status gt
    getTags(status))
  • val tagCounts hashTags.window(Minutes(10),
    Seconds(1)).countByValue()

sliding window operation
window length
sliding interval
23
Example 3 Counting the hashtags over last 10
mins
  • val tagCounts hashTags.window(Minutes(10),
    Seconds(1)).countByValue()

sliding window
countByValue
count over all the data in the window
24
Smart window-based countByValue
  • val tagCounts hashtags.countByValueAndWindow(Min
    utes(10), Seconds(1))

countByValue
add the counts from the new batch in the window
subtract the counts from batch before the window
tagCounts
?
25
Smart window-based reduce
  • Technique to incrementally compute count
    generalizes to many reduce operations
  • Need a function to inverse reduce (subtract
    for counting)
  • Could have implemented counting as
  • hashTags.reduceByKeyAndWindow(_ _, _ - _,
    Minutes(1), )

26
Demo
27
Fault-tolerant Stateful Processing
  • All intermediate data are RDDs, hence can be
    recomputed if lost

t-1
t2
t3
t
t1
hashTags
tagCounts
28
Fault-tolerant Stateful Processing
  • State data not lost even if a worker node dies
  • Does not change the value of your result
  • Exactly once semantics to all transformations
  • No double counting!

29
Other Interesting Operations
  • Maintaining arbitrary state, track sessions
  • Maintain per-user mood as state, and update it
    with his/her tweets
  • tweets.updateStateByKey(tweet gt
    updateMood(tweet))
  • Do arbitrary Spark RDD computation within DStream
  • Join incoming tweets with a spam file to filter
    out bad tweets
  • tweets.transform(tweetsRDD gt
  • tweetsRDD.join(spamHDFSFile).filter(...)
  • )

30
Performance
  • Can process 6 GB/sec (60M records/sec) of data on
    100 nodes at sub-second latency
  • Tested with 100 streams of data on 100 EC2
    instances with 4 cores each

31
Comparison with Storm and S4
  • Higher throughput than Storm
  • Spark Streaming 670k records/second/node
  • Storm 115k records/second/node
  • Apache S4 7.5k records/second/node

32
Fast Fault Recovery
  • Recovers from faults/stragglers within 1 sec

33
Real Applications Conviva
  • Real-time monitoring of video metadata
  • Achieved 1-2 second latency
  • Millions of video sessions processed
  • Scales linearly with cluster size

34
Real Applications Mobile Millennium Project
  • Traffic transit time estimation using online
    machine learning on GPS observations
  • Markov chain Monte Carlo simulations on GPS
    observations
  • Very CPU intensive, requires dozens of machines
    for useful computation
  • Scales linearly with cluster size

35
Vision - one stack to rule them all
Spark Shark Spark Streaming
36
Spark program vs Spark Streaming program
  • Spark Streaming program on Twitter stream
  • val tweets ssc.twitterStream(ltTwitter
    usernamegt, ltTwitter passwordgt)
  • val hashTags tweets.flatMap (status gt
    getTags(status))
  • hashTags.saveAsHadoopFiles("hdfs//...")
  • Spark program on Twitter log file
  • val tweets sc.hadoopFile("hdfs//...")
  • val hashTags tweets.flatMap (status gt
    getTags(status))
  • hashTags.saveAsHadoopFile("hdfs//...")

37
Vision - one stack to rule them all
./spark-shell scalagt val file
sc.hadoopFile(smallLogs) ... scalagt val
filtered file.filter(_.contains(ERROR)) ... sc
alagt val mapped file.map(...) ...
  • Explore data interactively using Spark Shell /
    PySpark to identify problems
  • Use same code in Spark stand-alone programs to
    identify problems in production logs
  • Use similar code in Spark Streaming to identify
    problems in live log streams

object ProcessProductionData def main(args
ArrayString) val sc new
SparkContext(...) val file
sc.hadoopFile(productionLogs) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
object ProcessLiveStream def main(args
ArrayString) val sc new
StreamingContext(...) val stream
sc.kafkaStream(...) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
38
Vision - one stack to rule them all
./spark-shell scalagt val file
sc.hadoopFile(smallLogs) ... scalagt val
filtered file.filter(_.contains(ERROR)) ... sc
alagt val mapped file.map(...) ...
  • Explore data interactively using Spark Shell /
    PySpark to identify problems
  • Use same code in Spark stand-alone programs to
    identify problems in production logs
  • Use similar code in Spark Streaming to identify
    problems in live log streams

Spark Shark Spark Streaming
object ProcessProductionData def main(args
ArrayString) val sc new
SparkContext(...) val file
sc.hadoopFile(productionLogs) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
object ProcessLiveStream def main(args
ArrayString) val sc new
StreamingContext(...) val stream
sc.kafkaStream(...) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
39
Alpha Release with Spark 0.7
  • Integrated with Spark 0.7
  • Import spark.streaming to get all the
    functionality
  • Both Java and Scala API
  • Give it a spin!
  • Run locally or in a cluster
  • Try it out in the hands-on tutorial later today

40
Summary
  • Stream processing framework that is ...
  • Scalable to large clusters
  • Achieves second-scale latencies
  • Has simple programming model
  • Integrates with batch interactive workloads
  • Ensures efficient fault-tolerance in stateful
    computations
  • For more information, checkout our paper
    http//tinyurl.com/dstreams
Write a Comment
User Comments (0)
About PowerShow.com