Spark Streaming Large-scale near-real-time stream processing - PowerPoint PPT Presentation

About This Presentation

Title:

Spark Streaming Large-scale near-real-time stream processing

Description:

Large-scale near-real-time stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Real Applications: Mobile Millennium Project Traffic transit time estimation ... – PowerPoint PPT presentation

Number of Views:682

Avg rating:3.0/5.0

Slides: 41

Provided by: sparkproj

Learn more at: https://spark.apache.org

Category:

more less

Transcript and Presenter's Notes

Title: Spark Streaming Large-scale near-real-time stream processing

1
Spark StreamingLarge-scale near-real-time
stream processing

Tathagata Das (TD)
UC Berkeley

2
What is Spark Streaming?

Framework for large scale stream processing
Scales to 100s of nodes
Can achieve second scale latencies
Integrates with Sparks batch and interactive
processing
Provides a simple batch-like API for implementing
complex algorithm
Can absorb live data streams from Kafka, Flume,
ZeroMQ, etc.

3
Motivation

Many important applications must process large
streams of live data and provide results in
near-real-time
Social network trends
Website statistics
Intrustion detection systems
etc.
Require large clusters to handle workloads
Require latencies of few seconds

4
Need for a framework

for building such complex stream processing
applications
But what are the requirements
from such a framework?

5
Requirements

Scalable to large clusters
Second-scale latencies
Simple programming model

6
Case study Conviva, Inc.

Real-time monitoring of online video metadata
HBO, ESPN, ABC, SyFy,
Two processing stacks

Custom-built distributed stream processing system
1000s complex metrics on millions of video
sessions
Requires many dozens of nodes for processing

Hadoop backend for offline analysis
Generating daily and monthly reports
Similar computation as the streaming system

7
Case study XYZ, Inc.

Any company who wants to process live streaming
data has this problem
Twice the effort to implement any new function
Twice the number of bugs to solve
Twice the headache
Two processing stacks

Custom-built distributed stream processing system
1000s complex metrics on millions of videos
sessions
Requires many dozens of nodes for processing

Hadoop backend for offline analysis
Generating daily and monthly reports
Similar computation as the streaming system

8
Requirements

Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch interactive processing

9
Stateful Stream Processing

Traditional streaming systems have a event-driven
record-at-a-time processing model
Each node has mutable state
For each record, update state send new records
State is lost if node dies!
Making stateful stream processing be
fault-tolerant is challenging

10
Existing Streaming Systems

Storm
Replays record if not processed by a node
Processes each record at least once
May update mutable state twice!
Mutable state can be lost due to failure!
Trident Use transactions to update state
Processes each record exactly once
Per state transaction updates slow

11
Requirements

Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch interactive processing
Efficient fault-tolerance in stateful computations

12
Spark Streaming
13
Discretized Stream Processing

Run a streaming computation as a series of very
small, deterministic batch jobs

live data stream
Spark Streaming

Chop up the live stream into batches of X seconds
Spark treats each batch of data as RDDs and
processes them using RDD operations
Finally, the processed results of the RDD
operations are returned in batches

batches of X seconds
Spark
14
Discretized Stream Processing

Run a streaming computation as a series of very
small, deterministic batch jobs

live data stream
Spark Streaming

Batch sizes as low as ½ second, latency 1
second
Potential for combining batch processing and
streaming processing in the same system

batches of X seconds
Spark
15
Example 1 Get hashtags from Twitter

val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)

DStream a sequence of RDD representing a stream
of data
Twitter Streaming API
tweets DStream
stored in memory as an RDD (immutable,
distributed)
16
Example 1 Get hashtags from Twitter

val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)
val hashTags tweets.flatMap (status gt
getTags(status))

transformation modify data in one Dstream to
create another DStream
new DStream
hashTags Dstream cat, dog,
new RDDs created for every batch
17
Example 1 Get hashtags from Twitter

val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)
val hashTags tweets.flatMap (status gt
getTags(status))
hashTags.saveAsHadoopFiles("hdfs//...")

output operation to push data to external storage
batch _at_ t1
batch _at_ t
batch _at_ t2
tweets DStream
flatMap
flatMap
flatMap
hashTags DStream
every batch saved to HDFS
18
Java Example

Scala
val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)
val hashTags tweets.flatMap (status gt
getTags(status))
hashTags.saveAsHadoopFiles("hdfs//...")
Java
JavaDStreamltStatusgt tweets ssc.twitterStream(ltTw
itter usernamegt, ltTwitter passwordgt)
JavaDstreamltStringgt hashTags tweets.flatMap(new
Functionlt...gt )
hashTags.saveAsHadoopFiles("hdfs//...")

Function object to define the transformation
19
Fault-tolerance

RDDs are remember the sequence of operations that
created it from the original fault-tolerant input
data
Batches of input data are replicated in memory of
multiple worker nodes, therefore fault-tolerant
Data lost due to worker failure, can be
recomputed from input data

tweets RDD
input data replicated in memory
flatMap
hashTags RDD
lost partitions recomputed on other workers
20
Key concepts

DStream sequence of RDDs representing a stream
of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor,
TCP sockets
Transformations modify data from on DStream to
another
Standard RDD operations map, countByValue,
reduce, join,
Stateful operations window, countByValueAndWindo
w,
Output Operations send data to external entity
saveAsHadoopFiles saves to HDFS
foreach do anything with each batch of results

21
Example 2 Count the hashtags

val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)
val hashTags tweets.flatMap (status gt
getTags(status))
val tagCounts hashTags.countByValue()

batch _at_ t1
batch _at_ t
batch _at_ t2
tweets
hashTags
tagCounts (cat, 10), (dog, 25), ...
22
Example 3 Count the hashtags over last 10 mins

val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)
val hashTags tweets.flatMap (status gt
getTags(status))
val tagCounts hashTags.window(Minutes(10),
Seconds(1)).countByValue()

sliding window operation
window length
sliding interval
23
Example 3 Counting the hashtags over last 10
mins

val tagCounts hashTags.window(Minutes(10),
Seconds(1)).countByValue()

sliding window
countByValue
count over all the data in the window
24
Smart window-based countByValue

val tagCounts hashtags.countByValueAndWindow(Min
utes(10), Seconds(1))

countByValue
add the counts from the new batch in the window
subtract the counts from batch before the window
tagCounts
?
25
Smart window-based reduce

Technique to incrementally compute count
generalizes to many reduce operations
Need a function to inverse reduce (subtract
for counting)
Could have implemented counting as
hashTags.reduceByKeyAndWindow(_ _, _ - _,
Minutes(1), )

26
Demo
27
Fault-tolerant Stateful Processing

All intermediate data are RDDs, hence can be
recomputed if lost

t-1
t2
t3
t
t1
hashTags
tagCounts
28
Fault-tolerant Stateful Processing

State data not lost even if a worker node dies
Does not change the value of your result
Exactly once semantics to all transformations
No double counting!

29
Other Interesting Operations

Maintaining arbitrary state, track sessions
Maintain per-user mood as state, and update it
with his/her tweets
tweets.updateStateByKey(tweet gt
updateMood(tweet))
Do arbitrary Spark RDD computation within DStream
Join incoming tweets with a spam file to filter
out bad tweets
tweets.transform(tweetsRDD gt
tweetsRDD.join(spamHDFSFile).filter(...)
)

30
Performance

Can process 6 GB/sec (60M records/sec) of data on
100 nodes at sub-second latency
Tested with 100 streams of data on 100 EC2
instances with 4 cores each

31
Comparison with Storm and S4

Higher throughput than Storm
Spark Streaming 670k records/second/node
Storm 115k records/second/node
Apache S4 7.5k records/second/node

32
Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

33
Real Applications Conviva

Real-time monitoring of video metadata

Achieved 1-2 second latency
Millions of video sessions processed
Scales linearly with cluster size

34
Real Applications Mobile Millennium Project

Traffic transit time estimation using online
machine learning on GPS observations

Markov chain Monte Carlo simulations on GPS
observations
Very CPU intensive, requires dozens of machines
for useful computation
Scales linearly with cluster size

35
Vision - one stack to rule them all
Spark Shark Spark Streaming
36
Spark program vs Spark Streaming program

Spark Streaming program on Twitter stream
val tweets ssc.twitterStream(ltTwitter
usernamegt, ltTwitter passwordgt)
val hashTags tweets.flatMap (status gt
getTags(status))
hashTags.saveAsHadoopFiles("hdfs//...")
Spark program on Twitter log file
val tweets sc.hadoopFile("hdfs//...")
val hashTags tweets.flatMap (status gt
getTags(status))
hashTags.saveAsHadoopFile("hdfs//...")

37
Vision - one stack to rule them all
./spark-shell scalagt val file
sc.hadoopFile(smallLogs) ... scalagt val
filtered file.filter(_.contains(ERROR)) ... sc
alagt val mapped file.map(...) ...

Explore data interactively using Spark Shell /
PySpark to identify problems
Use same code in Spark stand-alone programs to
identify problems in production logs
Use similar code in Spark Streaming to identify
problems in live log streams

object ProcessProductionData def main(args
ArrayString) val sc new
SparkContext(...) val file
sc.hadoopFile(productionLogs) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
object ProcessLiveStream def main(args
ArrayString) val sc new
StreamingContext(...) val stream
sc.kafkaStream(...) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
38
Vision - one stack to rule them all
./spark-shell scalagt val file
sc.hadoopFile(smallLogs) ... scalagt val
filtered file.filter(_.contains(ERROR)) ... sc
alagt val mapped file.map(...) ...

Explore data interactively using Spark Shell /
PySpark to identify problems
Use same code in Spark stand-alone programs to
identify problems in production logs
Use similar code in Spark Streaming to identify
problems in live log streams

Spark Shark Spark Streaming
object ProcessProductionData def main(args
ArrayString) val sc new
SparkContext(...) val file
sc.hadoopFile(productionLogs) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
object ProcessLiveStream def main(args
ArrayString) val sc new
StreamingContext(...) val stream
sc.kafkaStream(...) val filtered
file.filter(_.contains(ERROR)) val mapped
file.map(...) ...
39
Alpha Release with Spark 0.7