Slayt 1

About This Presentation

Title:

Slayt 1

Description:

An Efficient and Resilient Approach to Filtering & Disseminating Streaming ... The internet and the web are increasingly used to disseminate fast changing data. ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 55

Provided by: mrselt

Category:

more less

Transcript and Presenter's Notes

Title: Slayt 1

1

PAPER PRESENTATION on An Efficient and
Resilient Approach to Filtering Disseminating
Streaming Data CMPE 521 Database
Systems Prepared by Mürsel Tasgin Onur Kardes
2
Introduction

The internet and the web are increasingly used to
disseminate fast changing data.
Several examples for fast changing data
sensors,
traffic and weather information,
stock prices,
sports scores,
health monitoring information

3
Introduction

The properties of this data
Highly dinamic,
Streaming,
Aperiodic.
Users are interested in not only monitoring
streaming data but in also using it for on-line
decision making.

4
Introduction

Replicating the Source

SOURCE
Repository 1
Repository 3
Repository 2
5
Introduction

Services like Akamai.net and IBMs edge server
technology are exemplars of such networks of
repositories, which aim to provide better
services by shifting most of the work to the edge
of the network (closer to the end users).
But, although such systems scale quite well, if
the data is changing at a fast rate, the quality
of service at a repository farther from the data
source would deteriorate.

6
Introduction

In general
Replication can reduce the load on the sources,
But, replication of time-varying data introduces
new challenges
Coherency
Delays and scalability

7
Introduction

Coherency requirement (cr) Users specify the
bound on the tolerable imprecision associated
with each requested data item.

Repository 1 Microsoft 60,89 at time 1136
USER 1
SOURCE Microsoft 60,85 at time 1143
Repository 2 Microsoft 60,86 at time 1141
USER 2
8
Introduction

Coherency-preserving system
the delivered data must preserve associated
coherency requirements,
resilient to failures,
efficient.
Necessary changes are pushed to the users
instead of polling the source independently.

9
Introduction

Construction of an effective dissemination
network of repositories

A logical overlay network of repositories are
created according to
coherency needs of users attached to
each repository
expected delays at each repository
this network is called dynamic data
dissemination
graph (d3g).

10
Introduction

Construction of an effective dissemination
network of repositories

The previous algorithm called LeLA, for d3g, was
unable to cope with large number of data.
A new algorithm (DiTA) to build dissemination
networks that are scalable and resilient, is
introduced.

11
Introduction

Construction of an effective dissemination
network of repositories

In DiTA, repositories with more stringent
coherency requirements are placed closer to the
source in the network as they are likely to get
more updates than the ones with looser coherency
requirements.

In DiTA, a dynamic data dissemination tree,
d3g, is created for each data item, x.

12
Introduction
Construction of an effective dissemination
network of repositories
SOURCE
Repository 1c 0.2
Repository 2c 0.3
Repository 3c 0.8
Repository 6c 0.7
Repository 4c 0.7
Repository 5c 0.9
13
Introduction

Provision for the dissemination of dynamic data
in spite of failures in the overlay network

to handle repository and communication link
failures back-up parents are used.
back-up parent is asked to deliver data with
coherency that is less stringent than that
associated with the parent.

14
Introduction
Provision for the dissemination of dynamic data
in spite of failures in the overlay network
x,y,z,t
a,b,c,x
Parent
z
y,z,t
x,t
Back-up Parent
15
Introduction

Efficient filtering and scheduling techniques for
repositories

normally a repository receives updates and
selectively disseminates them to its downstreams.
it is not always necessary to disseminate the
exact values of the most recent updates, as long
as the values presented preserve the coherency
of the data.

16
The Basic Framework Data Coherency and Overlay
Network
17
The Basic Framework Data Coherency and Overlay
Network

a coherency requirement (c) is associated with a
data
item, to denote the maximum permissible
deviation of
the users view from the value of data x at
the source.
c can be specified in terms of
time (values should never be out-of-sync by more
than 5sec.)
value (weather information where the temperature
value should never be out-of-sync by more than 2
degrees).

18
The Basic Framework Data Coherency and Overlay
Network
Each data item in the repository from which a
user obtains data must be refreshed in such a way
that the user-specified coherency requirements
are maintained.
fidelity f observed by a user can be defined to
be the total length of time for which the above
inequality holds
19
The Basic Framework Data Coherency and Overlay
Network

Assume x is served by a single source
Repositories R1,....,Rn are interested in x.
These repositories in turn serve a subset of the
remaining repositories such that the resulting
network is in the form a tree rooted at the
source and consisting of repositories R1,....,Rn
.
Parent ?? dependent relationship.

20
The Basic Framework Data Coherency and Overlay
Network

Since the repository disseminates updates to its
users and dependents, the coherency requirement
of a repository should be the most stringent
requirement that it has to serve.
When a data change occurs at the source, it
checks which of its direct and indirect
dependents are interested in the change and
pushes the change to them.

21
Building a d3t

Start with a physical layout of the communication
network in the form of a graph, where the graph
consists of a set of sources, repositories and
the underlying network.
Try to build a d3t for a data item x.
The root of the d3t will be the source, which
serves x.
A repository P serving repository Q with data
item x, is called the parent of Q and Q is
called the dependent of P for x.

22
Building a d3t
Source for data itemx
in each repository
23
Building a d3t

A repository should ideally serve at least as
many unique pairs as the number of data items
served to it.
If a repository is currently serving less than
this fixed number, then we say that the
repository has the resources to serve a new
dependent.

R1
Dependent Data Item R7 x R11
y R18 x R9
z R10 t R21 x
?
24
Building a d3t
SOURCE
NO
R4c0.1
NO
Max(c)0.8
Max(c)0.7
R5 c0.4
cR6 gt cR10
So, replace R10 with R6, and push R6 down
YES
Max(c)0.8
Max(c)0.6
Max(c)0.7
R7 c0.8
R9 c0.7
R8 c0.6
R10 c0.3
25
Building a d3t
SOURCE
R4c0.1
Max(c)0.8
Max(c)0.7
This algorithm is called as Data-Item-at-a-Time-Al
gorithm (DiTA)
R5 c0.4
Max(c)0.6
Max(c)0.8
Max(c)0.7
Max(c)0.5
R7 c0.8
R6 c0.5
R8 c0.6
R9 c0.7
26
Building a d3t
Traces Collection procedure and charectristics

Real world stock price streams from
http//finance.yahoo.com are used.
10,000 values are polled during 1,000 traces
approximately a new data value is obtained per
second.

27
Building a d3t
Repositories Data, Coherency and Cooperation
characteristics

A coherency requirement c is associated with each
of the chosen data items.
cs associated with data in a repository are a
mix of stringent tolerances (varying from 0.01
to 0.05) and less stringent tolerances (varying
from 0.5 to 0.99).
T of the data items have stringent coherency
requirements at each repository (the remaining
(100 T), of data items have less stringent
coherency requirements).

28
Building a d3t
Physical Network topology and delays

The router topology was generated using BRITE
(http//www.cs.bu.edu/brite).
The repositories and the sources are selected
randomly.
node-node communication delays derived from a
Pareto distribution x ? (1 / x1/a) x1 where
a x / (x-1) and

29
Building a d3t
Physical Network topology and delays

x is the mean, x1 is the minimum delay a link
can have.
According to the experiments, x15 ms and x12
ms.
The computational delays for dissemination is
taken to be 12.5 ms .

30
Building a d3t
Metrics

The key metric is the loss in fidelity of the
data.
Fidelity was the total length of time which the
inequality
P(t) S(t) lt c holds.
Fidelity of a repository is the mean over all
data items stored in that repository
Fidelity of the system is the mean fidelity of
all repositories.
Obviously, the loss of fidelity is (100 -
fidelity)
One another metric is the number of messages in
the system (system load)

31
Building a d3t
Performance Evaluation

For the base performance measurement, 600
routers, 100 repositories and 4 servers were
used.
Total number of data items served by servers was
varied from 100 to 1000.
T parameter was varied from 20 to 80.
A previous algorithm, LeLA was used as a
benchmark.

32
Building a d3t
Performance Evaluation

Each node in DiTA does less work than in LeLA.
Thus, in DiTA height of the dissemination tree
will be more.
So, when computational delays are low but link
delays are large, LeLA may act better.
But, this happens only for negligible
computational delays (0.5 ms) and very high link
delays (110 ms)

33
Enchancing the Resiliency of the Repository
Network

Active backups vs. Passive backups
Passive backups may increase the load, which
causes the loss in fidelity.
So active backup parents are used.
A backup parent serves data to a dependent Q with
a coherency cB gt c.

34
Enchancing the Resiliency of the Repository
Network

If all changes are less than cB, the dependent
can not know when parent P fails. So P should
send periodic Im alive messages.
Once P fails, Q requests B to serve it the data
at c . When P recovers from the failure, Q
requests B to serve the data item at cB.
In this approach, there no backup for backups. So
that when both P and B fails, Q can not get any
updates.

35
Enchancing the Resiliency of the Repository
Network
Choice of cB Using a Probabilistic Model

For the sake of simplicity, cB k c.
Here, choice of k is important

k
36
Enchancing the Resiliency of the Repository
Network
Choice of cB Using a Probabilistic Model

Assuming that the data values change with uniform
probability and
Using a Markov Chain Model
Misses 2k2 2
2k2-2 is the number of updates a dependent will
miss before it detects that there is a failure.
According to the experiments, this number is
rather pessimistic nearly an upper limit.

37
Enchancing the Resiliency of the Repository
Network
Choice of backup parents
R
YES
B
P
C
NO
Choose one of them randomly
Q
38
Enchancing the Resiliency of the Repository
Network
Choice of backup parents

In case the coherency at which Q wants x from B
is less then the coherency at which B wants x ,
the parent of B is asked to serve x to Q with the
required tighter coherency.
An advantage of choosing a sibling, is that the
change in coherency requirement is not percolated
all the way to the source.
However, if an ancestor of P and B is heavily
loaded, then the delay due to the load will be
reflected in the updates of both the P and B .
This might result in additional loss in fidelity.

39
Enchancing the Resiliency of the Repository
Network
Effect of Repository failures on Loss of Fidelity

Because the kinds of failures are memory-less, an
exponential probability distribution is used for
simulating them.
Pr (X gt t) e-?t
? ?1 ? time to failure
? ?2 ? time to recover
In this approach link failures are not taken into
account. So the model is incomplete...

?2
40
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation

The effect of adding resiliency is shown.
k2 is used.
When 100 data items are used, 23 of updates sent
by backups are disseminated.
Some updates sent by backups reached before
parents.

41
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation

But when backup parents are loaded ( gt 400),
their updates are of no use, and increase the
loss of fidelity.
The dependent should control them by
time-stamping the updates.

42
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation

During the experiment, about 80-90 of the
repositories experienced at least one failure,
and the maximum number of failures in the system
at any given time for ?2 0.001 was around 12.
For ?2 0.01, the maximum number of failures was
5 and for ?2 0.1 , the maximum failures was 2.

43
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation

Effect of quick recovery is shown.
?1 0.0001 and ?2 2
For high coherence requirements, resiliency
improves fidelity even for transient failures.

44
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation

However, with resiliency with a very large
number of data items, for e.g., 1000, fidelity
drops.
This is because, at this point, the cost of
resiliency exceeds the benefits obtained by it,
and hence this increases the lost in fidelity.

45
Reducing the Delay at a Repository

Delays
Queing delay The time delay between the arrival
of the update and time its processing started
Processing delay Check delay (decide if the
update should be processed) computation delay(
delay of computing the update and pushing data to
the dependents)

processing delay
46
Reducing the Delay at a Repository

Question How can we reduce the average delays to
improve fidelity?
This can be done by
Better filtering i.e. Reducing the processing
delay in determining if
an update needs to disseminated to one or more
dependents
Better scheduling of disseminations

47
Reducing the Delay at a Repository

Better Filtering

For each dependent, a repository maintains the
coherency req. last value pushed to Upper bound
last pushed value cr Lower bound last
pushed value - cr C10.7 C20.6 C30.5 C40.3 C5
0.1 C60.05

Algorithm to find the dependents to disseminate
data
Sorted cr values
The dependent with first largest cr which needs
to be disseminated

CR values for dependents reside at the repository

For every window the below rule is valid
If an update violates above rule a pseudo value
is generated as actual value
Dependent ordering
48
Reducing the Delay at a Repository

Better Filtering

Better filtering provides
Sending the updates of dynamic data to end users
who are actually
interested in that update.
By filtering, no garbage data flow is on the
network. (no flooding of
data over the network) This improves
communication time in the
networks and provides better response times
By the help of filtering, a better scalable
system can be established and it will resist
against unexpected heavy loads.

49
Reducing the Delay at a Repository

Better scheduling of disseminations

Total delay of processing ui
C(u1) Cost of update(delay)
C(u2) Cost of update(delay)
b(u1) Beneficiary of update
b(u2) Beneficiary of update

Approach
Instead of standard queueing of processing the
update requests, a kind of prioritization is
superior to have better performance ? b(u)/C(u)
SCORING
Each update request is shceduled according to
this score. B(u) is the number of dependents that
will receive the update, C(u) is the cost of
dissemination to all dependants. B(u) values are
stored at aech repository so they are precomputed
automatically.
Advantages
Update requests that is important to many
dependents will be processed earlier ? BUSINESS
IMPORTANCE
Updates with low ratio gets delayed and if a new
update arrives older ones are dropped, which
improves performance especially in heaviliy
loaded environments ? SCALABILITY

50
Reducing the Delay at a Repository
Better scheduling of disseminations

Scheduling provides
Priority scheme and business importance approach
that achieves better results
As filtering, it makes improvements on
scalability some out of date update requests are
discarded from the queue. This saves unnecessary
computations and queue delays.

51
Reducing the Delay at a Repository

Experimental Results
Dependent ordering has lower loss of fidelity
than simple algorithm. However Scheduling has
better than those (up to 15)
Dependent ordering has less number of pushes
than simple algorithm.
Scheduling algorithm decrease computation
delays because some updates are dropped at the
queue because of new updates arrive and older
ones are out of date.
Fidelity loss with Scheduling is shown with
some numbers. It is seen that fidelity drops with
an increase in the number of data items. Even
with large increases in the number of data items,
high update rates loss of fidelity is in the
range within 10 only.
This provides better scalability

52
Reducing the Delay at a Repository

Advantages of the better performance approaches
Approach-1- Maintaining the dependents ordered
by cr values
Reduces the number of checks required for
processing each update
Reduces the number of pushes
Approach-2- Scheduling
Reduces the overall delay to the end clients by
processing updates which provide a higher benefit
at a lower cost
Gives a better choice in dropping updates as low
score updates are dropped
Due to lower propagation delay, it provides
better scalibility and degrades gracefully under
unexpected heavy loads

53
Related Work

Simple decision procedure is superior. Because
there are many complex algorithms and database
systems, that take much computation time to
maintain data repository up to date
Some dynamic web data dissemination algorithms
also uses push-based scheme. However if they use
coherency scalability is improved and another
important feature is that data repositories dont
need to cooperate with each other to maintain
coherence information. (its up to date already!)
This approach deals with rapidly changing dynamic
data while some similar approaches focus on web
content that changes at slower time-scales
Most powerful side of this approach is that it
deals with the problem of failure and forms a
resillient dissemination network.

54
Conclusion

The key points in this architecture are
Design of a push-based dissemination for
time-varying data. Not all the updates are
disseminated to each repository, only the updates
that meet the coherency requirements are pushed ?
EFFICIENT
Design of cooperative dissemination network. This
provides a resilient network and even if a
failure in the network occurs, data coherency is
not completely lost. ? RESILLIENT
Intelligent filtering, scheduling, selective
dissemination reduces the overhead in the
network. It provides a better scalability and
its a good alternative for dynamic data
publishing. ? SCALABLE