Title: Slayt 1
1 PAPER PRESENTATION on An Efficient and
Resilient Approach to Filtering Disseminating
Streaming Data CMPE 521 Database
Systems Prepared by Mürsel Tasgin Onur Kardes
2Introduction
- The internet and the web are increasingly used to
disseminate fast changing data. - Several examples for fast changing data
- sensors,
- traffic and weather information,
- stock prices,
- sports scores,
- health monitoring information
3Introduction
- The properties of this data
- Highly dinamic,
- Streaming,
- Aperiodic.
- Users are interested in not only monitoring
streaming data but in also using it for on-line
decision making.
4Introduction
SOURCE
Repository 1
Repository 3
Repository 2
5Introduction
- Services like Akamai.net and IBMs edge server
technology are exemplars of such networks of
repositories, which aim to provide better
services by shifting most of the work to the edge
of the network (closer to the end users). - But, although such systems scale quite well, if
the data is changing at a fast rate, the quality
of service at a repository farther from the data
source would deteriorate.
6Introduction
- In general
- Replication can reduce the load on the sources,
- But, replication of time-varying data introduces
new challenges - Coherency
- Delays and scalability
7Introduction
- Coherency requirement (cr) Users specify the
bound on the tolerable imprecision associated
with each requested data item.
Repository 1 Microsoft 60,89 at time 1136
USER 1
SOURCE Microsoft 60,85 at time 1143
Repository 2 Microsoft 60,86 at time 1141
USER 2
8Introduction
- Coherency-preserving system
- the delivered data must preserve associated
coherency requirements, - resilient to failures,
- efficient.
- Necessary changes are pushed to the users
instead of polling the source independently.
9Introduction
- Construction of an effective dissemination
network of repositories
- A logical overlay network of repositories are
created according to - coherency needs of users attached to
- each repository
- expected delays at each repository
- this network is called dynamic data
dissemination - graph (d3g).
10Introduction
- Construction of an effective dissemination
network of repositories
- The previous algorithm called LeLA, for d3g, was
unable to cope with large number of data. - A new algorithm (DiTA) to build dissemination
networks that are scalable and resilient, is
introduced.
11Introduction
- Construction of an effective dissemination
network of repositories
- In DiTA, repositories with more stringent
coherency requirements are placed closer to the
source in the network as they are likely to get
more updates than the ones with looser coherency
requirements.
- In DiTA, a dynamic data dissemination tree,
d3g, is created for each data item, x.
12Introduction
Construction of an effective dissemination
network of repositories
SOURCE
Repository 1c 0.2
Repository 2c 0.3
Repository 3c 0.8
Repository 6c 0.7
Repository 4c 0.7
Repository 5c 0.9
13Introduction
- Provision for the dissemination of dynamic data
in spite of failures in the overlay network
- to handle repository and communication link
failures back-up parents are used. - back-up parent is asked to deliver data with
coherency that is less stringent than that
associated with the parent.
14Introduction
Provision for the dissemination of dynamic data
in spite of failures in the overlay network
x,y,z,t
a,b,c,x
Parent
z
y,z,t
x,t
Back-up Parent
15Introduction
- Efficient filtering and scheduling techniques for
repositories
- normally a repository receives updates and
selectively disseminates them to its downstreams. - it is not always necessary to disseminate the
exact values of the most recent updates, as long
as the values presented preserve the coherency
of the data.
16The Basic Framework Data Coherency and Overlay
Network
17The Basic Framework Data Coherency and Overlay
Network
- a coherency requirement (c) is associated with a
data - item, to denote the maximum permissible
deviation of - the users view from the value of data x at
the source. - c can be specified in terms of
- time (values should never be out-of-sync by more
than 5sec.) - value (weather information where the temperature
value should never be out-of-sync by more than 2
degrees).
18The Basic Framework Data Coherency and Overlay
Network
Each data item in the repository from which a
user obtains data must be refreshed in such a way
that the user-specified coherency requirements
are maintained.
fidelity f observed by a user can be defined to
be the total length of time for which the above
inequality holds
19The Basic Framework Data Coherency and Overlay
Network
- Assume x is served by a single source
- Repositories R1,....,Rn are interested in x.
- These repositories in turn serve a subset of the
remaining repositories such that the resulting
network is in the form a tree rooted at the
source and consisting of repositories R1,....,Rn
. - Parent ?? dependent relationship.
20The Basic Framework Data Coherency and Overlay
Network
- Since the repository disseminates updates to its
users and dependents, the coherency requirement
of a repository should be the most stringent
requirement that it has to serve. - When a data change occurs at the source, it
checks which of its direct and indirect
dependents are interested in the change and
pushes the change to them.
21Building a d3t
- Start with a physical layout of the communication
network in the form of a graph, where the graph
consists of a set of sources, repositories and
the underlying network. - Try to build a d3t for a data item x.
- The root of the d3t will be the source, which
serves x. - A repository P serving repository Q with data
item x, is called the parent of Q and Q is
called the dependent of P for x.
22Building a d3t
Source for data itemx
in each repository
23Building a d3t
- A repository should ideally serve at least as
many unique pairs as the number of data items
served to it. - If a repository is currently serving less than
this fixed number, then we say that the
repository has the resources to serve a new
dependent.
R1
Dependent Data Item R7 x R11
y R18 x R9
z R10 t R21 x
?
24Building a d3t
SOURCE
NO
R4c0.1
NO
Max(c)0.8
Max(c)0.7
R5 c0.4
cR6 gt cR10
So, replace R10 with R6, and push R6 down
YES
Max(c)0.8
Max(c)0.6
Max(c)0.7
R7 c0.8
R9 c0.7
R8 c0.6
R10 c0.3
25Building a d3t
SOURCE
R4c0.1
Max(c)0.8
Max(c)0.7
This algorithm is called as Data-Item-at-a-Time-Al
gorithm (DiTA)
R5 c0.4
Max(c)0.6
Max(c)0.8
Max(c)0.7
Max(c)0.5
R7 c0.8
R6 c0.5
R8 c0.6
R9 c0.7
26Building a d3t
Traces Collection procedure and charectristics
- Real world stock price streams from
http//finance.yahoo.com are used. - 10,000 values are polled during 1,000 traces
approximately a new data value is obtained per
second.
27Building a d3t
Repositories Data, Coherency and Cooperation
characteristics
- A coherency requirement c is associated with each
of the chosen data items. - cs associated with data in a repository are a
mix of stringent tolerances (varying from 0.01
to 0.05) and less stringent tolerances (varying
from 0.5 to 0.99). - T of the data items have stringent coherency
requirements at each repository (the remaining
(100 T), of data items have less stringent
coherency requirements).
28Building a d3t
Physical Network topology and delays
- The router topology was generated using BRITE
(http//www.cs.bu.edu/brite). - The repositories and the sources are selected
randomly. - node-node communication delays derived from a
Pareto distribution x ? (1 / x1/a) x1 where
a x / (x-1) and -
29Building a d3t
Physical Network topology and delays
- x is the mean, x1 is the minimum delay a link
can have. - According to the experiments, x15 ms and x12
ms. - The computational delays for dissemination is
taken to be 12.5 ms .
30Building a d3t
Metrics
- The key metric is the loss in fidelity of the
data. - Fidelity was the total length of time which the
inequality - P(t) S(t) lt c holds.
- Fidelity of a repository is the mean over all
data items stored in that repository - Fidelity of the system is the mean fidelity of
all repositories. - Obviously, the loss of fidelity is (100 -
fidelity) - One another metric is the number of messages in
the system (system load)
31Building a d3t
Performance Evaluation
- For the base performance measurement, 600
routers, 100 repositories and 4 servers were
used. - Total number of data items served by servers was
varied from 100 to 1000. - T parameter was varied from 20 to 80.
- A previous algorithm, LeLA was used as a
benchmark.
32Building a d3t
Performance Evaluation
- Each node in DiTA does less work than in LeLA.
- Thus, in DiTA height of the dissemination tree
will be more. - So, when computational delays are low but link
delays are large, LeLA may act better. - But, this happens only for negligible
computational delays (0.5 ms) and very high link
delays (110 ms)
33Enchancing the Resiliency of the Repository
Network
- Active backups vs. Passive backups
- Passive backups may increase the load, which
causes the loss in fidelity. - So active backup parents are used.
- A backup parent serves data to a dependent Q with
a coherency cB gt c.
34Enchancing the Resiliency of the Repository
Network
- If all changes are less than cB, the dependent
can not know when parent P fails. So P should
send periodic Im alive messages. - Once P fails, Q requests B to serve it the data
at c . When P recovers from the failure, Q
requests B to serve the data item at cB. - In this approach, there no backup for backups. So
that when both P and B fails, Q can not get any
updates.
35Enchancing the Resiliency of the Repository
Network
Choice of cB Using a Probabilistic Model
- For the sake of simplicity, cB k c.
- Here, choice of k is important
k
36Enchancing the Resiliency of the Repository
Network
Choice of cB Using a Probabilistic Model
- Assuming that the data values change with uniform
probability and - Using a Markov Chain Model
- Misses 2k2 2
- 2k2-2 is the number of updates a dependent will
miss before it detects that there is a failure. - According to the experiments, this number is
rather pessimistic nearly an upper limit.
37Enchancing the Resiliency of the Repository
Network
Choice of backup parents
R
YES
B
P
C
NO
Choose one of them randomly
Q
38Enchancing the Resiliency of the Repository
Network
Choice of backup parents
- In case the coherency at which Q wants x from B
is less then the coherency at which B wants x , - the parent of B is asked to serve x to Q with the
required tighter coherency. - An advantage of choosing a sibling, is that the
change in coherency requirement is not percolated
all the way to the source. - However, if an ancestor of P and B is heavily
loaded, then the delay due to the load will be
reflected in the updates of both the P and B .
This might result in additional loss in fidelity.
39Enchancing the Resiliency of the Repository
Network
Effect of Repository failures on Loss of Fidelity
- Because the kinds of failures are memory-less, an
exponential probability distribution is used for
simulating them. - Pr (X gt t) e-?t
- ? ?1 ? time to failure
- ? ?2 ? time to recover
- In this approach link failures are not taken into
account. So the model is incomplete...
?2
40Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
- The effect of adding resiliency is shown.
- k2 is used.
- When 100 data items are used, 23 of updates sent
by backups are disseminated. - Some updates sent by backups reached before
parents.
41Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
- But when backup parents are loaded ( gt 400),
their updates are of no use, and increase the
loss of fidelity. - The dependent should control them by
time-stamping the updates.
42Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
- During the experiment, about 80-90 of the
repositories experienced at least one failure, - and the maximum number of failures in the system
at any given time for ?2 0.001 was around 12. - For ?2 0.01, the maximum number of failures was
5 and for ?2 0.1 , the maximum failures was 2.
43Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
- Effect of quick recovery is shown.
- ?1 0.0001 and ?2 2
- For high coherence requirements, resiliency
improves fidelity even for transient failures.
44Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
- However, with resiliency with a very large
number of data items, for e.g., 1000, fidelity
drops. - This is because, at this point, the cost of
resiliency exceeds the benefits obtained by it,
and hence this increases the lost in fidelity.
45Reducing the Delay at a Repository
- Delays
- Queing delay The time delay between the arrival
of the update and time its processing started - Processing delay Check delay (decide if the
update should be processed) computation delay(
delay of computing the update and pushing data to
the dependents)
processing delay
46Reducing the Delay at a Repository
- Question How can we reduce the average delays to
improve fidelity? - This can be done by
- Better filtering i.e. Reducing the processing
delay in determining if - an update needs to disseminated to one or more
dependents - Better scheduling of disseminations
47Reducing the Delay at a Repository
For each dependent, a repository maintains the
coherency req. last value pushed to Upper bound
last pushed value cr Lower bound last
pushed value - cr C10.7 C20.6 C30.5 C40.3 C5
0.1 C60.05
Algorithm to find the dependents to disseminate
data
Sorted cr values
The dependent with first largest cr which needs
to be disseminated
CR values for dependents reside at the repository
For every window the below rule is valid
If an update violates above rule a pseudo value
is generated as actual value
Dependent ordering
48Reducing the Delay at a Repository
- Better filtering provides
- Sending the updates of dynamic data to end users
who are actually - interested in that update.
- By filtering, no garbage data flow is on the
network. (no flooding of - data over the network) This improves
communication time in the - networks and provides better response times
- By the help of filtering, a better scalable
system can be established and it will resist
against unexpected heavy loads.
49Reducing the Delay at a Repository
- Better scheduling of disseminations
Total delay of processing ui
C(u1) Cost of update(delay)
C(u2) Cost of update(delay)
b(u1) Beneficiary of update
b(u2) Beneficiary of update
- Approach
- Instead of standard queueing of processing the
update requests, a kind of prioritization is
superior to have better performance ? b(u)/C(u)
SCORING - Each update request is shceduled according to
this score. B(u) is the number of dependents that
will receive the update, C(u) is the cost of
dissemination to all dependants. B(u) values are
stored at aech repository so they are precomputed
automatically. - Advantages
- Update requests that is important to many
dependents will be processed earlier ? BUSINESS
IMPORTANCE - Updates with low ratio gets delayed and if a new
update arrives older ones are dropped, which
improves performance especially in heaviliy
loaded environments ? SCALABILITY
50Reducing the Delay at a Repository
Better scheduling of disseminations
- Scheduling provides
- Priority scheme and business importance approach
that achieves better results - As filtering, it makes improvements on
scalability some out of date update requests are
discarded from the queue. This saves unnecessary
computations and queue delays.
51Reducing the Delay at a Repository
- Experimental Results
- Dependent ordering has lower loss of fidelity
than simple algorithm. However Scheduling has
better than those (up to 15) - Dependent ordering has less number of pushes
than simple algorithm. - Scheduling algorithm decrease computation
delays because some updates are dropped at the
queue because of new updates arrive and older
ones are out of date. - Fidelity loss with Scheduling is shown with
some numbers. It is seen that fidelity drops with
an increase in the number of data items. Even
with large increases in the number of data items,
high update rates loss of fidelity is in the
range within 10 only. - This provides better scalability
52Reducing the Delay at a Repository
- Advantages of the better performance approaches
- Approach-1- Maintaining the dependents ordered
by cr values - Reduces the number of checks required for
processing each update - Reduces the number of pushes
- Approach-2- Scheduling
- Reduces the overall delay to the end clients by
processing updates which provide a higher benefit
at a lower cost - Gives a better choice in dropping updates as low
score updates are dropped - Due to lower propagation delay, it provides
better scalibility and degrades gracefully under
unexpected heavy loads
53Related Work
- Simple decision procedure is superior. Because
there are many complex algorithms and database
systems, that take much computation time to
maintain data repository up to date - Some dynamic web data dissemination algorithms
also uses push-based scheme. However if they use
coherency scalability is improved and another
important feature is that data repositories dont
need to cooperate with each other to maintain
coherence information. (its up to date already!) - This approach deals with rapidly changing dynamic
data while some similar approaches focus on web
content that changes at slower time-scales - Most powerful side of this approach is that it
deals with the problem of failure and forms a
resillient dissemination network.
54Conclusion
- The key points in this architecture are
- Design of a push-based dissemination for
time-varying data. Not all the updates are
disseminated to each repository, only the updates
that meet the coherency requirements are pushed ?
EFFICIENT - Design of cooperative dissemination network. This
provides a resilient network and even if a
failure in the network occurs, data coherency is
not completely lost. ? RESILLIENT - Intelligent filtering, scheduling, selective
dissemination reduces the overhead in the
network. It provides a better scalability and
its a good alternative for dynamic data
publishing. ? SCALABLE