Title: Data Quality and Query Cost in Pervasive Sensing Systems
1Data Quality and Query Cost in Pervasive Sensing
Systems
Bentley CollegeComputer Information Systems
Dept. Waltham, Massachusetts, USA dyates_at_bentley.e
du
2Joint Work With
- Erich Nahum
- IBM T.J. Watson Research Center
- 19 Skyline Drive
- Hawthorne, New York, USA
- James Kurose and Prashant Shenoy
- Dept. of Computer Science
- University of Massachusetts
- Amherst, Massachusetts, USA
3Talk Outline
- Data quality and query cost for pervasive sensing
systems - Motivation and introduction
- Pervasive sensing applications
- Resource-constrained sensor fields
- Sensor networks and backbone networks
- Data management techniques to conserve resources
- Sensor network data server and cache
- Query cost, data quality, delay, value deviation
- Cost and quality performance
- Summary and Conclusions
4Research Contributions
- Define and quantify data quality and query cost
performance in pervasive sensing systems - Develop policies that approximate sensor field
values using cached values for nearby locations - Prove analytic upper bound on sensor field query
rate - Show cost and quality win-win for pervasive
sensing applications for which response time is
most important - Show cost vs. quality tradeoff for sensing
applications for which accuracy is most important - Results are robust with respect to the manner in
which the query workload changes
5Pervasive Sensing Applications
- Microsensors, on-board processing, wireless
interfaces feasible at very small scale can
monitor phenomena up close - Enables spatially and temporally dense monitoring
and control - Pervasive sensing will reveal previously
unobservable phenomena
Data center management
Manufacturing engineering
Environmental monitoring
Natural disaster response
Embedded, energy-constrained (wireless, small
form-factor), unattended systems
6Sensors Embedded in Infrastructure
- The day after a moderate earthquake jolts the
city of San Francisco, building inspectors check
on the structural integrity of an office building
in the financial district. Sensors embedded in
the walls of the building to monitor and record
vibration data confirm that the structure is safe
to enter. (Intel 2005)
7From Sensor Networks to Applications
- Sensor fields (blue), backbone (yellow),
monitoring control applications (red) - Queries submitted from sensing applications
- Replies received from sensor fields
- Our focus Data management at data server
Routers Switches
Sensing Application
Data server / Gateway (and cache)
Sound
Light
Embedded, energy-constrained (wireless, small
form-factor), unattended systems
8Data Server Node Without Cache
Sensor field
s
t1
s
l1
Sensor network query queue
s
Queries
Queries
s
s
s
s
t2
s
l2
Replies
s
Replies
s
s
Gateway reply queue
s
li query location i
ti timestamp associated with value sampled in
sensor field at location i
s sensor
9Data Server Node Without Cache
Sensor field
End-to-end delay occurs between Querym and
Replym.Value deviation is between the value in
Replym and the value at li as Replym leaves the
gateway reply queue.
s
t1
s
l1
Sensor network query queue
s
Queries
Queries
s
s
Querym
s
s
t2
s
l2
Replies
s
Replym
Replies
s
s
Gateway reply queue
s
li query location i
ti timestamp associated with value sampled in
sensor field at location i
s sensor
10Data Server Node With Cache
Sensor field
For a cache hit or a miss, end-to-end delay
occurs between Querym and Replym. Also, value
deviation is between the value in Replym and the
value at li as Replym leaves the gateway reply
queue.
s
l3
s
l1
Sensor network query queue
Gateway query queue
s
Queries
Queries
s
s
Querym
Miss or Prefetch
s
eli li,vi,ti
el1, el2
s
Cache
s
l2
Hit
s
Replym
Updates or replies
Updates
Replies
s
s
Cache update queue
Gateway reply queue
s
li query location eli cache entry for query
location
vi value in cache associated with location i
s sensor
ti timestamp of value associated with location i
Locations l1 and l2 are cached in entries el1 and
el2
11Query Cost and Data Quality
Cost to query location li is normalized such that
Normalized quality using softmax normalization
12Caching and Lookup Policies
- All hits
- All misses
- Simple lookup
- Piggyback queries
- Greedy age-based lookup
- Greedy distance-based lookup
- Median-of-3 lookup
approximate lookups and queries
Policies incorporate an age parameter TT can be
0, finite, or infinite
13Research Contributions
- Defined and quantified data quality and query
cost performance in pervasive sensing systems - Developed policies that approximate sensor field
values using cached values for nearby locations - Prove analytic upper bound on sensor field query
rate - Show cost and quality win-win for pervasive
sensing applications for which response time is
most important - Show cost vs. quality tradeoff for sensing
applications for which accuracy is most important - Results are robust with respect to the manner in
which the query workload changes
14 Lab Trace Data
Trace data from multi-sensor motes deployed at
Intel Berkeley lab (Deshpande 2004)
15 Lab Environment and Workload
- 2.3 million readings taken over 35 days
- Use readings with largest changes in value in our
simulator (light measured in Lux) - Changes occur slowly relative to correlated
changes (about 1 location every 1.4 seconds) - But, range of values is large
- Applications determine values for A and T
16Bounded Resource Consumption
- N is set of locations in sensor field
- Cache entry for each location used by multiple
queries for periods of T seconds (requires
blocking behind pending queries) - Sensor field query rate can be bounded by
queries per second - Proof Induction on size of N
- Sensor field transmissions dominate resource
consumption
17Data Quality Driven by Response Time
Picking a large value of A means delay is more
importantthan value deviationConsider
normalized quality when A 0.9
18Cost and Quality Performance whenResponse Time
drives Quality
Trace-driven ChangesA 0.9, T 90 secQuery
rate 0.9 lpsChange rate 1.4 lps
Approximate greedy lookups outperform other
policiesThere is a win-win here!
19Delay when Response Time drives Quality
Trace-driven Changes
20Research Contributions
- Defined and quantified data quality and query
cost performance in pervasive sensing systems - Developed policies that approximate sensor field
values using cached values for nearby locations - Proved analytic upper bound on sensor field query
rate - Showed cost and quality win-win for pervasive
sensing applications for which response time is
most important - Show cost vs. quality tradeoff for sensing
applications for which accuracy is most important - Results are robust with respect to the manner in
which the query workload changes
21Data Quality Driven by Accuracy
Choosing a small value of A means value deviation
is moreimportant to data quality than delayFor
example, consider normalized quality when A 0.1
22Cost vs. Quality when Accuracy drives Quality
Trace-driven ChangesA 0.1, T 90 secQuery
rate 0.9 lpsChange rate 1.4 lps
There is a tradeoff between cost and quality here
23Value Deviation when Accuracy drives Quality
Trace-driven Changes
Significant differences in accuracy between
policies
24Cost and Quality Trends when Response Time
drives Quality
Trace-driven ChangesA 0.9, T 9 secQuery
rate 90, 9,and 0.9 lps Again, there is
awin-win here!
25Cost vs. Quality Trends when Accuracy drives
Quality
Trace-driven ChangesA 0.1, T 9 secQuery
rate 90, 9,and 0.9 lps Same relative
performance
26Talk Summary
- Define and quantify data quality and query cost
performance in pervasive sensing systems - Develop policies that approximate sensor field
values using cached values for nearby locations - Prove analytic upper bound on sensor field query
rate - Show cost and quality win-win for pervasive
sensing applications for which response time is
most important - Show cost vs. quality tradeoff for sensing
applications for which accuracy is most important - Results are robust with respect to the manner in
which the query workload changes
27Thank You!
David J. Yates
Bentley CollegeComputer Information Systems
Dept. Waltham, Massachusetts, USA dyates_at_bentley.e
du
28Emergency Response Applications
- Fire erupts in a warehouse in an industrial
section of town. A sensing system installed in
the building feeds detailed data to fire crews
arriving on the scene, describing the location,
characteristic and etiology of the fire, and
predicting its future path. The result
firefighters are able to work quickly and safely
to bring the blaze under control. (Intel 2005)
29Technology Market Trends
- Three of the 7 companies named by Gartner as
Cool Vendors in Emerging Trends and
Technologies in 2005 produced hardware and/or
software for sensor networks (Reynolds et al.
2005) - IDC has identified supply chain management as the
largest sensor network market in the short-term
and predicts that the domestic market for RFID
sensors will exceed 1 billion in 2007 (C. Boone
2003)
30Data Quality and Query Cost Research Issues
- What form do data quality and query cost
performance take? - Can we bound resource consumption?
- Which policies provide best cost and quality when
value deviation is more important than delay? - Which policies provide best performance when
delay is more important than value deviation? - How does the manner in which the environment
changes impact performance?
31Softmax Normalization
- Requires that we know only the mean and standard
deviation for our system delays and value
deviations - Normalization makes transformed values lie in the
range 0,1 - Used in neural networks and data mining for
pattern recognition and data classification
(Bridle 1990, Bishop 1995, Han and Kamber 2000) - Reaches softly towards maximum and minimum
values, never quite getting there (RodrÃguez
2004) - Transformation is more or less linear in the
middle range, and has a nonlinearity at both ends
(RodrÃguez 2004)
32Query Workload Model
- Query workload consists of polling component and
random component - Parameterize to yield many workloads proposed by
others - e.g., Madd03 (Berkeley), Lu02 (Virginia),
Deme03 (Cornell), Jami03 (MIT),
Inta03Zhao03 (USC), Desh03 (CMU), Olst03
(Stanford) - These components are specified using two
parameters - ? period of the polling component
- ? average query arrival rate for a process that
represents the random component - Example 9 queries with fixed interarrival times
81 queries with exponentially distributed
interarrival times 90 queries / second - All locations are equally likely to be queried
33Models for Changes to Environment
- Changes at each location are independent
- Changes at each location correlated in space and
time Models developed at USC (Jindal 2004) - Changes taken from real-world sensor readings at
Intel Berkeley lab (Deshpande 2004) - Our focus - Models 2. and 3.
34Delay when Accuracy drives Quality
Correlated Changes
Trace-driven Changes
Large all misses delay has important impact on
quality,but is discounted by choice of A 0.1
35Results from Two Models
- For correlated and trace-driven sensor network
models - When delay is more important than value
deviation, policies that approximate values using
cached values for nearby locations provide best
cost and best quality performance - When value deviation is more important than
delay, there is a cost vs. quality tradeoff - Policies that always query (and cache) the
specified location provide the best quality
performance - Policies that approximate values using cached
values for nearby locations provide best cost
performance - What happens if we vary the query rate relative
to the rate at which the environment changes?
36Summary and Conclusions
- Data Quality and Query Cost in Pervasive Sensing
Systems - Define and quantify data quality and query cost
performance in Pervasive Sensing Systems - Blocking behind pending queries bounds sensor
field query rate - When delay is more important than value
deviation, policies that approximate values using
cached values for nearby locations provide best
cost and quality performance - When value deviation is more important than
delay, there is a cost vs. quality tradeoff - Policies that always query (and cache) the
specified location provide the best quality
performance - Policies that approximate values using cached
values for nearby locations provide best cost
performance - Results are robust with respect to the manner in
which the environment changes
37References I
- (Christopher Bishop 1995) Neural Networks for
Pattern Recognition. Oxford University Press,
Oxford. - (C. Boone 2003) U.S. RFID for the Retail Supply
Chain Spending Forecast and Analysis, 2003-2008,
IDC, December 2003. - (John Bridle 1990) Probabilistic interpretation
of feed-forward classification network outputs,
with relationships to statistical pattern
recognition, In Neurocomputing Algorithms,
Architectures and Applications, Volume 6,
Springer-Verlag, Berlin. - (A. Deshpande, C. Guestrin, S. Madden, J.M.
Hellerstein, W. Hong 2004) Model-Driven Data
Acquisition in Sensor Networks, In International
Conference on Very Large Data Bases (VLDB),
Toronto, August 2004.
38References II
- (J. Han and M. Kamber 2000) Data Mining Concepts
and Techniques. Morgan Kaufman Publishers, San
Francisco, California. - (Intel 2005) Intel Corp., Expanding Usage Models
for Pervasive Sensing Systems, Technology_at_Intel
Magazine, August 2005. - (A. Jindal and K. Psounis 2004) Modeling
spatially-correlated sensor network data, In
IEEE International Conference on Sensor and Ad
hoc Communications and Networks (SECON), Santa
Clara, California, October 2004. - (Reynolds et al. 2005) Martin Reynolds, Alan Mac
Neela, Carol Rozwell, and Anne-Marie Roussel,
Cool Vendors in Emerging Trends and
Technologies, Gartner Research Report, March
2005.
39References III
- (Caroline RodrÃguez 2004) A computational
environment for data preprocessing in supervised
classification, M.Sc. Thesis, Department of
Mathematics, University of Puerto Rico, Mayagüez,
July 2004. - (J. Sikander 2004), Microsoft RFID Technology
Overview, Microsoft Corp., November 2004.