Title: Cost Conscious Cleaning of Massive RFID Data Sets
1Cost Conscious Cleaning of Massive RFID Data Sets
- Hector Gonzalez, Jiawei Han, Xuehua Shen
- University of Illinois at Urbana Champaign
- Department of Computer Science
- The Database and Information System Laboratory
- ICDE 2007
2Outline
- Motivation
- RFID Data
- Error Sources
- Cleaning Methods
- Smoothing methods
- Rule based Methods
- DBN based methods
- Cost conscious cleaning
- Architecture
- Cleaning Methods
- Cleaning Sequence
- Cleaning Plan Induction
- Performance Study
3Motivation
- The reliability of current RFID systems is far
from optimal. - Under a wide variety of environments more than
50 of all tag readings are missed. - The volume of data generated by RFID systems is
huge - A large retailer with hundreds of readers can
generate thousands of tag readings every second. - An accurate and efficient cleaning process is
essential to the successful implementation of
RFID technology.
4Example
- A large retailer with RFID readers at warehouses,
distribution centers, and store backrooms. - A variety of factors impact correct tag
detections - Diverse reader/tag manufacturers, generation
- Moving (conveyor belts, doors) and static tags
(shelfs). - Different levels of RF noise caused by metal or
water in the environment or in products. - No single method can efficiently clean such a
large volume of data, generated under such
diverse circumstances.
5RFID Data
- Readers conduct interrogation cycles at periodic
invervals - An RF signal is issued
- Tags awake and transmit via RF their EPC
(electronic product code) - A singulation protocol is used to prevent tag
collisions - In order to improve accuracy, during a read cycle
multiple interrogation cycles are issued. - Readings of the form (EPC, Reader Time) are
generated at the end of each read cycle. Readings
have extra information such as - Total responses obtained during the read cycle
- Antenna used for detection, tag type, and signal
strength
6Error Sources
- There are two types of errors.
- False Negatives A tag that is present is not
detected. - False Positives A tag not present is detected.
- Why do we observe errors
- Collisions Multiple tags transmit
simultaneously, or Multiple readers transmit
simultaneously. - Environment Interference Metal or water near the
tag or reader cause RF interference. - Physical Configuration The tag moves too
quickly, or is located in a blind spot. - Logical Errors A door reader detects tags that
go nearby but not through.
7Cleaning Methods
- A cleaning method M is a classifier that assigns
a location to tag cases. - A tag case is a tuple of the form
- (ltEPC, timegt,ltf1,f2,,fkgt)
- Where each f_i is a feature (e.g. tag type,
signal strength) - The label assigned to each tag case is of the
form - (ltEPC,timegt location, confidence)
- Where confidence is in the rage of 0.0 1.0 and
indicates the level of certainty about the
location. - Terminal classifiers do not provide a confidence
values
8Fixed size smoothing window
- Fixed window smoothing
- The window is made up of the last k (fixed) read
cycles. - If there is any reading inside the window mark
the tag as present - Problems
- Difficult to define the best window size for
different conditions - Benefit
- Cheap method to apply, only requirement is to
remember last k readings
Truth
Readings
Smooth
9Adaptive window size smoothing
- Adaptive window size smoothing
- Change the size of the window according to the
observed probability of tag detections - Use a binomial model
- Let p, be the probability of detecting a present
tag - chose window size w, such that
- (1-p) w lt threshold
- Benefits We adapt the window size to the current
conditions - Problems
- It can be expensive to store and maintain p for
every single tag in the system - All readings inside the window have the same
weight, better to put more weight on recent
readings
10Rule Based Cleaning
- Rules can be derived from the data or given by a
domain expert. - A door reader should only recognize tags with a
bell shaped signal strength. - The shelf reader should only recognize items that
stay there for more than 5 minutes - Rules can be derived from an RFID warehouse
- Flowgraphs can be used to complete missing
readings and to decide location conflicts
11DBN Based Cleaning
- We can model tag detections using a dynamic but
hidden process - Tag readings correspond to noisy observations on
the true, but hidden, location of the tag - We dynamically update our belief on tag location
based on the sequence of observations - More recent observations weight higher on our
belief - DBNs allow us to differentiate between the
following two cases
No question, detect tag !!!
Should we detect this tag???
12DBN Structure
Transition Model
- We learn the transition and observation models
from data - DBNs can represent complex structures e.g.,
variable for RF noise, and tag speed interacting
with detection
Present t
Present t-1
hidden
Detect t
Detect t-1
Observed
Observation Model
13Belief state updating
Belief Update Equation
Observation Model
Belief at time t1 given all evidence up to t1
Update to belief state given transition model
Belief state is updated dynamically, we do not
need to remember a window of observations, we
only need to keep the latest belief state.
14Cost Conscious Cleaning
- Given a collection of cleaning methods, a set of
labeled tag cases, and a cost model for each
cleaning method, design an efficient cleaning
plan that defines the conditions under which each
cleaning method or sequence of cleaning methods
should be applied
15System Architecture
RFID Stream
Labeled Data
Cleaning Plan Induction
Apply Plan
Cleaning Methods
Clean Data
Online
Offline
16Cost Model
- In order to apply a cleaning method to a tag case
we need to incur a cost - Classification cost cost in terms of cpu and
storage of labeling each tag case. - Amortized per tuple training cost Cost to train
the cleaning method, e.g., in a DBN we need to
learn the transition and observation models. - Error cost
- When we make an error in deciding the location of
a tag, a cost is incurred. - Error cost can be a scalar or a function of the
distance of the correct location to the predicted
one, or even the price of the item.
17Cleaning Sequence
- Ordered application of cleaning methods for a set
M to a set of tag cases D - SD,M Ms1 ? Ms2 ? ? Msk
- Apply Ms1 to the entire data set D
- Apply Ms2 to the cases that Ms1 failed to
classify -
- Apply Msk to the cases that every other method
failed to classify - The cost of applying a cleaning sequence C(SD,M)
is the cost of applying each method as described
plus the error cost on tag cases misclassified by
every method - Optimal cleaning sequence SD,M is the cleaning
sequence with minimal cost among all possible
cleaning sequences given D, and M
18Cleaning Sequence Approximation
Step 1
M2
M1
M3
Step 2
Step 3
C(M1)1, C(M2)1.5, C(M3)0.5, Error 5 Accuracy
Adjusted Cleaning Cost
SD,M M1 ? M3 ? M2
C(SD,M) 1 0.520.5 0.431.5 0.195
19Cleaning Plan
- Input
- D Set of labeled tag cases lt(EPC,time),(features
)gt - M Set of available cleaning methods
M1,M2,,Mk - C Cost model C(M1),C(M2),,C(Mk),C(Error)
- Output
- A decision tree that splits D according to
feature values. - For each leaf in the tree a cleaning sequence is
defined. - Application
- For each test case,
- use feature values to get to appropriate leaf.
- apply cleaning sequence defined in the leaf.
20Available Features
- Tag features
- Communications protocol, Manufacturer, price,
quality - Detection history
- Reader features
- Number of antennas
- Protocol, price, vendor
- Location features
- type of area being covered (e.g. door, shelf,
conveyor belt) - Interference level (e.g. presence of metal or
water) - Item features
- Type of item, contents, price
21Cleaning Plan Induction
- Use a traditional Top Down Induction of Decision
Trees (TDIDT) algorithm - Node splitting criteria
- Split nodes based on expected cost reduction
Cleaning sequence cost before the split
Average cost for each cleaning sequence after the
split.
22Example
Labeled tag cases
Cleaning Plan
reader
door
shelf
Method pat Accuracy 100
yes
metal
no
Method fix_1 Accuracy 75
Method dbn Accuracy 100
- We use 3 cleaning methods fix_1, DBN, and pat
- The label is 1 if the tag is indeed present, and
0 otherwise - Each method predicts 1 for present, 0 for absent
- The cleaning plan selects when to apply each
method, e.g. we should use fix_1 to clean cases
from shelf readers when there is metal
23Example of node splitting
C(fix_1)1.0,C(DBN)2.0, C(PAT)2.0,
C(Error)5.0 Cleaning sequence for D DBN ? pat
2.0 0.332.0 0.115.0 3.21
DNB ? fix 2.0 0.161.0 2.16
pat 2.0
Cost Reduction 3.21 (2.160.67 2.00.33)
1.11
24Experimental Setup
- Data generator simulates a complex RFID system
with multiple locations, readers, and tag types. - The simulation is controlled by several
parameters - Item flow characteristics, paths traversed,
predictable vs random movements - Item speed
- Reader, tag, and item characteristics (e.g.
protocol, manufacturer, item contents). - Location characteristics and RF noise levels
- The simulation is run for a number of read
cycles, at each cycle to probability of tag
detection is a function of reader, item, tag, and
location characteristics.
25Cleaning methods
- We compare the performance of several cleaning
methods - DBN a dbn based cleaner
- var adaptive window smoothing
- fix_k fixed (size k) window smoothing
- Rule based methods
- pat a pattern recognition method based on signal
strength shape - maj used to resolve multi reader detection
conflicts by majority voting - Cost Model
- C(DBN)2.0, C(var)2.0, C(fix_1) 1.0,
C(fix_3)1.4, C(pat)2.5, C(maj)1.4, C(Error)10 - All methods are terminal except for DBN which
uses P(tag present) as confidence
26Complex Setup I
- readers in low noise
- readers detecting far away tags
- readers with variable detection rates
- reader at a conveyor belt
- readers detecting tags with water and metal
- In some areas there can be conflict of multiple
patterns detecting the same tag
27Complex Setup II
28Reader Setup
29Noise Level
30Conclusions
- A cost conscious cleaning framework for RFID data
can increase accuracy at lower cost than any
single cleaning method. - The cleaning plan can be efficiently learnt from
data by applying the idea of cleaning cost
reduction to node splitting. - DBN based cleaning methods capture the intrinsic
dynamic behavior of tag detection, and deliver
high accuracy at lower costs than smoothing
window based techniques.
31Thanks