Title: Case Study: BibFinder
1Case Study BibFinder
- BibFinder A popular CS bibliographic mediator
- Integrating 8 online sources DBLP, ACM DL, ACM
Guide, IEEE Xplore, ScienceDirect, Network
Bibliography, CSB, CiteSeer - More than 58000 real user queries collected
- Mediated schema relation in BibFinder
- paper(title, author,
conference/journal, year) - Primary key titleauthoryear
- Focus on Selection queries
- Q(title, author, year) - paper(title, author,
conference/journal, year), -
conferenceSIGMOD
2(No Transcript)
3(No Transcript)
4Selecting top-K sources for a given query
- Given a query Q, and sources S1.Sn, we need the
coverage and overlap statistics of sources Si
w.r.t. Q - P(SQ) is the coverage (Probability that a random
tuple belonging to Q is exported by source S) - P(S1..SjQ) is the overlap between S1..Sj
w.r.t. query Q (Probability that a random tuple
belonging to Q is exported by all the sources
S1..Sj). - If we have the coverage and overlap statistics,
then it is possible to pick the top-K sources
that will give maximal number of tuples for Q.
5Computing Effective Coverage provided by a set of
sources
Suppose we are calling 3 sources S1, S2, S3 to
answer a query Q. The effective coverage we
get is P(S1US2US3Q). In order to compute
this union, we need the intersection (overlap)
statistics (in addition to the coverage
statistics)
Given the above, we can pick the optimal
3-sources for answering Q by considering all
3-sized subsets of source set S1.Sn, and
picking the set with highest coverage
6Selecting top-K sources the greedy way
Selecting optimal K sources is hard in general.
One way to reduce cost is to select sources
greedily, one after other. For example, to
select 3 sources, we select first source Si as
the source with highest P(SiQ) value. To
pick the jth source, we will compute the residual
coverage of each of the remaining sources,
given the 1,2j-1 sources we have already
picked (the residual coverage computation
requires overlap statistics). For example
picking a third source in the context of
sources S1 and S2 will require us to calculate
7What good is a high coverage sourcethat is
off-line?
- Sources vary significantly in terms of their
response times - The response time depends both on the source
itself, as well as the query that is asked of it - Specifically, what fields are bound in the
selection query can make a difference - Hard enough to get a high coverage or a low
response time plan. But now we have to combine
them - Qn How do we define an optimal plan in the
context of both coverage/overlap and response
time requirements?
8Response time can depend on the query type
Range queries on year
Effect of binding author field
--Response times can also depend on the time of
the day, and the day of the week.
9Multi-objective Query optimization
- Need to optimize queries jointly for both high
coverage and low response time - Staged optimization wont quite work.
- An idea Make the source selection be dependent
on both (residual)coverage and response time
10Results on BibFinder
11Challenges
- Sources are incomplete and partially overlapping
- Calling every possible source is inefficient and
impolite - Need coverage and overlap statistics to figure
out what sources are most relevant for every
possible query!
- We introduce a frequency-based approach for
mining these statistics
12Outline
- Motivation
- BibFinder/StatMiner Architecture
- StatMiner Approach
- Automatically learning AV Hierarchies
- Discovering frequent query classes
- Learning coverage and overlap Statistics
- Using Coverage and Overlap Statistics
- StatMiner evaluation with BibFinder
- Related Work
- Conclusion
13Motivation
- We introduce StatMiner
- A threshold based hierarchical mining approach
- Store statistics w.r.t. query classes
- Keep more accurate statistics for more frequently
asked queries - Handling the efficiency and accuracy tradeoffs by
adjusting the thresholds
14BibFinder/StatMiner
15Query List
16AV Hierarchies and Query Classes
17StatMiner
18Using Coverage and Overlap Statistics to Rank
Sources
19Outline
- Motivation
- BibFinder/StatMiner Architecture
- StatMiner Approach
- Automatically learning AV Hierarchies
- Discovering frequent query classes
- Learning coverage and overlap Statistics
- Using Coverage and Overlap Statistics
- StatMiner evaluation with BibFinder
- Related Work
- Conclusion
20BibFinder/StatMiner Evaluation
- Experimental setup with BibFinder
- Mediator relation Paper(title,author,conference/j
ournal,year) - 25000 real user queries are used. Among them
4500 queries are randomly chosen as test queries.
- AV Hierarchies for all of the four attributes
are learned automatically. - 8000 distinct values in author, 1200 frequent
asked keywords itemsets in title, 600 distinct
values in conference/journal, and 95 distinct
values in year.
21Learned Conference Hierarchy
22Space Consumption for Different minfreq and
minoverlap
- We use a threshold on the support of a class,
called minfreq, to identify frequent classes - We use a minimum support threshold minoverlap to
prune overlap statistics for uncorrelated source
sets. - As we increase any of the these two thresholds,
the memory consumption drops, especially in the
beginning.
23Accuracy of the Learned Statistics
- Absolute Error
- No dramatic increases
- Keeping very detailed overlap statistics would
not necessarily increase the accuracy while
requiring much more space. For example
minfreq0.13 and minoverlap0.1 versus
minfreq0.33 and minoverlap0
24Plan Precision
- Here we observe the average precision of the
top-2 source plans - The plans using our learned statistics have high
precision compared to random select, and it
decreases very slowly as we change the minfreq
and minoverlap threshold.
25Plan Precision on Controlled Sources
We observer the plan precision of top-5 source
plans (totally 25 simulated sources). Using
greedy select do produce better plans. See
Section 3.8 and Section 3.9 for detailed
information
26Number of Distinct Results
- Here we observe the average number of distinct
results of top-2 source plans. - Our methods gets on average 50 distinct answers,
while random search gets only about 30 answers.
27Applications
- Path Selection in Bioinformatics LNRV03
- More and More Bioinformatics sources available on
Internet - Thousands of paths existing for answering users
queries - Path Coverage and Overlap Statistics are needed
- Text Database Selection in Information Retrieval
- StatMiner can provide a better way of learning
and storing representatives of the databases - Main Ideas
- Maintain a query list and discover frequent asked
keyword-sets - Learn keyword-set hierarchy based on the
statistics distance - Learn and store coverage (document frequency) for
frequent asked keyword-set classes. - A new query will be mapped to a set of close
classes and use their statistics to estimate
statistics for the query. - Advantages
- Multiple-word-term Scalability