Title: Databases and Information Retrieval: Rethinking the Great Divide
1Databases and Information RetrievalRethinking
the Great Divide
- SIGMOD Panel
- 14 Jun 2005
- Jayavel Shanmugasundaram
- Cornell University
210000 Foot View of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
3Bridging the Great Divide
- Option 1 Tie together existing DB and IR systems
- Example Approaches based on SQL/MM
- Option 2 Extend existing DB systems with IR
functionality, or vice versa - Example Add searching and ranking to RDBMSs
- Option 3 Design a new data management system
from the ground-up - Example Quark data management system
4Why Option 1 Wont Work
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
5Bridging the Great Divide
- Option 1 Tie together existing DB and IR systems
- Example Approaches based on SQL/MM
- Drawback Not powerful enough
- Option 2 Extend existing DB systems with IR
functionality, or vice versa - Example Add searching and ranking to RDBMSs
- Option 3 Design a new data management system
from the ground-up - Example Quark data management system
6ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Find relevant elements in important workshops
between the years 1999 and 2001 that are about
Ricardo and XML
7Why Extending (R)DBMSs Wont Work
- Violates many assumptions hardwired into
current database systems - Structured queries over structured fields,
keyword search queries over text fields - Is author name a structured or text field?
- Operators have precise, well-defined semantics
- Even the query result is not well-defined do we
return a paper or a workshop? - Scoring is an attribute tacked on as a relational
attribute - How can this scoring generalize IR scoring?
8Why Extending IR Systems Wont Work
- IR systems provide little support for structured
data - No support for complex operators
- How can complex queries be evaluated?
- Scoring does not take structure into account
- How can scoring capture both structured and
unstructured data?
9Bridging the Great Divide
- Option 1 Tie together existing DB and IR systems
- Example Approaches based on SQL/MM
- Drawback Not powerful enough
- Option 2 Extend existing DB systems with IR
functionality, or vice versa - Example Add searching and ranking to RDBMSs
- Drawback Shoehorns alien functionality into
already complex systems - Option 3 Design a new data management system
from the ground-up - Example Quark data management system
10Why Option 3 Will Work
- Designed ground-up with three principles
- Structural data independence
- Users can issues any query (complex and keyword)
over any data (structured and unstructured) - Generalized scoring
- Scoring works over any mix of structured and
unstructured data (e.g., XRank over HTML and XML) - Flexible query language
- Allows for arbitrary return results and scores
(e.g., TeXQuery, precursor to XQuery Full-Text,
NEXI)
11Bridging the Great Divide
- Option 1 Tie together existing DB and IR systems
- Example Approaches based on SQL/MM
- Drawback Not powerful enough
- Option 2 Extend existing DB systems with IR
functionality, or vice versa - Example Add searching and ranking to RDBMSs
- Drawback Shoehorns alien functionality into
already complex systems - Option 3 Design a new data management system
from the ground-up - Example Quark data management system
- Most promising alternative!