Title: High Performance Database Lab
1High Performance Database Lab
Ph.D CandidateJun Zhang
Zhaohui Peng SupervisorProf. Shan Wang School
of information, Renmin University of China Key
Laboratory of Data Engineering and Knowledge
Engineering, MOE 2006-05-25
2Outline
- Introduction to Our Lab
- Introduction to KSORD
- Our work on KSORD
3High Performance Database Lab
- High Performance Database Lab Key Laboratory of
Data Engineering and Knowledge Engineering, MOE - Cooperated with NCR Teradata Data Warehouse and
Business Intelligence United Lab - Cooperated with HP Labs,China Main Memory DBMS,
Parallel DBMS - Composition7 faculty,20 graduate students(9
Ph.D students) , 3 research groups - ? Core Technology of Databases
- ? Grid Data Management
- ? Database and IR
4High Performance Database Lab
- Â 1 Core Technology of Databases
- Â Â Â (1)KingBaseES
- Large-scale, universal, highly efficient,
National relational DBMS of China, having
independent intellectual property rights.
(database expert companyBasesoft IT Ltd) - Awarded by 863 research project of China,
KingbaseES got the highest score in National DBMS
testing of China in 2005 - KingbaseES mainly applied ine-government,
education, manufacturing, mobile communicating,
and etc. - Â Â Â (2)DBMS Self-tuning and Self-management
- Â Self-configuration, self-optimizing,
self-healing and self-protecting makes DBMS more
intelligent and available. - Â Â Â (3)Main Memory DBMS, Parallel DBMS
- (4)XML RDBMS
5High Performance Database Lab
2 Grid Data Management (1)Information grid and
platform construction (2)Magnanimous Data
Management in Grid (3)PDBMS Existing P2P
systems lack data management capabilities that
are typically found in DBMS. PDBMS presents a
flexible framework for data sharing of
heterogeneous data sources. Our main research
focus on the quality of semantic mapping, query
algorithm and data consistency.
6High Performance Database Lab
- DBIR
- Information retrieval on relational
databases - (1)keyword search techniques
- (2)semantic keyword search
- (3)clustering search results
- (4)relevance feedback
7Outline
- Introduction to Our Lab
- Introduction to KSORD
- Our work on KSORD
8DB-IR is a Hot Topic
SIGMOD2006 Effective Keyword Search in
Relational Databases SIGMOD2005 Panels
Databases and Information Retrieval Rethinking
the Great Divide VLDB2005 Two Sessions
Session 18 DB and IR 1 Session 22 DB
and IR 2 VLDB2004 DB-IR Tutorial vldb2004
DB-IR Tutorial VLDB2002 Tutorial Text Search
for Fine-grained Semi-structured Data
9How to integrate DB and IR ?
- Option 1 Tie together existing DB and IR systems
- Example Approaches based on SQL/MM
- Option 2 Extend existing DB systems with IR
functionality, or vice versa - Example Add searching and ranking to RDBMS
- Option 3 Design a new data management system
from the ground-up - Example Quark data management system
10Why is KSORD Necessary ?
- IR systems search unstructured data by keywords.
- Results are usually imprecise and incomplete
- DB system search structured data by SQL.
- Results are sound and complete, all results are
equally good.
Can we search databases with keywords?
Yes Searching DB with keywords is
necessary? Motivation
11Why is KSORD Necessary ?
- Web user or Casual user expect to query database
by using free-form keyword query. - Just like searching the web by using search
engine - Web users dont know the database schema and SQL
- Hidden Web Problem Search hidden database by
using keyword query. - Most of data on the Web are stored in databases,
hidden to search engines, only a few data on
the Web can be found by search engines - the mismatch of search interfaces between search
engines and databases - If database systems support keyword search,
publishing or searching a database is expected to
be simpler and easier in the web, and the deep
web problem can be alleviated.
12Why is KSORD Necessary ?
- Keyword Query a unified query language/interface
to integrating diverse kinds of information
systems - Modern information systems should manage many
kinds of data - structured relational data SQL
- semi-structured XML documents XQUERY
- unstructured text documents Keyword Query
- different query languages must be used for
searching different kinds of data
13What is KSORD ?
Query gray transaction
WRITE
AUTHOR
AuthorID Name
A101 Donald D. Chamberlin
A102 Jim Gray
A103 Vera Watson
PaperId AuthorID
P101 A101
P102 A102
P103 A102
a1
w1
a2
w2
a3
w3
PAPER
PaperID Title Year Type
P101 Specifying Queries as Relational Expressions 1974 Inproceedings
P102 Transaction Processing Concepts and Techniques 1982 book
P103 Database and Transaction Processing Benchmarks 1992 inproceedings
p1
p2
p3
Result a2-w2-p2, a2-w3-p3
14How to Realize KSORD ?
In the integration of database and IR techniques,
we focus on how to implement IR in relational
databases
- method 1 improve DBMS
- Improve relational algebra model, query
processor, SQL, so that enhance the ability of
proximity search, Top-k and semantic search in
databases. eg FR97, PRA,ACM97, IAE03,TOP-K,
VLDB03, DCE04,OSS, VLDB2004 - DBMS of the new generation should have these
properties - method 2 middleware
- Based on the full text indexing provided by
RDBMS, keyword search over relational databases
enables casual users to use keyword queries (a
set of keywords) to search relational databases
just like searching the Web, without any
knowledge of the database schema or any need of
writing SQL queries - e.g. GSV98,VLDB98 , BHN02, ICDE02 ,
HP02, VLDB02, VLDB03 , BHP04, VLDB04
15How to Realize KSORD ?
- method 2 middleware
- (1)offline systems
- Retrieve results for a keyword query from a
mediate representation generated by "crawling"
the database in advance. The offline Systems
execute queries efficiently, but they can't query
the up-to-date data in time, and also need a long
preprocessing time and large physical space to
generate the mediate representation. - ESKOSU03, Stanford03 indexing Text Object
(Virtual Document) - DataSpotDEG98, VLDB98 build an external
graph-based hyperbase - DbSurferWLK03 indexing textual content of
tuples as virtual web pages - ObjectRankBHP04,VLDB04 ObjectRank similar to
PageRank.
16How to Realize KSORD ?
- (2) online systems
- Convert a keyword query into many SQL queries
and retrieve the database itself. The online
systems can retrieve the latest data from
database, but their execution may be inefficient
because those converted SQL queries usually
contain many join operators. - data graph based method
- BANKSHAL02, ICDE02 model the whole
database as a graph - Schema graph based method
- DBxploreACD02, ICDE02, DISCOVERhr02,
VLDB02, EfficientIRHR03, VLDB03 model
database schema as a graph
17What is Our Previous Work ( Basic Work)
? Survey Search relational databases with
keywords SK05,JCST ? Prototype systems
implementation BANKS, DISCOVER ? Study
Prototype IR-Style (DISCOVER II)
18What is Our Previous Work ( Innovative Work)
- SEEKER Keyword-based Information Retrieval Over
Relational Databases - DETECTOR A universal database retrieval system
based on dynamic database
- Research on New Preprocessing Technology for
Keyword Search in Databases - A Study of Content-based Search Techniques in
Peer-to-Peer Network
- A Study of Integration Techniques of
Heterogeneous Information Sources Based on Grid
19Outline
- Introduction to Our Lab
- Introduction to KSORD
- Our work on KSORD
20What are Worthy Doing?
- How to Realize KSORD?
- SEEKER (schema-graph based online system, done)
- DETECTOR (data-graph based online system, done)
- ITREKS ( offline system, doing )
- How to Improve KSORD?
- Efficiency
- HUNTER (preprocessing techniques, done)
- Effectiveness
- Effective Keyword Search in Relational Databases
FangLiu, SIGMOD06
21What are We Doing?
- How to Improve KSORD?
- Efficiency
- QuickCN
- PreCN Preprocessing Candidate Networks for
Effcient Keyword Search over Databases.
(Submitted ). - CLASCN Candidate Network Selection for Efficient
Top-k Keyword Queries over Databases. (Submitted
). - JoinCN doing.
- Effectiveness
- Result Representation
- TreeCluster Clustering Results of Keyword Search
over Databases. WAIM2006, HONGKONG (Accepted ) - Semantic Search
- Si-SEEKER Ontology-based Semantic Search over
Databases. KSEM2006, August 5, Guilin,
China(Accepted).
22Our FrameWork
SemCN (Semantic Search Ontology)
PreCN (Preprocessing)
ClASCN (Classification, learning, and Selection)
JoinCN (DBMS,TOP-K Algorithm)
23SemCN
SemCN (Semantic Search Ontology)
24SemCN
25SemCN
- Basic Ideas
- Exploiting Domain Ontology to construct semantic
indexes in relational database - Semantic indexes to support semantic search, just
like full-text to support keyword search. - Generalized Vector Space Model to compute
semantic similarity by utilizing domain ontology
hierarchical structure - Semantic Search Combined with Keyword Search
26SemCN
27SemCN
- Domain Ontology
- ACMCSS98 ( 1475 concepts, 2 relationships (
subClassOf, relatedTo) - Data Set DBLP
- Annotation
- SIGMOD XML data set (477 annotated papers and
1369 semantic index entries) - Crawling ACM Digital Library about 20,000 Papers
(doing) - Concept Extractor
- a simple Concept Extractor (Stanford Parser)
28SemCN
Domain Ontology Computer Science
29SemCN
- how to compute semantic similarity?
- Data Concept Vector and Query Concept Vector
- Structured Data Semantic Indexes
- D gt Concept Extractor gt D(C1, C2, C3, .)
- Keyword Query Concept Extractor
- Q gt Concept Extractor gt Q(C1, C2, C3, ,
) - For Example
- Keyword Query Data Management gt Q(H.2)
- Data (Paper Title) Enriching the conceptual
basis for query formulation through relationship
semantics in databases. D(H.2.1.1, H.2.3.4) - the classic vector space model are supposed to be
perpendicular to each other and the dot product
of them is zero?
30SemCN
Generalized Vector Space Model GMW03
31SemCN
32SemCN
Full-Text Indexes
Semantic Indexes
33SemCN
Show the Effectiveness
Efficiency is the future work!
34SemCN
Tuple Sets Merger
Score Normalization
Combine two kinds of Scores
35SemCN
36SemCN
37PreCN
PreCN (Preprocessing)
38PreCN
Fig. 2 Comparing Tts,Tcn and Tsql.Fix MaxCNsize
5 and topk 100,vary Key- wNum
39PreCN
40PreCN
- Basic Ideas
- Preprocessing the maximum Tuple Sets Graph to
generate CNs in advance - Preprocess the schema information (stable), not
the data themselves (changeful). - Retrieving the set of CNs for a user query
instead of temporarily generating CNs
41PreCN
42PreCN
43PreCN
44PreCN
45PreCN
46PreCN
47PreCN
48CLASCN
ClASCN (Classification, learning, and Selection)
49CLASCN
50CLASCN
Important observations the top-k results only
distribute in a few CNs while tens or hundreds of
CNs can be generated for a keyword query. For
example, as for DBLP(http//dblp.uni-trier.de/)
database, top 100 results per user query only
distribute in 3 CNs on average,and top 100
results for a mass of user queries only
distribute in about 22 of all CNs.
51CLASCN
- Basic Ideas
- Each CN is viewed as a database
- Construct CN language model, CN(k1,k2,,km)
- Compute similarity between a user query
Q(k1,k2,,km) and CNs - Select the most promising CNs to produce top-k
results.
52CLASCN
53CLASCN
54CLASCN
55CLASCN
Selecting
Learning
56CLASCN
57CLASCN
58CLASCN
59CLASCN
60CLASCN
61CLASCN
62References(1)
HW05 Yingjie He, Shan Wang. Efficient Top-k
Query Processing in Pure Peer-to-Peer Network.
Journal of Software. 2005,16(4).540552. HFW04
Yingjie He, Yanfeng Su, Shan Wang, Xiaoyong
Du,Efficient top-k query processing in P2P
network, Database and EXpert systems
Applications(DEXA 2004), Proceedings Lecture
Notes In Computer Science 3180 pp.381-390,
August 2004 Spain SK05 Shan Wang, Kun-Long
Zhang. Searching Databases with Keywords. Journal
of Computer Science Technology. 2005,20(1).
5562. WW05 Jijun Wen,Shan Wang. SEEKER
Keyword-based Information Retrieval Over
Relational Databases. Journal of Software.
2005 WZ04 Shan Wang, Kunlong Zhang. Database
system on the Grid. Journal of Computer
Applications, 24(10)1-3, 2004. MZW04 Xiaofeng
Meng, Longxiang Zhou, Shan Wang. State of the Art
and Trends in Database Research. Journal of
Software,2004,15(12). 1822 1836. HYJ04
Yingjie He. A Study of Content-based Search
Techniques in Peer-to-Peer Network. PhD thesis of
Renmin University of China. 2004.
63References(2)
WEN05 Jijun Wen. Keyword-based Information
Retrieval over Relational Databases. PhD thesis
of Renmin University of China. 2005. ZKL05
Kunlong Zhang. Research on New Preprocessing
Technology for Keyword Search in Databases. PhD
thesis of Renmin University of China.
2005. YJL05 Jiali Yao. DETECTOR A universal
database retrieval system based on dynamic
database. Master thesis of Renmin University of
China. 2005. WYT05 Yunting Wang. A Study of
Integration Techniques of Heterogeneous
Information Sources Based on Grid. Master thesis
of Renmin University of China. 2005. WAN05 Shan
Wang et al. Database and Information System
Research and Challenge (1988-2003 research
reports). Higher Education Press 2005.8
64References(3)
FR97 Norbert Fuhr, Thomas Rolleke. A
probabilistic Relational Algebra for the
Integration of Information Retrieval and Database
Systems. ACM Transactions on Information Systems,
15(1). 1997. 3266. IAE03 I. Ilyas, W. Aref,
and A. Elmagarmid. Supporting Top-k Join Queries
in Relational Databases. In Proceedings of the
29th International Conference on Very Large Data
Bases, 2003. LCI05 Chengkai Li, Kevin
Chen-Chuan Chang, Ihab F. Ilyas, Sumin Song.
RankSQL Query Algebra and Optimization for
Relational Top-k Queries. SIGMOD 2005. 131-142.
DCE04 Souripriya Das, Eugene Inseok Chong,
George Eadon, Jagannathan Srinivasan. Supporting
Ontology-Based Semantic matching in RDBMS.
Proceedings of the Thirtieth International
Conference on Very Large Data Bases.2004.
1054-1065. SB98 G. Salton, C. Buckley.
Term-Weighting Approaches in Automatic Retrieval.
Information Processing and Management,
24(5).1998 513-523.
65References(4)
GSV98 R. Goldman, N. Shivajumar, S.
Venkatasubramanian, and H. Garcia-Molina.
Proximity Search in Databases. In Proceedings of
the 24th International Conference on Very Large
Databases, 1998. BHN02 G. Bhalotia, A.
Hulgeri, C. Nakhe, S. Chakrabarti, and S.
Sudarshan. Keyword Searching and Browsing in
Databases using BANKS. In Proceedings of 18th
International Conference on Data Engineering,
2002. ABP02 B. Aditya, Gaurav Bhalotia, Parag,
Charuta Nakhey, Arvind Hulgeri, Soumen
Chakrabarti, and S. Sudarshan. Banks Browsing
and keyword searching in relational databases. In
Proceedings of the 28th International Conference
on Very Large Data Bases, 2002.
Demonstration. ACD02 S. Agrawal, S. Chaudhuri,
and G. Das. DBXplorer A System For Keyword-Based
Search Over Relational Databases. In Proceedings
of 18th International Conference on Data
Engineering, 2002. HP02 V. Hristidis and Y.
Papakonstantinou. DISCOVER Keyword Search in
Relational Databases. In Proceedings of the 28th
International Conference on Very Large Data
Bases, 2002.
66References(5)
HGP03 V. Hristidis, L. Gravano, and Y.
Papakonstantinou. Efficient IR-Style Keyword
Search over Relational Databases. In Proceedings
of the 29th International Conference on Very
Large Data Bases, 2003. SW03 Qi Su, Jennifer
Widom. Efficient and Extensible Keyword Search
over Rleational Databases. Stanford University
Technical Report. 2003. BHP04 A. Balmin, V.
Hristidis, and Y. Papakonstantinou. ObjectRank
Authority-Based Keyword Search in Databases. In
Proceedings of the 30th International Conference
on Very Large Data Bases, 2004. SU03 Q.Su,
J.Widom. Indexing relational database Content
offine for efficient keyword-based search.
Technical Report, http//dbpubs.stanford.edu/pub/2
003-13, stanfordStanford University,
2003. DEG98 S.Dar,G.Entin,S.Geva,and
E.Palmon.DTLs DataSpotDatabase exploration
using plain languages. In Proceedings of the 24th
Internaltional Confererence on Very Large
Databases,1998. WLK03 R.Wheeldon,M.Levene,and
K.Keenoy. Search and Navigation in Relational
Databases.http//arxiv.org/abs/cs.DB/0307073 GMW0
3 P. Ganesan, H. Garcia-Molina, and J. Widom.
Exploiting Hierarchical Domain Structure to
Compute Similarity. ACM Trans. Inf. Syst. 21(1).
200364-93
67High Performance Database Lab
Q A
Thanks!
Ph.D CandidateJun Zhang (zhangjun11_at_ruc.edu.cn)
Zhaohui Peng (pengch_at_ruc.edu.cn) Su
pervisorProf. Shan Wang School of information,
Renmin University of China Key Laboratory of Data
Engineering and Knowledge Engineering, MOE
2006-05-25