High Performance Database Lab

About This Presentation

Title:

High Performance Database Lab

Description:

Supervisor:Prof. Shan Wang. School of information, Renmin University of China ... Vera Watson. Jim Gray. Donald D. Chamberlin. A103. A102. A101. a1. a2. a3 ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 68

Provided by: sebastianm

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Database Lab

1
High Performance Database Lab
Ph.D CandidateJun Zhang
Zhaohui Peng SupervisorProf. Shan Wang School
of information, Renmin University of China Key
Laboratory of Data Engineering and Knowledge
Engineering, MOE 2006-05-25
2
Outline

Introduction to Our Lab
Introduction to KSORD
Our work on KSORD

3
High Performance Database Lab

High Performance Database Lab Key Laboratory of
Data Engineering and Knowledge Engineering, MOE
Cooperated with NCR Teradata Data Warehouse and
Business Intelligence United Lab
Cooperated with HP Labs,China Main Memory DBMS,
Parallel DBMS
Composition7 faculty,20 graduate students(9
Ph.D students) , 3 research groups
? Core Technology of Databases
? Grid Data Management
? Database and IR

4
High Performance Database Lab

1 Core Technology of Databases
(1)KingBaseES
Large-scale, universal, highly efficient,
National relational DBMS of China, having
independent intellectual property rights.
(database expert companyBasesoft IT Ltd)
Awarded by 863 research project of China,
KingbaseES got the highest score in National DBMS
testing of China in 2005
KingbaseES mainly applied ine-government,
education, manufacturing, mobile communicating,
and etc.
(2)DBMS Self-tuning and Self-management
Self-configuration, self-optimizing,
self-healing and self-protecting makes DBMS more
intelligent and available.
(3)Main Memory DBMS, Parallel DBMS
(4)XML RDBMS

5
High Performance Database Lab
2 Grid Data Management (1)Information grid and
platform construction (2)Magnanimous Data
Management in Grid (3)PDBMS Existing P2P
systems lack data management capabilities that
are typically found in DBMS. PDBMS presents a
flexible framework for data sharing of
heterogeneous data sources. Our main research
focus on the quality of semantic mapping, query
algorithm and data consistency.
6
High Performance Database Lab

DBIR
Information retrieval on relational
databases
(1)keyword search techniques
(2)semantic keyword search
(3)clustering search results
(4)relevance feedback

7
Outline

Introduction to Our Lab
Introduction to KSORD
Our work on KSORD

8
DB-IR is a Hot Topic
SIGMOD2006 Effective Keyword Search in
Relational Databases SIGMOD2005 Panels
Databases and Information Retrieval Rethinking
the Great Divide VLDB2005 Two Sessions
Session 18 DB and IR 1 Session 22 DB
and IR 2 VLDB2004 DB-IR Tutorial vldb2004
DB-IR Tutorial VLDB2002 Tutorial Text Search
for Fine-grained Semi-structured Data
9
How to integrate DB and IR ?

Option 1 Tie together existing DB and IR systems
Example Approaches based on SQL/MM
Option 2 Extend existing DB systems with IR
functionality, or vice versa
Example Add searching and ranking to RDBMS
Option 3 Design a new data management system
from the ground-up
Example Quark data management system

10
Why is KSORD Necessary ?

IR systems search unstructured data by keywords.
Results are usually imprecise and incomplete

DB system search structured data by SQL.
Results are sound and complete, all results are
equally good.

Can we search databases with keywords?
Yes Searching DB with keywords is
necessary? Motivation
11
Why is KSORD Necessary ?

Web user or Casual user expect to query database
by using free-form keyword query.
Just like searching the web by using search
engine
Web users dont know the database schema and SQL
Hidden Web Problem Search hidden database by
using keyword query.
Most of data on the Web are stored in databases,
hidden to search engines, only a few data on
the Web can be found by search engines
the mismatch of search interfaces between search
engines and databases
If database systems support keyword search,
publishing or searching a database is expected to
be simpler and easier in the web, and the deep
web problem can be alleviated.

12
Why is KSORD Necessary ?

Keyword Query a unified query language/interface
to integrating diverse kinds of information
systems
Modern information systems should manage many
kinds of data
structured relational data SQL
semi-structured XML documents XQUERY
unstructured text documents Keyword Query
different query languages must be used for
searching different kinds of data

13
What is KSORD ?
Query gray transaction
WRITE
AUTHOR
AuthorID Name
A101 Donald D. Chamberlin
A102 Jim Gray
A103 Vera Watson
PaperId AuthorID
P101 A101
P102 A102
P103 A102
a1
w1
a2
w2
a3
w3
PAPER
PaperID Title Year Type
P101 Specifying Queries as Relational Expressions 1974 Inproceedings
P102 Transaction Processing Concepts and Techniques 1982 book
P103 Database and Transaction Processing Benchmarks 1992 inproceedings
p1
p2
p3
Result a2-w2-p2, a2-w3-p3
14
How to Realize KSORD ?
In the integration of database and IR techniques,
we focus on how to implement IR in relational
databases

method 1 improve DBMS
Improve relational algebra model, query
processor, SQL, so that enhance the ability of
proximity search, Top-k and semantic search in
databases. eg FR97, PRA,ACM97, IAE03,TOP-K,
VLDB03, DCE04,OSS, VLDB2004
DBMS of the new generation should have these
properties
method 2 middleware
Based on the full text indexing provided by
RDBMS, keyword search over relational databases
enables casual users to use keyword queries (a
set of keywords) to search relational databases
just like searching the Web, without any
knowledge of the database schema or any need of
writing SQL queries
e.g. GSV98,VLDB98 , BHN02, ICDE02 ,
HP02, VLDB02, VLDB03 , BHP04, VLDB04

15
How to Realize KSORD ?

method 2 middleware
(1)offline systems
Retrieve results for a keyword query from a
mediate representation generated by "crawling"
the database in advance. The offline Systems
execute queries efficiently, but they can't query
the up-to-date data in time, and also need a long
preprocessing time and large physical space to
generate the mediate representation.
ESKOSU03, Stanford03 indexing Text Object
(Virtual Document)
DataSpotDEG98, VLDB98 build an external
graph-based hyperbase
DbSurferWLK03 indexing textual content of
tuples as virtual web pages
ObjectRankBHP04,VLDB04 ObjectRank similar to
PageRank.

16
How to Realize KSORD ?

(2) online systems
Convert a keyword query into many SQL queries
and retrieve the database itself. The online
systems can retrieve the latest data from
database, but their execution may be inefficient
because those converted SQL queries usually
contain many join operators.
data graph based method
BANKSHAL02, ICDE02 model the whole
database as a graph
Schema graph based method
DBxploreACD02, ICDE02, DISCOVERhr02,
VLDB02, EfficientIRHR03, VLDB03 model
database schema as a graph

17
What is Our Previous Work ( Basic Work)
? Survey Search relational databases with
keywords SK05,JCST ? Prototype systems
implementation BANKS, DISCOVER ? Study
Prototype IR-Style (DISCOVER II)
18
What is Our Previous Work ( Innovative Work)

SEEKER Keyword-based Information Retrieval Over
Relational Databases
DETECTOR A universal database retrieval system
based on dynamic database

Research on New Preprocessing Technology for
Keyword Search in Databases
A Study of Content-based Search Techniques in
Peer-to-Peer Network

A Study of Integration Techniques of
Heterogeneous Information Sources Based on Grid

19
Outline

Introduction to Our Lab
Introduction to KSORD
Our work on KSORD

20
What are Worthy Doing?

How to Realize KSORD?
SEEKER (schema-graph based online system, done)
DETECTOR (data-graph based online system, done)
ITREKS ( offline system, doing )
How to Improve KSORD?
Efficiency
HUNTER (preprocessing techniques, done)
Effectiveness
Effective Keyword Search in Relational Databases
FangLiu, SIGMOD06

21
What are We Doing?

How to Improve KSORD?
Efficiency
QuickCN
PreCN Preprocessing Candidate Networks for
Effcient Keyword Search over Databases.
(Submitted ).
CLASCN Candidate Network Selection for Efficient
Top-k Keyword Queries over Databases. (Submitted
).
JoinCN doing.
Effectiveness
Result Representation
TreeCluster Clustering Results of Keyword Search
over Databases. WAIM2006, HONGKONG (Accepted )
Semantic Search
Si-SEEKER Ontology-based Semantic Search over
Databases. KSEM2006, August 5, Guilin,
China(Accepted).

22
Our FrameWork
SemCN (Semantic Search Ontology)
PreCN (Preprocessing)
ClASCN (Classification, learning, and Selection)
JoinCN (DBMS,TOP-K Algorithm)
23
SemCN
SemCN (Semantic Search Ontology)
24
SemCN
25
SemCN

Basic Ideas
Exploiting Domain Ontology to construct semantic
indexes in relational database
Semantic indexes to support semantic search, just
like full-text to support keyword search.
Generalized Vector Space Model to compute
semantic similarity by utilizing domain ontology
hierarchical structure
Semantic Search Combined with Keyword Search

26
SemCN
27
SemCN

Domain Ontology
ACMCSS98 ( 1475 concepts, 2 relationships (
subClassOf, relatedTo)
Data Set DBLP
Annotation
SIGMOD XML data set (477 annotated papers and
1369 semantic index entries)
Crawling ACM Digital Library about 20,000 Papers
(doing)
Concept Extractor
a simple Concept Extractor (Stanford Parser)

28
SemCN
Domain Ontology Computer Science
29
SemCN

how to compute semantic similarity?
Data Concept Vector and Query Concept Vector
Structured Data Semantic Indexes
D gt Concept Extractor gt D(C1, C2, C3, .)
Keyword Query Concept Extractor
Q gt Concept Extractor gt Q(C1, C2, C3, ,
)
For Example
Keyword Query Data Management gt Q(H.2)
Data (Paper Title) Enriching the conceptual
basis for query formulation through relationship
semantics in databases. D(H.2.1.1, H.2.3.4)
the classic vector space model are supposed to be
perpendicular to each other and the dot product
of them is zero?

30
SemCN
Generalized Vector Space Model GMW03
31
SemCN
32
SemCN
Full-Text Indexes
Semantic Indexes
33
SemCN
Show the Effectiveness
Efficiency is the future work!
34
SemCN
Tuple Sets Merger
Score Normalization
Combine two kinds of Scores
35
SemCN
36
SemCN
37
PreCN
PreCN (Preprocessing)
38
PreCN
Fig. 2 Comparing Tts,Tcn and Tsql.Fix MaxCNsize
5 and topk 100,vary Key- wNum
39
PreCN
40
PreCN

Basic Ideas
Preprocessing the maximum Tuple Sets Graph to
generate CNs in advance
Preprocess the schema information (stable), not
the data themselves (changeful).
Retrieving the set of CNs for a user query
instead of temporarily generating CNs

41
PreCN
42
PreCN
43
PreCN
44
PreCN
45
PreCN
46
PreCN
47
PreCN
48
CLASCN
ClASCN (Classification, learning, and Selection)
49
CLASCN
50
CLASCN
Important observations the top-k results only
distribute in a few CNs while tens or hundreds of
CNs can be generated for a keyword query. For
example, as for DBLP(http//dblp.uni-trier.de/)
database, top 100 results per user query only
distribute in 3 CNs on average,and top 100
results for a mass of user queries only
distribute in about 22 of all CNs.
51
CLASCN

Basic Ideas
Each CN is viewed as a database
Construct CN language model, CN(k1,k2,,km)
Compute similarity between a user query
Q(k1,k2,,km) and CNs
Select the most promising CNs to produce top-k
results.

52
CLASCN
53
CLASCN
54
CLASCN
55
CLASCN
Selecting
Learning
56
CLASCN
57
CLASCN
58
CLASCN
59
CLASCN
60
CLASCN
61
CLASCN
62
References(1)
HW05 Yingjie He, Shan Wang. Efficient Top-k
Query Processing in Pure Peer-to-Peer Network.
Journal of Software. 2005,16(4).540552. HFW04
Yingjie He, Yanfeng Su, Shan Wang, Xiaoyong
Du,Efficient top-k query processing in P2P
network, Database and EXpert systems
Applications(DEXA 2004), Proceedings Lecture
Notes In Computer Science 3180 pp.381-390,
August 2004 Spain SK05 Shan Wang, Kun-Long
Zhang. Searching Databases with Keywords. Journal
of Computer Science Technology. 2005,20(1).
5562. WW05 Jijun Wen,Shan Wang. SEEKER
Keyword-based Information Retrieval Over
Relational Databases. Journal of Software.
2005 WZ04 Shan Wang, Kunlong Zhang. Database
system on the Grid. Journal of Computer
Applications, 24(10)1-3, 2004. MZW04 Xiaofeng
Meng, Longxiang Zhou, Shan Wang. State of the Art
and Trends in Database Research. Journal of
Software,2004,15(12). 1822 1836. HYJ04
Yingjie He. A Study of Content-based Search
Techniques in Peer-to-Peer Network. PhD thesis of
Renmin University of China. 2004.
63
References(2)
WEN05 Jijun Wen. Keyword-based Information
Retrieval over Relational Databases. PhD thesis
of Renmin University of China. 2005. ZKL05
Kunlong Zhang. Research on New Preprocessing
Technology for Keyword Search in Databases. PhD
thesis of Renmin University of China.
2005. YJL05 Jiali Yao. DETECTOR A universal
database retrieval system based on dynamic
database. Master thesis of Renmin University of
China. 2005. WYT05 Yunting Wang. A Study of
Integration Techniques of Heterogeneous
Information Sources Based on Grid. Master thesis
of Renmin University of China. 2005. WAN05 Shan
Wang et al. Database and Information System
Research and Challenge (1988-2003 research
reports). Higher Education Press 2005.8
64
References(3)
FR97 Norbert Fuhr, Thomas Rolleke. A
probabilistic Relational Algebra for the
Integration of Information Retrieval and Database
Systems. ACM Transactions on Information Systems,
15(1). 1997. 3266. IAE03 I. Ilyas, W. Aref,
and A. Elmagarmid. Supporting Top-k Join Queries
in Relational Databases. In Proceedings of the
29th International Conference on Very Large Data
Bases, 2003. LCI05 Chengkai Li, Kevin
Chen-Chuan Chang, Ihab F. Ilyas, Sumin Song.
RankSQL Query Algebra and Optimization for
Relational Top-k Queries. SIGMOD 2005. 131-142.
DCE04 Souripriya Das, Eugene Inseok Chong,
George Eadon, Jagannathan Srinivasan. Supporting
Ontology-Based Semantic matching in RDBMS.
Proceedings of the Thirtieth International
Conference on Very Large Data Bases.2004.
1054-1065. SB98 G. Salton, C. Buckley.
Term-Weighting Approaches in Automatic Retrieval.
Information Processing and Management,
24(5).1998 513-523.
65
References(4)
GSV98 R. Goldman, N. Shivajumar, S.
Venkatasubramanian, and H. Garcia-Molina.
Proximity Search in Databases. In Proceedings of
the 24th International Conference on Very Large
Databases, 1998. BHN02 G. Bhalotia, A.
Hulgeri, C. Nakhe, S. Chakrabarti, and S.
Sudarshan. Keyword Searching and Browsing in
Databases using BANKS. In Proceedings of 18th
International Conference on Data Engineering,
2002. ABP02 B. Aditya, Gaurav Bhalotia, Parag,
Charuta Nakhey, Arvind Hulgeri, Soumen
Chakrabarti, and S. Sudarshan. Banks Browsing
and keyword searching in relational databases. In
Proceedings of the 28th International Conference
on Very Large Data Bases, 2002.
Demonstration. ACD02 S. Agrawal, S. Chaudhuri,
and G. Das. DBXplorer A System For Keyword-Based
Search Over Relational Databases. In Proceedings
of 18th International Conference on Data
Engineering, 2002. HP02 V. Hristidis and Y.
Papakonstantinou. DISCOVER Keyword Search in
Relational Databases. In Proceedings of the 28th
International Conference on Very Large Data
Bases, 2002.
66
References(5)
HGP03 V. Hristidis, L. Gravano, and Y.
Papakonstantinou. Efficient IR-Style Keyword
Search over Relational Databases. In Proceedings
of the 29th International Conference on Very
Large Data Bases, 2003. SW03 Qi Su, Jennifer
Widom. Efficient and Extensible Keyword Search
over Rleational Databases. Stanford University
Technical Report. 2003. BHP04 A. Balmin, V.
Hristidis, and Y. Papakonstantinou. ObjectRank
Authority-Based Keyword Search in Databases. In
Proceedings of the 30th International Conference
on Very Large Data Bases, 2004. SU03 Q.Su,
J.Widom. Indexing relational database Content
offine for efficient keyword-based search.
Technical Report, http//dbpubs.stanford.edu/pub/2
003-13, stanfordStanford University,
2003. DEG98 S.Dar,G.Entin,S.Geva,and
E.Palmon.DTLs DataSpotDatabase exploration
using plain languages. In Proceedings of the 24th
Internaltional Confererence on Very Large
Databases,1998. WLK03 R.Wheeldon,M.Levene,and
K.Keenoy. Search and Navigation in Relational
Databases.http//arxiv.org/abs/cs.DB/0307073 GMW0
3 P. Ganesan, H. Garcia-Molina, and J. Widom.
Exploiting Hierarchical Domain Structure to
Compute Similarity. ACM Trans. Inf. Syst. 21(1).
200364-93
67
High Performance Database Lab
Q A
Thanks!
Ph.D CandidateJun Zhang (zhangjun11_at_ruc.edu.cn)
Zhaohui Peng (pengch_at_ruc.edu.cn) Su
pervisorProf. Shan Wang School of information,
Renmin University of China Key Laboratory of Data
Engineering and Knowledge Engineering, MOE
2006-05-25

Write a Comment

User Comments (0)