Data Integration and Information Retrieval - PowerPoint PPT Presentation

About This Presentation

Title:

Data Integration and Information Retrieval

Description:

Homework 3 handed out. Midterm on Thu 3/20, 80 minutes, closed-book ... 'March Madness' Find information on college basketball teams which: (1) are ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 45

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Integration and Information Retrieval

1
Data Integration andInformation Retrieval

Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 Internet and Web Systems
February 18, 2008

Some slides by Berthier Ribeiro-Neto
2
Reminders Announcements

Homework 3 handed out
Midterm on Thu 3/20, 80 minutes, closed-book
Past midterm available on the web site

3
Where We Left Off

We could use XQueries to translate data from one
XML schema to another

4
Translating Values with a Concordance Table
pid PennID n name s ssn tr treatment f
PennID t ssn

for p in doc (student.xml) /db/student,
pid in p/pennid/text(), n in
p/name/text(),d in doc(dental.xml)/db/patient
, s in d/ssn/text(), tr in
d/treatment/text(),m in doc (concord.xml)
/db/mapping, f in m/from/text(),
t in m/to/text()where pid f and s t
return ltstudentgt ltnamegt n lt/namegt lttreatmen
tgt tr lt/treatmentgt lt/studentgt

student.xml ltstudentgtltpennidgt12346lt/pennidgt
ltnamegtMary McDonaldlt/namegt
lttakinggtltsemgtF03lt/semgt
ltclassgtcse330lt/classgtlt/takinggt
lt/studentgt dental.xml ltpatientgtltssngt323-468-12
12lt/ssngt lttreatmentgtDental
sealantlt/treatmentgt lt/patientgt concord.xml ltm
appinggt ltfromgt12346lt/fromgt lttogt323-468-1212lt/t
ogt lt/mappinggt
5
Drawbacks to Point-to-Point Mappings

They can get data from one source to another, but
what if you want to see elements that arent
shared?
Painful to create n2 mappings
Sometimes we dont actually want to ship the data
from one source to another, but to see both
We dont want to put Barnes Nobles inventory
INTO Amazons but we want to see books from
both
Two alternate strategies
Hierarchy map everything to a mediator
Peer-to-peer map data across a web of mappings
(PDMS see CIS 650)

6
Data Integration and Warehousing

Create a middleware mediator or data
integration system over the sources
All sources are mapped to a common mediated
schema
Warehouse approach actually has a central
database, and load data from the sources into it
Virtual approach has just a schema it consults
sources to answer each query
The mediator accepts queries over the central
schema and returns all relevant answers

7
Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Query-basedSchema Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Data
8
Mediator / Virtual Integration Systems

The subject of much research since the 80s and
especially 90s
Examples TSIMMIS, Information Manifold, MIX,
Garlic,
Original focus on Web
Real-world integration companies (IBM,
BEA/Oracle, Actuate, ) are focusing on the
enterprise more !
A common model
Take the source data
Define a schema mapping that produces content for
the mediated schema, based on the source data
The data for the mediated schema is the union
of all of the mappings

9
Answering Queries

Based on view unfolding composing a query and
view
The query is being posed over the mediated schema
for b in document(dblp.xml)/root/bookwhere
b/title/text Distributed Systems and
b/author/text() Tanenbaumreturn b
Wrappers are responsible for converting data from
the source into a subset of the mediated schema
for c in sql(select author,year,title from
CISbook)return ltbookgt c/ lt/bookgt

10
The Mediated Schema as a Union of Views from
Wrappers

Wrappers have names, some sort of output schema
define function GetCISBooks() as book for c
in sql(select author,year,title from
CISbook)return ltbookgt c/ lt/bookgt
This gets unioned with output from other
results
return ltrootgt
GetCISBooks()
GetEEBooks()
lt/rootgt

book
author
year
title
11
How to Answer the Query

Given our query
for b in document(dblp.xml)/root/bookwhere
b/title/text() Distributed Systems and
b/author/text() Tanenbaumreturn b
We want to find all wrapper definitions that
output the right structure to match our query
Book elements with titles and authors (and any
other attributes)

12
Query Composition with Views

We find all views that define book with author
and title, and we compose the query with each of
these
In our example, we find one wrapper definition
that matches
define function GetCISBooks() as book for b
in sql(select author,year,title from
CISbook)return ltbookgt b/ lt/bookgt
for b in document(mediated-schema)/root/bookwh
ere b/title/text() Distributed Systems and
b/author/text() Tanenbaumreturn b

return ltrootgt GetCISBooks() lt/rootgt
13
Making It Work

for b in doc ()/root/bookwhere
b/title/text() Dist. Systems and
b/author/text() Tanenbaumreturn b

root
book
author
year
title
c/author
c/year
c/title
c
author year title
14
The Final Step Unfolded View

The query and the view definition are merged (the
view is unfolded), yielding, e.g.
for b in sql(select author,title,year from
CISbook where authorTanenbaum)where
b/title/text() Distributed Systems return b

15
Summary Mapping, Integrating, and Sharing Data

Based on XQuery rather than XSLT
Views (in XQuery, functions) as the bridge
between schemas
Joins and nesting are important in creating these
views
Can do point-to-point mappings to exchange data
Very common approach mediated schema or
warehouse
Create a central schema may be virtual
Map sources to it
Pose queries over this
UDDI versus this approach?
What about search and its relationship to
integration? In particular, search over Amazon,
Google Maps, Google, Yahoo,

16
Web Search

Goal is to find information relevant to a users
interests
Challenge 1 A significant amount of content on
the web is not quality information
Many pages contain nonsensical rants, etc.
The web is full of misspellings, multiple
languages, etc.
Many pages are designed not to convey information
but to get a high ranking (e.g., search engine
optimization)
Challenge 2 billions of documents
Challenge 3 hyperlinks encode information

17
Our Discussion of Web Search

Begin with traditional information retrieval
Document models
Stemming and stop words
Web-specific issues
Crawlers and robots.txt
Scalability
Models for exploiting hyperlinks in ranking
Google and PageRank
Latent Semantic Indexing

18
Information Retrieval

Traditional information retrieval is basically
text search
A corpus or body of text documents, e.g., in a
document collection in a library or on a CD
Documents are generally high-quality and designed
to convey information
Documents are assumed to have no structure beyond
words
Searches are generally based on meaningful
phrases, perhaps including predicates over
categories, dates, etc.
The goal is to find the document(s) that best
match the search phrase, according to a search
model
Assumptions are typically different from Web
quality text, limited-size corpus, no hyperlinks

19
Motivation for Information Retrieval

Information Retrieval (IR) is about
Representation
Storage
Organization of
And access to information items
Focus is on users information need rather than
a precise query
March Madness Find information on college
basketball teams which (1) are maintained by a
US university and (2) participate in the NCAA
tournament
Emphasis is on the retrieval of information (not
data)

20
Data vs. Information Retrieval

Data retrieval, analogous to database querying
which docs contain a set of keywords?
Well-defined, precise logical semantics
A single erroneous object implies failure!
Information retrieval
Information about a subject or topic
Semantics is frequently loose we want
approximate matches
Small errors are tolerated (and in fact
inevitable)
IR system
Interpret contents of information items
Generate a ranking which reflects relevance
Notion of relevance is most important needs a
model

21
Basic Model
Docs
Index Terms
doc
match
Information Need
Ranking
?
query
22
Information Retrieval as a Field

IR addressed many issues in the last 20 years
Classification and categorization of documents
Systems and languages for searching
User interfaces and visualization of results
Area was seen as of narrow interest libraries,
mainly
Sea-change event the advent of the web
Universal library
Free (low cost) universal access
No central editorial board
Many problems in finding information IR seen as
key to finding the solutions!

23
The Full Info Retrieval Process
Text
Browser / UI
user interest
Text
Text Processing and Modeling
logical view
logical view
Query Operations
Indexing
user feedback
Crawler/ Data Access
inverted index
query
Searching
Index
retrieved docs
Documents (Web or DB)
Ranking
ranked docs
24
Terminology

IR systems usually adopt index terms to process
queries
Index term
a keyword or group of selected words
any word (more general)
Stemming might be used
connect connecting, connection, connections
An inverted index is built for the chosen index
terms

25
Whats a Meaningful Result?

Matching at index term level is quite imprecise
Users are frequently dissatisfied
One problem users are generally poor at posing
queries
Frequent dissatisfaction of Web users (who often
give single-keyword queries)
Issue of deciding relevance is critical for IR
systems ranking

26
Rankings

A ranking is an ordering of the documents
retrieved that (hopefully) reflects the relevance
of the documents to the user query
A ranking is based on fundamental premises
regarding the notion of relevance, such as
common sets of index terms
sharing of weighted terms
likelihood of relevance
Each set of premisses leads to a distinct IR model

27
Types of IR Models
U s e r T a s k
Retrieval Adhoc Filtering
Browsing
28
Classic IR Models Basic Concepts

Each document represented by a set of
representative keywords or index terms
An index term is a document word useful for
remembering the document main themes
Traditionally, index terms were nouns because
nouns have meaning by themselves
However, search engines assume that all words are
index terms (full text representation)

29
Classic IR Models Ranking

Not all terms are equally useful for representing
the document contents less frequent terms allow
identifying a narrower set of documents
The importance of the index terms is represented
by weights associated to them
Let
ki be an index term
dj be a document
wij is a weight associated with (ki,dj)
The weight wij quantifies the importance of the
index term for describing the document contents

30
Classic IR Models Notation

ki is an index term (keyword)
dj is a document
t is the total number of docs
K (k1, k2, , kt) is the set of all index
terms
wij gt 0 is a weight associated with (ki,dj)
wij 0 indicates that term does not belong to
doc
vec(dj) (w1j, w2j, , wtj) is a weighted
vector associated with the document dj
gi(vec(dj)) wij is a function which returns
the weight associated with pair (ki,dj)

31
Boolean Model

Simple model based on set theory
Queries specified as boolean expressions
precise semantics
neat formalism
q ka ? (kb ? ?kc)
Terms are either present or absent. Thus,
wij ? 0,1
An example query
q ka ? (kb ? ?kc)
Disjunctive normal form vec(qdnf) (1,1,1)
? (1,1,0) ? (1,0,0)
Conjunctive component vec(qcc) (1,1,0)

32
Boolean Model for Similarity

q ka ? (kb ? ?kc)
sim(q,dj) 1 if ? vec(qcc) s.t.
(vec(qcc) ? vec(qdnf)) ? (?ki,
gi(vec(dj)) gi(vec(qcc))) 0 otherwise

33
Drawbacks of Boolean Model

Retrieval based on binary decision criteria with
no notion of partial matching
No ranking of the documents is provided (absence
of a grading scale)
Information need has to be translated into a
Boolean expression which most users find awkward
The Boolean queries formulated by the users are
most often too simplistic
As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query

34
Vector Model

A refinement of the boolean model, which focused
strictly on exact matchines
Non-binary weights provide consideration for
partial matches
These term weights are used to compute a degree
of similarity between a query and each document
Ranked set of documents provides for better
matching

35
Vector Model

Define
wij gt 0 whenever ki ? dj
wiq gt 0 associated with the pair (ki,q)
vec(dj) (w1j, w2j, ..., wtj) vec(q)
(w1q, w2q, ..., wtq)
With each term ki , associate a unit vector
vec(i)
The unit vectors vec(i) and vec(j) are assumed
to be orthonormal (i.e., index terms are assumed
to occur independently within the documents)
The t unit vectors vec(i) form an orthonormal
basis for a t-dimensional space
In this space, queries and documents are
represented as weighted vectors

36
Vector Model
j
dj
?

Sim(q,dj) cos(?) vec(dj) ? vec(q) /
dj q ? wij wiq / dj q
Since wij gt 0 and wiq gt 0, 0 sim(q,dj)
1
A document is retrieved even if it matches the
query terms only partially

q
i
37
Weights in the Vector Model

Sim(q,dj) ? wij wiq / dj q
How do we compute the weights wij and wiq?
A good weight must take into account two effects
quantification of intra-document contents
(similarity)
tf factor, the term frequency within a document
quantification of inter-documents separation
(dissimilarity)
idf factor, the inverse document frequency
wij tf(i,j) idf(i)

38
TF and IDF Factors

Let
N be the total number of docs in the collection
ni be the number of docs which contain ki
freq(i,j) raw frequency of ki within dj
A normalized tf factor is given by
f(i,j) freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms
which occur within the document dj
The idf factor is computed as
idf(i) log (N / ni)
the log is used to make the values of tf and
idf comparable.
It can also be interpreted as the amount of
information associated with the term ki

39
Vector ModelExample 1
40
Vector ModelExample 1I
41
Vector ModelExample III
42
Vector Model, Summarized

The best term-weighting schemes tf-idf weights
wij f(i,j) log(N/ni)
For the query term weights, a suggestion is
wiq (0.5 0.5 freq(i,q) /
max(freq(l,q)) log(N / ni)
This model is very good in practice
tf-idf works well with general collections
Simple and fast to compute
Vector model is usually as good as the known
ranking alternatives

43
Pros Cons of Vector Model

Advantages
term-weighting improves quality of the answer set
partial matching allows retrieval of docs that
approximate the query conditions
cosine ranking formula sorts documents according
to degree of similarity to the query
Disadvantages
assumes independence of index terms not clear if
this is a good or bad assumption

44
Comparison of Classic Models

Boolean model does not provide for partial
matches and is considered to be the weakest
classic model
Some experiments indicate that the vector model
outperforms the third alternative, the
probabilistic model, in general
Recent IR research has focused on improving
probabilistic models but these havent made
their way to Web search
Generally we use a variation of the vector model
in most text search systems