Concept Search: Semantics Enabled Syntactic Search - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Concept Search: Semantics Enabled Syntactic Search

Description:

0. Query concept: Cq = canine feline. 1. For each atomic concept more specific atomic concepts ... E.g., canine feline {C2= little dog, C3= huge cat, ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 23

Provided by: carbonVide

Category:

more less

Transcript and Presenter's Notes

Title: Concept Search: Semantics Enabled Syntactic Search

1
Fausto Giunchiglia, Uladzimir Kharkevich, Ilya
Zaihrayeu
Concept Search Semantics EnabledSyntactic Search
June 2nd, 2008, Tenerife, Spain
2
Outline

Information Retrieval (IR)
Syntactic IR
Problems of Syntactic IR
Semantic Continuum
Concept Search (C-Search)
C-Search via Inverted Indices
Preliminary Evaluation
Conclusion and Future work

3
Information Retrieval (IR)

IR can be represented as a mapping function
IR Q ? D
Q - natural language queries which specify user
information needs
D - a set of documents in the document
collection, which meet these
needs, (optionally) ordered according to
the degree of relevance.
Ex. document collection
Ex. queries

4
Information Retrieval System

IR_System ltModel, Data_Structure, Term,
Matchgt
Model IR models used for document and query
representations, for computing query answers and
relevance ranking.
Bag of words model (representation)
Boolean Model, Vector Space Model, Probabilistic
Model (retrieval)
Data_Structure data structures used for
indexing and retrieval.
Inverted Index
Signature File
Term an atomic element in document and query
representations.
a word or multi-words phrase
Match matching technique used for term
matching.
a syntactic matching of words or phrases
search for equivalent words
search for words with common prefixes
search for words within a certain edit distance
with a given word

5
Syntactic IR (Ex. Inv. Index)
Q3
6
Problems of Syntactic IR

(I) Ambiguity of Natural Language
Polysemy one word ? multiple meanings
e.g., baby is a young
mammal or a human child
Synonymy different words ? same meaning
e.g., mark and print a visible indication
made on a surface
(II) Complex Concepts
Syntactic IR does not take into account complex
concepts formed by Natural Language Phrases
(e.g., Noun Phrases).
E.g., Computer table ? A laptop computer is on a
coffee table
(III) Related Concepts
Syntactic IR does not take into account related
concepts
E.g., carnivores (flesh-eating mammals) is more
general than
dog OR cat

7
Syntactic IR

We can think of Syntactic IR as a point in a
space of IR approaches

8
(1) Ambiguity Natural Language ? Formal
Language

E.g., baby ? C(baby) a human child
print ? C(print) a visible indication
made on a surface

9
(2) Complex Concepts Words ? Multi-word
Phrases

E.g., Computer table ? C (computer table)
A laptop computer is on a coffee table ?
C (laptop computer), C (coffee
table)

10
(3) Related Concepts String similarity ?
Knowledge

E.g., carnivores ? dog ? C(carnivores) ?
C(dog)

11
Semantic Continuum

C-Search

12
C-Search in Semantic Continuum

NL2FL-axis - Lack of background knowledge
It is not always possible to find a concept which
corresponds to a given word (e.g., a concept does
not exist in the lexical database).
In this case, word itself is used as the
identifier for a concept.
W2P-axis - Descriptive phrases
(Complex) concepts are extracted from descriptive
phrases
descriptive phrase noun phrase OR noun
phrase
E.g., C(A little dog OR a huge cat) (little-2 ?
dog-1) ? (huge-1 ? cat-3)
KNOW-axis - lexical knowledge
We use synonyms, hyponyms, hypernyms
Semantic Matching ? search for related complex
concepts.

13
C-Search in Semantic Continuum
14
C-Search via Inverted Indices

Moving from Syntactic IR to C-Search does not
require the introduction of new data structures
or retrieval models
The current implementation of C-Search
Model Bag of concepts (representation),
Boolean Model (retrieval),
Vector Space Model (ranking)
Data_Structure Inverted Index
Term an atomic or a complex concept
Match semantic matching of concepts

15
C-Search (Ex. Inv. Index)
16
Concept Matching

Goal To find a set of document concepts matching
query concept
1st approach - directly via S-Match
Sequentially iterate through all document
concepts
Compare document concept with query concept
(using S-Match)
Collect those concepts for which S-Match return
more specific (?)
It can be slow! (because number of document
concepts gt 10E6)
2nd approach - via Inverted Indices (brief
overview)
A-Index
? Index atomic concepts by more general atomic
concept
?-Index
? Index conjunctive clauses by its components
(i.e., atomic concepts)
? -Index
? Index DNF formulas by its components (i.e.,
conjunctive clauses)

17
Concept Indices (An example)

Let us consider the following concept
C1 (little-2 ? dog-1) ? (huge-1 ? cat-3)
Fragments of concept indices for document concept
C1

18
Concept Retrieval (An example)

0. Query concept Cq canine ? feline
1. For each atomic concept ? more specific atomic
concepts
Search A-Index
E.g., canine ? dog, wolf, and feline ? cat,
lion,
2. For each atomic concept ? more specific
conjunctive clauses
Search ?-Index
E.g., dog ? C2 little ? dog, and cat ? C3
huge ? cat,
(Note that canine ? C2 little ? dog, and
feline ? C3 huge ? cat, )
3. For each disjunctive clause ? more specific
conjunctive clauses
Union of conjunctive clauses
E.g., canine ? feline ? C2 little ? dog, C3
huge ? cat,
4. For each disjunctive clause ? more specific
DNF formulas
Search ? -Index
E.g., canine ? feline ? C1 (little ? dog) ?
(huge ? cat),
5.

19
Evaluation Settings

Data_set_1 Home sub-tree of DMoz web directory
Document set documents classified to nodes
(29506)
Query set concatenation of node's and its
parent's labels (890)
Relevance judgment node-document links
Data_set_2 Only difference with Data_set_1 is
Document set concatenation of titles and
descriptions of docs in DMoz.
WordNet is used as Lexical DB
GATE is used as NLP Tool
Lucene is used as Inverted Index

20
Evaluation results

Data_set_1
Data_set_2

21
Conclusion and Future work

Conclusion
In C-Search, syntactic IR is extended with a
semantics layer
C-Search performs as good as syntactic search
while allowing for
an improvement when semantics is available
In principle, C-Search supports a continuum from
purely syntactic IR to fully semantic IR in which
indexing and retrieval can be performed at any
point of the continuum depending on how much
semantics is available
Future work
Development of more accurate concept extraction
algorithm
Development of document relevance metrics based
on both syntactic and semantic similarities of
query and document descriptions
Allow semantic scope (such as equivalence,
more/less general, disjoint)
Comparing the performance of the proposed
solution with the state-of-the-art syntactic IR
systems using a syntactic IR benchmark