Title: Concept Search: Semantics Enabled Syntactic Search
1Fausto Giunchiglia, Uladzimir Kharkevich, Ilya
Zaihrayeu
Concept Search Semantics EnabledSyntactic Search
June 2nd, 2008, Tenerife, Spain
2Outline
- Information Retrieval (IR)
- Syntactic IR
- Problems of Syntactic IR
- Semantic Continuum
- Concept Search (C-Search)
- C-Search via Inverted Indices
- Preliminary Evaluation
- Conclusion and Future work
3Information Retrieval (IR)
- IR can be represented as a mapping function
- IR Q ? D
- Q - natural language queries which specify user
information needs - D - a set of documents in the document
collection, which meet these - needs, (optionally) ordered according to
the degree of relevance. - Ex. document collection
- Ex. queries
4Information Retrieval System
- IR_System ltModel, Data_Structure, Term,
Matchgt - Model IR models used for document and query
representations, for computing query answers and
relevance ranking. - Bag of words model (representation)
- Boolean Model, Vector Space Model, Probabilistic
Model (retrieval) - Data_Structure data structures used for
indexing and retrieval. - Inverted Index
- Signature File
- Term an atomic element in document and query
representations. - a word or multi-words phrase
- Match matching technique used for term
matching. - a syntactic matching of words or phrases
- search for equivalent words
- search for words with common prefixes
- search for words within a certain edit distance
with a given word
5Syntactic IR (Ex. Inv. Index)
Q3
6Problems of Syntactic IR
- (I) Ambiguity of Natural Language
- Polysemy one word ? multiple meanings
- e.g., baby is a young
mammal or a human child - Synonymy different words ? same meaning
- e.g., mark and print a visible indication
made on a surface - (II) Complex Concepts
- Syntactic IR does not take into account complex
concepts formed by Natural Language Phrases
(e.g., Noun Phrases). - E.g., Computer table ? A laptop computer is on a
coffee table - (III) Related Concepts
- Syntactic IR does not take into account related
concepts - E.g., carnivores (flesh-eating mammals) is more
general than - dog OR cat
7Syntactic IR
- We can think of Syntactic IR as a point in a
space of IR approaches
8(1) Ambiguity Natural Language ? Formal
Language
- E.g., baby ? C(baby) a human child
- print ? C(print) a visible indication
made on a surface
9(2) Complex Concepts Words ? Multi-word
Phrases
- E.g., Computer table ? C (computer table)
- A laptop computer is on a coffee table ?
- C (laptop computer), C (coffee
table)
10(3) Related Concepts String similarity ?
Knowledge
- E.g., carnivores ? dog ? C(carnivores) ?
C(dog)
11Semantic Continuum
12C-Search in Semantic Continuum
- NL2FL-axis - Lack of background knowledge
- It is not always possible to find a concept which
corresponds to a given word (e.g., a concept does
not exist in the lexical database). - In this case, word itself is used as the
identifier for a concept. - W2P-axis - Descriptive phrases
- (Complex) concepts are extracted from descriptive
phrases - descriptive phrase noun phrase OR noun
phrase - E.g., C(A little dog OR a huge cat) (little-2 ?
dog-1) ? (huge-1 ? cat-3) - KNOW-axis - lexical knowledge
- We use synonyms, hyponyms, hypernyms
- Semantic Matching ? search for related complex
concepts.
13C-Search in Semantic Continuum
14C-Search via Inverted Indices
- Moving from Syntactic IR to C-Search does not
require the introduction of new data structures
or retrieval models - The current implementation of C-Search
- Model Bag of concepts (representation),
- Boolean Model (retrieval),
- Vector Space Model (ranking)
- Data_Structure Inverted Index
- Term an atomic or a complex concept
- Match semantic matching of concepts
15C-Search (Ex. Inv. Index)
16Concept Matching
- Goal To find a set of document concepts matching
query concept - 1st approach - directly via S-Match
- Sequentially iterate through all document
concepts - Compare document concept with query concept
(using S-Match) - Collect those concepts for which S-Match return
more specific (?) - It can be slow! (because number of document
concepts gt 10E6) - 2nd approach - via Inverted Indices (brief
overview) - A-Index
- ? Index atomic concepts by more general atomic
concept - ?-Index
- ? Index conjunctive clauses by its components
(i.e., atomic concepts) - ? -Index
- ? Index DNF formulas by its components (i.e.,
conjunctive clauses)
17Concept Indices (An example)
- Let us consider the following concept
- C1 (little-2 ? dog-1) ? (huge-1 ? cat-3)
- Fragments of concept indices for document concept
C1
18Concept Retrieval (An example)
- 0. Query concept Cq canine ? feline
- 1. For each atomic concept ? more specific atomic
concepts - Search A-Index
- E.g., canine ? dog, wolf, and feline ? cat,
lion, - 2. For each atomic concept ? more specific
conjunctive clauses - Search ?-Index
- E.g., dog ? C2 little ? dog, and cat ? C3
huge ? cat, - (Note that canine ? C2 little ? dog, and
feline ? C3 huge ? cat, ) - 3. For each disjunctive clause ? more specific
conjunctive clauses - Union of conjunctive clauses
- E.g., canine ? feline ? C2 little ? dog, C3
huge ? cat, - 4. For each disjunctive clause ? more specific
DNF formulas - Search ? -Index
- E.g., canine ? feline ? C1 (little ? dog) ?
(huge ? cat), - 5.
19Evaluation Settings
- Data_set_1 Home sub-tree of DMoz web directory
- Document set documents classified to nodes
(29506) - Query set concatenation of node's and its
parent's labels (890) - Relevance judgment node-document links
- Data_set_2 Only difference with Data_set_1 is
- Document set concatenation of titles and
descriptions of docs in DMoz. - WordNet is used as Lexical DB
- GATE is used as NLP Tool
- Lucene is used as Inverted Index
20Evaluation results
21Conclusion and Future work
- Conclusion
- In C-Search, syntactic IR is extended with a
semantics layer - C-Search performs as good as syntactic search
while allowing for - an improvement when semantics is available
- In principle, C-Search supports a continuum from
purely syntactic IR to fully semantic IR in which
indexing and retrieval can be performed at any
point of the continuum depending on how much
semantics is available - Future work
- Development of more accurate concept extraction
algorithm - Development of document relevance metrics based
on both syntactic and semantic similarities of
query and document descriptions - Allow semantic scope (such as equivalence,
more/less general, disjoint) - Comparing the performance of the proposed
solution with the state-of-the-art syntactic IR
systems using a syntactic IR benchmark
22