Concept Search: Semantics Enabled Syntactic Search - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Concept Search: Semantics Enabled Syntactic Search

Description:

0. Query concept: Cq = canine feline. 1. For each atomic concept more specific atomic concepts ... E.g., canine feline {C2= little dog, C3= huge cat, ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 23
Provided by: carbonVide
Category:

less

Transcript and Presenter's Notes

Title: Concept Search: Semantics Enabled Syntactic Search


1
Fausto Giunchiglia, Uladzimir Kharkevich, Ilya
Zaihrayeu
Concept Search Semantics EnabledSyntactic Search
June 2nd, 2008, Tenerife, Spain
2
Outline
  • Information Retrieval (IR)
  • Syntactic IR
  • Problems of Syntactic IR
  • Semantic Continuum
  • Concept Search (C-Search)
  • C-Search via Inverted Indices
  • Preliminary Evaluation
  • Conclusion and Future work

3
Information Retrieval (IR)
  • IR can be represented as a mapping function
  • IR Q ? D
  • Q - natural language queries which specify user
    information needs
  • D - a set of documents in the document
    collection, which meet these
  • needs, (optionally) ordered according to
    the degree of relevance.
  • Ex. document collection
  • Ex. queries

4
Information Retrieval System
  • IR_System ltModel, Data_Structure, Term,
    Matchgt
  • Model IR models used for document and query
    representations, for computing query answers and
    relevance ranking.
  • Bag of words model (representation)
  • Boolean Model, Vector Space Model, Probabilistic
    Model (retrieval)
  • Data_Structure data structures used for
    indexing and retrieval.
  • Inverted Index
  • Signature File
  • Term an atomic element in document and query
    representations.
  • a word or multi-words phrase
  • Match matching technique used for term
    matching.
  • a syntactic matching of words or phrases
  • search for equivalent words
  • search for words with common prefixes
  • search for words within a certain edit distance
    with a given word

5
Syntactic IR (Ex. Inv. Index)
Q3
6
Problems of Syntactic IR
  • (I) Ambiguity of Natural Language
  • Polysemy one word ? multiple meanings
  • e.g., baby is a young
    mammal or a human child
  • Synonymy different words ? same meaning
  • e.g., mark and print a visible indication
    made on a surface
  • (II) Complex Concepts
  • Syntactic IR does not take into account complex
    concepts formed by Natural Language Phrases
    (e.g., Noun Phrases).
  • E.g., Computer table ? A laptop computer is on a
    coffee table
  • (III) Related Concepts
  • Syntactic IR does not take into account related
    concepts
  • E.g., carnivores (flesh-eating mammals) is more
    general than
  • dog OR cat

7
Syntactic IR
  • We can think of Syntactic IR as a point in a
    space of IR approaches

8
(1) Ambiguity Natural Language ? Formal
Language
  • E.g., baby ? C(baby) a human child
  • print ? C(print) a visible indication
    made on a surface

9
(2) Complex Concepts Words ? Multi-word
Phrases
  • E.g., Computer table ? C (computer table)
  • A laptop computer is on a coffee table ?
  • C (laptop computer), C (coffee
    table)

10
(3) Related Concepts String similarity ?
Knowledge
  • E.g., carnivores ? dog ? C(carnivores) ?
    C(dog)

11
Semantic Continuum
  • C-Search

12
C-Search in Semantic Continuum
  • NL2FL-axis - Lack of background knowledge
  • It is not always possible to find a concept which
    corresponds to a given word (e.g., a concept does
    not exist in the lexical database).
  • In this case, word itself is used as the
    identifier for a concept.
  • W2P-axis - Descriptive phrases
  • (Complex) concepts are extracted from descriptive
    phrases
  • descriptive phrase noun phrase OR noun
    phrase
  • E.g., C(A little dog OR a huge cat) (little-2 ?
    dog-1) ? (huge-1 ? cat-3)
  • KNOW-axis - lexical knowledge
  • We use synonyms, hyponyms, hypernyms
  • Semantic Matching ? search for related complex
    concepts.

13
C-Search in Semantic Continuum
14
C-Search via Inverted Indices
  • Moving from Syntactic IR to C-Search does not
    require the introduction of new data structures
    or retrieval models
  • The current implementation of C-Search
  • Model Bag of concepts (representation),
  • Boolean Model (retrieval),
  • Vector Space Model (ranking)
  • Data_Structure Inverted Index
  • Term an atomic or a complex concept
  • Match semantic matching of concepts

15
C-Search (Ex. Inv. Index)
16
Concept Matching
  • Goal To find a set of document concepts matching
    query concept
  • 1st approach - directly via S-Match
  • Sequentially iterate through all document
    concepts
  • Compare document concept with query concept
    (using S-Match)
  • Collect those concepts for which S-Match return
    more specific (?)
  • It can be slow! (because number of document
    concepts gt 10E6)
  • 2nd approach - via Inverted Indices (brief
    overview)
  • A-Index
  • ? Index atomic concepts by more general atomic
    concept
  • ?-Index
  • ? Index conjunctive clauses by its components
    (i.e., atomic concepts)
  • ? -Index
  • ? Index DNF formulas by its components (i.e.,
    conjunctive clauses)

17
Concept Indices (An example)
  • Let us consider the following concept
  • C1 (little-2 ? dog-1) ? (huge-1 ? cat-3)
  • Fragments of concept indices for document concept
    C1

18
Concept Retrieval (An example)
  • 0. Query concept Cq canine ? feline
  • 1. For each atomic concept ? more specific atomic
    concepts
  • Search A-Index
  • E.g., canine ? dog, wolf, and feline ? cat,
    lion,
  • 2. For each atomic concept ? more specific
    conjunctive clauses
  • Search ?-Index
  • E.g., dog ? C2 little ? dog, and cat ? C3
    huge ? cat,
  • (Note that canine ? C2 little ? dog, and
    feline ? C3 huge ? cat, )
  • 3. For each disjunctive clause ? more specific
    conjunctive clauses
  • Union of conjunctive clauses
  • E.g., canine ? feline ? C2 little ? dog, C3
    huge ? cat,
  • 4. For each disjunctive clause ? more specific
    DNF formulas
  • Search ? -Index
  • E.g., canine ? feline ? C1 (little ? dog) ?
    (huge ? cat),
  • 5.

19
Evaluation Settings
  • Data_set_1 Home sub-tree of DMoz web directory
  • Document set documents classified to nodes
    (29506)
  • Query set concatenation of node's and its
    parent's labels (890)
  • Relevance judgment node-document links
  • Data_set_2 Only difference with Data_set_1 is
  • Document set concatenation of titles and
    descriptions of docs in DMoz.
  • WordNet is used as Lexical DB
  • GATE is used as NLP Tool
  • Lucene is used as Inverted Index

20
Evaluation results
  • Data_set_1
  • Data_set_2

21
Conclusion and Future work
  • Conclusion
  • In C-Search, syntactic IR is extended with a
    semantics layer
  • C-Search performs as good as syntactic search
    while allowing for
  • an improvement when semantics is available
  • In principle, C-Search supports a continuum from
    purely syntactic IR to fully semantic IR in which
    indexing and retrieval can be performed at any
    point of the continuum depending on how much
    semantics is available
  • Future work
  • Development of more accurate concept extraction
    algorithm
  • Development of document relevance metrics based
    on both syntactic and semantic similarities of
    query and document descriptions
  • Allow semantic scope (such as equivalence,
    more/less general, disjoint)
  • Comparing the performance of the proposed
    solution with the state-of-the-art syntactic IR
    systems using a syntactic IR benchmark

22
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com