Presentation of mandatory assignment in TDT4215 - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Presentation of mandatory assignment in TDT4215

Description:

Collocations. Make an index of all possible bigrams (d 4) ... Use the resulting collocation index to propose phrase and near queries to the user. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 18
Provided by: trul
Category:

less

Transcript and Presenter's Notes

Title: Presentation of mandatory assignment in TDT4215


1
  • Presentation of mandatory assignment in TDT4215

2
System overview
3
System architecture
4
System architecture continued
  • Stemmer
  • Spring
  • Lingo clustering (for reference)
  • the rest is selfmade

5
The (compressed) inverted file (binary)
6
The inverted file (simplified)
  • One inverted list with 3 hits, doc nr 3,7 and 8.
  • Each list represented as a difference between
    document numbers
  • Difference is coded using local bernoulli model
    and Golomb code
  • The resulting compressed inverted list is shown
    below

7
Query language
  • Single terms
  • /-
  • Phrases
  • NEAR

8
Phrases and proximity
  • Phrases are supported by, for each entry
    (document) in all of the inverted lists
  • Check whether or not term locations are adjacent
    to each other (in correct order)
  • Proximity is implemented in the same basic way as
    phrases.
  • Proximity used for two things
  • NEAR-queries
  • Proximity boost in ranking

9
Ranking
  • Cosine similarity
  • TF-IDF
  • LTC
  • Proximity

10
Collocations
  • Make an index of all possible bigrams (dlt4).
  • Extract the relevant ones using a chi squared
    test.
  • Use the resulting collocation index to propose
    phrase and near queries to the user.
  • Proposal of collocations of length 3.

11
Clustering
  • Used for result presentation
  • K-means
  • Suffix tree clustering
  • Processing teasers

12
Evaluation (Precision/Recall)
13
Evaluation (PROVISIONS OF THE TEST BAN TREATY)
14
Evaluation extended system
  • Proximity
  • Either good or bad
  • Collocations
  • Computationally expensive
  • May be useful
  • Possible to extend with linguistic techniques

15
Evaluation extended system
  • Clustering
  • Works, but not always well
  • Does not scale
  • Language detection
  • Nice feature
  • Spelling correction
  • Limited use
  • May be useful

16
Demo
17
Link
  • http//styggen.idi.ntnu.no/brille
  • QUESTIONS?
Write a Comment
User Comments (0)
About PowerShow.com