SMART Mini TREC - PowerPoint PPT Presentation

About This Presentation
Title:

SMART Mini TREC

Description:

... if on stopword list, stemmed otherwise. SMART has standard stopword list (570 words) ... Find longest matching legal stem by comparing to trie data struct ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 22
Provided by: jason88
Learn more at: http://www.cs.cmu.edu
Category:
Tags: smart | trec | list | longest | mini

less

Transcript and Presenter's Notes

Title: SMART Mini TREC


1
SMART Mini TREC
  • Danyel Fisher
  • Jason I. Hong
  • Jonathan Huang

2
Results
3
Overview
  • How SMART works
  • Problems we encountered and workarounds
  • How we got our results

4
About SMART
  • Developed 1961-64 at Harvard
  • Maintained at Cornell University
  • Vector-based analysis, tf x idf weighting
  • Version 11 still written in KR C

5
How SMART Works (Batch)
  • Interactive mode and batch mode
  • Index data
  • Index queries
  • Run queries

6
How SMART Works
  • Indexing options
  • Normalize term-frequency
  • n, b, m, a, s, l
  • Alter document weight from TF-IDF
  • n, t, p, f, s
  • Normalize vector
  • n, s, c, f, m
  • Examples
  • nnc, ats
  • So 6 4 3 72 combinations
  • These options apply to queries and docs

7
How SMART Works with TREC
  • Create SMART doc lt-gt TREC doc map
  • Index TREC documents
  • Index queries
  • Run queries
  • Convert SMART output to TREC output

8
Step 1 SMART TREC map
  • Script provided with SMART
  • Allows us to convert SMART results to TREC
    results after results are generated
  • Errors in data file
  • 22 0 456080 1.00000 ft911-22
  • 23 0 456113 1.00000 719
  • 24 0 456146 1.00000 ft911-24
  • 25 0 456179 1.00000 ft911-25
  • 26 0 456212 1.00000 ft911-26
  • 27 0 456245 1.00000
  • 28 0 456278 1.00000 ft911-28
  • Couldn't figure out why, Financial Times data
    seemed regular and consistent

9
Step 2 Index TREC Documents
  • Creates vectors and inverted lists
  • doc.nnc doc.nnc.var vector inv.nnc
    inv.nnc.var inverted list
  • 250 megs of data from Financial Times
  • Takes 45-70 minutes to index
  • Uses tagged SMART format
  • Needs to be pre-parsed from
  • Changed from FTimes ltTITLEgt format
  • T.Gorbachev visits tank in farm

10
Step 2 Index TREC Documents
  • Stopwords and Stemming
  • By default, tokens eliminated if on stopword
    list, stemmed otherwise
  • SMART has standard stopword list (570 words)
  • e.g. a, a's, ain't, thanx, howbeit, ltd, uucp,
    latterly
  • e.g. one, two, th, t's, que, hither
  • We just used the standard list
  • Uses triestem for stemming
  • Find longest matching legal stem by comparing to
    trie data struct
  • Also available but didn't try remove_s, none

11
Step 2 Index TREC Documents
  • Problems
  • Need to estimate dictionary size beforehand
  • Default is 30001
  • Also tried 500000, 700000, 900000, 2000000
  • When printing dictionary
  • smart Undetermined error detected - Quit

12
Step 3 Index Queries
  • Pretty much same as document indexing
  • e.g. query.atc query.atc.var vectors
  • Indexes quickly
  • A few minutes for 50 MINI TREC queries
  • Files much smaller too
  • 8K per query type for MINI TREC

13
Step 4 Run Queries
  • Takes about 10 minutes per query
  • Stores top-ranked files in own format
  • run_eval.tr.ltc.mtn, run_eval.tr.ltc.mtn.var

14
Step 5 SMART to TREC
  • Didn't work at all (corrupt dictionary?)
  • On running conversion from TREC to SMART
  • smart in dict '/projects/is240/smart/trec/dict'
    Illegal value for seek - Quit
  • trec_eval written by SMART people
  • Output of a SMART run is similar to TREC run
  • Can't get top-ranked TREC ID, but can get other
    info

15
Alternatives
  • Discovered that we could get TREC-like results if
    we had qrels file
  • Unfortunately defeats purpose of blind run
  • Interesting problem
  • qrels file for Financial Times docs 401-450
  • SMART re-numbers indexed docs 401-450
    internally to 1-50
  • SMART becomes mismatched with qrels
  • Manually renumbered MINI TREC query file and MINI
    TREC qrels file to start at 1

16
Observations
  • Query Indexing
  • Files are small (8K)
  • Files are fast to index (1 minute)
  • Document Indexing
  • Files are big (250 megs)
  • Files are slow to index (45-70 minutes)

17
Observations (cont.)
  • Process
  • Indexed all 72 query types
  • Indexed MINI TREC documents as nnc
  • Ran all 72 query types
  • Took 15 query types that had best results
  • Re-indexed MINI TREC docs several times
  • Tried a few doc indexing schemes arbitrarily
    nnn, nnc, atn, atc, ats, ltn ltc, lts, msn, mtn
  • Ran top 15 query types on each

18
Results
  • Results varied widely
  • Precision at recall 0.0 ranged from 0.11 to 0.61
  • Various combinations worked well, didn't see
    trends in doc and query indexing types
  • One strange observation
  • out.ltquerygt.ltdocgt
  • out.atc.nnc, out.atn.nnc, out.ats.nnc same
    results
  • out.ltc.nnc, out.ltn.nnc, out.lts.nnc same
    results
  • For queries, normalize vector has no effect?

19
Results
  • Best Results

20
Results
  • Best Results

21
Other Problems Encountered
  • Extremely opaque system
  • Lots of parameter values, switches, etc
  • Commands don't have parameters listed
  • Specification file doesn't have documentation
  • Really obscure error messages
  • Just short of "The source code is the docs"
  • Best practice - view tar file for docs
  • Spent more time on SMART than IR
Write a Comment
User Comments (0)
About PowerShow.com