SMART Mini TREC

About This Presentation

Title:

SMART Mini TREC

Description:

... if on stopword list, stemmed otherwise. SMART has standard stopword list (570 words) ... Find longest matching legal stem by comparing to trie data struct ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 22

Provided by: jason88

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: SMART Mini TREC

1
SMART Mini TREC

Danyel Fisher
Jason I. Hong
Jonathan Huang

2
Results
3
Overview

How SMART works
Problems we encountered and workarounds
How we got our results

4
About SMART

Developed 1961-64 at Harvard
Maintained at Cornell University
Vector-based analysis, tf x idf weighting
Version 11 still written in KR C

5
How SMART Works (Batch)

Interactive mode and batch mode
Index data
Index queries
Run queries

6
How SMART Works

Indexing options
Normalize term-frequency
n, b, m, a, s, l
Alter document weight from TF-IDF
n, t, p, f, s
Normalize vector
n, s, c, f, m
Examples
nnc, ats
So 6 4 3 72 combinations
These options apply to queries and docs

7
How SMART Works with TREC

Create SMART doc lt-gt TREC doc map
Index TREC documents
Index queries
Run queries
Convert SMART output to TREC output

8
Step 1 SMART TREC map

Script provided with SMART
Allows us to convert SMART results to TREC
results after results are generated
Errors in data file
22 0 456080 1.00000 ft911-22
23 0 456113 1.00000 719
24 0 456146 1.00000 ft911-24
25 0 456179 1.00000 ft911-25
26 0 456212 1.00000 ft911-26
27 0 456245 1.00000
28 0 456278 1.00000 ft911-28
Couldn't figure out why, Financial Times data
seemed regular and consistent

9
Step 2 Index TREC Documents

Creates vectors and inverted lists
doc.nnc doc.nnc.var vector inv.nnc
inv.nnc.var inverted list
250 megs of data from Financial Times
Takes 45-70 minutes to index
Uses tagged SMART format
Needs to be pre-parsed from
Changed from FTimes ltTITLEgt format
T.Gorbachev visits tank in farm

10
Step 2 Index TREC Documents

Stopwords and Stemming
By default, tokens eliminated if on stopword
list, stemmed otherwise
SMART has standard stopword list (570 words)
e.g. a, a's, ain't, thanx, howbeit, ltd, uucp,
latterly
e.g. one, two, th, t's, que, hither
We just used the standard list
Uses triestem for stemming
Find longest matching legal stem by comparing to
trie data struct
Also available but didn't try remove_s, none

11
Step 2 Index TREC Documents

Problems
Need to estimate dictionary size beforehand
Default is 30001
Also tried 500000, 700000, 900000, 2000000
When printing dictionary
smart Undetermined error detected - Quit

12
Step 3 Index Queries

Pretty much same as document indexing
e.g. query.atc query.atc.var vectors
Indexes quickly
A few minutes for 50 MINI TREC queries
Files much smaller too
8K per query type for MINI TREC

13
Step 4 Run Queries

Takes about 10 minutes per query
Stores top-ranked files in own format
run_eval.tr.ltc.mtn, run_eval.tr.ltc.mtn.var

14
Step 5 SMART to TREC

Didn't work at all (corrupt dictionary?)
On running conversion from TREC to SMART
smart in dict '/projects/is240/smart/trec/dict'
Illegal value for seek - Quit
trec_eval written by SMART people
Output of a SMART run is similar to TREC run
Can't get top-ranked TREC ID, but can get other
info

15
Alternatives

Discovered that we could get TREC-like results if
we had qrels file
Unfortunately defeats purpose of blind run
Interesting problem
qrels file for Financial Times docs 401-450
SMART re-numbers indexed docs 401-450
internally to 1-50
SMART becomes mismatched with qrels
Manually renumbered MINI TREC query file and MINI
TREC qrels file to start at 1

16
Observations

Query Indexing
Files are small (8K)
Files are fast to index (1 minute)
Document Indexing
Files are big (250 megs)
Files are slow to index (45-70 minutes)

17
Observations (cont.)

Process
Indexed all 72 query types
Indexed MINI TREC documents as nnc
Ran all 72 query types
Took 15 query types that had best results
Re-indexed MINI TREC docs several times
Tried a few doc indexing schemes arbitrarily
nnn, nnc, atn, atc, ats, ltn ltc, lts, msn, mtn
Ran top 15 query types on each

18
Results

Results varied widely
Precision at recall 0.0 ranged from 0.11 to 0.61
Various combinations worked well, didn't see
trends in doc and query indexing types
One strange observation
out.ltquerygt.ltdocgt
out.atc.nnc, out.atn.nnc, out.ats.nnc same
results
out.ltc.nnc, out.ltn.nnc, out.lts.nnc same
results
For queries, normalize vector has no effect?

19
Results