Title: What is a document
1What is a document?
- Information need From where did the metaphor,
doing X is like herding cats, arise? - quotation? Managing senior programmers is like
herding cats. Dave Platt or - paper/article?
- video?
2Basic IR Documents
- Assume
- free text from a quotation through a book
(unstructured or semi-structured data) - English
- available electronically (on-line repositories)
- generally, too many documents to store locally in
an index. - generally, infer semantics through low level
units (e.g., terms) and metadata
3Logical View of Documents
(Figure taken from on-line course resources for
Modern Information Retrieval by Baeza-Yates and
Ribeiro-Neto)
4Structure
- Metadata is information on the organization of
the data. - external to meaning length, author, date
- subject matter subject codes, keywords,
taxonomic indicators - Organizational Conventions
- articles have a title, author list, abstract,
sections, etc. - web pages have headings, title, keywords, etc.
Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
5Markup Languages
- Markup is extra syntax that describes formatting,
attributes, semantics, etc. - Tags provide direction and delineate beginning
and end of marks. - Examples
- TeX,
- Standard Generalized Markup Language (SGML),
eXtensible Markup Language (XML) , - HyperText Markup Language (HTML).
6Term Separators Accents, Spacing, etc
- Lexical analysis divides text into distinct
terms. - usually disregard punctuation, numbers, spaces
- Decisions
- how to treat case and hyphens?
- disregard comments?
- how to use or not formatting directives?
Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
7Information in Terms
- Information entropy quantifies information
content - where there are a set s of terms and p is the
relative frequency () of a term.
8Term Distribution
- Zipfs Law approximates the distribution of term
frequencies in a text. - Frequency of ith most frequent term is
- times that of most frequent term where 1.5 lt Q
lt 2.0
9Stop Words
- words that either
- appear so frequently that they do not distinguish
documents (e.g., www) or - have more syntactic than semantic role (e.g.,
the). - Advantage Filtering out stop words reduces
document description and focuses attention on
terms that convey more information. - Disadvantage May reduce recall
Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
10Vocabulary Size
- Heaps Law models the size of vocabulary as a
function of - the size of the text (n),
- a baseline (10ltKlt100),
- a growth factor (b lt 1).
Voc
Text Size
11Noun Groups
- Further focus term set by filtering for
particular subsets selected manually (e.g.,
classifications or index terms). - Discard terms that are not nouns.
- Fix spelling errors.
- Use a thesaurus to combine similar words.
- From Google web site, Top 20 gaining queries
2002 contain only nouns.
Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
12Stemming
- Grammars permit minor modifications of terms that
change their type rather than meaning, e.g.,
plurals, gerunds, some prefixes and suffixes - Stemming reduces term to just the core (stem).
- Advantages reduces set of terms, combines same
meaning - Disadvantage may reduce recall by incorrectly
combining meanings (e.g., skies and ski)
Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
13Putting it together Document
- The purpose of the course is to teach theory and
practice underlying the construction of Web based
information systems. As such, the course will
devote equal time to information retrieval and
software engineering topics. The theory will be
put into practice through a semester long team
programming project.
48 words, 307 characters
14Putting it together Stop Word Removal
- purpose course teach theory practice underlying
construction Web based information course devote
equal time information retrieval software
engineering topics theory practice semester long
team programming project
26 words, 213 chars
15Putting it together Only Nouns
- purpose course theory practice construction Web
information course equal time information
retrieval software engineering topics theory
practice semester team programming project
21 words, 179 chars
16Putting it together Stemming Alphabetizing
- construct course course engineer equal informat
informat practice practice program project
purpose retrieve semester software team theory
theory time topic web
21 words, 161 chars
17Indexing
- Terms remaining after document processing must be
stored to facilitate retrieval. - Typically, they are stored in an inverted index.
More on that later
Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure