What is a document - PowerPoint PPT Presentation

About This Presentation
Title:

What is a document

Description:

quotation? ' Managing senior programmers is like herding cats. ... free text from a quotation through a book (unstructured or semi-structured data) English ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 18
Provided by: CSU67
Category:
Tags: document

less

Transcript and Presenter's Notes

Title: What is a document


1
What is a document?
  • Information need From where did the metaphor,
    doing X is like herding cats, arise?
  • quotation? Managing senior programmers is like
    herding cats. Dave Platt or
  • paper/article?
  • video?

2
Basic IR Documents
  • Assume
  • free text from a quotation through a book
    (unstructured or semi-structured data)
  • English
  • available electronically (on-line repositories)
  • generally, too many documents to store locally in
    an index.
  • generally, infer semantics through low level
    units (e.g., terms) and metadata

3
Logical View of Documents
(Figure taken from on-line course resources for
Modern Information Retrieval by Baeza-Yates and
Ribeiro-Neto)
4
Structure
  • Metadata is information on the organization of
    the data.
  • external to meaning length, author, date
  • subject matter subject codes, keywords,
    taxonomic indicators
  • Organizational Conventions
  • articles have a title, author list, abstract,
    sections, etc.
  • web pages have headings, title, keywords, etc.

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
5
Markup Languages
  • Markup is extra syntax that describes formatting,
    attributes, semantics, etc.
  • Tags provide direction and delineate beginning
    and end of marks.
  • Examples
  • TeX,
  • Standard Generalized Markup Language (SGML),
    eXtensible Markup Language (XML) ,
  • HyperText Markup Language (HTML).

6
Term Separators Accents, Spacing, etc
  • Lexical analysis divides text into distinct
    terms.
  • usually disregard punctuation, numbers, spaces
  • Decisions
  • how to treat case and hyphens?
  • disregard comments?
  • how to use or not formatting directives?

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
7
Information in Terms
  • Information entropy quantifies information
    content
  • where there are a set s of terms and p is the
    relative frequency () of a term.

8
Term Distribution
  • Zipfs Law approximates the distribution of term
    frequencies in a text.
  • Frequency of ith most frequent term is
  • times that of most frequent term where 1.5 lt Q
    lt 2.0

9
Stop Words
  • words that either
  • appear so frequently that they do not distinguish
    documents (e.g., www) or
  • have more syntactic than semantic role (e.g.,
    the).
  • Advantage Filtering out stop words reduces
    document description and focuses attention on
    terms that convey more information.
  • Disadvantage May reduce recall

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
10
Vocabulary Size
  • Heaps Law models the size of vocabulary as a
    function of
  • the size of the text (n),
  • a baseline (10ltKlt100),
  • a growth factor (b lt 1).

Voc
Text Size
11
Noun Groups
  • Further focus term set by filtering for
    particular subsets selected manually (e.g.,
    classifications or index terms).
  • Discard terms that are not nouns.
  • Fix spelling errors.
  • Use a thesaurus to combine similar words.
  • From Google web site, Top 20 gaining queries
    2002 contain only nouns.

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
12
Stemming
  • Grammars permit minor modifications of terms that
    change their type rather than meaning, e.g.,
    plurals, gerunds, some prefixes and suffixes
  • Stemming reduces term to just the core (stem).
  • Advantages reduces set of terms, combines same
    meaning
  • Disadvantage may reduce recall by incorrectly
    combining meanings (e.g., skies and ski)

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
13
Putting it together Document
  • The purpose of the course is to teach theory and
    practice underlying the construction of Web based
    information systems. As such, the course will
    devote equal time to information retrieval and
    software engineering topics. The theory will be
    put into practice through a semester long team
    programming project.

48 words, 307 characters
14
Putting it together Stop Word Removal
  • purpose course teach theory practice underlying
    construction Web based information course devote
    equal time information retrieval software
    engineering topics theory practice semester long
    team programming project

26 words, 213 chars
15
Putting it together Only Nouns
  • purpose course theory practice construction Web
    information course equal time information
    retrieval software engineering topics theory
    practice semester team programming project

21 words, 179 chars
16
Putting it together Stemming Alphabetizing
  • construct course course engineer equal informat
    informat practice practice program project
    purpose retrieve semester software team theory
    theory time topic web

21 words, 161 chars
17
Indexing
  • Terms remaining after document processing must be
    stored to facilitate retrieval.
  • Typically, they are stored in an inverted index.
    More on that later

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
structure
Write a Comment
User Comments (0)
About PowerShow.com