Introduction to Content Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Content Analysis

Description:

Library book checkout patterns. Incoming Web Page Requests (Nielsen) ... One measure: Mutual Information (Church and Hanks 89) ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 27
Provided by: hea4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Content Analysis


1
Introduction to Content Analysis
  • Prof. Marti Hearst
  • SIMS 202, Lecture 14

2
Topics for Today
  • Overview of Content Analysis
  • Text Representation
  • Statistical Characteristics of Text Collections

3
Content Analysis
  • Automated Transformation of raw text into a form
    that represent some aspect(s) of its meaning
  • Including, but not limited to
  • Automated Thesaurus Generation
  • Phrase Detection
  • Categorization
  • Clustering
  • Summarization

4
Techniques for Content Analysis
  • Statistical
  • Single Document
  • Full Collection
  • Linguistic
  • Syntactic
  • Semantic
  • Pragmatic
  • Knowledge-Based (Artificial Intelligence)
  • Hybrid (Combinations)

5
Text Processing
  • Standard Steps
  • Recognize document structure
  • titles, sections, paragraphs, etc.
  • Break into tokens
  • usually space and punctuation delineated
  • special issues with Asian languages
  • Stemming/morphological analysis
  • Store in inverted index (to be discussed later)

6
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • tengo, tienes, tiene, tenemos, tienen
  • Derivational Morphology
  • Derive one word from another,
  • Often change grammatical class
  • build, building health, healthy

7
Automated Methods
  • Powerful multilingual tools exist for
    morphological analysis
  • PCKimmo, Xerox Lexical technology
  • Require a grammar and dictionary
  • Use two-level automata
  • Stemmers
  • Very dumb rules work well (for English)
  • Porter Stemmer Iteratively remove suffixes
  • Improvement pass results through a lexicon

8
Errors Generated by Porter Stemmer (Krovetz 93)
9
Statistical Properties of Text
  • Token occurrences in text are not uniformly
    distributed
  • They are also not normally distributed
  • They do exhibit a Zipf distribution
  • (in-class demonstration of distribution types)

10
Zipf Distribution
  • The product of the frequency of words (f) and
    their rank (r) is approximately constant
  • Rank order of words frequency of occurrence
  • Main Characteristics
  • a few elements occur very frequently
  • a medium number of elements have medium frequency
  • many elements occur very infrequently

11
What Kinds of Data Exhibit a Zipf Distribution?
  • Words in a text collection
  • Library book checkout patterns
  • Incoming Web Page Requests (Nielsen)
  • Outgoing Web Page Requests (Cunha Crovella)
  • Document Size on Web (Cunha Crovella)

12
Housing Listing Frequency Data
6208 tokens, 1318 unique (very small collection)
13
Words that occur few times (housing listings)
14
Medium and very frequent words (housing listings)
15
A More Standard Collection
Government documents, 157734 tokens, 32259 unique
8164 the 4771 of 4005 to 2834 a 2827 and 2802
in 1592 The 1370 for 1326 is 1324 s 1194 that
973 by
969 on 915 FT 883 Mr 860 was 855 be 849
Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE
798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI
1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT
1 ADVISERS 1 AE
16
Word Frequency vs. Resolving Power (from van
Rijsbergen 79)
The most frequent words are not the most
descriptive.
17
StatisticalIndependence vs. Dependence
  • How likely is a red car to drive by given weve
    seen a black one?
  • How likely is word W to appear, given that weve
    seen word V?
  • Color of cars driving by are independent
    (although more frequent colors are more likely)
  • Words in text are not independent (although again
    more frequent words are more likely)

18
Statistical Independence
  • Compute for a window of words

a b c d e f g h i j k l m n o p
w1
w11
w21
19
Lexical Associations
  • Subjects write first word that comes to mind
  • doctor/nurse black/white (Palermo Jenkins 64)
  • Text Corpora yield similar associations
  • One measure Mutual Information (Church and Hanks
    89)
  • If word occurrences were independent, the
    numerator and denominator would be equal (if
    measured across a large collection)

20
Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
21
Un-Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
These associations were likely to happen because
the non-doctor words shown here are very
common and therefore likely to co-occur with any
noun.
22
Document Vectors
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • A vector is like an array of floating point
  • Has direction and magnitude
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse

23
Document Vectors
Document ids
  • nova galaxy heat hwood film role diet fur
  • 1.0 0.5 0.3
  • 0.5 1.0
  • 1.0 0.8 0.7
  • 0.9 1.0 0.5
  • 1.0 1.0
  • 0.9 1.0
  • 0.5 0.7 0.9
  • 0.6 1.0 0.3 0.2 0.8
  • 0.7 0.5 0.1 0.3

A B C D E F G H I
24
Documents in 3D Space
25
Documents and Query in 3D Space
26
Summary
  • Content Analysis transforming raw text into more
    computationally useful forms
  • Words in text collections exhibit interesting
    statistical properties
  • Word frequencies have a Zipf distribution
  • Word co-occurrences exhibit dependencies
  • Text documents are transformed to vectors
  • pre-processing includes tokenization, stemming,
    collocations/phrases
  • Documents occupy multi-dimensional space
Write a Comment
User Comments (0)
About PowerShow.com