David Newman, UC Irvine Lecture 2 1 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

David Newman, UC Irvine Lecture 2 1

Description:

Added to list of ideas for projects (link on class website) ... instanc, and your messag is broken into ani number of littl bundl; each take a ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 35
Provided by: Informatio367
Category:
Tags: ani | david | irvine | lecture | newman

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 2 1


1
CS 277 Data MiningLecture 2 Text Mining and
Information Retrieval
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 1 available. Due Tuesday Oct 9.
  • Added to list of ideas for projects (link on
    class website)
  • In 2 weeks (Oct 16) Project proposal due

3
Lecture Topics in Text Mining
  • Information Retrieval
  • Text Classification
  • Text Clustering
  • Information Extraction

4
Text Mining Applications
  • Information Retrieval
  • Query-based search of large text archives, e.g.,
    the Web
  • Text Classification
  • Automated assignment of topics to Web pages,
    e.g., Yahoo, Google
  • Automated classification of email into spam and
    non-spam (ham)
  • Text Clustering
  • Automated organization of search results in
    real-time into categories
  • Discovery clusters and trends in technical
    literature (e.g. CiteSeer)
  • Information Extraction
  • Extracting standard fields from free-text
  • extracting names and places from reports,
    newspapers
  • extracting resume information automatically from
    resumes

5
Text Mining
  • Information Retrieval
  • Text Classification
  • Text Clustering
  • Information Extraction

6
General Concepts in Information Retrieval
  • Representation language
  • Typically a vector of W features
  • images set of color, intensity, texture,
    gradient features characterizing images
  • text word counts
  • Data set of D objects
  • Typically represented as an D x W matrix
  • Query q
  • User poses a query to search data set
  • Query is expressed in the same representation
    language as the data
  • each text document is a set of words that occur
    in the document
  • Query q is also expressed as a set of words, e.g.
    data and mining

7
Query by Content
  • traditional DB query exact matches
  • query q level MANAGER AND age lt 30
  • or, Boolean match on text
  • query Irvine AND fun return all docs with
    Irvine and fun
  • Not useful when there are many matches
  • data mining in Google returns 60 million
    documents
  • query-by-content query more general / less
    precise
  • what record is most similar to a query q?
  • for text data, often called information
    retrieval (IR)
  • can also be used for images, sequences, video
  • q can itself be an object (a document) or a
    shorter version (1 word)
  • Goal
  • Match query q to the D objects in the database
  • Return a ranked list of the most relevant objects
    in the data set given q

8
Issues in Query by Content
  • What representation language to use?
  • How to measure similarity between q and each d in
    D?
  • How to compute the results in real-time?
  • How to rank the results for the user?
  • Allowing user feedback (query modification)
  • How to evaluate and compare different IR
    algorithms/systems?

9
Text Retrieval
  • document book, paper, WWW page, ...
  • term word, word-pair, phrase, (often
    W100,000)
  • query q set of terms, e.g., data mining
  • NLP (natural language processing) too hard, so
  • want (vector) representation for text which
  • retains maximum useful semantics
  • supports efficient distance computes between docs
    and q
  • term weights
  • Boolean (term in document or not) bag of words
  • real-valued (freq term in doc relative to all
    docs) ...
  • notice loses word order, sentence structure

10
Processing one document
tokenize
stem
vocab filter
11
Practical Issues
  • Tokenization
  • Convert document to word counts
  • word token any nonempty sequence of
    characters
  • for HTML (etc) need to remove formatting
  • Canonical forms, Stopwords, Stemming
  • Remove capitalization
  • Stopwords
  • remove very frequent words (the, and) can use
    standard list
  • Can also remove very rare words
  • Stemming (next slide)
  • Data representation
  • 3 column ltdocID termID positiongt
  • Inverted index (faster)
  • List of sorted lttermID docIDgt pairs useful for
    finding docs containing certain terms
  • Equivalent to a sparse representation of term x
    doc matrix

12
Stemming
  • May want to reduce all morphological variants of
    a word to a single index term
  • a document containing words fish and fisher will
    not be retrieved by a query containing fishing
    (no fishing explicitly contained in the document)
  • Stemming - reduce words to their root form
  • fish becomes a new index term
  • Porter stemming algorithm (1980)
  • relies on a preconstructed suffix list with
    associated rules
  • if suffixIZATION and prefix contains at least
    one vowel followed by a consonant, replace with
    suffixIZE
  • BINARIZATION gt BINARIZE
  • Not always desirable university, universal -gt
    univers (in Porters)
  • WordNet dictionary-based approach

13
Porter Stemmer
  • Aside You will use the Porter Stemmer in
    Homework 1. I have provided you with porter.pl
  • If you dont already have access to perl, it is
    available on your unix account

14
Porter Stemmer (example)
input
output
  • The Internet is an engineering feat of no small
    magnitude. It operates using a method of data
    transmission invented in the 1960s called packet
    switching. Send e-mail, for instance, and your
    message is broken into any number of little
    bundles each takes a separate route and reunites
    with the others at the destination.
  • the internet is an engin feat of no small
    magnitud. it oper us a method of data transmiss
    invent in the 1960s call packet switch. send e
    mail, for instanc, and your messag is broken into
    ani number of littl bundl each take a separ rout
    and reunit with the other at the destin.

Q Do you think Google uses stemming?
Q What might stemming be good for?
15
Toy example of a document-term matrix
16
Inverted index
  • Queries
  • q1 database
  • q2 database schema

17
Distance metric
  • Vector-space model
  • 1xW vector d
  • What is a suitable distance metric between q and
    d?
  • ? whiteboard

18
Distance metric cosine similarity
  • Go with cosine similarity
  • simil(q,d) ltq,dgt / q d cos q
  • slightly unnatural using inner products and
    norms from L2
  • Bias
  • favors shorter documents?
  • or longer documents?

19
Distance matrices for toy document-term data
Euclidean Distances
TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24
21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5
0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3
0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3
0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0
17 4 23
Cosine Distances
20
q database schema
21
TF-IDF Term Weighting Schemes
  • Not all terms in a query or document may be
    equally important...
  • TF (term freq) term weight number of times in
    that document
  • problem term common to many docs ? low
    discrimination
  • IDF (inverse-document frequency of a term)
  • nj documents contain term j, D documents in total
  • IDF log(D/nj)
  • Favors terms that occur in relatively few
    documents
  • TF-IDF weight TF(term)IDF(term)
  • No real theoretical basis, but works well
    empirically and widely used

22
TF-IDF Example
TF
t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32
10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2
0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7
16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9
1 0 0 34 27 25 d10 6 0 0 17 4 23 idf 0.1 0.7
0.5 0.7 0.4 0.7
TF-IDF
t1 t2 t3 t4 t5 t6 d1 2.5 14.6 4.6
0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3
11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0
0 d5 4.5 21.5 10.2 0 1.1 0 ...
IDF weights log(D/nj) (0.1, 0.7, 0.5, 0.7,
0.4, 0.7)
23
Baseline Document Querying System
  • Queries q binary term vectors
  • Documents represented by TF-IDF weights
  • Cosine distance used for retrieval and ranking

24
Baseline Document Querying System
TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24
21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5
0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3
0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3
0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0
17 4 23
TF-IDF doc-term matrix t1 t2 t3 t4 t5
t6 d1 2.5 14.6 4.6 0 0 2.1 d2 3.4 6.9
2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4
0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1
0 ...
q (1,0,1,0,0,0)
TF TF-IDF d1 0.70 0.32 d2 0.77 0.51 d3
0.58 0.24 d4 0.60 0.23 d5 0.79 0.43 ...
Cosine similarity
25
Precision versus Recall
  • Rank documents (numerically) with respect to
    query
  • Compute precision and recall by threshholding the
    rankings
  • precision
  • fraction of retrieved objects that are relevant
  • recall
  • fraction of retrieved relevant objects / total
    relevant objects
  • Tradeoff high precision ?? low recall, and
    vice-versa
  • Very similar to ROC in concept
  • For multiple queries, precision for specific
    ranges of recall can be averaged (so-called
    interpolated precision).

26
Precision versus Recall
  • Chakrabati, p55
  • recall(k) 1/R S ri , i1k
  • precision(k) 1/k S ri , i1k

R5
27
Precision-Recall Curve (form of ROC)
alternative (point) values precision
where recallprecision or precision for
fixed number of retrievals or average
precision over multiple recall levels
C is universally worse than A B
28
TREC evaluations
  • Text Retrieval Conference (TReC)
  • Web site trec.nist.gov
  • Annual impartial evaluation of IR systems
  • e.g., D 1 million documents
  • TREC organizers supply contestants with several
    hundred queries q
  • Each competing system provides its ranked list of
    documents
  • Union of top 100 ranked documents or so from each
    system is then manually judged to be relevant or
    non-relevant for each query q
  • Precision, recall, etc, then calculated and
    systems compared

29
Other Examples of Evaluation Data Sets
  • Cranfield data
  • Number of documents 1400
  • 225 Queries, medium length, manually
    constructed test questions
  • Relevance determined by expert committee (from
    1968)
  • Newsgroups
  • Articles from 20 Usenet newsgroups
  • Queries randomly selected documents
  • Relevance is the document d in the same category
    as the query doc?

30
Performance on Cranfield Document Set
31
Performance on Newsgroups Data
32
Related Types of Data
  • Sparse high-dimensional data sets with counts,
    like document-term matrices, are common in data
    mining
  • transaction data
  • Rows customers
  • Columns products
  • Web log data (ignoring sequence)
  • Rows Web surfers
  • Columns Web pages
  • Recommender systems
  • Given some products from user i, suggest other
    products to the user
  • e.g., Amazon.coms book recommender
  • Collaborative filtering
  • use k-nearest-individuals as the basis for
    predictions
  • Many similarities with querying and information
    retrieval
  • use of cosine distance to normalize vectors

33
Web-based Retrieval
  • Additional information in Web documents
  • Link structure (e.g., PageRank to be discussed
    later)
  • HTML structure
  • Link/anchor text
  • Title text
  • This information can be leveraged for better
    retrieval
  • Additional issues in Web retrieval
  • Scalability size of corpus is huge (10 to 100
    billion docs)
  • Constantly changing
  • Crawlers to update document-term information
  • need schemes for efficient updating indices
  • Evaluation is more difficult how is relevance
    measured? How many documents in total are
    relevant?

34
Further Reading
  • Chakrabati Chapter 3
  • General reference on text and language modeling
  • Foundations of Statistical Language Processing,
    C. Manning and H. Schutze, MIT Press, 1999.
  • Very useful reference on indexing and searching
    text
  • Managing Gigabytes Compressing and Indexing
    Documents and Images, 2nd edition, Morgan
    Kaufmann, 1999, by Ian H. Witten, Alistair
    Moffat, and Timothy C. Bell,
  • Web-related Document Search
  • Information on how real Web search engines work
    http//searchenginewatch.com/
Write a Comment
User Comments (0)
About PowerShow.com