Search Engine Technology http:www.cs.columbia.eduradevSET07.html - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Search Engine Technology http:www.cs.columbia.eduradevSET07.html

Description:

Small worlds and scale-free networks. Power law distributions. Syllabus. Models of the Web. ... tolley spies games. totally spies games. tajmahal restaurant ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 46

Provided by: rad2

Category:

more less

Transcript and Presenter's Notes

Title: Search Engine Technology http:www.cs.columbia.eduradevSET07.html

1
Search Engine Technologyhttp//www.cs.columbia.e
du/radev/SET07.html

January 17, 2007
Prof. Dragomir R. Radev
radev_at_umich.edu

2
SET Winter 2007

Introduction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Examples of search engines

Conventional (library catalog). Search by
keyword, title, author, etc.
Text-based (Lexis-Nexis, Google, Yahoo!).Search
by keywords. Limited search using queries in
natural language.
Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ).
Question answering systems (Ask, NSIR,
Answerbus)Search in (restricted) natural
language
Clustering systems (Vivisimo, Clusty)
Research systems (Lemur, Nutch)

8
What does it take to build a search engine?

Decide what to index
Collect it
Index it (efficiently)
Keep the index up to date
Provide user-friendly query facilities

9
What else?

Understand the structure of the web for efficient
crawling
Understand user information needs
Preprocess text and other unstructured data
Cluster data
Classify data
Evaluate performance

10
Goals of the course

Understand how to collect, store, index, analyze,
search and present large quantities of
unstructured text.
Understand the dynamics of the Web by building
appropriate mathematical models.
Build working systems that assist users in
finding useful information on the Web.
Understand and use third party software.

11
Course logistics

Wednesdays 6-8 PM in 415 CEPSR
Office hour tba in 703 CEPSR
Web site http//www.cs.columbia.edu/radev/SET07
Instructor Dragomir Radev (PhD, Columbia-CS),
associate professor at U. Michigan (EECS and SI)
Email radev_at_umich.edu (please do not send me
mail at Columbia)
TA Malek Ben Salem (malek_at_cs.columbia.edu)

12
Course outline

Classic document retrieval storing, indexing,
retrieval.
Web retrieval crawling, query processing.
Text and web mining classification, clustering.
Network analysis random graph models,
centrality, diameter and clustering coefficient.

13
Syllabus

(Jan 17) Introduction
(Jan 17) Queries and Documents. Models of
Information retrieval. The Boolean model. The
Vector model.
(Jan 24) Document preprocessing. Tokenization.
Stemming. The Porter algorithm. Storing, indexing
and searching text. Inverted indexes.
(Jan 24) Word distributions. The Zipf
distribution. The Benford distribution. Heaps
law. TFIDF.
(Jan 31) Vector space similarity and ranking.
Relevance feedback and query expansion.
(Jan 31) Retrieval Evaluation. Precision and
Recall. F-measure. Reference collections. The
TREC conferences.
String matching. Approximate matching.
Compression and coding. Optimal codes.

14
Syllabus

Vector space clustering. k-means clustering. EM
clustering.
Text classification. Linear classifiers.
k-nearest neighbors. Naive Bayes.
Maximum margin classifiers. Support vector
machines.
Singular value decomposition and Latent Semantic
Indexing.
Probabilistic models of IR. Document models.
Language models. Burstiness.
Crawling the Web. Hyperlink analysis. Measuring
the Web.
Hypertext retrieval. Web-based IR. Document
closures.
Random graph models. Properties of random graphs
clustering coefficient, betweenness, diameter,
giant connected component, degree distribution.
Social network analysis. Small worlds and
scale-free networks. Power law distributions.

15
Syllabus

Models of the Web. The Bow-tie model.
Graph-based methods. Harmonic functions. Random
walks. PageRank.
Hubs and authorities. HITS and SALSA. Bipartite
graphs.
Webometrics. Measuring the size of the Web.
Focused crawling. Resource discovery. Discovering
communities.
Collaborative filtering. Recommendation systems.
Information extraction. Hidden Markov Models.
Conditional Random Fields.
Adversarial IR. Spamming and anti-spamming
methods.
Additional topics, e.g., natural language
processing, XML retrieval, text tiling, text
summarization, question answering, spectral
clustering, human behavior on the web,
semi-supervised learning

16
Readings

required Information Retrieval by Manning,
Schuetze, and Raghavan (http//www-csli.stanford.e
du/schuetze/information-retrieval-book.html),
freely available, mirrored on January 2, 2007.
optional Modeling the Internet and the Web
Probabilistic Methods and Algorithms by Pierre
Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
u).
papers from SIGIR, WWW and journals (to be
announced in class).

17
Prerequisites

Linear algebra vectors, matrices, and operations
on them, determinants, eigenvectors.
Calculus differentiation, finding extrema of
functions.
Probabilities random variables, discrete and
continuous distributions, Bayes theorem.
Programming experience with at least one
web-aware programming language such as Perl
(highly recommended) or Java in a UNIX
environment.
Required CS account (check CS web site)

18
Course requirements

Four (mostly programming) assignments (40)
Some of them will be in Perl. The rest can be
done in any appropriate language.
Reading assignments (10)
Final project (40)
Students will present their final project in a
poster session in class.
Class participation (10)
No final exam.

19
Final project format

Research paper - using the SIGIR format. Students
will be in charge of problem formulation,
literature survey, hypothesis formulation,
experimental design, implementation, and possibly
submission to a conference like SIGIR or WWW.
Software system - develop a working system or
API. Students will be responsible for identifying
a niche problem, implementing it and deploying
it, either on the Web or as an open-source
downloadable tool. The system can be either stand
alone or an extension to an existing one.

20
Project ideas

Build a question answering system.
Build a language identification system.
Social network analysis from the Web.
Participate in the Netflix challenge.
Query log analysis.
Build models of Web evolution.
Information diffusion in blogs or web.
Author-topic models of web pages.
Using the web for machine translation.
Building evolving models of web documents.
News recommendation system.
Compress the text of Wikipedia (losslessly).
Spelling correction using query logs.
Automatic query expansion.

21
Available corpora

Enron email
CIA world factbook
DBLP papers in CS
NNDB information about people
BLOGS collection of blogs
US congressional speeches
AOL queries
Netflix recommendations
IMDB
NIE news articles
PUBMED biomedical paper abstracts
Wikipedia
ACL Anthology collection of papers in NLP/CL
DOTGOV download of .GOV
biocreative biomedical papers

WT100G 100GB download of the web
Google n-grams
webfreq frequency of words on the web
SMS corpus
Citeseer CS papers
DMOZ the open directory project
corpus of paraphrases
multilingual parallel parliamentary proceedings
textual entailment corpus
question answering corpus
summarization corpus
various text classification corpora
(Reuters-21578, 20NG)
Peekaboom (from the game)

22
Related courses elsewhere

Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze)
Cornell (Jon Kleinberg)
CMU (Yiming Yang and Jamie Callan)
UMass (James Allan)
UTexas (Ray Mooney)
Illinois (Chengxiang Zhai)
Johns Hopkins (David Yarowsky)
For a long list of courses related to Search
Engines, Natural Language Processing, Machine
Learning look herehttp//clair.si.umich.edu808
0/wordpress/?p11

23
SET Winter 2007
2. Models of Information retrieval The
Vector model The Boolean model
24
Sample queries (from Excite)

In what year did baseball become an offical
sport?
play station codes . com
birth control and depression
government
"WorkAbility I"conference
kitchen appliances
where can I find a chines rosewood
tiger electronics
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's
Shahnameh exhibit characteristics of a hero?
emeril Lagasse
Hubble
M.S Subalaksmi
running

25
Key Terms Used in IR

QUERY a representation of what the user is
looking for - can be a list of words or a phrase.
DOCUMENT an information entity that the user
wants to retrieve
COLLECTION a set of documents
INDEX a representation of information that makes
querying easier
TERM word or concept that appears in a document
or a query

26
Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
27
Documents

Not just printed paper
Can be records, pages, sites, images, people,
movies
Document encoding (Unicode)
Document representation
Document preprocessing

28
Sample query sessions (from AOL)

toley spies gramestolley spies gamestotally
spies games
tajmahal restaurant brooklyn nytaj mahal
restaurant brooklyn nytaj mahal restaurant
brooklyn ny 11209
do you love me like you saydo you love me like
you say lyricsdo you love me like you say lyrics
marvin gaye

29
Characteristics of user queries

Sessions users revisit their queries.
Very short queries typically 2 words long.
A large number of typos.
A small number of popular queries. A long tail of
infrequent ones.
Almost no use of advanced query operators with
the exception of double quotes

30
Queries as documents

Advantages
Mathematically easier to manage
Problems
Different lengths
Syntactic differences
Repetitions of words (or lack thereof)

31
Document representations

Term-document matrix (m x n)
Document-document matrix (n x n)
Typical example in a medium-sized collection
3,000,000 documents (n) with 50,000 terms (m)
Typical example on the Web n30,000,000,000,
m1,000,000
Boolean vs. integer-valued matrices

32
Major IR models

Boolean
Vector
Probabilistic
Language modeling
Fuzzy retrieval
Latent semantic indexing

33
The Boolean model
Venn diagrams
z
x
w
y
D1
D2
34
Boolean queries

Operators AND, OR, NOT, parentheses
Example
CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
Ambiguous uses of AND and OR in human language
Exclusive vs. inclusive OR
Restrictive operator AND or OR?

35
Canonical forms of queries

De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)

Normal forms
Conjunctive normal form (CNF)
Disjunctive normal form (DNF)
Reference librarians prefer CNF - why?

36
Evaluating Boolean queries

Incidence vectors
CLEVELAND 1100010
OHIO 1000111
Examples
CLEVELAND AND OHIO
CLEVELAND AND NOT OHIO
CLEVALAND OR OHIO

37
Exercise

D1 computer information retrieval
D2 computer retrieval
D3 information
D4 computer information
Q1 information AND retrieval
Q2 information AND NOT computer

38
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
39
How to deal with?

Multi-word phrases?
Document ranking?

40
The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
41
Vector queries

Each document is represented as a vector
Non-efficient representation
Dimensional compatibility

42
The matching process

Document space
Matching is done between a document and a query
(or between two documents)
Distance vs. similarity measures.
Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.

43
Miscellaneous similarity measures

The Cosine measure (normalized dot product)

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y

The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
44
Exercise

Compute the cosine scores ? (D1,D2) and ? (D1,D3)
for the documents D1 lt1,3gt, D2 lt100,300gt and
D3 lt3,1gt
Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.

45
Readings