Title: Search Engine Technology http:www.cs.columbia.eduradevSET07.html
1Search Engine Technologyhttp//www.cs.columbia.e
du/radev/SET07.html
- January 17, 2007
- Prof. Dragomir R. Radev
- radev_at_umich.edu
2SET Winter 2007
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Examples of search engines
- Conventional (library catalog). Search by
keyword, title, author, etc. - Text-based (Lexis-Nexis, Google, Yahoo!).Search
by keywords. Limited search using queries in
natural language. - Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ). - Question answering systems (Ask, NSIR,
Answerbus)Search in (restricted) natural
language - Clustering systems (Vivisimo, Clusty)
- Research systems (Lemur, Nutch)
8What does it take to build a search engine?
- Decide what to index
- Collect it
- Index it (efficiently)
- Keep the index up to date
- Provide user-friendly query facilities
9What else?
- Understand the structure of the web for efficient
crawling - Understand user information needs
- Preprocess text and other unstructured data
- Cluster data
- Classify data
- Evaluate performance
10Goals of the course
- Understand how to collect, store, index, analyze,
search and present large quantities of
unstructured text. - Understand the dynamics of the Web by building
appropriate mathematical models. - Build working systems that assist users in
finding useful information on the Web. - Understand and use third party software.
11Course logistics
- Wednesdays 6-8 PM in 415 CEPSR
- Office hour tba in 703 CEPSR
- Web site http//www.cs.columbia.edu/radev/SET07
- Instructor Dragomir Radev (PhD, Columbia-CS),
associate professor at U. Michigan (EECS and SI) - Email radev_at_umich.edu (please do not send me
mail at Columbia) - TA Malek Ben Salem (malek_at_cs.columbia.edu)
12Course outline
- Classic document retrieval storing, indexing,
retrieval. - Web retrieval crawling, query processing.
- Text and web mining classification, clustering.
- Network analysis random graph models,
centrality, diameter and clustering coefficient.
13Syllabus
- (Jan 17) Introduction
- (Jan 17) Queries and Documents. Models of
Information retrieval. The Boolean model. The
Vector model. - (Jan 24) Document preprocessing. Tokenization.
Stemming. The Porter algorithm. Storing, indexing
and searching text. Inverted indexes. - (Jan 24) Word distributions. The Zipf
distribution. The Benford distribution. Heaps
law. TFIDF. - (Jan 31) Vector space similarity and ranking.
Relevance feedback and query expansion. - (Jan 31) Retrieval Evaluation. Precision and
Recall. F-measure. Reference collections. The
TREC conferences. - String matching. Approximate matching.
- Compression and coding. Optimal codes.
14Syllabus
- Vector space clustering. k-means clustering. EM
clustering. - Text classification. Linear classifiers.
k-nearest neighbors. Naive Bayes. - Maximum margin classifiers. Support vector
machines. - Singular value decomposition and Latent Semantic
Indexing. - Probabilistic models of IR. Document models.
Language models. Burstiness. - Crawling the Web. Hyperlink analysis. Measuring
the Web. - Hypertext retrieval. Web-based IR. Document
closures. - Random graph models. Properties of random graphs
clustering coefficient, betweenness, diameter,
giant connected component, degree distribution. - Social network analysis. Small worlds and
scale-free networks. Power law distributions.
15Syllabus
- Models of the Web. The Bow-tie model.
- Graph-based methods. Harmonic functions. Random
walks. PageRank. - Hubs and authorities. HITS and SALSA. Bipartite
graphs. - Webometrics. Measuring the size of the Web.
- Focused crawling. Resource discovery. Discovering
communities. - Collaborative filtering. Recommendation systems.
- Information extraction. Hidden Markov Models.
Conditional Random Fields. - Adversarial IR. Spamming and anti-spamming
methods. - Additional topics, e.g., natural language
processing, XML retrieval, text tiling, text
summarization, question answering, spectral
clustering, human behavior on the web,
semi-supervised learning
16Readings
- required Information Retrieval by Manning,
Schuetze, and Raghavan (http//www-csli.stanford.e
du/schuetze/information-retrieval-book.html),
freely available, mirrored on January 2, 2007. - optional Modeling the Internet and the Web
Probabilistic Methods and Algorithms by Pierre
Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
u). - papers from SIGIR, WWW and journals (to be
announced in class).
17Prerequisites
- Linear algebra vectors, matrices, and operations
on them, determinants, eigenvectors. - Calculus differentiation, finding extrema of
functions. - Probabilities random variables, discrete and
continuous distributions, Bayes theorem. - Programming experience with at least one
web-aware programming language such as Perl
(highly recommended) or Java in a UNIX
environment. - Required CS account (check CS web site)
18Course requirements
- Four (mostly programming) assignments (40)
- Some of them will be in Perl. The rest can be
done in any appropriate language. - Reading assignments (10)
- Final project (40)
- Students will present their final project in a
poster session in class. - Class participation (10)
- No final exam.
19Final project format
- Research paper - using the SIGIR format. Students
will be in charge of problem formulation,
literature survey, hypothesis formulation,
experimental design, implementation, and possibly
submission to a conference like SIGIR or WWW. - Software system - develop a working system or
API. Students will be responsible for identifying
a niche problem, implementing it and deploying
it, either on the Web or as an open-source
downloadable tool. The system can be either stand
alone or an extension to an existing one.
20Project ideas
- Build a question answering system.
- Build a language identification system.
- Social network analysis from the Web.
- Participate in the Netflix challenge.
- Query log analysis.
- Build models of Web evolution.
- Information diffusion in blogs or web.
- Author-topic models of web pages.
- Using the web for machine translation.
- Building evolving models of web documents.
- News recommendation system.
- Compress the text of Wikipedia (losslessly).
- Spelling correction using query logs.
- Automatic query expansion.
21Available corpora
- Enron email
- CIA world factbook
- DBLP papers in CS
- NNDB information about people
- BLOGS collection of blogs
- US congressional speeches
- AOL queries
- Netflix recommendations
- IMDB
- NIE news articles
- PUBMED biomedical paper abstracts
- Wikipedia
- ACL Anthology collection of papers in NLP/CL
- DOTGOV download of .GOV
- biocreative biomedical papers
- WT100G 100GB download of the web
- Google n-grams
- webfreq frequency of words on the web
- SMS corpus
- Citeseer CS papers
- DMOZ the open directory project
- corpus of paraphrases
- multilingual parallel parliamentary proceedings
- textual entailment corpus
- question answering corpus
- summarization corpus
- various text classification corpora
(Reuters-21578, 20NG) - Peekaboom (from the game)
22Related courses elsewhere
- Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze) - Cornell (Jon Kleinberg)
- CMU (Yiming Yang and Jamie Callan)
- UMass (James Allan)
- UTexas (Ray Mooney)
- Illinois (Chengxiang Zhai)
- Johns Hopkins (David Yarowsky)
- For a long list of courses related to Search
Engines, Natural Language Processing, Machine
Learning look herehttp//clair.si.umich.edu808
0/wordpress/?p11
23SET Winter 2007
2. Models of Information retrieval The
Vector model The Boolean model
24Sample queries (from Excite)
- In what year did baseball become an offical
sport? - play station codes . com
- birth control and depression
- government
- "WorkAbility I"conference
- kitchen appliances
- where can I find a chines rosewood
- tiger electronics
- 58 Plymouth Fury
- How does the character Seyavash in Ferdowsi's
Shahnameh exhibit characteristics of a hero? - emeril Lagasse
- Hubble
- M.S Subalaksmi
- running
25Key Terms Used in IR
- QUERY a representation of what the user is
looking for - can be a list of words or a phrase. - DOCUMENT an information entity that the user
wants to retrieve - COLLECTION a set of documents
- INDEX a representation of information that makes
querying easier - TERM word or concept that appears in a document
or a query
26Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
27Documents
- Not just printed paper
- Can be records, pages, sites, images, people,
movies - Document encoding (Unicode)
- Document representation
- Document preprocessing
28Sample query sessions (from AOL)
- toley spies gramestolley spies gamestotally
spies games - tajmahal restaurant brooklyn nytaj mahal
restaurant brooklyn nytaj mahal restaurant
brooklyn ny 11209 - do you love me like you saydo you love me like
you say lyricsdo you love me like you say lyrics
marvin gaye
29Characteristics of user queries
- Sessions users revisit their queries.
- Very short queries typically 2 words long.
- A large number of typos.
- A small number of popular queries. A long tail of
infrequent ones. - Almost no use of advanced query operators with
the exception of double quotes
30Queries as documents
- Advantages
- Mathematically easier to manage
- Problems
- Different lengths
- Syntactic differences
- Repetitions of words (or lack thereof)
31Document representations
- Term-document matrix (m x n)
- Document-document matrix (n x n)
- Typical example in a medium-sized collection
3,000,000 documents (n) with 50,000 terms (m) - Typical example on the Web n30,000,000,000,
m1,000,000 - Boolean vs. integer-valued matrices
32Major IR models
- Boolean
- Vector
- Probabilistic
- Language modeling
- Fuzzy retrieval
- Latent semantic indexing
33The Boolean model
Venn diagrams
z
x
w
y
D1
D2
34Boolean queries
- Operators AND, OR, NOT, parentheses
- Example
- CLEVELAND AND NOT OHIO
- (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
- Ambiguous uses of AND and OR in human language
- Exclusive vs. inclusive OR
- Restrictive operator AND or OR?
35Canonical forms of queries
NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
- Normal forms
- Conjunctive normal form (CNF)
- Disjunctive normal form (DNF)
- Reference librarians prefer CNF - why?
36Evaluating Boolean queries
- Incidence vectors
- CLEVELAND 1100010
- OHIO 1000111
- Examples
- CLEVELAND AND OHIO
- CLEVELAND AND NOT OHIO
- CLEVALAND OR OHIO
37Exercise
- D1 computer information retrieval
- D2 computer retrieval
- D3 information
- D4 computer information
- Q1 information AND retrieval
- Q2 information AND NOT computer
38Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
39How to deal with?
- Multi-word phrases?
- Document ranking?
40The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
41Vector queries
- Each document is represented as a vector
- Non-efficient representation
- Dimensional compatibility
42The matching process
- Document space
- Matching is done between a document and a query
(or between two documents) - Distance vs. similarity measures.
- Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.
43Miscellaneous similarity measures
- The Cosine measure (normalized dot product)
? (di x qi)
X ? Y
? (D,Q)
? (di)2
? (qi)2
X Y
X ? Y
? (D,Q)
X ? Y
44Exercise
- Compute the cosine scores ? (D1,D2) and ? (D1,D3)
for the documents D1 lt1,3gt, D2 lt100,300gt and
D3 lt3,1gt - Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.
45Readings
- For January 24 MRS1, MRS2, MRS5 (Zipf)
- For January 31 MRS7, MRS8