Search Engine Technology http:www.cs.columbia.eduradevSET07.html - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Search Engine Technology http:www.cs.columbia.eduradevSET07.html

Description:

Small worlds and scale-free networks. Power law distributions. Syllabus. Models of the Web. ... tolley spies games. totally spies games. tajmahal restaurant ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 46
Provided by: rad2
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Technology http:www.cs.columbia.eduradevSET07.html


1
Search Engine Technologyhttp//www.cs.columbia.e
du/radev/SET07.html
  • January 17, 2007
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
SET Winter 2007
  • Introduction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Examples of search engines
  • Conventional (library catalog). Search by
    keyword, title, author, etc.
  • Text-based (Lexis-Nexis, Google, Yahoo!).Search
    by keywords. Limited search using queries in
    natural language.
  • Multimedia (QBIC, WebSeek, SaFe)Search by visual
    appearance (shapes, colors, ).
  • Question answering systems (Ask, NSIR,
    Answerbus)Search in (restricted) natural
    language
  • Clustering systems (Vivisimo, Clusty)
  • Research systems (Lemur, Nutch)

8
What does it take to build a search engine?
  • Decide what to index
  • Collect it
  • Index it (efficiently)
  • Keep the index up to date
  • Provide user-friendly query facilities

9
What else?
  • Understand the structure of the web for efficient
    crawling
  • Understand user information needs
  • Preprocess text and other unstructured data
  • Cluster data
  • Classify data
  • Evaluate performance

10
Goals of the course
  • Understand how to collect, store, index, analyze,
    search and present large quantities of
    unstructured text.
  • Understand the dynamics of the Web by building
    appropriate mathematical models.
  • Build working systems that assist users in
    finding useful information on the Web.
  • Understand and use third party software.

11
Course logistics
  • Wednesdays 6-8 PM in 415 CEPSR
  • Office hour tba in 703 CEPSR
  • Web site http//www.cs.columbia.edu/radev/SET07
  • Instructor Dragomir Radev (PhD, Columbia-CS),
    associate professor at U. Michigan (EECS and SI)
  • Email radev_at_umich.edu (please do not send me
    mail at Columbia)
  • TA Malek Ben Salem (malek_at_cs.columbia.edu)

12
Course outline
  • Classic document retrieval storing, indexing,
    retrieval.
  • Web retrieval crawling, query processing.
  • Text and web mining classification, clustering.
  • Network analysis random graph models,
    centrality, diameter and clustering coefficient.

13
Syllabus
  • (Jan 17) Introduction
  • (Jan 17) Queries and Documents. Models of
    Information retrieval. The Boolean model. The
    Vector model.
  • (Jan 24) Document preprocessing. Tokenization.
    Stemming. The Porter algorithm. Storing, indexing
    and searching text. Inverted indexes.
  • (Jan 24) Word distributions. The Zipf
    distribution. The Benford distribution. Heaps
    law. TFIDF.
  • (Jan 31) Vector space similarity and ranking.
    Relevance feedback and query expansion.
  • (Jan 31) Retrieval Evaluation. Precision and
    Recall. F-measure. Reference collections. The
    TREC conferences.
  • String matching. Approximate matching.
  • Compression and coding. Optimal codes.

14
Syllabus
  • Vector space clustering. k-means clustering. EM
    clustering.
  • Text classification. Linear classifiers.
    k-nearest neighbors. Naive Bayes.
  • Maximum margin classifiers. Support vector
    machines.
  • Singular value decomposition and Latent Semantic
    Indexing.
  • Probabilistic models of IR. Document models.
    Language models. Burstiness.
  • Crawling the Web. Hyperlink analysis. Measuring
    the Web.
  • Hypertext retrieval. Web-based IR. Document
    closures.
  • Random graph models. Properties of random graphs
    clustering coefficient, betweenness, diameter,
    giant connected component, degree distribution.
  • Social network analysis. Small worlds and
    scale-free networks. Power law distributions.

15
Syllabus
  • Models of the Web. The Bow-tie model.
  • Graph-based methods. Harmonic functions. Random
    walks. PageRank.
  • Hubs and authorities. HITS and SALSA. Bipartite
    graphs.
  • Webometrics. Measuring the size of the Web.
  • Focused crawling. Resource discovery. Discovering
    communities.
  • Collaborative filtering. Recommendation systems.
  • Information extraction. Hidden Markov Models.
    Conditional Random Fields.
  • Adversarial IR. Spamming and anti-spamming
    methods.
  • Additional topics, e.g., natural language
    processing, XML retrieval, text tiling, text
    summarization, question answering, spectral
    clustering, human behavior on the web,
    semi-supervised learning

16
Readings
  • required Information Retrieval by Manning,
    Schuetze, and Raghavan (http//www-csli.stanford.e
    du/schuetze/information-retrieval-book.html),
    freely available, mirrored on January 2, 2007.
  • optional Modeling the Internet and the Web
    Probabilistic Methods and Algorithms by Pierre
    Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
    2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
    u).
  • papers from SIGIR, WWW and journals (to be
    announced in class).

17
Prerequisites
  • Linear algebra vectors, matrices, and operations
    on them, determinants, eigenvectors.
  • Calculus differentiation, finding extrema of
    functions.
  • Probabilities random variables, discrete and
    continuous distributions, Bayes theorem.
  • Programming experience with at least one
    web-aware programming language such as Perl
    (highly recommended) or Java in a UNIX
    environment.
  • Required CS account (check CS web site)

18
Course requirements
  • Four (mostly programming) assignments (40)
  • Some of them will be in Perl. The rest can be
    done in any appropriate language.
  • Reading assignments (10)
  • Final project (40)
  • Students will present their final project in a
    poster session in class.
  • Class participation (10)
  • No final exam.

19
Final project format
  • Research paper - using the SIGIR format. Students
    will be in charge of problem formulation,
    literature survey, hypothesis formulation,
    experimental design, implementation, and possibly
    submission to a conference like SIGIR or WWW.
  • Software system - develop a working system or
    API. Students will be responsible for identifying
    a niche problem, implementing it and deploying
    it, either on the Web or as an open-source
    downloadable tool. The system can be either stand
    alone or an extension to an existing one.

20
Project ideas
  • Build a question answering system.
  • Build a language identification system.
  • Social network analysis from the Web.
  • Participate in the Netflix challenge.
  • Query log analysis.
  • Build models of Web evolution.
  • Information diffusion in blogs or web.
  • Author-topic models of web pages.
  • Using the web for machine translation.
  • Building evolving models of web documents.
  • News recommendation system.
  • Compress the text of Wikipedia (losslessly).
  • Spelling correction using query logs.
  • Automatic query expansion.

21
Available corpora
  • Enron email
  • CIA world factbook
  • DBLP papers in CS
  • NNDB information about people
  • BLOGS collection of blogs
  • US congressional speeches
  • AOL queries
  • Netflix recommendations
  • IMDB
  • NIE news articles
  • PUBMED biomedical paper abstracts
  • Wikipedia
  • ACL Anthology collection of papers in NLP/CL
  • DOTGOV download of .GOV
  • biocreative biomedical papers
  • WT100G 100GB download of the web
  • Google n-grams
  • webfreq frequency of words on the web
  • SMS corpus
  • Citeseer CS papers
  • DMOZ the open directory project
  • corpus of paraphrases
  • multilingual parallel parliamentary proceedings
  • textual entailment corpus
  • question answering corpus
  • summarization corpus
  • various text classification corpora
    (Reuters-21578, 20NG)
  • Peekaboom (from the game)

22
Related courses elsewhere
  • Stanford (Chris Manning, Prabhakar Raghavan, and
    Hinrich Schuetze)
  • Cornell (Jon Kleinberg)
  • CMU (Yiming Yang and Jamie Callan)
  • UMass (James Allan)
  • UTexas (Ray Mooney)
  • Illinois (Chengxiang Zhai)
  • Johns Hopkins (David Yarowsky)
  • For a long list of courses related to Search
    Engines, Natural Language Processing, Machine
    Learning look herehttp//clair.si.umich.edu808
    0/wordpress/?p11

23
SET Winter 2007
2. Models of Information retrieval The
Vector model The Boolean model
24
Sample queries (from Excite)
  • In what year did baseball become an offical
    sport?
  • play station codes . com
  • birth control and depression
  • government
  • "WorkAbility I"conference
  • kitchen appliances
  • where can I find a chines rosewood
  • tiger electronics
  • 58 Plymouth Fury
  • How does the character Seyavash in Ferdowsi's
    Shahnameh exhibit characteristics of a hero?
  • emeril Lagasse
  • Hubble
  • M.S Subalaksmi
  • running

25
Key Terms Used in IR
  • QUERY a representation of what the user is
    looking for - can be a list of words or a phrase.
  • DOCUMENT an information entity that the user
    wants to retrieve
  • COLLECTION a set of documents
  • INDEX a representation of information that makes
    querying easier
  • TERM word or concept that appears in a document
    or a query

26
Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
27
Documents
  • Not just printed paper
  • Can be records, pages, sites, images, people,
    movies
  • Document encoding (Unicode)
  • Document representation
  • Document preprocessing

28
Sample query sessions (from AOL)
  • toley spies gramestolley spies gamestotally
    spies games
  • tajmahal restaurant brooklyn nytaj mahal
    restaurant brooklyn nytaj mahal restaurant
    brooklyn ny 11209
  • do you love me like you saydo you love me like
    you say lyricsdo you love me like you say lyrics
    marvin gaye

29
Characteristics of user queries
  • Sessions users revisit their queries.
  • Very short queries typically 2 words long.
  • A large number of typos.
  • A small number of popular queries. A long tail of
    infrequent ones.
  • Almost no use of advanced query operators with
    the exception of double quotes

30
Queries as documents
  • Advantages
  • Mathematically easier to manage
  • Problems
  • Different lengths
  • Syntactic differences
  • Repetitions of words (or lack thereof)

31
Document representations
  • Term-document matrix (m x n)
  • Document-document matrix (n x n)
  • Typical example in a medium-sized collection
    3,000,000 documents (n) with 50,000 terms (m)
  • Typical example on the Web n30,000,000,000,
    m1,000,000
  • Boolean vs. integer-valued matrices

32
Major IR models
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing

33
The Boolean model
Venn diagrams
z
x
w
y
D1
D2
34
Boolean queries
  • Operators AND, OR, NOT, parentheses
  • Example
  • CLEVELAND AND NOT OHIO
  • (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
  • Ambiguous uses of AND and OR in human language
  • Exclusive vs. inclusive OR
  • Restrictive operator AND or OR?

35
Canonical forms of queries
  • De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
  • Normal forms
  • Conjunctive normal form (CNF)
  • Disjunctive normal form (DNF)
  • Reference librarians prefer CNF - why?

36
Evaluating Boolean queries
  • Incidence vectors
  • CLEVELAND 1100010
  • OHIO 1000111
  • Examples
  • CLEVELAND AND OHIO
  • CLEVELAND AND NOT OHIO
  • CLEVALAND OR OHIO

37
Exercise
  • D1 computer information retrieval
  • D2 computer retrieval
  • D3 information
  • D4 computer information
  • Q1 information AND retrieval
  • Q2 information AND NOT computer

38
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
39
How to deal with?
  • Multi-word phrases?
  • Document ranking?

40
The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
41
Vector queries
  • Each document is represented as a vector
  • Non-efficient representation
  • Dimensional compatibility

42
The matching process
  • Document space
  • Matching is done between a document and a query
    (or between two documents)
  • Distance vs. similarity measures.
  • Euclidean distance, Manhattan distance, Word
    overlap, Jaccard coefficient, etc.

43
Miscellaneous similarity measures
  • The Cosine measure (normalized dot product)

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y
  • The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
44
Exercise
  • Compute the cosine scores ? (D1,D2) and ? (D1,D3)
    for the documents D1 lt1,3gt, D2 lt100,300gt and
    D3 lt3,1gt
  • Compute the corresponding Euclidean distances,
    Manhattan distances, and Jaccard coefficients.

45
Readings
  • For January 24 MRS1, MRS2, MRS5 (Zipf)
  • For January 31 MRS7, MRS8
Write a Comment
User Comments (0)
About PowerShow.com