Title: http://comet.lehman.cuny.edu/jung/presentation/presentation.html
1http//comet.lehman.cuny.edu/jung/presentation/pre
sentation.html
- Introduction to Modern Information Retrieval and
Search Engines - And Some Research Issues
- Professor Gwang Jung
- Department of Mathematics
- and Computer Science
- Lehman College, CUNY
- November 10, Fall 05
2Outline
- Introduction to Information Retrieval
- Introduction to Search Engines (IR Systems for
the Web) - Search Engine Example Google
- Brief Introduction to Semantic Web
- Useful Tools for IR System Building and
Resources for Advanced Research - Research Issues
3- Introduction to Information Retrieval
4Information Age
5IR in General
- Information Retrieval in general deals with
- Retrieval of structured, semi-structured and
unstructured data (information items) in response
to a user query (topic statement). - User query
- Structured (e.g., Boolean expression of keywords
or terms) - Unstructured (e.g., terms, sentence, document)
- In other words, IR is the process of applying
algorithms over unstructured, semi-structured, or
structured data in order to satisfy a given
query. - Efficiency with respect to
- Algorithms, Query processing, Data
organization/structure - Effectiveness with respect to
- Retrieval results
6IR Systems
7Formal Definition of IR System
- IRS (T, D, Q, F, R)
- T set of index terms (terms)
- D set of documents in a document database
- Q set of user queries
- F D x Q ? R (retrieval function)
- R real numbers (RSV Retrieval Status Value)
- Relevance Judgment is given by users.
8IRS versus DBMS
9IR Systems Focus on Retrieval effectiveness
- The effective retrieval of relevant information
depends on - User task (formulating effective query for the
information need) - Indexing
- IR systems in general adopt index terms to
represent documents and queries. - The process of developing document
representations by assigning index terms to
documents (information items). - Retrieval model (often called IR model) and
logical view of documents - Logical view of documents (logical representation
of documents) depends on IR model
10Indexing
- The process of developing document
representations by assigning descriptions to
information items (texts, documents, or
multimedia items). - Descriptors index terms terms
- Descriptors also lead users to participate in
formulating information requests. - Two types of index terms
- Objective author name, publisher, date of
publication - Subjective keywords selected from full text
- Two types of indexing methods
- Manual performed by human experts (for very
effective IR systems) may use ontology - Automatic performed by computer HW and SW
11Indexing Aims (1)
- Recall the proportion of relevant items
(documents) retrieved. - R of relevant items retrieved / total of
relevant items in the db - Precision the proportion of retrieved documents
that are relevant. - P of relevant items retrieved / total of
items retrieved - Effectiveness of indexing is mainly controlled by
Term Specificity - Broader terms may retrieve both useful (relevant)
and useless (non-relevant) info items for the
user. - Narrower (specific) index terms favor precision
at the expense of recall. - Index Language (set of well-selected index terms)
- T index term t
- Pre-specified (controlled) easy maintenance
poor adaptability - Uncontrolled (dynamic) expanded dynamically
taken freely from the texts to be indexed and
from the users queries. - Synonymous terms can be expanded to T by
thesaurus, e-dictionary (e.g., WordNet), and/or
knowledge base (e.g., ontology).
12Indexing Aims (2)
- Recall and Precision values vary from 0 to 1.
- Average users want to have high recall and high
precision. - In practice, a compromise must be reached (middle
point).
R
1.0
P
0
1.0
13Steps for Indexing
- Objective attributes of a document are extracted
(e.g., title, author, URL, structure). - Grammatical functional words (stop words) in
general are not considered as index terms (e.g.,
of, then, this, and, ., etc). - Case insensitivity might be performed.
- Stemming might be used.
- Frequency of nonfunctional words are used to
specify the term importance. - Term frequency weight fulfils only one of the
indexing aims, I.e., Recall. - Terms that occur rarely in the individual
document database may be used to distinguish
documents in which they occur from those in which
they do not occur ? could improve Precision. - Document frequency the number of documents in
the collection in which a term tj ? T occurs
14Inverted Index File
Inverted Index Entries
Optionally postings (the positions of the term in
a document)
15Retrieval Models (1)
- Set theoretic IR models
- Documents are represented by a set of terms
- Well known Set Theoretic Models
- Boolean IR Model
- Retrieval Function is based on Boolean operation
(e.g., and, or, not) - Query is formulated by Boolean logic
- Fuzzy Set IR Model
- Retrieval function is based on Fuzzy set
operations - Query is formulated by Boolean logic
- Rough Set IR Model
- Various set operations were examined.
- Ad-hoc Boolean query
- Probabilistic IR model
- Mainly used for probabilistic index term
weighting - Provides mathematical framework for the well
known tfidf indexing scheme - Language Model based
- Infer query concept from a document as retrieval
process
16Retrieval Models (2)
- Vector space model
- Queries and documents are represented as weighted
vectors. - Vectors in the basis are called term vectors, and
assumed they are semantically independent. - A document (query) is represented as a linear
combination of vectors in the generating set. - Retrieval function is based on dot product or
cosine measure between document and query
vectors. - Extended Boolean IR model
- Combine characteristics of the vector space IR
model with properties of Boolean algebra. - Retrieval function is based on Euclidean
distances in a n-dimensional vector space.
Distances are measured by using p-norms, where
1 ? p ? ?
17The Retrieval Process
18The retrieval Process in IR System
19- Introduction to Search Engines (IR Systems for
the Web)
20 World Wide Web History
- 1965 Hypertext
- Ted Nelson developed idea of hypertext in 1965.
- Late 1960s
- Doug Engelbart invented the mouse and built the
first implementation of hypertext in the late
1960s at SRI. - Early 1970s
- ARPANET was developed in the early 1970s.
- 1982 - Transmission Control Protocol (TCP) and
Internet Protocol (IP) - 1989- WWW
- Developed by Tim Berners-Lee and others in 1990
at CERN to organize research documents available
on the Internet. - Combined idea of documents available by FTP with
the idea of hypertext to link documents. - Developed initial HTTP network protocol, URLs,
HTML, and first web server.
21Search Engine (Web-based IR System) History
- By late 1980s many files were available by
anonymous FTP. - In 1990, Alan Emtage of McGill Univ. developed
Archie (short for archives) - Assembled lists of files available on many FTP
servers. - Allowed regular expression search of these file
names. - In 1993, Veronica and Jughead were developed to
search names of text files available through
Gopher servers. - In 1993, early web robots (spiders) were built to
collect URLs - Wanderer
- ALIWEB (Archie-Like Index of the WEB)
- WWW Worm (indexed URLs and titles for regex
search) - In 1994, Stanford graduate students David Filo
and Jerry Yang started manually collecting
popular web sites into a topical hierarchy called
Yahoo.
22Search Engine History (contd)
- In early 1994, Brian Pinkerton developed
WebCrawler as a class project at U Washington.
(eventually became part of Excite and AOL). - A few months later, Fuzzy Maudlin, a professor at
CMU developed Lycos with his graduate students. - First to use a standard IR system as developed
for the DARPA Tipster project. - First to index a large set of pages.
- In late 1995, DEC developed Altavista.
- Used a large farm of Alpha machines to quickly
process large numbers of queries. - Supported boolean operators, phrases, and
reverse pointer queries. - In 1998 Google was developed by graduate
students Larry Page Sergey Brin at Stanford U - use of link analysis to rank documents
23How do Web SE Work?
- Search Engines for the general web
- search a database of the full text of web pages
selected from billions of Web pages - searching is based on inverted index entries
- Search Engine Databases
- Full text documents are collected by software
robot (also called softbot, spider). They
navigate the web for collecting pages. - Web can be viewed as a graph structure.
- The navigation can be based on DFS (Depth First
Search), or BFS (Breadth First Search), or based
on some combined navigation heuristics. - How to detect cycles? ? research issue
- Indexer then build inverted index entries stored
them into inverted files. - If necessary the inverted files may be
compressed. - Some types of pages links are excluded from the
search engine - form invisible Web (maybe many times bigger than
the visible Web).
24------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
Breadth-First Crawling
25------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
------------ ------------ ------------ -----------
- ------------ ------------
Depth-First Crawling
26Web Search Engine System Architecture
27Robot
User
Internet Websites
Interface
Temporary storage
Logical Document Representation (based on IR
Models)
Retrieval Mechanism
Parser
Inverted Files (can be based on Different
physical data structures
Stopper/Stemmer
Indexer
28Distributed Architecture (example)
- Harvest (http//harvest.sourceforge.net/)
- Distributed web search engine
- distribute the load among different machines
- indexer doesn't run on the same machine as broker
or web server
29What Makes a SE Good?
- Database of web documents
- Size of database
- Freshness (Recency or up-to-datedness)
- Types of documents offered
- Retrieval Speed
- The search engine's capabilities
- Search options
- Effectiveness of the retrieval mechanism
- Support Concept-based search ? semantic web
- Concept-based search systems try to determine
what you mean, not just what you say. - Concept-based often works better in theory than
in practice. Concept-based indexing is difficult
task to perform. - Presentation of the results
- keywords highlighted in context
- showing summary of the web page that match
30- Search Engine Example (Google)
31Google
- The most popular web search engine
- Crawls (by robots) the web, stores a local cache
of found pages - Builds a lexicon of common words
- For each word creates an index list of pages
containing it - Also human-compiled information from the Open
Directory - Cached links - let you see older versions of
recently changed ones - Link Analysis system
- page rank heuristic
- Estimated size of index
- 580 million pages visited and recorded
- Uses link data to get to another 500 million
pages (by link analysis system) - Recent estimation is around 4 billion pages (??)
- Index refresh
- Updated monthly/weekly or daily for popular pages
- Serves queries from three data centres (service
replication) - Service updates are synchronized.
- Two on West Coast of the US, one on East Coast.
32Google Founders
- Larry Page, Co-founder President, Products
- Sergey Brin, Co-founder President, Technology
- PhD students at Stanford
- Became public co. last year
33Google Architecture Overview
34Google Indexer
term frequencies
35Google Lexicon
36Google Searcher
37Google Features
- Combines traditional IR text matching with
extremely heavy use of link popularity to rank
the pages it has indexed. - Other services also use link popularity, but none
do to the extent that Google does. - Traditional IR (LITE)
- Link Popularity (HEAVYLY USED)
- Citation Importance Ranking (Quality of links
pointing at it) - Relevancy
- Similarity between query and a page
- Number of Links
- Link Quality
- Link Content
- Ranking boosts on text styles
- PageRank
- Usage simulation Citation importance ranking
- User randomly navigates
- Process modelled by Markov Chain
38Collecting Links in Google
- Submission (by Web Promotion)
- Add URL page (may not need to do a "deep" submit)
- The best way to ensure that your site is indexed
is to build links. The more other sites are
pointing at you, the more likely you will be
crawled and ranked well. - Crawling and Index Depth
- Aims to refresh its index on a monthly basis,
- If Google doesn't actually index pages, it may
still return it in a search because it makes
extensive use of the text within hyperlinks. - This text is associated with the pages the link
points at, and it makes it possible for Google to
find matching pages even when these pages cannot
themselves be indexed.
39Google Guidelines for Web-submission
40Deep SubmitPro
41Link Analysis for Relevancy (1)
- Inspired by the CiteSeer (NEC International,
Princeton, NJ) and IBM Clever Project - CiteSeer..
- http//www.almaden.ibm.com/cs/k53/clever.html
- Google ranks web pages based on the number,
quality and content of links pointing at them
(citations). - Number of Links
- All things being equal, a page with more links
pointing at it will do better than a page with
few or no links to it. - Link Quality
- Numbers aren't everything. A single link from an
important site might be worth more than many
links from relatively unknown sites. - Weights page importance links from important
pages weighted higher
42Link Analysis for Relevancy (2)
- Link Content
- The text in and around links relates to the page
they point at. For a page to rank well for
"travel," it would need to have many links that
use the word travel in them or near them on the
page. It also helps if the page itself is
textually relevant for travel. - Ranking boosts on text styles
- The appearance of terms in bold text, or in
header text, or in a large font size is all taken
into account. None of these are dominant factors,
but they do figure into the overall equation.
43PageRank
- Usage simulation Citation importance ranking
- Based on a model of a Web surfer who follows
links and makes occasional haphazard jumps,
arriving at certain places more frequently than
others. - User randomly navigates
- Jumps to random page with probability p
- Follows a random hyperlink from the page with
probability 1-p - Does not go back to a previously visited page by
following a previously traversed link backwards - Google finds a type of universally important page
intuitively - locations that are heavily visited in a random
traversal of the Web's link structure.
44PageRank Heuristics
- Process modelled by the following heuristics
- probability of being in each page is computed, p
set by the system - wj PageRank of page j
- ni number of outgoing links on page i
- m is the number of nodes in G (the number of Web
pages in the collection)
45PageRank Illusrtation
w1
wm
wj
w2
(1- p)
wn
w3
46Google Spamming
- Link popularity ranking system leaves it
relatively immune to traditional spamming
techniques. - Goes beyond the text on pages to decide how good
they are. No links, low rank. - Common spam idea
- Create a lot of new pages within a site that link
to a single page, in an effort to boost that
page's popularity, perhaps spreading out these
pages across a network of sites. - The (Evil) Genius of Comment Spammers By Steven
Johnson, WIRED 12.03 http//www.wired.com/wired/ar
chive/12.03/google.html?pg7
47http//www.wired.com/wired/archive/12.03/google.ht
ml?pg7
48Topic Search http//www.google.com/options/index.h
tml
49Brief Introduction to Semantic Web
50Machine Process-able Knowledge on the Web
- Unique identity of resources and objects- URI
- Metadata Annotations
- Data describing the content and meaning of
resources - But everyone must speak the same language
- Terminologies
- Shared and common vocabularies
- But everyone must mean the same thing
- Ontologies
- Shared and common understanding of a domain
- Essential for exchange and discovery of knowledge
- Inference
- Apply the knowledge in the metadata and the
ontology to create new metadata and new knowledge
51The Semantic Web
52Ontologies The Semantic Backbone
53Language Tower in Semantic Web
Web Ontology Language 1.0 Reference http//www.w3.
org/TR/owl-ref/
Attribution
Explanation
Rules Inference
Ontologies
Metadata annotations
Standard Syntax
Identity
54Person
participants gt1
Sport
Team-based Sport
Blackburn Rovers
participants gt1
Blackburn
Soccer Club
Soccer
Sports Club
UK
partof
Sport
Club
Europe
Country
Organisation
55Event
Competition
Tournament
Sports Tournament
Soccer Tournament
Andy Cole
Brad Friedal
Soccer Player
Blackburn Rovers
Worthington Cup
Sports Player
56Blackburn Rovers
Nottingham
UK
partof
Europe
birthplace
Country
Andy Cole
Soccer Player
nationality
Sports Player
Person
57Blackburn Rovers
Lakewood
UK
USA
partof
Europe
Country
Country
Brad Friedal
birthplace
Soccer Player
nationality
Sports Player
Person
58- Useful IR System Building Software
- And Resources
59Lucene API (http//lucene.apache.org/)
- Pure java (data abstraction, platform-independence
, components reusable) - High-performance indexing
- Support both incremental indexing and batch
indexing - Provide Accurate and Efficient Searching
Mechanisms - Complex queries based on Boolean and phrase
queries, and quires with specific document fields - Ranked searching highest score being returned
first - Allow users to develop variety of new
applications - Searchable email
- CD-based documentation search
- DBMS Object ID management
60http//www.getopt.org/luke/
61www.egothor.org (support EBIR)
62http//nltk.sourceforge.net/
63http//ciir.cs.umass.edu/research/indri/
64http//www.summarization.com/
65http//wordnet.princeton.edu/
66http//protege.stanford.edu/
67http//www.google.com/apis/
68http//www.amazon.com/gp/browse.html/103-1065429-7
111805?5FencodingUTF8node3435361
Then click Alexa Web Information Service 1.0
Released
69http//mg4j.dsi.unimi.it/
70http//www.xapian.org/history.php (Probabilistic
IR model)
71http//www.searchtools.com/info/info-retrieval.htm
l
IR research resources
72http//www-db.stanford.edu/db_pages/projects.html
73http//dbpubs.stanford.edu8090/aux/index-en.html
74http//citeseer.ist.psu.edu/
75- Web Challenges
- for IR Research Community
76Research Issues (1)
- IR research field is interdisciplinary in nature
- Traditionally focused on retrieval effectiveness
- Retrieval models and mechanisms (e.g., various
ad-hoc models, probabilistic/statistic reasoning,
language model INDRI system at UMASS) - Use of Relevance feedback for improving
effectiveness (e.g., query reformulation,
pseudo-thesaurus, document categorization/clusteri
ng through machine learning techniques as
knowledge acquisition tools) - Knowledge/semantic richer retrieval approaches
(e.g., RUBRIC-rule based IR, some recent
concept-based IR based on Rules) - Information filtering based on user profiling
- Traditionally based on small set of text
collections - Little work has been done on retrieval efficiency
although we have some reports (e.g., use of
parallel architecture for handling index files
based on signature files, etc)
77Research Issues (2)
- Challenges
- Distributed Data Documents spread over millions
of different web servers. - Volatile Data Many documents change or
disappear rapidly (e.g. dead links) ? information
recency (up-to-datedness) - Large Volume Billions of separate documents.
- Unstructured and Redundant Data No uniform
structure, HTML errors, up to 30 (near)
duplicate documents. - Quality of Data No editorial control, false
information, poor quality writing, typos, etc. - Need large scale knowledge/semantic rich
retrieval applications - Heterogeneous Data Multiple media types (images,
video, VRML), languages, character sets, etc.
78Research Issues (3)
- Retrieval Effectiveness (all in large scale) with
efficiency in mind - Test effectiveness of IR models with efficiency
as an important considerations - Effective and efficient indexing for both
documents and query - Natural language processing (some statistical)
- Distributed incremental indexing
- System and physical data structure/algorithm
issues - Distributed brokering architecture for
information recency - Investigation of semantic richer approaches
- Semantic web, and other rule based approaches
- Effective and efficient knowledge indexing
- Use of users relevance feedback
- Automatic feedback acquisition
- User profiling and information filtering
- Evaluation measures (Predictable)
- Text summarization for better presentation
- Text categorization (clustering) for topic search
- (e.g., Yahoo subject directory, Google topic).
79Research Issues (4)
- Multimedia indexing
- IBM QBIC project (http//wwwqbic.almaden.ibm.com/)
- Indexing tools for various media types (e.g., an
image of mountain with a lake covered by snow,
SemCap) - Develop test bed for controllable experiments
- Internet emulator/simulator
- Distributed IR subsystems
- Appropriate performance measures (e.g., RB
Precision) - Refer to the recent papers by Stanford
researchers addressing - Both retrieval effectiveness and efficiency