Title: Web-based Information Architectures MSEC 20-760 Mini II
1Web-based Information ArchitecturesMSEC
20-760Mini II
Location GSIA Simon Auditorium Time 130-320
pm, Tues. Thurs. Instructor Prof. Jaime
Carbonell Office NSH 4519 Email
jgc_at_cs.cmu.edu Tel 268-7279 Augmented with
expert guest lectures Teaching assistant Jian
Zhang Office NSH 4605 Email
jianzhang_at_cmu.edu Tel 268-6521 Offices
Hours TBD Administrative assistant
TBD Office NSH 4517 Email
rcpomp_at_cs.cmu.edu Tel 268-4788
2Administrative Issues
Prerequisites Basic programming skills
(preferably JAVA) Familiarity with the web
(HTML, browsing, etc.) Fundamentals of Web
Programming (20-753). Grading 30 homeworks (2
programming assignments) 30 miniproject
(student teams will propose) 15 midterm (5
pages notes, calculator OK, no laptops) 25
final (10 pages notes, calculator OK, no
laptops) Bulletin Board Schedule/syllabus Lectur
e notes (in powerpoint) Homework Announcements
discussions
3Textbook and Reference Materials (1)
Required Class notes (slides on web site) and
handouts (to be provided) Required
"Understanding Search Engines Mathematical
Modeling and Text Retrieval" by Michael W.
Berry, Murray Browne Available at
http//www.siam.org (tel 1-800-447-7426) Opti
onal Background reading material provided
4Textbook and Reference Materials (2)
Optional "Advances in Information Retrieval"
Edited by Croft, Kluwer Academic Pub., 2000
more detailed state-of-the-art IR
book Optional "Machine Learning" by Tom M.
Mitchell, WCB McGraw-Hill Tools for
text categorization and data mining.
5Information Retrieval The Challenge (1)
Text DB includes (1) Rainfall measurements
in the Sahara continue to show a steady decline
starting from the first measurements in 1961. In
1996 only 12mm of rain were recorded in upper
Sudan, and 1mm in Southern Algiers... (2) Dan
Marino states that professional football risks
loosing the number one position in heart of fans
across this land. Declines in TV
audience ratings are cited... (3) Alarming
reductions in precipitation in desert regions are
blamed for desert encroachment of previously
fertile farmland in Northern Africa. Scientists
measured both yearly precipitation and
groundwater levels...
6Information Retrieval The Challenge (2)
User query states "Decline in rainfall and
impact on farms near Sahara" Challenges How to
retrieve (1) and (3) and not (2)? How to rank
(3) as best? How to cope with no shared words?
7Information Retrieval in eCommerce (1)
Bringing in Customers How do Web-search engines
work? How to maximize hits on my eCommerce
pages? How to maximize preselection of
customers who will transact?
8Information Retrieval in eCommerce (2)
Analyzing the Competition How do we find the
competition? How will customers find the
competition? Can we do preemptive information
strikes? Text Mining How to learn what
customers want most? How to find out what they
missed, but wanted? How to discover customer
search/browsing patterns?
9Information Retrieval Assumption (1)
Basic IR task There exists a document
collection Dj Users enters at hoc query
Q Q correctly states users interest User
wants Di lt Dj most relevant to Q
10Information Retrieval Assumption (2)
"Shared Bag of Words" assumption Every query
wi Every document wk ...where wi wk
in same S All syntax is irrelevant (e.g. word
order) All document structure is irrelevant All
meta-information is irrelevant (e.g. author,
source, genre) gt Words suffice for relevance
assessment
11Information Retrieval Assumption (3)
- Retrieval by shared words
- If Q and Dj share some wi , then Relevant(Q, Dj )
- If Q and Dj share all wi , then Relevant(Q, Dj )
- If Q and Dj share over K of wi , then
Relevant(Q, Dj)
12Boolean Queries (1)
Industrial use of Silver Q silver R "The
Counts silver anniversary..." "Even the crash
of 87 had a silver lining..." "The Lone Ranger
lived on in syndication..." "Sliver dropped to a
new low in London..." ... Q silver AND
photography R "Posters of Tonto and the Lone
Ranger..." "The Queens Silver Anniversary
photos..." ...
13Boolean Queries (2)
- Q (silver AND (NOT anniversary)
- AND (NOT lining)
- AND emulsion)
- OR (AgI AND crystal
- AND photography))
- R "Silver Iodide Crystals in Photography..."
- "The emulsion was worth its weight in
silver..." - ...
14Boolean Queries (3)
- Boolean queries are
- a) easy to implement
- b) confusing to compose
- c) seldom used (except by librarians)
- d) prone to low recall
- e) all of the above
15Beyond the Boolean Boondoggle (1)
- Desiderata (1)
- Query must be natural for all users
- Sentence, phrase, or word(s)
- No ANDs, ORs, NOTs, ...
- No parentheses (no structure)
- System focus on important words
- Q I want laser printers now
16Beyond the Boolean Boondoggle (2)
- Desiderata (2)
- Find what I mean, not just what I say
- Q cheap car insurance
- (pAND (pOR
- "cheap" 1.0
- "inexpensive" 0.9
- "discount" 0.5)
- (pOR "car" 1.0
- "auto" 0.8
- "automobile" 0.9
- "vehicle" 0.5)
- (pOR "insurance" 1.0
- "policy" 0.3))
17Beyond the Boolean Boondoggle (3)
- Desiderata (3)
- Speech-recognized queries
- Coming soon, to a system near you
- longer queries
- more fluff words to filter
- acoustic recognition errors
18INFORMATION RETRIEVAL
User
The Web
Spider
Search Engine
Inverted Index
Library, etc.
19INFORMATION RETRIEVALAPPLICATIONS
- Searching Document Archives
- Libraries (title, subject, full-text)
- Data bases of patents and applications
- DBs of legal cases (e.g. Lexis, Westlaw)
- Searching the Web
- Pure search engines (Google, Inktomi, )
- Browsing Search (Yahoo, Terra-Lycos, )
- Meta-search (Metacrawler, Vivisimo, )
- Corporate or Government Intranets
- Non-traditional (e.g. Software DBs, News)
20INFORMATION RETRIEVAL (IR) EVOLUTION
- IR in the 1980s
- Single collection with lt 106 documents (archive)
- Boolean queries with unordered-set answer
- IR circa 2000
- Single collection with gt 109 documents (web)
- Free-form queries with ranked-list answer
- IR circa 2010
- Multiple collections gt 1012 docs (invisible web)
- Find what I mean queries with clustering,
summarization and customization.
21Content for Rest of the Course (1)
- See the course BB for the latest updates to the
course schedule. - Under the Hood
- The vector space model for retrieval
- Building an inverted index
- Term weighting and selection
- Web spidering
- Automated text categorization
22Content for Rest of the Course (2)
- IR Uses in eCommerce
- How to make search engine work for you
- How to build optimal search-attractive web sites
- The business(es) of web-based information
- Beyond Web Search Engines
- Speech processing primer
- Information extraction from web pages
- Data mining primer
- Multi-media applications
- Business models
23Optional Quick Review of Linear Algebra
- If you know n-dimensional vectors, matrices,
computing inner products, etc.., Then you do not
need this review. You may take a break. - If you learned this material, but do not remember
it, please stay and listen to refresh your
knowledge. - If you never learned linear algebra, stay, listen
and (optionally) read either - G. Hadley. Linear Algebra. Addison-Wesley, 1961.
Ch 3. - Or, Stephen W. Goode. An Introduction to
Differential Equations and Linear Algebra.
Prentice Hall, 1991. Ch.3).