Title: Opportunities and Challenges of Web Search and Mining
1Opportunities and Challenges of Web Search and
Mining
Academia Sinica National Taiwan University
2Outline
- Web SE
- Inside SE
- Googles Business Models
- Googles Impacts
- Recent Development
- Next-Generation WSE
- Web Mining
3WSE Google
Globalization!
4WSE Google
5Problems of WSE
Inside WSE . Fast . Coverage .
Accuracy
6Problems of WSE
Inside WSE . Fast . Coverage .
Accuracy
Business . Profitable . Models .
Competitions
7Problems of WSE
Business . Profitable . Models .
Competition
Inside WSE . Fast . Coverage .
Accuracy
Impacts . Web Computing . Knowledge
Windows . New Paradigm of Civilization
8I. Some Must-Know Statistics
9Online Language Populations
- Source Global Reach (global-reach.biz/globstats)
10Top Ten Languages in the Web
TOPÂ TEN LANGUAGESIN THEÂ INTERNET Internet Users,by Language AveragePenetration World PopulationEstimate for Language Language as ofTotal Internet Users
English 287,369,520 26.2 1,098,654,265 35.9
Chinese 105,484,112 8.0 1,321,669,200 13.2
Japanese 66,548,060 52.1 127,853,600 8.3
German 54,035,201 56.3 95,893,300 6.8
Spanish 53,670,063 13.9 386,413,200 6.7
French 35,034,269 9.3 375,164,185 4.4
Korean 30,670,000 41.0 74,730,000 3.8
Italian 28,610,000 49.3 57,987,100 3.6
Portuguese 23,058,254 10.3 224,664,100 2.9
Dutch 13,657,170 56.6 24,125,950 1.7
TOPÂ TENÂ LANGUAGES 698,353,773 18.4 3,787,154,900 87.3
Rest of the Languages 101,686,725 3.9 2,602,992,587 12.7
WORLDÂ TOTAL 800,040,498 12.5 6,390,147,487 100.0
More and more non-English users!
- Source Internet World Stats
11Web Content
More and more non-English pages
Source Network Wizards Jan 99 Internet Domain
Survey
12Web Users and Pages (5 years ago)
Challenge of Scalability !
Chinese Users 110M Including 87M (CN), 4.9M
(HK), 8.8M (TW), 2.14M (SG), and others. Source
Global Reach, 2004
13Number of Chinese Web Pages
573,000,000 pages
Scalability Problem !
14Number of Web Pages
Billions Of Textual Documents IndexedAs of Sept
2, 2003
The worlds largest search engine ?
- 4,285,199,774 pages (Google)
- 4.28 billion Web pages, 880 million images, and
other documents
KEY GGGoogle, ATWAllTheWeb, INKInktomi,
TMATeoma, AVAltaVista. Source Search Engine
Watch
15The top 10 Internet trends 2004 predicted by
eOneNet.com
- 1.   World Internet population will continue to
grow at an exponential rate, with China taking
the lead in Asia having more than 100 million
Internet users. - 2.   Broadband Internet penetration will
continue to grow with China and US in the lead
with an expected growth rate exceeding 30 each. - 3.   Online retail sales will still be led by
the US with an expected revenue exceeding US80
billion. - 4.   Paid search will account for the biggest
online ad spending. With the successful paid
search business models of Google and Overture,
more search engines will offer paid search
advertising.
16The top 10 Internet trends 2004 predicted by
eOneNet.com
- 5.   Spams will increase at least 20 despite
the new US anti-spam law. The US legislators will
be forced to consider amending the anti-spam law
from an opt-out law to an opt-in law. - 6.   Ads placed in opt-in email newsletters will
increase 25 as legitimate marketers find this is
the easier way to comply with the anti-spam law
and a better way of targeting customers. - 7.   Rich media will continue to be hot. More
than 25 of online ads served will contain rich
media contents.
17The top 10 Internet trends 2004 predicted by
eOneNet.com
- 8.   20 more small businesses will develop
their own websites or use the Internet as a sales
and marketing channel. - 9.   Entertainment online will be grow at a
rapid pace, with more sites offering videos and
digital music download services. - 10.   The Internet boom will revive with more
Internet companies going for IPO both in the US
and in Asia, in particular kicked off by the most
anticipated Google IPO in Spring.
18II. Inside WSE
19Components
- Crawler/Spider
- Index Server
- Query Server
- Document Delivery
20Architecture
(1)
(3)
SE
1B queries/day
Index
Spider
(4)
Web
Archive
Browser
SE
Index
Indexer
SE
Index
Quality results
5B pages
. Freshness
(2)
Log
.Spam
(5)
Scalable
21Spider
- Get all Pages from the Web
- Web Traverse
- Challenges
- Performance, e.g., Pages/Per PC
- Coverage
- Currency
- Spam Filtering
- Hidden Web
22Index Server
- Index occurrences of all words in the pages
- Data Cleanness
- Challenges
- Space Overhead,pages/PC
- Incremental
- Scalability Distributed Processing
- Multiple Languages
23System Anatomy
24Data Structure
Lexicon fit in memory two different forms Hit
list account for most space use 2 bytes to save
space Forward index barrels are sorted by
wordID. Inside barrel, sorted by docID Inverted
Index some content as the forward index,
but sorted by wordID. doc list is sorted by
docID
25Query Server
- Search Relevant URLs for queries via looking up
indices - Challenges
- Speed, check queries/Per Sec
- Functions supported
- Localization
26PageRank
27PageRank (Cont.)
- be the set of pages that point to u.
be the number of - links from u and let c be a factor used for
normalization, then - a simplified version of PageRank
-
28Search Functions
- Phrase search, e.g. "petite galerie"
- Truncation, e.g. librar, womn
- Constraining search, e.g. title"The Wall Street
Journal" - Proximity search, e.g. gold near silver
- Boolean, e.g. noir film -"pinot noir"
- Parentheses and Nested Boolean, e.g. silver and
not (gold or platinum) - Limit search, e.g. limit by date range
- Capitalization, e.g. turkey vs. Turkey
- Ranking fields and refine search
- LiveTopics
- Translate Service
- Other
29Document Delivery
- Bottleneck of Bandwidth
- Presentation
- Caching
- Queries, Search Results
- Aakman Model
30III. Business
31What is Google?
- Specialized web search engine
- Founded in 1998 by 2 graduate students at
Stanford University (Larry Page and Sergey Brin) - Provides a comprehensive, relevant, and
easy-to-use web search and browsing service (free)
- Googles features fast, unbiased, and accurate
results, allows access to over 4 billion web
pages, and over 800 million images (most
important valid web pages)
32Company Facts
Employees 1,300 Languages spoken
34 Worldwide Offices 21 (Mostly in US
Europe) Annual Revenues 900m
33Google Revenue
- Revenue(an e-business)
-
- ½ from selling relevant text-based ads
(sponsored links near search results) - ½ from licensing its search technology to
companies like Yahoo
- Source
- Eric Schmidt Interview,
- PCWorld.com (January 30, 2002)
34Sources of Revenue
- Adwords (150,000 advertisers) sponsored links
ad - cost-per-click pricing only when people click
on the link - -- Advertisement is extremely cheap and
effective - i.e. Edmunds.com spent 250,000 a month in
advertising because 1 spent generated 1.70. - Google Search Appliance
- an integrated hardware/software solution that
extends the power of Google to corporate
intranets and web servers - -- Customers include Cisco Systems, Sony,
Procter Gamble, Sun Microsystems, etc
35Challenges (cont.)
- Easy entry into the Search Engine Industry
- Lack of customer lock-in (vs. Microsoft)
- Google will focus on creating services to
voluntarily draw in customers - Large, well-known competitors are focusing on
in-house search technology (Yahoo, Microsoft,
AOL, eBay, Amazon) - Customers are becoming competitors (Yahoo, AOL)
36Competitors Ebay and Amazon
- Ebay (www.ebay.com) E-commerce
- Web-based marketplace in which a community of
buyers and sellers are brought together to
browse, buy and sell various items - -- Business revenue Charges Proceeds (Fees)
- (5) 0.01-25 (2.5) 25-1000 (1.25) over
1000 - Amazon (www.amazon.com) E-commerce
- a customer-centric company that sells a range
of products that it purchases from manufacturers
and distributors
37Competitors Microsoft and Yahoo
- Microsoft is developing its own search engine
- -- Can lasso users into its search engine
through its operating system - -- Has the braniacs to implement top of the
line search engine technology - Yahoo was customer of Google (may now become
Googles biggest competitor) - -- Offers placement under sponsored links and
within actual results (unethical)
38IV. Impacts
39Impacts
- Web Computing
- Knowledge Windows
- New Web OS
40Web Computing
- Faster than local search
- Very-large scale of computing systems
- Realize global users behaviors
- Acquire global information sources
41Web Computing
- Local disc or global disc?
- Personal information management?
- Gmails
- Photo search
42Knowledge Windows
- Windows of Information Search
- Alliance with online databases
- Windows of Personal Knowledge Management
- Knowledge Windows
43New Web OS
- Merged with Linux OS
- Software download from end-users
- Information Service OS
44V. New Gen. of WSE
45Advanced Google
- Is Google good enough?
- Takano
- Takano NII
- Takano NII Japan
- More about Google Services
- http//www.google.com/options/
46New Features in Google
- Google Labs http//labs.google.com/
- Google Desktop Search
- Searching text, Web, Word, Excel, PowerPoint,
Outlook, AOL Instant Messenger - Google SMS
- Searching phone book, dictionary, product prices,
- Google Print
- Searching books
47(No Transcript)
48Other Search Tools
- A9.com (by Amazon)
- Bookmark, history, discover, diary
- Books, movies,
- Clusty.com (by Vivisimo)
- Clustering engine
- Snap.com (by Idealab)
- Sorting by popularity, satisfaction, Web
popularity, Web satisfaction, domain, - Alexa.com (by Amazon)
- Average user review ratings,
- Others Yahoo, AskJeeves, AOL Search, HotBot,
MSN, Netscape, Lycos, Altavista, LookSmart,
Gigablast, Overture, About, FindWhat, Teoma,
InformSearch,
49Clusty.com
50Example on Vivisimo
51Vivisimo (cont.)
52New Directions
- Personalization
- Photo search, email search filtering
- Information Extraction
- EX Scholar search
- Information Agent
- Deep Web Search
53VI. Web Mining
54Web Search/Information Retrieval
Millions of Users
55Improving Search via Mining
Millions of Users
56Valuable Web Resources
Knowledge Discovery
Hyper Links Anchor Texts Search Result
Pages Query Logs Query Session Logs Clicked
Stream Logs Deep Web, .
Web logs, texts, images,
Millions of Users
57Discovered Knowledge
Knowledge Discovery
Users Preferences/Need Topic, Location,
Timing, Authority/Popularity Site, File,
People, Company, Product Clusters/Associations
/ Relations Site, Page, People,
Company, Product, Query
Web logs, texts, images,
Millions of Users
58Web Mining for IR
Knowledge Discovery
Search Classification Clustering Cross-language
IR Information Extraction Text mining Filtering
Web logs, texts, images,
Millions of Users
59- CS 276 / LING 239IInformation Retrieval and Web
Mining - Prabhakar Raghavan and Hinrich Schütze
- Course Description
- Basic and advanced techniques for text-based
information systems efficient text indexing
Boolean, vector space, and probabilistic
retrieval models evaluation and interface
issues Web search including crawling, link-based
algorithms, and Web metadata text/Web
clustering, classification, wrapper, information
extraction, and collaborative filtering systems
text mining. Projects can be chosen from diverse
topics in information retrieval.
60Computational Linguistics, 29 , Issue 3,
September 2003 .
61Research at Web Knowledge Discovery Lab
62Research at Web Knowledge Discovery Lab
- Live series
- LiveTrans
- SIGIR04, ACL04, JCDL04
- ACM Trans. On Information System, 2004
- Online Translation of unknown queries via Web
- LiveClassifier
- WWW04, IJCNLP04
- ACM Trans. on ALIP, 2004
- Training classifiers and classifying short text
via Web
63Research at Web Knowledge Discovery Lab
- LiveCluster
- CIKM04
- ACM Trans. On Information System, 2004
- Generating taxonomy from terms or documents
64LiveTrans Cross-language Web Search
65LiveClassifier Classifying search results into
user-defined classification tree
66LiveClassifier Paper Title Categorization
Note no labeled training data
67LiveCluster Taxonomy Generation
68Terms Clustering
69Query Clustering
70(No Transcript)
71Outline
- Translating Unknown Queries (SIGIR04)
- Training Text Classifiers (WWW04)
- Generating Taxonomy/Topic Hierarchies (TOIS04)
72Translating Unknown Queries
- Anchor Text Mining
- Probabilistic Modeling (ACM TALIP02)
- Transitive Translation (ACM TOIS04)
- Search-Result Page Mining
- Translation Extraction Selection (JCDL04)
- CLIR Other Applications (SIGIR04, ACL04)
Note First work dealing with online translation
73Introduction (cont.)
- Bottleneck of CLIR service
- Real queries are often short
- Out-of-dictionary terms
- and might have local variations
- Ex proper nouns, new terminologies,
- Need for a powerful query translation engine
- Up-to-date dictionary
English Terminologies Chinese Translation
Digital library ?????/?????
Banff ??/??
Ishikawa ???
NII Japan ????????
louvre museum ???
SARS ??????????/??/??
Clinton ???/???
Bill Gates ????
74Web Mining of Query Translations
Source Term
TargetTranslations
TermTranslation
OOD
Yahoo lt-gt ??
Web Mining
Anchor-Text Mining
Search-Result Mining
- Different problems for different resources
75Anchor Text (Yahoo lt-gt ??)
- Applies to most languages
- Translation candidates are likely to appear in
the same anchor-text-set
76Search Result Page (National Palace Museum vs.
?????)
- Mixed-language characteristic in Chinese pages
77Problems
- Term extraction
- Translation selection noisy reduction
- Language pairs with limited corpora
- Processing speed
- Data cleanness (language identification)
- Language independence
78Term Extraction SCPCD
79Term Selection Probabilistic Inference Model
- Integrating anchor texts and link structures into
probabilistic inference model - Based on co-occurrence page authority
Page Authority
Co-occurrence
Page Rank
80Observation of Anchor Text
81Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Source Query
Taiwan -
Yahoo
- in USA
Yahoo
82Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Translation Candidates
????
??
Taiwan -
Yahoo
- in USA
?? -
??
Yahoo
Anchor-Text Set
83Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Page Authority
Co-occurrence
(in-link 187)
(in-link 21)
????
??
Taiwan -
Yahoo
- in USA
?? -
??
Yahoo
84Search Result Mining
PAT-tree based term extraction method Chien,
SIGIR 97
Term Extraction
Search Engine
Source Query
Web Pages
Target Translations
Term Selection
85Term Selection
- How to decide the ranking?
- S, Ti frequently co-occur in the same pages
- Not necessarily true for synonyms and antonyms
- S, Ti the result pages containing similar
co-occurring context terms as feature vectors
Query S
86Chi-Square Test
- Chi-Square Test a statistical method for
co-occurrence analysis Gale Church 91
a of pages containing both terms s and t b
of pages containing term s but not t c of
pages containing term t but not s d of pages
containing neither term s nor t N the total
number of pages, i.e., N abcd
87Context Vector Analysis
- Context Vector Analysis co-occurring context
terms as feature vectors - Similarity measure cosine measure
88 Indirect Association Problem
89Competitive Linking Algorithm
s
system
t1
?? (Cisco)
?? (system)
Cisco
t2
St1
?? (information)
?? (network)
St2
?? (computer)
Fig. 6. An illustration showing a bipartite graph
generated by using Algorithm 2.
90Combined Method
- To take advantage of both methods
- Anchor-text-based higher precision
- Search-result-based higher coverage
Rm(s,t) Ranking of score in different methods
91Experiments
- Performance on Query Translation
- Test Bed real query terms from the Dreamer
search engine log in Taiwan - 228,566 unique terms, during a period of 3 months
in 1998 - Random-query test set
- 50 query terms in Chinese, randomly selected from
the top 20,000 queries in the log - 40 of them were out-of-dictionary
92Random Query Test Set
Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set.
Method Top-1 Top-3 Top-5 Coverage
CV 40.0 54.0 54.0 68
X2 36.0 50.0 52.0 68
AT 20.0 32.0 32.0 32
Combined 44.0 64.0 66.0 72
- Many query terms didnt appear in anchor-text
sets (coverage)
93Other Experiments
- 430 popular Chinese queries, 67.4 top-1
inclusion rate - Common terms randomly selected 100 common nouns
and 100 common verbs from general-purpose Chinese
dictionary
94Transitive Translation
95Transitive Translation Model
96Chinese-Japanese Translation
Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Source terms (Traditional Chinese) Extracted target translations Extracted target translations Extracted target translations
Source terms (Traditional Chinese) English Simplified Chinese Japanese
?? ?? ??? ?? ???? ?? ?? ?? ??? ?? Sony Nike Stanford Sydney internet network homepage computer database information ?? ?? ??? ?? ??? ?? ?? ??? ??? ?? ??? ??? ??????? ???? ??????? ?????? ?????? ??????? ?????? ?????????
Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese.
Model Top1 Top2 Top3 Top4 Top5
Direct 10.5 12.8 14.3 15.1 15.1
Indirect 40.2 49.4 56.6 58.6 59.6
Transitive 42.9 51.4 58.6 61.3 61.9
97Translation Lexicons with Regional Variations
(a) Taiwan
(b) Mainland China
(c) Hong Kong Figure
1 Examples of search-result pages in different
Chinese regions that were obtained via the
English query words George Bush from Google.
98Summary
- A work dealing with live translation of unknown
queries - Anchor-text-based
- High precision for high-frequency terms
- Effective for proper nouns in multiple languages
- Not applicable if size of anchor-text set not
enough - Search-result-based
- Exploit rich Web resources
- High coverage for English-Chinese language pair
99- LiveCluster
- Generating Taxonomy from terms or documents
100- Taxonomy Generation from Terms
101Hierarchical Query Clustering
102The Steps
- Feature Extraction
- Use co-occurred seed terms extracted from
retrieved top pages - Term Vector
- Each query term is assigned a term vector
- Record the co-occurred feature terms and their
frequency values in the retrieved documents. - Term Similarity
- tfidf-based Cosine measurement
- Hierarchical Term Clustering
- Cluster popular query terms in the log into
initial categories - Query terms with similar features are grouped
into clusters.
103Feature Extraction
- Use co-occurred seed terms extracted from
retrieved top pages
nude
Co-occurred feature terms
Creative Nude Photography Network -- Fine Art
Nude and ... ... The Creative Nude and Erotic
Photography Network is the number one net portal
to the best in fine art nude and erotic
photography! Over 100 CNPN Member Sites ...
Nude Places... to be naked. Walking in the
forest, cruising the lake in open boats,
swimming, picnicking and nude photography are all
enjoyed in the nude. 60 minutes 39.95. ... A
Brave Nude World... A Brave Nude World! Warning
This site contains links to fine art nude
erotic photography. If you are under 18 or do not
wish to view this material, You can ...
tf/df
term
3/2
erotic photography
1/1
naked
2/2
photography
3/2
art
104Term Weighting
105Extraction of Basic Feature Terms
- Performance of different features randomly
selected, hi-frequency, and seed terms - Popular queries not affected by ephemeral trends,
e.g., movie, basketball, mutual fund, etc. - More expressive and distinguishable in describing
a particular category - Two logs compared and extracted 9,709 overlapping
top query terms as feature terms
106Task I Query Clustering (Cont.)
- Feature Extraction
- Use co-occurred seed terms extracted from
retrieved top pages - Term Vector
- Each query term is assigned a term vector
- Record the co-occurred feature terms and their
frequency values in the retrieved documents. - Term Similarity
- TF IDF-based Cosine measurement
- Hierarchical Term Clustering
- Cluster popular query terms in the log into
initial categories - Query terms with similar features are grouped
into clusters.
107Term Similarity
108Hierarchical Term Clustering
- Agglomerative hierarchical clustering (AHC)
- Compute the similarity between all pairs of
clusters - Estimate similarity between all pairs of composed
terms - Use the lowest term similarity value as the
cluster similarity value - Merge the most similar (closest) two clusters
- Complete linkage method
- Update the cluster vector of the new cluster
- Repeat steps 2 and 3 until only a single cluster
remains
109(No Transcript)
110Clustering Results
111Cluster Partition
112Quality Function
113Quality Function (Cont.)
114Quality Function (Cont.)
115Preliminary Experiment
- Test queries
- Two sets top 1k queries and random 1k queries
- Each of the test queries has been manually
assigned according classes - Evaluation metrics
- F-Measure
116Evaluation F-Measure
117Obtained F-Measures
118(No Transcript)
119Results of Hierarchical Structure Generation