Text Information Retrieval and Applications - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Text Information Retrieval and Applications

Description:

Multimedia IR (image, speech, music, video) Semantic retrieval (XML, Semantic Web) ... Web pages in the world: 19.2 billion pages (indexed by Yahoo as of August 2005) ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 86
Provided by: 140122
Category:

less

Transcript and Presenter's Notes

Title: Text Information Retrieval and Applications


1
Text Information Retrieval and Applications
Advanced Topics
  • By J. H. Wang
  • May 27, 2009

2
Outline
  • Advanced Retrieval Technologies
  • Cross-Language Information Retrieval
  • Multimedia Information Retrieval
  • Semantic Retrieval
  • Applications to IR
  • Advanced Google
  • Meta Search
  • Search Result Clustering

3
Advanced Retrieval Technologies
  • Cross-Language Information Retrieval (CLIR)
  • Multimedia IR (image, speech, music, video)
  • Semantic retrieval (XML, Semantic Web)

4
Cross-Language Information Retrieval
  • Cross Language Information Retrieval (CLIR) -- A
    technology enabling users to query in one
    language and retrieve relevant documents written
    or indexed in another language

5
Cross Language Web Search
  • A technology enabling users to query in one
    language and retrieve relevant Web pages written
    or indexed in another language

6
Why Cross-Language?
  • Source Global Reach (global-reach.biz/globstats)

7
Internet World Users by Language
8
Top Ten Languages Used in the Web
Source Internet World Stats (Mar. 31, 2009)

TOP TEN LANGUAGESIN THE INTERNET Internet Usersby Language InternetPenetrationby Language Growthin Internet( 2000 - 2008 ) Internet Users of Total World Populationfor this Language(2008 Estimate)
English 463,790,410 37.2 226.7 29.1 1,247,862,351
Chinese 321,361,613 23.5 894.8 20.1 1,365,138,028
Spanish 130,775,144 32.0 619.3 8.2 408,760,807
Japanese 94,000,000 73.8 99.7 5.9 127,288,419
French 73,609,362 17.8 503.4 4.6 414,043,695
Portuguese 72,555,800 29.7 857.7 4.5 244,080,690
German 65,243,673 67.7 135.5 4.1 96,402,666
Arabic 41,396,600 14.2 1,545.2 2.6 291,073,346
Russian 38,000,000 27.0 1,125.8 2.4 140,702,094
Korean 36,794,800 51.9 93.3 2.3 70,944,739
TOP 10 LANGUAGES 1,337,527,402 30.4 329.2 83.8 4,406,296,835
Rest of the Languages 258,742,706 11.2 424.5 16.2 2,303,732,235
WORLD TOTAL 1,596,270,108 23.8 342.2 100.0 6,710,029,070
Top Ten Languages Used in the Web( Number of Internet Users by Language )
More and more non-English users!
9
Web Content
More and more non-English pages
Source Network Wizards Internet Domain Survey
(Jan 99 )
10
Chart of Web Content (by Language)
Source Vilaweb.com, as quoted by eMarketer
(Feb. 2001)
  • Total Web pages 313 B
  • English 68.4
  • Japanese 5.9
  • German 5.8
  • Chinese 3.9
  • French 3.0
  • Spanish 2.4
  • Russian 1.9
  • Italian 1.6
  • Portuguese 1.4
  • Korean 1.3
  • Other 4.6

11
Language Percent of Public Sites
  • English 72
  • German 7
  • Japanese 6
  • Spanish 3
  • French 3
  • Italian 2
  • Dutch 2
  • Chinese 2
  • Korean 1
  • Portuguese 1
  • Russian 1
  • Polish 1

Source OCLC, 2002
12
Web Users and Pages (10 years ago)
Challenge of Scalability !
Total Users 800MChinese Users 110M Including
87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M
(SG), 1.5M (US), and others. Source Global
Reach, 2004
13
Number of Chinese Web Pages
10,030,000,000 pages
Scalability Problem !
14
Number of Web Pages
The worlds largest search engine ?
Billions Of Textual Documents IndexedDecember
1995-September 2003
Search Engine Reported Size Page Depth
Google 8.1 billion 101K
MSN 5.0 billion 150K
Yahoo 4.2 billion (estimate) 500K
Ask Jeeves 2.5 billion 101K
KEY GGGoogle, ATWAllTheWeb, INKInktomi,
TMATeoma, AVAltaVista. Source Search Engine
Watch (Nov. 2004)
15
Number of Web Pages
  • Estimated size
  • Web pages in the world 19.2 billion pages
    (indexed by Yahoo as of August 2005)
  • Websites in the world 70,392,567 websites
    (indexed by Netcraft as of August 2005)
  • Web pages per website 273 (rounding to the
    nearest whole number)
  • Updated estimate
  • 231,510,169 distinct websites (as found by the
    Netcraft Web Server Survey in April 2009)
  • 63.2 billion

Source http//news.netcraft.com/archives/web_ser
ver_survey.html
Source http//www.boutell.com/newfaq/misc/sizeof
web.html
16
Number of Web Pages
  • 1 trillion unique URLs (We knew the web was big,
    by Jesse Alpert Nissan Hajaj, Software
    Engineers, Web Search Infrastructure Team, 25
    July 2008)
  • 19,200,000,000 pages (Mayer, Tim, 8 August 2005,
    Our Blog is Growing Up And So Has Our Index)
  • 320,000,000 pages (World Wide Web is 320 million
    and growing, BBC News Sci/Tech, 3 April 1998.)
  • 1,000,000,000 pages (Internet. How much
    information? 2000. Regents of the University of
    California.)
  • 800,000,000 pages (Maran, Ruth, and Paul
    Whitehead. "Web Pages." Internet and World Wide
    Web Simplified, 3rd ed. Foster City IDG Books
    Worldwide, 1999. )
  • 8,034,000,000 pages (Miller, Colleen. web sites
    number of pages. NEC Research, IDC.)

Source http//hypertextbook.com/facts/2007/Loran
tLee.shtml
17
Challenge of Cross-Language Web Search
  • Existing CLIR systems mostly rely on bilingual
    dictionaries and dictionary lookup
  • 81 of the search terms could not be obtained
    from common English-Chinese translation
    dictionaries

????? (CPU), ???? (E-commerce), ??????(PDA), ??
(Yahoo), ???? (NASA), ???? (Star War), ?????
(SARS),
18
Challenge
  • Existing CLIR systems mostly rely on bilingual
    dictionaries and dictionary lookup
  • 81 of the search requests could not be obtained
    from common English-Chinese translation
    dictionaries
  • How to find effective translations automatically
    for query terms not included in a dictionary ?

19
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
Possible global use
20
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
Need for CLIR services
21
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
??/?/??
Query Translation
22
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
??/?/??
Cost-ineffective to construct translation
dictionaries
Query Translation
23
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
??/?/??
Query Translation
Taking the Web as online corpus to deal with
translation of unknown terms
?
Web
24
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
??/?????
English Query
Query Translation
National Palace Museum
?
Online Term Translation Suggestions
?
Web
25
Query Translation CLIR in DL
Chinese Query
Mono-Lingual Document Search
Chinese Digital Libraries
??
??/?/??
English/Japanese/Korean Queries
?
Query Translation
?
Auto- generated Translation Lexicons
?
Web
26
CLIR
  • Conventional approach to query translation
  • Parallel documents as the corpus
  • Assume long queries
  • Problems of CLIR in digital libraries
  • No corpus for cross-lingual training
  • Short queries
  • ? Out-of-dictionary terms
  • Ex proper nouns, new terminologies,

English Terminologies Chinese Translation
mechanical strain ????
viscous damping ????
Richard Feynman ??
Hyoplastic Left Heart Syndrome ?????????
NII Japan ????????
SARS ??????????
Extracorporeal Shock Wave Lithotripsy ????
Davinci ???
27
Translation Lexicon Construction for CLIR
  • To use the Web as the corpus for query
    translation
  • Web mining techniques
  • Anchor-text-based ACM TOIS 04, ACM TALIP 02
  • Search-result-based JCDL 04
  • To extract terms from real document collections
    as possible queries
  • Term extraction method SIGIR 97

28
Web Mining Approach to Term Translation Extraction
The Web
Source query
Anchor texts
Academia Sinica
LiveTrans Engine
Search results
Target translations
?????/???
  • LiveTrans http//wkd.iis.sinica.edu.tw/LiveTrans/

29
National Palace Museum vs. ?????Search-Result
Page
Noises
  • Mixed-language characteristic in Chinese pages
  • How to extract translation candidates?
  • Which candidates to choose?

30
Yahoo vs. ?? -- Anchor-Text Set
  • Anchor text (link text)
  • The descriptive text of a link on a Web page
  • Anchor-text set
  • A set of anchor texts pointing to the same page
    (URL)
  • Multilingual translations
  • Yahoo/??/??
  • America/??/????
  • Anchor-text-set corpus
  • A collection of anchor-text sets

??-USA
Korea
Yahoo Search Engine
Yahoo! America
?????Yahoo!
http//www.yahoo.com
????
??????
Japan
Taiwan
China
31
Term Translation Extraction from Different
Resources
WebSpider
Term Extraction
Search Engine
SimilarityEstimation
Source Query
Target Translation
National Palace Museum
???????, ??, ?????
32
LiveTrans Cross-language Web Search
33
More Examples
34
More Examples
35
Multimedia IR
  • Different forms of information need
  • Image retrieval
  • Speech information retrieval
  • Music information retrieval
  • Video information retrieval

36
Image Retrieval
  • Content-based
  • Query by image content
  • Query by example (????)
  • Similarity in visual features
  • Color, texture, shape,
  • Relevance feedback
  • Text-based
  • Annotation

37
Content-Based Image Retrieval (CBIR)
  • Example systems
  • CIRES (Content-based Image Retrieval System)
    http//amazon.ece.utexas.edu/qasim/research.htm
  • SIMPLIcity http//www-db.stanford.edu/IMAGE/
  • National Museum of History http//210.201.141.12/
    cgi-bin/cbir-query.cgi?tid-1

38
Relevance Feedback (RF)
Source Dr. Cheng
Image
Similar images (no RF)
39
Similar Images Using Relevance Feedback
Image
Similar images using RF
40
Automatic Image Annotation
Problem 1
Keywords?
Visual Similarity
polar bear ice snow
white bear snow tundra
polar bears snow fight
Image Banks with Annotations
41
Spoken Document Retrieval
  • Spoken document retrieval
  • Indexing speech messages using speech recognition
  • Retrieving relevant messages for a text/speech
    query
  • Techniques
  • Document Processing acoustic change detection,
    speech/non-speech detection, Mandarin/non-Mandarin
    detection, story segmentation, speaker
    recognition/clustering
  • Speech Recognition
  • Indexing/Retrieval

42
SoVideo
43
Music Information Retrieval
  • Finding a song by similar melody
  • Query by singing
  • Query by humming
  • Singer identification
  • Background noise
  • Singer voice model

44
Video Information Retrieval
  • Difference with CBIR
  • Temporal information
  • Structural organization
  • Complexity of querying system
  • Techniques
  • Video segmentation
  • Keyframe identification

45
Semantic Retrieval
  • HTML vs. XML
  • Semantic Web (Agent, Ontology, RDF)

46
Common Language of the Web
  • HTML
  • Link Pi ? Pj
  • URL (URI), anchor text
  • Part-of

National Taiwan University
http//www.ntu.edu.tw/
NTU
47
Link Analysis Hubs Authorities in PageRank
48
Current Web Search
  • Keyword-based search (e.g., Google)
  • Full text indexing
  • Page authority (link analysis)
  • Page popularity (query log and users click)
  • Problems
  • Not specific
  • Data in pages have no semantic annotations
  • Yo-yo Mas most recent CD
  • No topic disambiguation
  • Documents with different topics mix together
  • Yo-yo Mas CDs, concerts, biography, gossips,

49
Search on Semantic Web
  • Metadata search
  • To increase precision and flexibility
  • Topic-based search
  • To help contextualize queries and overlay results
    in terms of a knowledge base

50
XML (Extensible Markup Language)
  • More flexible tags
  • DTD (Data Type Definition)
  • Definition of the tags

51
XML Search
  • XML Text Search Engines
  • Amberfish (Etymon)
  • X3 (X-cubed) (DocSoft)
  • UltraSeek (Verity)
  • XML Structured Query Engines
  • Fxgrep
  • Cheshire II (UC Berkeley)
  • XML Query Languages
  • XQuery (W3C XMLQuery)
  • XQL
  • XML-QL

52
Semantic Web
  • "The Semantic Web is an extension of the current
    Web in which information is given well-defined
    meaning, better enabling computers and people to
    work in cooperation." -- Tim Berners-Lee, James
    Hendler, Ora Lassila, The Semantic Web,
    Scientific American, May 2001

53
Semantic Web
Agent
Agent
RDF
ontology
Agent
54
Semantic Web
  • RDF (Resource Description Framework)
  • Common language
  • Ontology
  • Knowledge representation
  • Agent

55
Why Semantic Web?
  • Standardizing knowledge sharing and reusability
    on the Web
  • Interoperable (independent of devices and
    platforms)
  • Machine readableenabling intelligent processing
    of information

56
An Example of Semantic Relation
author
work
written by
publisher
publish
57
What is a Software Agent?
  • A paradigm shift of information utilization from
    direct manipulation to indirect access and
    delegation
  • A kind of middleware between information demand
    (client) and information supply (server)
  • A software that has autonomous, personalized,
    adaptive, mobile, communicative, social, decision
    making abilities

58
What is Ontology?
  • An ontology is a formal and explicit
    specification of shared conceptualization of a
    domain of interest (T. Gruber)
  • Formal semantics
  • Consensus of terms
  • Machine readable and processible
  • Model of real world
  • Domain specific

59
What is Ontology?(2)
  • Generalization of
  • Entity relationship diagrams
  • Object database schemas
  • Taxonomies
  • Thesauri
  • Conceptualization contains phenomena like
  • Concepts/classes/frames/entity types
  • Constraints
  • Axioms, rules

60
Agents and Ontology
  • Agents must have domain knowledge to solve
    domain-specific problems
  • Agents must have common sharable ontology to
    communicate and share knowledge with each other
  • The common sharable ontology must be represented
    in a standard format so that all software agents
    can understand and communicate

61
Agents and Semantic Web
  • Semantic Web provides the structure for
    meaningful content of Web pages, so that software
    agents roaming from page to page will carry out
    sophisticated tasks
  • An agent coming to a clinics web page will know
    Dr. Henry works at the clinic on Monday,
    Wednesday and Friday without having the full
    intelligence to understand the text
  • Assumption is Dr. Henry make the page using an
    off-the-shelf tool, as well as the resources
    listed on the Physical Therapy Associations site

62
Knowledge Representation on the Web
  • The challenge of the Web is to provide a language
    to express both data and rules for reasoning
    about the data meta-data that allows rules from
    any existing knowledge representation system to
    be exported onto the Web
  • Adding logic to the Web means to use rules to
    make inference, choose actions and answer
    questions. The logic must be powerful enough but
    not too complicated for agents to consider a
    paradox

63
Language Layers on the Web
Trust
DAML-L (logic)
Declarative Languages OIL, DAMLOnt
PICS
DC
XHTML SMIL
RDF
XML
HTML
Semantic web infrastructure is built on RDF data
model
64
Languages on the Web
  • HTMLURL
  • XMLDTD (Data Type Definition)
  • RDFRDF schema

65
Statements RDF
  • The basic structure of RDF is object-attribute-va
    lue
  • In terms of labeled graph O-A-gtV

A
O
V
66
Semantic Web Search Engine
  • Swoogle http//swoogle.umbc.edu/ CIKM 2004
  • SHOE (Simple HTML Ontology Extensions)
    http//www.cs.umd.edu/projects/plus/SHOE/search/
  • SWSE http//www.swse.org/
  • http//www.semanticwebsearch.com/

67
Applications to IR
  • Advanced Google
  • Meta Search
  • Search Result Clustering

68
What do Users Really Want?
  • Topic-based vs. keyword-based
  • NTU
  • How to improve current search engines?
  • Resources about Search Engines
  • Search Engine Watch http//searchenginewatch.com/
  • Research Buzz http//researchbuzz.com/

69
Advanced Google
  • Is Google good enough?
  • NTU
  • NTU university
  • NTU university Singapore
  • More and more Services
  • Google Web, Image, News, Video, Google Desktop
    Search ,
  • Google Groups, Gmail, Google Talk, Google
    Calendar,
  • Google Mobile, Google SMS, Google Local,
  • Google Print (Book Search), Google Maps, Google
    Earth,
  • Google Scholar, Translate, Finance, Docs, Reader,
  • More about Google Services
  • http//www.google.com/options/
  • Google Labs http//labs.google.com/

70
More Types of Document Search
  • Google Web, Image, News, Groups, Desktop
    (Office, mail),
  • Microsoft Lookout (mail)
  • Yahoo Stata (mail), Adobe (PDF)

71
Searching Different Media
  • Multimedia Search MP3, Blog, messenger, mobile,
  • Baidu.com MP3, image, news,
  • Singingfish.com (AOL) audio/video,
  • GoFish.com audio, video, mobile, games
  • AllTheWeb.com pictures, audio, video,
  • Blog search engines
  • Daypop, Bloogz, Waypath,
  • A9.com (by Amazon)
  • Books, movies,
  • Bookmark, history, discover, diary
  • Mobissimo.com
  • Airfare search, hotel search
  • Yahoo-OCLC toolbar library search
  • Searching Open WorldCat (OCLC union catalog)

72
Different Forms of Presentation
  • Clusty.com (by Vivisimo)
  • Clustering engine
  • Snap.com (by Idealab)
  • Sorting by popularity, satisfaction, Web
    popularity, Web satisfaction, domain,
  • Alexa.com (by Amazon)
  • Average user review ratings,
  • Visualization
  • TouchGraph Google Browser http//www.touchgraph.c
    om/TGGoogleBrowser.html
  • Kartoo.com a visual meta search engine
  • Girafa
  • ConceptSpace
  • LostGoggles (formerly MoreGoogle) thumbnail
    preview

73
Focused Search Engines
  • Scirus http//scirus.landingzone.nl
  • For scientific information only
  • Google Scholar http//scholar.google.com/
  • For scholarly literature

74
Some Google Hacks and Searching Tricks
  • References
  • Tara Calishain and Rael Dornfest, Google Hacks,
    OReilly
  • Kevin Hemenway and Tara Calishain, Spidering
    Hacks, OReilly
  • http//douweosinga.com/projects/googlehacks
  • Tara Calishain, Web Search Garage, Prentice
    Hall
  • Chris Sherman, Google Power Unleash the Full
    Potential of Google, McGraw Hill

75
Further Utilizing Google
  • Google API http//www.google.com/apis/
  • 1,000 automated queries per day
  • Google Hacks
  • Google Talk
  • Word Color
  • Google Battle
  • Google Date
  • Google Best Time to Visit
  • Google Protocol

76
Meta (Federated) Search
  • To search simultaneously several individual
    search engines and their databases of web pages
  • Ixquick, Metacrawler, Dogpile,
  • Clustering meta-searchers
  • Vivisimo, KillerInfo,
  • Meta-search engines for deep digging
  • SurfWax, Copernic Agent,

77
Meta Search Engine
Web
SE1
MetaSearchEngine
SE2
User
SEn
78
Search Result Clustering
  • Why search result clustering?
  • Why is SRC different from document clustering?
  • In assessment of algorithms quality
  • Precision, recall vs. user-oriented, subjective
    assessment

79
Example of Search Result Clustering
National Taiwan University
NTU Hospital
NTU?
Nanyang Technological University, Singapore
80
Example Clustering Search Engines
  • Vivisimo.com
  • Clusty.com
  • WebClust.com
  • KillerInfo.com
  • InfoNetWare.com
  • SnakeT (Snippet Aggregation for Knowledge
    ExTraction) http//roquefort.unipi.it/
  • A hierarchical clustering engine for snippets
  • Mooter.com

81
Example on Vivisimo
82
Vivisimo (cont.)
83
Clusty.com
84
InfoNetWare.com
85
Thanks for Your Attention!
Write a Comment
User Comments (0)
About PowerShow.com