Title: Finding Authoritative People from the Web
1Finding Authoritative People from the Web
- Masanori Harada, Shin-ya Sato, Kazuhiro Kazama
- harada,sato,kazama_at_ingrid.org
- NTT Network Innovation Labs.
2Contents
- Motivation
- Why study finding people?
- Examples
- Approach
- Extract personal names on the web
- Find relevant people using a search engine
- Results
- Performance evaluation
- Summary and future plans
3Background
- As the web is connected to the real world, we
can - Find real-world things by searching the web.
- Understand the real world by investigating the
web (and vice versa).
the real world
... connections
the web
searching
4Objective
- Find authoritative people for all sorts of
topics by extending a web search engine - Why find people?
- Once people have been found, many other things
(e.g. books) can be retrieved using digital
libraries - What is authoritative?
- People mentioned in many web pages with regard to
a queried topic
5Screenshot
Relevant personal names
Relationships
Relevant web pages
6Example (1) subject to people
- digital libraries (1007 pages)
- Possible application book finder
- Using library catalogs, it could suggest relevant
books written by these authoritative people
- Shigeo Sugimoto Univ. Library Information
Science - Koichi Tabata Univ. Library Information
Science - Jun Adachi National Institute of Informatics
- Takeo Yamamoto National Institute of Informatics
- Hiroyuki Taya National Diet Library
7Example (2) thing to people
- Spirited Away (35,936 pages)
- Possible application movie recommender
- Using movie databases, it could suggest movies
which share key people for any queried topic
- Hayao Miyazaki director
- Bunta Sugawara voice actor
- Mari Natsuki voice actress
- Yumi Kimura singer of theme song
- Joe Hisaishi composer
8Example (3) person to people
- Masanori Harada (205 pages)
- Possible application social networking
- Unlike social networking services like orkut,
there is no need to enter relationships manually
- Masanori Harada me
- Shin-ya Sato co-author of this paper
- Kazuhiro Kazama co-author of this paper
- Kent Tamura web search researcher
- Isao Asai web search researcher
9NEXAS
- Named entity extraction and association search
- Associate an entity and a web page by extracting
names identifying the entity. - Find entities associated with top-ranked web
pages retrieved for a query.
A
More relevant (authoritative)
More relevant (authoritative)
B
web search
Less relevant
Less relevant
C
Irrelevant
10Extracting personal names
- Web data
- 52 million Japanese web pages collected in July
2003 - Japanese personal name extractor
- Extracts only full names
- Assumes a full name can identify a person
- Big name dictionaries enables accurate extraction
- Precision 93.5, Recall 85.3
11Personal names on the web
- Personal names appear frequently on the web
- 6.6M unique names extracted from 52M web pages
- 1/4 of web pages contain full names
- Celebrities appear gt10 thousand times
- singers, actors, sports stars, novelists,
politicians, etc. - Most names appear only a few times
- Name frequency indicates popularity
- But number of pages is easily affected by
automatically generated texts and spams - Number of servers is less affected
- Authoritative people are mentioned in many servers
12Procedure for finding people
- 1. Find web pages using a full-text search engine
- 2. List personal names extracted from top T
relevant and authoritative web pages - 3. Calculate relevance scores and output top k
relevant people, who - Appear frequently on top-ranked web pages
- Do not appear frequently on irrelevant web pages
13Calculating relevance score
- Scoring functions df, sf, dfidf, and sfidf
- df document frequency within top T pages
- sf server frequency within top T pages
0.01 df - idf log (N / fp )
- N number of pages in the collection
- fp document frequency
-
- sf Alleviates effects of generated texts
and spams - idf Weight for moderating generally famous
person
14Performance evaluation
- Compared 4 scoring functions for varying T
- Precision average of scores of top k people
- Judged if a person was relevant (Score 1) or not
(Score 0) by searching for the personal name
using Google - 45 simple topics were used
- 15 musical instruments players or not?
- 15 sports players or not?
- 15 information technologies experts or not?
15Precision of scoring functions
- sf is very effective
- idf is fairly effective
sfidf sf dfidf df
Precision
Number of people evaluated
16Precision with varying T
- More web pages do not necessary yield better
results
sfidf T 500 T 1000 T 200 T 2000 T 5000 T
100
Precision
Number of people evaluated
17Future work
- Apply for other languages, especially English
- Can we distinguish different John Smith?
- Find books, companies, shops, etc.
- By extracting ISBNs, domain names, places, etc.
- Analyze co-occurrences as a social network
Online demo is available at http//valhalla.ingr
id.org28080/ throughout the conference
Japanese fonts and Java2 plug-in are required
18Precison vs. result size
- Too-specific or too-general topics are difficult.
sfidf T1000
Precision of top 10 people
Number of pages retrieved for a topic
databases
compiler theory
19Popular Japanese names
- singers, actors, sports starts, novelists,
politicians, etc.
Table Top 10 most frequent names
20Related work
- ReferralWeb Kautz 1997
- Finds experts around the user by searching the
web - Tested only with computer science topics
- Web Question Answering Kwok 2001Brill 2001
- Retrieve one exact answer to a long, complete
natural language question - Our contributions
- Observed the distribution of personal names on
the web - Extended a web search engine so that it
accurately finds relevant people for all sorts of
queries.
21Common failures
- Too specific topics
- Too general topics
- Name extraction errors
- Falsely extracted non-names
- Missed (not extracted) names
- Historical/Fictional characters
- Celebrities
- Popular names often appear without regard to a
specific topic
22Numbers
- Number of web pages 52,302,805
- Number of web servers 664,139
- Number of pages w/ names 13,922,012
- Number of name occurrences 117,091,977
- Number of unique names 6,161,805
- Total size of web pages 450GB
- Size of inverted index 113GB
- Size of dictionaries
- Family names 21,141
- Personal names 12,130
- Full names 19,675
23Topics for the experiment
- Musical instruments
- violin, cello, trumpet, clarinet, harp,
percussion, synthesizer, ocarina, accordion,
contrabass, pipe organ, and marimba - Sports
- soccer, baseball, marathon, swimming, rugby,
volleyball, basketball, boxing, badminton, ice
hockey, speed skating, fencing, lacrosse, pole
vault, and discus throw - Information technology terms
- databases, java, information retrieval, XML,
IPv6, speech recognition, P2P, data mining,
machine translation, complexity theory, web
search engines, probabilistic reasoning,
simulated annealing, compiler theory, and
randomized algorithms
24Name extraction method
- Procedure
- 1. Remove HTML tags.
- 2. Using a morphological analyzer, split each
sentence into morphemes and assign
Part-Of-Speech tags. - 3. Extract ltfamily namegtltpersonal namegt
sequences. - Performance improved by enriching dictionaries
- 17k families 12k personals 2k popular full
names Precision 78.4, Recall 75.0 - 21k families 40k personals 19k popular full
names Precision 93.5, Recall 85.3
25The same-name problem
- Not very serious when we query a subject
- Different people having the same name rarely
exist in a specific area - Japanese family/personal names are diverse
- Still, not a few people share the same name
- Solutions (under consideration)
- Classifies web pages by topic
- Analyze a social network around the name
(Different people have different friends)
26Popularity and frequency
- Personal name frequencies indicate web users'
interest in celebrities. - The number of pages is prone to be affected by
automatically generated pages and spams. - The number of servers is better to find popular
people.