Finding Authoritative People from the Web - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Finding Authoritative People from the Web

Description:

Finding Authoritative People. from the Web. Masanori Harada, Shin-ya Sato, ... ice hockey, speed skating, fencing, lacrosse, pole vault, and discus throw ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 27
Provided by: har54
Category:

less

Transcript and Presenter's Notes

Title: Finding Authoritative People from the Web


1
Finding Authoritative People from the Web
  • Masanori Harada, Shin-ya Sato, Kazuhiro Kazama
  • harada,sato,kazama_at_ingrid.org
  • NTT Network Innovation Labs.

2
Contents
  • Motivation
  • Why study finding people?
  • Examples
  • Approach
  • Extract personal names on the web
  • Find relevant people using a search engine
  • Results
  • Performance evaluation
  • Summary and future plans

3
Background
  • As the web is connected to the real world, we
    can
  • Find real-world things by searching the web.
  • Understand the real world by investigating the
    web (and vice versa).

the real world

... connections


the web
searching
4
Objective
  • Find authoritative people for all sorts of
    topics by extending a web search engine
  • Why find people?
  • Once people have been found, many other things
    (e.g. books) can be retrieved using digital
    libraries
  • What is authoritative?
  • People mentioned in many web pages with regard to
    a queried topic

5
Screenshot
Relevant personal names
Relationships
Relevant web pages
6
Example (1) subject to people
  • digital libraries (1007 pages)
  • Possible application book finder
  • Using library catalogs, it could suggest relevant
    books written by these authoritative people
  • Shigeo Sugimoto Univ. Library Information
    Science
  • Koichi Tabata Univ. Library Information
    Science
  • Jun Adachi National Institute of Informatics
  • Takeo Yamamoto National Institute of Informatics
  • Hiroyuki Taya National Diet Library

7
Example (2) thing to people
  • Spirited Away (35,936 pages)
  • Possible application movie recommender
  • Using movie databases, it could suggest movies
    which share key people for any queried topic
  • Hayao Miyazaki director
  • Bunta Sugawara voice actor
  • Mari Natsuki voice actress
  • Yumi Kimura singer of theme song
  • Joe Hisaishi composer

8
Example (3) person to people
  • Masanori Harada (205 pages)
  • Possible application social networking
  • Unlike social networking services like orkut,
    there is no need to enter relationships manually
  • Masanori Harada me
  • Shin-ya Sato co-author of this paper
  • Kazuhiro Kazama co-author of this paper
  • Kent Tamura web search researcher
  • Isao Asai web search researcher

9
NEXAS
  • Named entity extraction and association search
  • Associate an entity and a web page by extracting
    names identifying the entity.
  • Find entities associated with top-ranked web
    pages retrieved for a query.

A
More relevant (authoritative)
More relevant (authoritative)

B
web search
Less relevant
Less relevant
C
Irrelevant
10
Extracting personal names
  • Web data
  • 52 million Japanese web pages collected in July
    2003
  • Japanese personal name extractor
  • Extracts only full names
  • Assumes a full name can identify a person
  • Big name dictionaries enables accurate extraction
  • Precision 93.5, Recall 85.3

11
Personal names on the web
  • Personal names appear frequently on the web
  • 6.6M unique names extracted from 52M web pages
  • 1/4 of web pages contain full names
  • Celebrities appear gt10 thousand times
  • singers, actors, sports stars, novelists,
    politicians, etc.
  • Most names appear only a few times
  • Name frequency indicates popularity
  • But number of pages is easily affected by
    automatically generated texts and spams
  • Number of servers is less affected
  • Authoritative people are mentioned in many servers

12
Procedure for finding people
  • 1. Find web pages using a full-text search engine
  • 2. List personal names extracted from top T
    relevant and authoritative web pages
  • 3. Calculate relevance scores and output top k
    relevant people, who
  • Appear frequently on top-ranked web pages
  • Do not appear frequently on irrelevant web pages

13
Calculating relevance score
  • Scoring functions df, sf, dfidf, and sfidf
  • df document frequency within top T pages
  • sf server frequency within top T pages
    0.01 df
  • idf log (N / fp )
  • N number of pages in the collection
  • fp document frequency
  • sf Alleviates effects of generated texts
    and spams
  • idf Weight for moderating generally famous
    person

14
Performance evaluation
  • Compared 4 scoring functions for varying T
  • Precision average of scores of top k people
  • Judged if a person was relevant (Score 1) or not
    (Score 0) by searching for the personal name
    using Google
  • 45 simple topics were used
  • 15 musical instruments players or not?
  • 15 sports players or not?
  • 15 information technologies experts or not?

15
Precision of scoring functions
  • sf is very effective
  • idf is fairly effective

sfidf sf dfidf df
Precision
Number of people evaluated
16
Precision with varying T
  • More web pages do not necessary yield better
    results

sfidf T 500 T 1000 T 200 T 2000 T 5000 T
100
Precision
Number of people evaluated
17
Future work
  • Apply for other languages, especially English
  • Can we distinguish different John Smith?
  • Find books, companies, shops, etc.
  • By extracting ISBNs, domain names, places, etc.
  • Analyze co-occurrences as a social network

Online demo is available at http//valhalla.ingr
id.org28080/ throughout the conference
Japanese fonts and Java2 plug-in are required
18
Precison vs. result size
  • Too-specific or too-general topics are difficult.

sfidf T1000
Precision of top 10 people
Number of pages retrieved for a topic
databases
compiler theory
19
Popular Japanese names
  • singers, actors, sports starts, novelists,
    politicians, etc.

Table Top 10 most frequent names
20
Related work
  • ReferralWeb Kautz 1997
  • Finds experts around the user by searching the
    web
  • Tested only with computer science topics
  • Web Question Answering Kwok 2001Brill 2001
  • Retrieve one exact answer to a long, complete
    natural language question
  • Our contributions
  • Observed the distribution of personal names on
    the web
  • Extended a web search engine so that it
    accurately finds relevant people for all sorts of
    queries.

21
Common failures
  • Too specific topics
  • Too general topics
  • Name extraction errors
  • Falsely extracted non-names
  • Missed (not extracted) names
  • Historical/Fictional characters
  • Celebrities
  • Popular names often appear without regard to a
    specific topic

22
Numbers
  • Number of web pages 52,302,805
  • Number of web servers 664,139
  • Number of pages w/ names 13,922,012
  • Number of name occurrences 117,091,977
  • Number of unique names 6,161,805
  • Total size of web pages 450GB
  • Size of inverted index 113GB
  • Size of dictionaries
  • Family names 21,141
  • Personal names 12,130
  • Full names 19,675

23
Topics for the experiment
  • Musical instruments
  • violin, cello, trumpet, clarinet, harp,
    percussion, synthesizer, ocarina, accordion,
    contrabass, pipe organ, and marimba
  • Sports
  • soccer, baseball, marathon, swimming, rugby,
    volleyball, basketball, boxing, badminton, ice
    hockey, speed skating, fencing, lacrosse, pole
    vault, and discus throw
  • Information technology terms
  • databases, java, information retrieval, XML,
    IPv6, speech recognition, P2P, data mining,
    machine translation, complexity theory, web
    search engines, probabilistic reasoning,
    simulated annealing, compiler theory, and
    randomized algorithms

24
Name extraction method
  • Procedure
  • 1. Remove HTML tags.
  • 2. Using a morphological analyzer, split each
    sentence into morphemes and assign
    Part-Of-Speech tags.
  • 3. Extract ltfamily namegtltpersonal namegt
    sequences.
  • Performance improved by enriching dictionaries
  • 17k families 12k personals 2k popular full
    names Precision 78.4, Recall 75.0
  • 21k families 40k personals 19k popular full
    names Precision 93.5, Recall 85.3

25
The same-name problem
  • Not very serious when we query a subject
  • Different people having the same name rarely
    exist in a specific area
  • Japanese family/personal names are diverse
  • Still, not a few people share the same name
  • Solutions (under consideration)
  • Classifies web pages by topic
  • Analyze a social network around the name
    (Different people have different friends)

26
Popularity and frequency
  • Personal name frequencies indicate web users'
    interest in celebrities.
  • The number of pages is prone to be affected by
    automatically generated pages and spams.
  • The number of servers is better to find popular
    people.
Write a Comment
User Comments (0)
About PowerShow.com