Mining Academic Community

About This Presentation

Title:

Mining Academic Community

Description:

A social network is a structure made up of nodes, representing entities from ... Network Motif,' in the 2006 IEEE/WIC/ACM International Conference on Web ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 93

Provided by: chengy1

Category:

more less

Transcript and Presenter's Notes

Title: Mining Academic Community

1
Mining Academic Community

Jan-Ming Ho
hohoiis.sinica.edu.tw
Computer System and Communication Lab
Institute of Information Science
Academia Sinica

2
What is Community?

In Graph Theory
densely connected groups of vertices, with
sparser connection between groups
In Social Network Analysis
groups of entities that share similar properties
or connect to each other via certain relations
A social network is a structure made up of nodes,
representing entities from different conceptual
groups, that are linked with different types of
relations

3
Why is Community Important?

Interesting data with community structure
researcher collaboration, friendship network,
WWW, Massive Multi-player on-line gaming,
electronic communications.
Groups of web pages that link to more web pages
in the community than pages outside correspond to
web pages on related topics
Groups in social networks correspond to social
communities, which can be used to understand
organizational structure, academic collaboration,
shared interests and affinities, etc.

4
Motivation

Understand the research network between authors,
conferences and topics (rank entities by
relevance for given entities)
Find and justifiably recommend research
collaborators for given authors
Explore the academic social network
Find out most important papers, researchers and
venues for a given topic

5
Related Systems

Many digital library systems exist
ACM Digital Library
IEEExplorer
DBLP
Citeseer
Libra
DBConnect
Problems
The coverage of dataset is not large enough
Name ambiguous problem exists in
Web pages
Citation records

6
Libra Academic Search

http//libra.msra.cn
Free computer science bibliography search engine
A test-bed for object-level vertical search
research
Currently the following types of paper-related
objects can be searched
Papers, Authors, Conferences, Journals, Research
Communities

7
(No Transcript)
8
(No Transcript)
9
DBconnect Conference
10
DBconnect Topic
11
DBconnect Author
12
ZoomInfo
(1) People Directory (2) Developer Tools (3)
Social Network, Profile Statistics, Employment
History (4) Ability to identify ambiguous?! Ex.
Can get 21 different people called Bing Liu
13
ArnetMiner
14
Our goal

Developing an automatic system to
Explore the academic social network
Find out most important papers, researchers and
venues for a given topic
Provide solutions for existent problems
Collecting larger citation datasets
Retrieving data from web pages
Publication list finder
Extracting citation strings from web pages
Citation parser
Multilingual data sources
Chinese and English corpuses
Name dissemination mechanism in
Web pages
Citation records

15
Our contributions

Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee,
and Jan-Ming Ho, "Web Appearance Disambiguation
of Personal Names Based on Network Motif," in the
2006 IEEE/WIC/ACM International Conference on Web
Intelligence (WI 2006), Hong Kong, Dec. 18-22,
2006
Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho,
"PLF A Publication List Web Page Finder for
Researchers," in Proceedings of the 2007
IEEE/WIC/ACM International Conference on Web
Intelligence (WI 2007), Silicon Valley, USA, Nov.
2-5, 2007
Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and
Jan-Ming Ho, "Mining Translations of Chinese Name
from Web Corpora by Using Query Expansion
Technique and Support Vector Machine," in
Proceedings of the 2007 IEEE/WIC/ACM
International Conference on Web Intelligence (WI
2007), Silicon Valley, USA, Nov. 2-5, 2007
Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming
Lee, "AEFS Authoritative Expert Finding System
Based on a Language Model and Social Network
Analysis," in Proceedings of the 12th Conference
on Artificial Intelligence and Applications
(TAAI2007), Nov 16-17, 2007
Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho,
"BibPro A Citation Parser Based on Sequence
Alignment Techniques," will appear in Proceedings
of the IEEE 22nd International Conference on
Advanced Information Networking and Applications
(AINA-08)

16
PLF A Publication List Web Page Finder for
Researchers
17
Agenda

Introduction
Publication List Web Page Finder, PLF
Performance Evaluation
Conclusion, Future Work

18
Overview of a Publication List Web Page

Keep abreast of state-of-the-art research
Contains citations not found elsewhere.
May provide some reference materials, such as
slides and talks.
Challenges
How to find the publication list web pages
Only with the given name .
Various versions or Multiple copies
An author may have many affiliations.
Name ambiguity problem
E.g., Dr. Bing Liu, we found that 26 people share
the same name by inquiring to ZoomInfo (people
search engine).

19
Problem
Publication List Web Page?
20
Definition of Publication List
Affiliated Personal Publication List Web Page
(APPL) a web page belongs to the affiliated web
site of a specific person with the given name.
Affiliation Institute of Information Science,
Academia Sinica
citation string
21
Agenda

Introduction
Publication List Web Page Finder, PLF
Performance Evaluation
Conclusion, Future Work

22
Process Flow
23
Basic Concept
A publication list web page may
contain many citation strings
24
Agenda

Introduction
Publication List Web Page Finder, PLF
Performance Evaluation
Conclusion, Future Work

25
Dataset

Scenario
Seminar members have usually published major
research works
We randomly collected 200 names from the WWW 06
Conference Committee website

APPL Types APPL people population
others 0 22 11
single-group 1 120 60
multi-group 2 35 17.5
3 16 8
4 7 3.5
26
Experiment Evaluation

Evaluation metrics
We consider the top-5 results derived by each
link and focus on the top-5 recall metric, which
is calculated by

Notation Definition
Ra the number of publication list web pages belonging to researchers listed in the dataset
R the number of publication list web pages contained in the top-5 results
27
Parameter Analysis for Single-Group
(m, n)
(m, n)
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n

Figure (a)
When m increases, the recall rate also
increases.
Figure (b)
System performance may be constrained by m.

28
Parameter Analysis for Multi-Group
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n

Figure (a)
It is clear that the performance when m 40 is
always better than the other settings.
Figure (b)
The best performance (top-5 recall is 70)
occurs when n 75.

29
Performance Evaluations
(a)Performance of approaches in single-group
(b)Performance of different ways in multi-group

The parameter m has a strong influence on the
systems performance for example, an oversized m
may degrade the performance.
The parameter n has little influence on the
systems performance.
The PLF system outperforms the other two
approaches on both the single-group and the
multi-group datasets.

30
Conclusion

We have defined the problem of finding the
publication list web pages of a researcher, and
proposed PLF system
Ongoing work
Name ambiguity problem
How to merge the multiple publication list web
pages for a specific person into a single page.

31
Discussion Name Ambiguity Problem

Scenario
We take the name Bing Liu
as an example
Analyze manually
Observation
Citation Count
Name translation problem
Partial matching problem

32
Extracting Citation Strings from Web Pages
33
Extract Citation Records
Extract
Web Page
Structured Data
34
Challenges

The formats of publication list web pages vary
There are no fixed syntactic rules for parsing
citation records
Hence, We can not apply simple rules to extract
citation records automatically

35
Challenges Complex Layouts of Publication List
Pages
36
Ideas

The semantic structure of web pages is organized
by visual arrangement.
We can utilize semi-structure information (visual
) of web pages to help extraction task.
With hierarchical structure and geometric
information, DOM tree is not only a great
structure to present Web pages, but also very
helpful for visual pattern analysis.

37
DOM Tree Presentation of Web page
38
Architecture of Citation Extraction System
39
Modules of Citation Extraction System

Common Style Finder
find out all common style patterns for each level
of granularity in web pages
Citation Extractor
explore data regions with common style patterns
distill extraction rules from those data regions
rank extraction patterns based on a normal word
count distribution probability

40
BibPro A Citation Parser based on Sequence
Alignment Techniques
41
System Goal
42
Basic Idea(1/2)

Encode citation to protein sequence
Only keep the citation style information
order of fields
field separators

43
Basic Idea(2/2)

To determine citation style by the order of
punctuation marks and reserved words

44
How to encode citation to protein sequence?

Keep the citation style information
Which field should be included? (only can use 23
symbol)
Which punctuation are used to separate fields?
By observing different citation styles, we define
an encode table to translate each token of
citation to an amino acid symbol

45
Encode Table
A Author T Title L Journal F Volumn value W Issue value H Page value M Month Y Year X noise (unrecognized token) S Issue key. e.g. no, No P Page key. e.g. pp, page V Volume key. e.g. Vol, vo N numeral Q _at_ \ _ / ! ? ? I ( lt ? K ) gt ? D . G " R , C - E ' Z B blank
46
How to using protein sequence to extract metadata?

Transform extraction problem to sequence
alignment problem
Form translation
Unknown Answer
BASE FORM
ALIGN FORM
INDEX FORM
Known Answer
RESULT FORM
STYLE FORM
INDEX FORM

47
RESULT FORM (Known Answer)
48
BASE FORM (Unknow Answer)
49
System Structure

System PreProcess
(Template Generating System)
Citation Crawler
Template Builder
Online Parsing
(Parsing System)
Template Matching
Metadata Extraction

50
Citation Crawler
51
BLAST-powered Template Matching
52
Evaluation for CiteSeer DataSet

Consider the inconsistency between the Citation
String and BibTex file(metadata)
Old Measurement
New Measurement

53
Definition

Tokenparsedfield denote tokens that appear in
the parsed subfield
Tokenquery citation denote tokens that appear in
the query citation string
TokenBibTex field denote tokens that appear in
the specific subfield in the BibTex file
TokenBibTex denote all tokens that appear in
the BibTex file
These tokens don' t include punctuation

54
Compare with ParaCite

DataSet
Collected from CiteSeer
Training Set 2416
Testing Set 4131
ParaCite
Using default template Database
add template to its database isnt easy
Test Testing Set
Our System
Using training template Database (Training Set)
Test Testing Set

55
Experimental Results
ParaCite Autor Title Journal Page Issue Year Score
new Eva 32.90 73.35 29.83 4.58 25.05 77.04 50.22

ParaCite Autor Title Journal Page Issue Year Score
old Eva 99.08 62.72 30.46 100.00 93.96 99.70 78.81

Our Author Title Journal Volumn Page Issue Month Year Score
new Eva 93.73 73.32 51.34 83.52 94.62 85.11 89.18 96.49 84.80

Our Author Title Journal Volumn Page Issue Month Year Score
old Eva 90.58 89.51 67.66 93.58 96.69 91.79 99.49 99.50 91.45
56
Analysis

ParaCite only can extract one author name
Old evaluation have a problem it is highly
probable that you will obtain high accuracy, if
you extract less information

57
Evaluation for clean DataSet

Ciation String is fully composed of corresponding
metadata

58
Compare with INFOMAP

DataSet
Includes 160000 record
Training Dataset 10000 X 6 (JMIS, ACM, IEEE,
APA, MISQ, and ISR)
Testing Dataset 10000 X 6 (JMIS, ACM, IEEE, APA,
MISQ, and ISR)

59
Result
Author Title Journal Volumn Page Issue Year Overall average
APA 99.67 96.38 97.06 98.99 98.71 98.12 99.42 98.33
IEEE 98.72 98.12 99.12 99.30 98.40 98.39 99.40 98.78
ACM 97.14 95.01 93.93 97.19 97.92 97.03 98.88 96.73
ISR 99.48 96.17 96.96 99.15 98.55 98.39 99.35 98.29
MISQ 98.59 97.99 98.98 99.41 98.83 98.61 99.54 98.85
JMIS 91.95 87.90 90.46 99.23 98.76 98.03 99.46 95.11
Average 97.59 95.26 96.09 98.88 98.53 98.09 99.34 97.68
60
Evaluation for Cora DataSet

500 records
Be used as benchmark for many papers
(HMM, SVM, CRF)

61
Evaluation

Divide words into four kinds
TP,FP,TN,FN
Four metrics
Word Accuracy (TPTN)/(TPFPFNTN)
Precision TP/(TPFP)
Recall TP/(TPFN)
F1-measure (2PrecisionRecall)/(PrecisionRecall
)

62
Our System Our System Our System
acc. F1.
Author 97.17 93.98
Title 94.17 90.13
Journal 93.58 83.27
Volume 99.21 84.62
Page 99.21 92.09
Date 99.92 98.96
63
Mining Translations of Chinese Names from Web
Corpora by Using a Query Expansion Technique and
Support Vector Machine
64
Agenda

Introduction
Proposed Approach
Experiments
Conclusions and Future Work

65
Background

Most of academic information can be found on the
Web
Scholar Google, DBLP etc.

66
Problems in Searching Chinese Name
Only Chinese Corpus
67
Challenges in Chinese Name Translation

Many pronunciation rules in different areas
? ? Chen (Taiwan)
? ? Tsun (Hong Kong)
? ? Tan (Fukien)
Some additional words exist.
Ex ??? (Kwang-Ming Frank Hwang)
Ex ??? (Jane Win-Shih Liu)

68
Common Chinese Name Translation Format
Name Format Examples
Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name) ??? (Fon-Che Liu) ??? (Ng Tian Hann) ?? (Ngau Lam)
Type-2. (Merged Chinese given name) (Surname) ??? (Derchyi Wu)
Type-3. (Western first name) (Surname) ??? (Anne Chao)
Type-4. (Chinese given name) (Western first name) (Surname) ??? (Kwang-Ming Frank Hwang)
Type-5. (Abbreviated Chinese given name) (Surname) ??? (S.-Y. Chang)
Type-6. (Western first name) (Abbreviated Chinese given name) (Surname) ??? (Jack-C. Lee)
Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname) ???(Gwei-Hung H. Tsai)
Type-8. (Chinese given name) (Unpredictable Surname) ???(Jane Win-Shih Liu)
69
Goal

Design an automatic mechanism to translate a
given Chinese name into its related English name

70
Agenda

Introduction
Proposed Approach
Experiments
Conclusions and Future Work

71
Concepts of Proposed Approach
No corresponding translations
72
Three Major Techniques

Query expansion technique
? Translation of the surname
Obtaining the related Web page snippets of the
Chinese name translation.
Solve the problem of the unrelated term existing
in the name translation.
Knowledge-based method
? Chinese surname database, A common dictionary,
Western first name database
Obtaining all the name-like terms from the
returned Web page snippets.
SVM
? Chinese pronunciation database, the phonetic
feature and the distant feature,
selectedatraining samples
Selecting the appropriate Chinese name
translations from the candidates.

73
System Architecture
Returned Web page snippets
Returned Web page snippets
Name candidates
Name candidates
Chinese names
Chinese names
Query expander
Candidate extractor
SVM-based name selector
Query expander
Candidate extractor
SVM-based name selector
Translated English names
Translated English names
Chinese surname database
Western first name database
Chinese pronunciation database
Chinese surname database
Western first name database
Chinese pronunciation database
On-line dictionary
On-line dictionary
74
Query Expander

Goal
To retrieve Web page snippets that contain both
a persons Chinese name and the translation of
the persons surname.
Name splitter
Determining whether the input Chinese name
contains a compound surname
? Chinese surname database
Dividing the input Chinese name into a Surname
part and a given name part.
Surname translator
Selecting appropriate surname translations.
? Chinese surname database
The strength of relationship between each surname
translation and the person is determined by the
distance from the persons Chinese name to the
surnames translation.
Web page retriever
Making the concept of the query word more
clearly.
Retrieving the related Web pages back.
The new query word will be (Chinese name)
(Surnames translation).

75
Distance from Two Terms

Calculation of the distance from two terms
where D is the distance, N is the number of
non-words between the two terms.

???( Wei-Da Chen)
The distance from the persons Chinese name (???)
to the surnames translation (Chen) is 3.
76
Candidate Extractor

Goal
To extract possible candidates from the
retrieved Web page snippets.
Steps
Removing all HTML tags.
Identifying out all the positions of the Chinese
surnames existing in the snippets.
? Chinese surname database
Extracting any English terms near each surname in
the snippets if the term has one of the following
properties
The term cannot be found in a common dictionary.
The term is a Western first name.
The length of the term is 1.
?At most three English terms in the
neighborhood of the surname will be extracted.

77
System Architecture 4/10 - Candidate extractor
Step1 Identifying out all the
positions of the Chinese surnames existing in the
snippets.
The extracted terms will be the name translation
candidates and be sent to SVM-based name selector
for processing

Step2 Extracting
any English terms near each surname in the
snippets if the term has one of the following
properties
The term cannot be found in a common
dictionary.
The term is a Western first name.
The length of the term is 1.

78
SVM-based Name Selector

Goal
To extract each candidates features and utilize
them to determine whether the candidate is the
correct translation of the input Chinese name.
Features
The phonetic feature
Phonetic similarity
? Soundex algorithm
The distant feature
Smallest distance (between the Chinese name and
the translation candidates)
Number of appearance in the neighborhood

79
Distant Features

The neighborhood
The close area of each occurrence of the Chinese
name.
The close area is defined by a given threshold of
distance of number of words.

Smallest distance 2
Number of appearance in the neighborhood of the
candidate win-shih 2
80
Summary

Query expansion technique
Retrieving related Web pages.
Knowledge-based method
Extracting appropriate name translation
candidates from the retrieved Web pages.
SVM
Learning the verification rule and
Selecting appropriate name translation from
extracted candidates.

81
Agenda

Introduction
Proposed Approach
Experiments
Conclusions and Future Work

82
Testing Environment and Dataset 1/3

The following tool are used
Cambridge on-line dictionary
Google search engine
LIBSVM
Two datasets are used
Dataset I (training testing)
Collected from the Directory of scholars of
Institute of Mathematics.
Contains 78 pieces of data.
Dataset II (testing)
Collected by our program from the Website of the
Directory of Division of Computer Science of
National Science Council.
Contains 1,157 pieces of data, and the name
translations of 40 data are not existed in Google.

83
Testing Environment and Dataset 2/3
Name format Example Dataset I Dataset I Dataset II Dataset II
Name format Example
Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name) ???(Jen-Wen Ding) ???(Der-Rong Din) ???(Ming Ouhyang) 19 24.3 1000 89.5
Type-2. (Merged Chinese given name) (Surname) ???(Piyu Tsai) 10 12.8 42 3.8
Type-3. (Western first name) (Surname) ???(Eugene Lai) 9 11.5 9 0.8
Type-4. (Chinese given name) (Western first name) (Surname) ???(Alan Li-Sung liu) ???(Jia-Yih Joy Chen) ???(Fongray Frank Young) 14 17.9 50 4.5
Type-5. (Abbreviated Chinese given name) (Surname) ???(I.-C. Hung) 3 3.8 0 0
Type-6. (Western first name) (Abbreviated Chinese given name) (Surname) ???(Judy C. R. Tseng) 8 10.3 9 0.8
Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname) ???(Tetz C. Huang) 3 3.8 3 0.4
Type-8. (Chinese given name) (Unpredictable Surname) ???(Trieu-Kien Truong) 12 15.4 4 0.4
84
Testing Environment and Dataset 3/3

The alignment accuracy
Proposed by Huang (2005).
The probability of selecting the correct answers
when the searched snippets contain the correct
answers.
A
where
Ai The alignment accuracy of candidate i.
Nd The number of testing data.
Ncc The number of correct translation.
Performance measurement Top-1 to Top-5 alignment
accuracy.

85
Results and Analysis 1/3- Overall performance on
Dataset I
70.5 top-1 accuracy 91 top-5 accuracy
86
Results and Analysis 2/3 - Overall performance
on Dataset II
57.9 top-1 accuracy 86.2 top-5 accuracy
87
Results and Analysis 3/3 - Performance of each
name type
Name format Example
Type-1 ???(Jen-Wen Ding) ???(Der-Rong Din) ???(Ming Ouhyang)
Type-2 ???(Piyu Tsai)
Type-3 ???(Eugene Lai)
Type-4 ???(Alan Li-Sung liu) ???(Jia-Yih Joy Chen)
Type-5 ???(I.-C. Hung)
Type-6 ???(Judy C. R. Tseng)
Type-7 ???(Tetz C. Huang)
Type-8 ???(Trieu-Kien Truong)
Our system performs better in type-1, type-2,
type-4, type-6.
88
Discussions

Major reason for the low performance on Type-3,
Type-5, Type-7 and Type-8
The lack of Web information.
Usually more than one correct name translations
for an input Chinese name are found out.
The name ambiguity problem.

89
Limitations