Mining Academic Community - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Academic Community

Description:

A social network is a structure made up of nodes, representing entities from ... Network Motif,' in the 2006 IEEE/WIC/ACM International Conference on Web ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 93
Provided by: chengy1
Category:

less

Transcript and Presenter's Notes

Title: Mining Academic Community


1
Mining Academic Community
  • Jan-Ming Ho
  • hohoiis.sinica.edu.tw
  • Computer System and Communication Lab
  • Institute of Information Science
  • Academia Sinica

2
What is Community?
  • In Graph Theory
  • densely connected groups of vertices, with
    sparser connection between groups
  • In Social Network Analysis
  • groups of entities that share similar properties
    or connect to each other via certain relations
  • A social network is a structure made up of nodes,
    representing entities from different conceptual
    groups, that are linked with different types of
    relations

3
Why is Community Important?
  • Interesting data with community structure
  • researcher collaboration, friendship network,
    WWW, Massive Multi-player on-line gaming,
    electronic communications.
  • Groups of web pages that link to more web pages
    in the community than pages outside correspond to
    web pages on related topics
  • Groups in social networks correspond to social
    communities, which can be used to understand
    organizational structure, academic collaboration,
    shared interests and affinities, etc.

4
Motivation
  • Understand the research network between authors,
    conferences and topics (rank entities by
    relevance for given entities)
  • Find and justifiably recommend research
    collaborators for given authors
  • Explore the academic social network
  • Find out most important papers, researchers and
    venues for a given topic

5
Related Systems
  • Many digital library systems exist
  • ACM Digital Library
  • IEEExplorer
  • DBLP
  • Citeseer
  • Libra
  • DBConnect
  • Problems
  • The coverage of dataset is not large enough
  • Name ambiguous problem exists in
  • Web pages
  • Citation records

6
Libra Academic Search
  • http//libra.msra.cn
  • Free computer science bibliography search engine
  • A test-bed for object-level vertical search
    research
  • Currently the following types of paper-related
    objects can be searched
  • Papers, Authors, Conferences, Journals, Research
    Communities

7
(No Transcript)
8
(No Transcript)
9
DBconnect Conference
10
DBconnect Topic
11
DBconnect Author
12
ZoomInfo
(1) People Directory (2) Developer Tools (3)
Social Network, Profile Statistics, Employment
History (4) Ability to identify ambiguous?! Ex.
Can get 21 different people called Bing Liu
13
ArnetMiner
14
Our goal
  • Developing an automatic system to
  • Explore the academic social network
  • Find out most important papers, researchers and
    venues for a given topic
  • Provide solutions for existent problems
  • Collecting larger citation datasets
  • Retrieving data from web pages
  • Publication list finder
  • Extracting citation strings from web pages
  • Citation parser
  • Multilingual data sources
  • Chinese and English corpuses
  • Name dissemination mechanism in
  • Web pages
  • Citation records

15
Our contributions
  • Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee,
    and Jan-Ming Ho, "Web Appearance Disambiguation
    of Personal Names Based on Network Motif," in the
    2006 IEEE/WIC/ACM International Conference on Web
    Intelligence (WI 2006), Hong Kong, Dec. 18-22,
    2006
  • Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho,
    "PLF A Publication List Web Page Finder for
    Researchers," in Proceedings of the 2007
    IEEE/WIC/ACM International Conference on Web
    Intelligence (WI 2007), Silicon Valley, USA, Nov.
    2-5, 2007
  • Kai-Hsiang Yang, Wei-Da Chen, Hahn-Ming Lee and
    Jan-Ming Ho, "Mining Translations of Chinese Name
    from Web Corpora by Using Query Expansion
    Technique and Support Vector Machine," in
    Proceedings of the 2007 IEEE/WIC/ACM
    International Conference on Web Intelligence (WI
    2007), Silicon Valley, USA, Nov. 2-5, 2007
  • Chia-Ching Chou, Kai-Hsiang Yang and Hahn-Ming
    Lee, "AEFS Authoritative Expert Finding System
    Based on a Language Model and Social Network
    Analysis," in Proceedings of the 12th Conference
    on Artificial Intelligence and Applications
    (TAAI2007), Nov 16-17, 2007
  • Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho,
    "BibPro A Citation Parser Based on Sequence
    Alignment Techniques," will appear in Proceedings
    of the IEEE 22nd International Conference on
    Advanced Information Networking and Applications
    (AINA-08)

16
PLF A Publication List Web Page Finder for
Researchers
17
Agenda
  • Introduction
  • Publication List Web Page Finder, PLF
  • Performance Evaluation
  • Conclusion, Future Work

18
Overview of a Publication List Web Page
  • Keep abreast of state-of-the-art research
  • Contains citations not found elsewhere.
  • May provide some reference materials, such as
    slides and talks.
  • Challenges
  • How to find the publication list web pages
  • Only with the given name .
  • Various versions or Multiple copies
  • An author may have many affiliations.
  • Name ambiguity problem
  • E.g., Dr. Bing Liu, we found that 26 people share
    the same name by inquiring to ZoomInfo (people
    search engine).

19
Problem
Publication List Web Page?
20
Definition of Publication List
Affiliated Personal Publication List Web Page
(APPL) a web page belongs to the affiliated web
site of a specific person with the given name.
Affiliation Institute of Information Science,
Academia Sinica
citation string
21
Agenda
  • Introduction
  • Publication List Web Page Finder, PLF
  • Performance Evaluation
  • Conclusion, Future Work

22
Process Flow
23
Basic Concept
A publication list web page may
contain many citation strings
24
Agenda
  • Introduction
  • Publication List Web Page Finder, PLF
  • Performance Evaluation
  • Conclusion, Future Work

25
Dataset
  • Scenario
  • Seminar members have usually published major
    research works
  • We randomly collected 200 names from the WWW 06
    Conference Committee website

APPL Types APPL people population
others 0 22 11
single-group 1 120 60
multi-group 2 35 17.5
3 16 8
4 7 3.5
26
Experiment Evaluation
  • Evaluation metrics
  • We consider the top-5 results derived by each
    link and focus on the top-5 recall metric, which
    is calculated by

Notation Definition
Ra the number of publication list web pages belonging to researchers listed in the dataset
R the number of publication list web pages contained in the top-5 results
27
Parameter Analysis for Single-Group
(m, n)
(m, n)
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n
  • Figure (a)
  • When m increases, the recall rate also
    increases.
  • Figure (b)
  • System performance may be constrained by m.

28
Parameter Analysis for Multi-Group
(a) Fixed n mixed with different scale m
(b) Fixed m mixed with different scale n
  • Figure (a)
  • It is clear that the performance when m 40 is
    always better than the other settings.
  • Figure (b)
  • The best performance (top-5 recall is 70)
    occurs when n 75.

29
Performance Evaluations
(a)Performance of approaches in single-group
(b)Performance of different ways in multi-group
  1. The parameter m has a strong influence on the
    systems performance for example, an oversized m
    may degrade the performance.
  2. The parameter n has little influence on the
    systems performance.
  3. The PLF system outperforms the other two
    approaches on both the single-group and the
    multi-group datasets.

30
Conclusion
  • We have defined the problem of finding the
    publication list web pages of a researcher, and
    proposed PLF system
  • Ongoing work
  • Name ambiguity problem
  • How to merge the multiple publication list web
    pages for a specific person into a single page.

31
Discussion Name Ambiguity Problem
  • Scenario
  • We take the name Bing Liu
  • as an example
  • Analyze manually
  • Observation
  • Citation Count
  • Name translation problem
  • Partial matching problem

32
Extracting Citation Strings from Web Pages
33
Extract Citation Records
Extract
Web Page
Structured Data
34
Challenges
  • The formats of publication list web pages vary
  • There are no fixed syntactic rules for parsing
    citation records
  • Hence, We can not apply simple rules to extract
    citation records automatically

35
Challenges Complex Layouts of Publication List
Pages
36
Ideas
  • The semantic structure of web pages is organized
    by visual arrangement.
  • We can utilize semi-structure information (visual
    ) of web pages to help extraction task.
  • With hierarchical structure and geometric
    information, DOM tree is not only a great
    structure to present Web pages, but also very
    helpful for visual pattern analysis.

37
DOM Tree Presentation of Web page
38
Architecture of Citation Extraction System
39
Modules of Citation Extraction System
  • Common Style Finder
  • find out all common style patterns for each level
    of granularity in web pages
  • Citation Extractor
  • explore data regions with common style patterns
  • distill extraction rules from those data regions
  • rank extraction patterns based on a normal word
    count distribution probability

40
BibPro A Citation Parser based on Sequence
Alignment Techniques
41
System Goal
42
Basic Idea(1/2)
  • Encode citation to protein sequence
  • Only keep the citation style information
  • order of fields
  • field separators

43
Basic Idea(2/2)
  • To determine citation style by the order of
    punctuation marks and reserved words

44
How to encode citation to protein sequence?
  • Keep the citation style information
  • Which field should be included? (only can use 23
    symbol)
  • Which punctuation are used to separate fields?
  • By observing different citation styles, we define
    an encode table to translate each token of
    citation to an amino acid symbol

45
Encode Table
A Author T Title L Journal F Volumn value W Issue value H Page value M Month Y Year X noise (unrecognized token) S Issue key. e.g. no, No P Page key. e.g. pp, page V Volume key. e.g. Vol, vo N numeral Q _at_ \ _ / ! ? ? I ( lt ? K ) gt ? D . G " R , C - E ' Z B blank
46
How to using protein sequence to extract metadata?
  • Transform extraction problem to sequence
    alignment problem
  • Form translation
  • Unknown Answer
  • BASE FORM
  • ALIGN FORM
  • INDEX FORM
  • Known Answer
  • RESULT FORM
  • STYLE FORM
  • INDEX FORM

47
RESULT FORM (Known Answer)
48
BASE FORM (Unknow Answer)
49
System Structure
  • System PreProcess
  • (Template Generating System)
  • Citation Crawler
  • Template Builder
  • Online Parsing
  • (Parsing System)
  • Template Matching
  • Metadata Extraction

50
Citation Crawler
51
BLAST-powered Template Matching
52
Evaluation for CiteSeer DataSet
  • Consider the inconsistency between the Citation
    String and BibTex file(metadata)
  • Old Measurement
  • New Measurement

53
Definition
  • Tokenparsedfield denote tokens that appear in
    the parsed subfield
  • Tokenquery citation denote tokens that appear in
    the query citation string
  • TokenBibTex field denote tokens that appear in
    the specific subfield in the BibTex file
  • TokenBibTex denote all tokens that appear in
    the BibTex file
  • These tokens don' t include punctuation

54
Compare with ParaCite
  • DataSet
  • Collected from CiteSeer
  • Training Set 2416
  • Testing Set 4131
  • ParaCite
  • Using default template Database
  • add template to its database isnt easy
  • Test Testing Set
  • Our System
  • Using training template Database (Training Set)
  • Test Testing Set

55
Experimental Results
ParaCite Autor Title Journal Page Issue Year Score
new Eva 32.90 73.35 29.83 4.58 25.05 77.04 50.22

ParaCite Autor Title Journal Page Issue Year Score
old Eva 99.08 62.72 30.46 100.00 93.96 99.70 78.81

Our Author Title Journal Volumn Page Issue Month Year Score
new Eva 93.73 73.32 51.34 83.52 94.62 85.11 89.18 96.49 84.80

Our Author Title Journal Volumn Page Issue Month Year Score
old Eva 90.58 89.51 67.66 93.58 96.69 91.79 99.49 99.50 91.45
56
Analysis
  • ParaCite only can extract one author name
  • Old evaluation have a problem it is highly
    probable that you will obtain high accuracy, if
    you extract less information

57
Evaluation for clean DataSet
  • Ciation String is fully composed of corresponding
    metadata

58
Compare with INFOMAP
  • DataSet
  • Includes 160000 record
  • Training Dataset 10000 X 6 (JMIS, ACM, IEEE,
    APA, MISQ, and ISR)
  • Testing Dataset 10000 X 6 (JMIS, ACM, IEEE, APA,
    MISQ, and ISR)

59
Result
Author Title Journal Volumn Page Issue Year Overall average
APA 99.67 96.38 97.06 98.99 98.71 98.12 99.42 98.33
IEEE 98.72 98.12 99.12 99.30 98.40 98.39 99.40 98.78
ACM 97.14 95.01 93.93 97.19 97.92 97.03 98.88 96.73
ISR 99.48 96.17 96.96 99.15 98.55 98.39 99.35 98.29
MISQ 98.59 97.99 98.98 99.41 98.83 98.61 99.54 98.85
JMIS 91.95 87.90 90.46 99.23 98.76 98.03 99.46 95.11
Average 97.59 95.26 96.09 98.88 98.53 98.09 99.34 97.68
60
Evaluation for Cora DataSet
  • 500 records
  • Be used as benchmark for many papers
  • (HMM, SVM, CRF)

61
Evaluation
  • Divide words into four kinds
  • TP,FP,TN,FN
  • Four metrics
  • Word Accuracy (TPTN)/(TPFPFNTN)
  • Precision TP/(TPFP)
  • Recall TP/(TPFN)
  • F1-measure (2PrecisionRecall)/(PrecisionRecall
    )

62
Our System Our System Our System
acc. F1.
Author 97.17 93.98
Title 94.17 90.13
Journal 93.58 83.27
Volume 99.21 84.62
Page 99.21 92.09
Date 99.92 98.96
63
Mining Translations of Chinese Names from Web
Corpora by Using a Query Expansion Technique and
Support Vector Machine
64
Agenda
  • Introduction
  • Proposed Approach
  • Experiments
  • Conclusions and Future Work

65
Background
  • Most of academic information can be found on the
    Web
  • Scholar Google, DBLP etc.

66
Problems in Searching Chinese Name
Only Chinese Corpus
67
Challenges in Chinese Name Translation
  • Many pronunciation rules in different areas
  • ? ? Chen (Taiwan)
  • ? ? Tsun (Hong Kong)
  • ? ? Tan (Fukien)
  • Some additional words exist.
  • Ex ??? (Kwang-Ming Frank Hwang)
  • Ex ??? (Jane Win-Shih Liu)

68
Common Chinese Name Translation Format
Name Format Examples
Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name) ??? (Fon-Che Liu) ??? (Ng Tian Hann) ?? (Ngau Lam)
Type-2. (Merged Chinese given name) (Surname) ??? (Derchyi Wu)
Type-3. (Western first name) (Surname) ??? (Anne Chao)
Type-4. (Chinese given name) (Western first name) (Surname) ??? (Kwang-Ming Frank Hwang)
Type-5. (Abbreviated Chinese given name) (Surname) ??? (S.-Y. Chang)
Type-6. (Western first name) (Abbreviated Chinese given name) (Surname) ??? (Jack-C. Lee)
Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname) ???(Gwei-Hung H. Tsai)
Type-8. (Chinese given name) (Unpredictable Surname) ???(Jane Win-Shih Liu)
69
Goal
  • Design an automatic mechanism to translate a
    given Chinese name into its related English name

70
Agenda
  • Introduction
  • Proposed Approach
  • Experiments
  • Conclusions and Future Work

71
Concepts of Proposed Approach
No corresponding translations
72
Three Major Techniques
  • Query expansion technique
  • ? Translation of the surname
  • Obtaining the related Web page snippets of the
    Chinese name translation.
  • Solve the problem of the unrelated term existing
    in the name translation.
  • Knowledge-based method
  • ? Chinese surname database, A common dictionary,
    Western first name database
  • Obtaining all the name-like terms from the
    returned Web page snippets.
  • SVM
  • ? Chinese pronunciation database, the phonetic
    feature and the distant feature,
    selectedatraining samples
  • Selecting the appropriate Chinese name
    translations from the candidates.

73
System Architecture
Returned Web page snippets
Returned Web page snippets
Name candidates
Name candidates
Chinese names
Chinese names
Query expander
Candidate extractor
SVM-based name selector
Query expander
Candidate extractor
SVM-based name selector
Translated English names
Translated English names
Chinese surname database
Western first name database
Chinese pronunciation database
Chinese surname database
Western first name database
Chinese pronunciation database
On-line dictionary
On-line dictionary
74
Query Expander
  • Goal
  • To retrieve Web page snippets that contain both
    a persons Chinese name and the translation of
    the persons surname.
  • Name splitter
  • Determining whether the input Chinese name
    contains a compound surname
  • ? Chinese surname database
  • Dividing the input Chinese name into a Surname
    part and a given name part.
  • Surname translator
  • Selecting appropriate surname translations.
  • ? Chinese surname database
  • The strength of relationship between each surname
    translation and the person is determined by the
    distance from the persons Chinese name to the
    surnames translation.
  • Web page retriever
  • Making the concept of the query word more
    clearly.
  • Retrieving the related Web pages back.
  • The new query word will be (Chinese name)
    (Surnames translation).

75
Distance from Two Terms
  • Calculation of the distance from two terms
    where D is the distance, N is the number of
    non-words between the two terms.

???( Wei-Da Chen)
The distance from the persons Chinese name (???)
to the surnames translation (Chen) is 3.
76
Candidate Extractor
  • Goal
  • To extract possible candidates from the
    retrieved Web page snippets.
  • Steps
  • Removing all HTML tags.
  • Identifying out all the positions of the Chinese
    surnames existing in the snippets.
  • ? Chinese surname database
  • Extracting any English terms near each surname in
    the snippets if the term has one of the following
    properties
  • The term cannot be found in a common dictionary.
  • The term is a Western first name.
  • The length of the term is 1.
  • ?At most three English terms in the
    neighborhood of the surname will be extracted.

77
System Architecture 4/10 - Candidate extractor
Step1 Identifying out all the
positions of the Chinese surnames existing in the
snippets.
The extracted terms will be the name translation
candidates and be sent to SVM-based name selector
for processing
  • Step2 Extracting
    any English terms near each surname in the
    snippets if the term has one of the following
    properties
  • The term cannot be found in a common
    dictionary.
  • The term is a Western first name.
  • The length of the term is 1.

78
SVM-based Name Selector
  • Goal
  • To extract each candidates features and utilize
    them to determine whether the candidate is the
    correct translation of the input Chinese name.
  • Features
  • The phonetic feature
  • Phonetic similarity
  • ? Soundex algorithm
  • The distant feature
  • Smallest distance (between the Chinese name and
    the translation candidates)
  • Number of appearance in the neighborhood

79
Distant Features
  • The neighborhood
  • The close area of each occurrence of the Chinese
    name.
  • The close area is defined by a given threshold of
    distance of number of words.

Smallest distance 2
Number of appearance in the neighborhood of the
candidate win-shih 2
80
Summary
  • Query expansion technique
  • Retrieving related Web pages.
  • Knowledge-based method
  • Extracting appropriate name translation
    candidates from the retrieved Web pages.
  • SVM
  • Learning the verification rule and
  • Selecting appropriate name translation from
    extracted candidates.

81
Agenda
  • Introduction
  • Proposed Approach
  • Experiments
  • Conclusions and Future Work

82
Testing Environment and Dataset 1/3
  • The following tool are used
  • Cambridge on-line dictionary
  • Google search engine
  • LIBSVM
  • Two datasets are used
  • Dataset I (training testing)
  • Collected from the Directory of scholars of
    Institute of Mathematics.
  • Contains 78 pieces of data.
  • Dataset II (testing)
  • Collected by our program from the Website of the
    Directory of Division of Computer Science of
    National Science Council.
  • Contains 1,157 pieces of data, and the name
    translations of 40 data are not existed in Google.

83
Testing Environment and Dataset 2/3
Name format Example Dataset I Dataset I Dataset II Dataset II
Name format Example
Type-1. (Chinese given name) (Surname) or (Surname), (Chinese given name) ???(Jen-Wen Ding) ???(Der-Rong Din) ???(Ming Ouhyang) 19 24.3 1000 89.5
Type-2. (Merged Chinese given name) (Surname) ???(Piyu Tsai) 10 12.8 42 3.8
Type-3. (Western first name) (Surname) ???(Eugene Lai) 9 11.5 9 0.8
Type-4. (Chinese given name) (Western first name) (Surname) ???(Alan Li-Sung liu) ???(Jia-Yih Joy Chen) ???(Fongray Frank Young) 14 17.9 50 4.5
Type-5. (Abbreviated Chinese given name) (Surname) ???(I.-C. Hung) 3 3.8 0 0
Type-6. (Western first name) (Abbreviated Chinese given name) (Surname) ???(Judy C. R. Tseng) 8 10.3 9 0.8
Type-7. (Chinese given name) (Abbreviated Chinese given name) (Surname) ???(Tetz C. Huang) 3 3.8 3 0.4
Type-8. (Chinese given name) (Unpredictable Surname) ???(Trieu-Kien Truong) 12 15.4 4 0.4
84
Testing Environment and Dataset 3/3
  • The alignment accuracy
  • Proposed by Huang (2005).
  • The probability of selecting the correct answers
    when the searched snippets contain the correct
    answers.
  • A
  • where
  • Ai The alignment accuracy of candidate i.
  • Nd The number of testing data.
  • Ncc The number of correct translation.
  • Performance measurement Top-1 to Top-5 alignment
    accuracy.

85
Results and Analysis 1/3- Overall performance on
Dataset I
70.5 top-1 accuracy 91 top-5 accuracy
86
Results and Analysis 2/3 - Overall performance
on Dataset II
57.9 top-1 accuracy 86.2 top-5 accuracy
87
Results and Analysis 3/3 - Performance of each
name type
Name format Example
Type-1 ???(Jen-Wen Ding) ???(Der-Rong Din) ???(Ming Ouhyang)
Type-2 ???(Piyu Tsai)
Type-3 ???(Eugene Lai)
Type-4 ???(Alan Li-Sung liu) ???(Jia-Yih Joy Chen)
Type-5 ???(I.-C. Hung)
Type-6 ???(Judy C. R. Tseng)
Type-7 ???(Tetz C. Huang)
Type-8 ???(Trieu-Kien Truong)
Our system performs better in type-1, type-2,
type-4, type-6.
88
Discussions
  • Major reason for the low performance on Type-3,
    Type-5, Type-7 and Type-8
  • The lack of Web information.
  • Usually more than one correct name translations
    for an input Chinese name are found out.
  • The name ambiguity problem.

89
Limitations
  • Uncommon surname
  • Rely on Web resources
  • Search engine selecting
  • No name disambiguation

90
Agenda
  • Introduction
  • Proposed Approach
  • Experiments
  • Conclusions

91
Conclusions
  • Mining information through Web corpora is
    effective for dealing with person name
    translation problem
  • Name ambiguity problem arises frequently

92
Thank You Jan-Ming Ho hoho_at_iis.sinica.edu.tw Ins
titute of Information Science Academia Sinica
Write a Comment
User Comments (0)
About PowerShow.com