Alias Detection in Link Data Sets - PowerPoint PPT Presentation

About This Presentation
Title:

Alias Detection in Link Data Sets

Description:

Osama bin Laden = the Emir, the Prince. Misspelled words ... Wanted al-Qaeda terror network chief Osama bin. Laden and his top aide, Ayman al-Zawahri, have ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 37
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:
Tags: alias | data | detection | link | sets

less

Transcript and Presenter's Notes

Title: Alias Detection in Link Data Sets


1
Alias Detection in Link Data Sets
  • Masters Thesis
  • Paul Hsiung

2
Alias Definition
  • Alias of names
  • Dubya G.W. Bush
  • Usama Osama
  • G.W.Bush the PresidentOsama bin Laden the
    Emir, the Prince
  • Misspelled words
  • Unintentional (typos)
  • Intentional mortgage m0rtg_at_ge (Spam)

3
In What Context Do Aliases Occur?
  • Newspaper articles
  • WebPages
  • Spam emails
  • Any collections of text

4
Link Data Set
  • A way to represent the context
  • Compose of set of names and links
  • Names are extracted from the text
  • Names can refer to the same entity (Dubya and
    G.W.Bush)
  • Links are collection of names and represent a
    relationship between names

5
Example
  • Wanted al-Qaeda terror network chief Osama bin
  • Laden and his top aide, Ayman al-Zawahri, have
  • Moved out of Pakistan and are believed to have
  • Crossed the mountainous border back into
  • Afghanistan
  • (Osama bin Laden, Ayman al-Zawahri, al-Qaeda)
  • (Pakistan, Osama bin Laden)
  • (Afghanistan, Osama bin Laden)

6
Graph Representation
Pakistan
al-Qaeda
Afghanistan
Osama
Ayman
7
Advantages
  • Link data set is easily understood by computers
  • Mimic the way intelligence communities gather data

8
Alias Detection
  • Given two names in a link data set, are they
    aliases (i.e. do they refer to the same entity?)
  • How to measure their alias-ness?
  • Semi-supervised learning

9
Orthographic Measures
  • String edit distance
  • Minimum number of insertions, deletions, and
    substitutions required to transform one name into
    the other
  • SED(Osama, Usama) 2
  • SED(Osama, Bush) 7
  • Intuitive measure

10
Some Orthographic Measures
  • String edit distance
  • Normalized string edit distance
  • Discretized string edit distance

11
Semantic Measures
  • But what about aliases such as the Prince and
    Osama?
  • Define friends of Osama as people who have
    occurred in same links with Osama
  • Through link data sets, number of occurrences of
    each friend can be collected
  • Intuition friends of the Prince look like
    friends of Osama
  • Treat friends as probability vectors

12
Example of Friends
al-Qaeda
10
CNN
Osama
2
5
Islam
13
Comparing Two Friends Lists
al-Qaeda
2
10
The Prince
CNN
8
2
Osama
50
5
Islam
Music
14
Some Semantic Measures
  • Dot Product 10 2 2 8
  • Normalized Dot Product
  • Common Friends 2 (CNN, AlQaeda)
  • KL Distance

15
Classifier
  • So we have a link data set
  • We have some measures of what aliases are
  • We can easily hand-pick some examples of aliases
  • Lets build a classifier!

16
Classifier Training Set
  • Positive examples hand-pick pairs of names in
    link data set that are known aliases
  • Negative examples randomly pick pairs of names
    from the same link data set
  • Calculate measures for all the pairs and insert
    them as attributes into the training set

17
Classifier Example
18
Classifier Cross-Validation
  • Experimented with Decision Trees, k-Nearest
    Neighbors, Naïve Bayes, Support Vector Machines,
    and Logistic Regression
  • Logistic Regression performed the best

19
Prediction
  • Given a query name in the link data set with
    known aliases
  • Pair query name with ALL other names
  • Calculate attributes for all pairs
  • Run each pair through the classifier and obtain a
    score (how likely are they to be aliases?)

20
Example
21
Prediction
  • Use the score to sort the pairs from most likely
    to be an alias to least likely
  • See where the true aliases lie in the sorted list
    and produce a ROC curve
  • Evaluate classifier based on ROC curve

22
Summary
True alias pairs (no query name)
Random pairs
Query name
Calc Attributes
Calc Attributes
Train Logistic Regression
Run Classifier
ROC curve
23
ROC Curve
  • Start from (0,0) on the graph
  • Go down the sorted list
  • If the name on the list is a true alias, move y
    by one unit
  • If the name on the list is not a true alias, move
    x by one unit

24
Perfect ROC Example
3
2
1
0
1
2
3
25
ROC Example
3
2
1
0
1
2
3
26
ROC Normalize
  • Balance positive and negative examples
  • Area under curve(AUC) 5/9
  • Able to average multiple curves

1
0.6
0.3
0
0.3
0.6
1
27
Empirical Results
  • Test on one web page link data set and two spam
    link data sets
  • Hand pick aliases for each set

28
Empirical Results
  • Choose an alias from the set of hand pick aliases
    as a query name
  • Build classifier from other aliases that are not
    aliases with the query name
  • Do prediction and obtain ROC curve
  • Repeat for each alias in the set of hand pick
    aliases
  • Average all ROC curves by normalized axis

29
Evaluation
  • We want to know how significant is each group of
    attributes
  • Train one classifier with just orthographic
    attributes
  • Train another with just semantic attributes
  • Train a third with both sets of attributes
  • Compare curve and area under curve (AUC)

30
Terrorist Data Set
  • Manually extracted from public web pages
  • News and articles related to terrorism
  • Names mentioned in the articles are subjectively
    linked
  • Used 919 alias pairs for training

31
Web Page Chart
32
Spam Data Set
  • Collection of spam emails
  • Filter out html tags
  • All the words are converted to tokens with white
    spaces being the boundaries
  • Common tokens are filtered (e.g. the a)
  • Each email represents a link
  • Each link contains tokens from corresponding email

33
Example
  • SubjectMortgage rates as low as 2.95
  • Refltsuyzvigcfflgtinaltswwvvcobadtbogtnce
    toltshecpgkgffagtday to as low as
  • 2.ltsppyjukbywvbqcgt95 Saltscqzxytdcuagtve
    thoultsdzkltzcyrygtsaltsefaioubryxkplgtnds of
    dolltscarqdscpvibywgtlltsklhxmxbvdrgtars or
    bltskaavzibaenixgtuy the ltbrgt
  • holtsolbbdcqoxpdxcrgtme of yoltsvesxhobppoygtur
    drltsxjsfyvhhejoldlgteams!ltbrgt
  • Filtered to
  • (mortgage, rates, low, refinance, today,
  • save, thousands, dollars, home, dreams)

34
Spam I Chart
35
Spam II Chart
36
Conclusion
  • Orthographic measures work well
  • Semantic sometimes better, sometimes worse than
    orthographic
  • Combining them produces the best
  • Future work includes adding other measures such
    as phonetic string edit distance
  • Larger question many aliases to many names
Write a Comment
User Comments (0)
About PowerShow.com