Instance-based mapping between thesauri and folksonomies - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Instance-based mapping between thesauri and folksonomies

Description:

Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 23
Provided by: videolectu1
Category:

less

Transcript and Presenter's Notes

Title: Instance-based mapping between thesauri and folksonomies


1
Instance-based mapping between thesauri and
folksonomies
  • Christian Wartena
  • Rogier Brussee
  • Telematica Instituut

2
Outline
  • Interoperability of Keywords
  • Wikipedia and del.icio.us
  • Keyword similarity
  • Experiment
  • Conclusion

3
Interoperability of Keywords
  • Documents (pictures, movies, ) are annotated
    with keywords for organization and retrieval.
  • In different collections/communities different
    sets of keywords are used.
  • The set of selectable keywords is often organized
    in and delimited by a thesaurus.
  • The set of freely generated end-user keywords,
    tags forms a folksonomy
  • Align keywords/tags by comparing usage.
  • Tested on del.icio.us tags and Wikipedia
    categories.

4
del.icio.us and Wikipedia
  • Del.icio.us
  • Social book marking site
  • Bookmarks in most cases can be interpreted as
    labels or tags for the bookmarked URL.
  • Many Wikipedia articles are tagged by del.icio.us
    users
  • Wikipedia
  • Articles are labeled with one or more categories
    by the article authors.
  • Categories are organized hierarchically.
  • Categories are organized consciously like in a
    thesaurus
  • New categories are introduced after discussions
    between active Wikipedians.

5
Keyword alignment
  • Problem
  • Given a keyword k in a system A, what is the most
    similar keyword k in system B.
  • Given a tag from del.icio.us, what is the most
    similar Wikipedia category (or vice versa).
  • Approach
  • Interpret similarity as similarity of usage.
  • Compute similarity of usage on a common
    sub-collection.
  • Evaluation
  • Compare results to human judgment of similarity.

6
Keyword similarity
  • Basic assumption similarity is similarity of
    usage.
  • If two keywords have similar usage they will give
    similar results in retrieval tasks.
  • Two keywords have similar usage if they
  • Have a similar distribution over documents
  • Divergence (relative entropy) of distributions
  • Cosine
  • Often co-occur
  • Jaccard coefficient

7
New measure for keyword similarity
  • Keywords have similar usage if they co-occur with
    similar frequency with all other keywords.
  • We use the frequency with which a tag/keyword is
    assigned to a document.
  • We include co-occurrence information with other
    terms.
  • Helps to cope with sparse data
  • In other words
  • Terms are similar if they have similar
    co-occurrence patterns
  • Similar to Tag Context Similarity of Cattuto et
    al.s presentation tomorrow (Semantic Social
    Networks Session)

8
(No Transcript)
9
Formalization Distribution of co-occurring terms
  • where
  • q(td) is the keyword distribution of d
  • Q(dz) is the document distribution of z
  • The fraction of zs that is found in d
  • Weighted average of the keyword distributions of
    documents
  • The weight is the relevance of d for z given by
    the probability Q(dz)

10
Distance of keywords
  • For each keyword there is a distribution over all
    (other) keywords.
  • Similarity is expressed by divergence of these
    distributions
  • Kullback-Leibler divergence
  • Bits per keyword saved by compressing a
    subcollection with keyword distribution p using p
    instead of a general distribution q.

11
Distance of keywords (contd)
  • Jensen-Shannon divergence
  • Mean distribution
  • Jensen-Shannon divergence is symmetric.
  • Jensen-Shannon divergence is square of a
    non-negative distance satisfying the triangle
    inequality.

12
Alignment
  • Consider a collection of documents annotated with
    different sets of keywords.
  • Represent a keyword by a distribution over terms
    from both collections.
  • For each term find the closest term from the
    other collection.

13
Experiment I
  • Mapping between Teleblik keywords and User Tags
  • Educational videos.
  • Professional keywords from public broadcasting
    archive.
  • Keywords assigned in an experiment by high school
    students.
  • Data
  • 100 videos
  • 12.414 tags
  • 4.348 different tags
  • 269 different keywords

14
Experiment II
  • Mapping between del.icio.us tags and Wikipedia
    categories
  • Del.icio.us tags collected by Mathias Lux
    (Klagenfurt Univ.)
  • Data
  • 58.345 Wikipedia articles
  • 500.618 tags and category annotations
  • 42.425 different Wikipedia categories
  • 49.603 different tags
  • Mappings computed for tags occurring on at least
    10 docs.
  • Mappings for 2355 tags
  • Mappings for 1827 categories
  • Using co-occurrence data with all 49.603
    tags/categories

15
Evaluation of mapping
  • Manual evaluation
  • Classification of a sample of mappings into
  • Broader term
  • Narrower
  • Related term
  • Unrelated
  • Source term is not a keyword (e.g. to read)
  • Meaning unknown

16
Evaluation of aligning Wikipedia and del.icio.us
17
Distance vs. mapping quality
  • Pairs with a small distance are evaluated better
    than pairs with large distance.
  • Evaluation of mappings with smallest and largest
    distance
  • a) Categories to tags
  • b) Tags to categories

18
Effect of keyword frequency
  • No correlation between keyword frequency and
    divergence with best mapping found.

19
Comparison with Jaccard-coefficient
  • Evaluation of mapping using two different
    distance measures.
  • Categories broader, narrower and related are
    merged
  • Results for
  • a) Categories to tags
  • b) Tags to categories

20
Discussion of results
  • Method works very well in test
  • Good mapping results
  • Distance is good indication of quality
  • Insensitive to frequency (upto a certain degree)
  • Better than Jaccard, because it uses
  • co-occurrence with other tags (tag context)
  • frequency with which a tag is assigned to a
    document.
  • Frequency information is typical for user
    generated tags.
  • We expect this method to perform less well for
    aligning keywords with other keywords (without
    assignment frequencies).
  • Distance measure also works well for clustering
    tags.

21
Future work
  • Evaluating relatedness using external sources
    (e.g. Wordnet)
  • Compare to other distance measures
  • We used documents annotated completely according
    to two annotation schemes.
  • How large has the overlap to be to obtain decent
    results?
  • We can create partial overlap of disjoint
    document sets by a partial identification of the
    keywords.
  • Detect asymmetry in relations (broader vs.
    narrower term)

22
Conclusion
  • Using co-occurrence patterns is a fruitful
    approach.
  • Frequent terms from folksonomies do behave
    similar to carefully assigned keywords.
  • Because usage based similarity measure yields
    good mappings.
  • Folksonomy seems to work!
Write a Comment
User Comments (0)
About PowerShow.com