Concept and Theme Discovery through Probabilistic Models and Clustering

1 / 22
About This Presentation
Title:

Concept and Theme Discovery through Probabilistic Models and Clustering

Description:

mite. mites. jacobsoni. acarina. brood. parasite. colonies. host. control. chelicerata. chelicerates ... mite. mites. jacobsoni. acarina. brood. parasite ... –

Number of Views:90
Avg rating:3.0/5.0
Slides: 23
Provided by: qia59
Category:

less

Transcript and Presenter's Notes

Title: Concept and Theme Discovery through Probabilistic Models and Clustering


1
Concept and Theme Discovery through Probabilistic
Models and Clustering
  • Qiaozhu Mei
  • Oct. 12, 2005

2
Concepts and Themes
  • Language units in biology literature mining
  • Terms
  • Phrases
  • Entities
  • Concepts (tight groups of terms/entities
    representing semantics e.g. Gene Synonyms)
  • Themes (loose groups of terms representing
    topic/subtopics)

3
Theme Discovery
  • What weve got now
  • A Generative Model to extract k themes from a
    collection
  • Each theme as a language model, represented by
    top probability words in a theme language model
  • KL Divergence to model the distance/similarity
    between themes
  • retrieve most similar themes to a term group

4
Theme Discovery (cont.)
  • What weve got now (cont.)
  • Use HMM to segment the whole collection with the
    theme extracted
  • Use MMR to find most representative and least
    redundant phrases to represent a theme (currently
    using n-gram prob. as and edit distance as
    similarity, performance to be tuned..)
  • Results http//ucair.cs.uiuc.edu/qmei2/ThemeNavig
    ation.html

5
Some justifications
  • Fly collection
  • Cluster 0 circadian
  • Cluster 1 adh, evolution
  • Cluster 2 a mixture of two topics, apoptosis and
    promoters
  • Cluster 6 brain development
  • Cluster 8 cell division
  • Cluster 12 drosophila immunity
  • Cluster 13 nervous systems
  • Cluster 14 hedgehog segment Polarity gene
  • Cluster 16 Histone, Polycomb
  • Cluster 17 visual system

6
Theme Discovery (cont.)
  • Problems
  • How to select k? (how many themes do we believe
    are there in the collection bee collection
    should have smaller k than fly collection)
  • Can we find themes in a hierarchical manner?
  • This can solve the former problemhowever, when
    to cutoff?
  • How to represent a theme?
  • Top words sometimes difficult to tell the
    semantics
  • Phrases?
  • Sentences?
  • Other possible approaches to extract theme?
    (LDAs, Clustering methods)

7
Hierarchical Theme Discovery
  • A straightforward approach (top down splitting)
  • Discover k themes from the initial collection
  • Segment the collection by the k themes
  • For each theme, build a sub-collection with the
    segments in previous step
  • For each sub-collection, extract k themes
  • Do these processes iteratively
  • Problem When to stop splitting iteration?

Collection
Theme1
Theme3
Theme2
Theme2.1
Theme2.3
Theme2.2

8
Hierarchical Theme Discovery (results)
A bee collection with 929 documents
Level1 5 themes



Level2 3 sub-themes for each higher level theme
9
Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
10
Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
venom reward patients naja kda proteins wasp prote
in diptera pla2 vespula primates hominidae chordat
a vertebrata mug sting sperm dose quality
african european population populations patterns p
attern genetic discrimination mitochondrial studie
s information are contrast green two bees have der
ived africa subspecies
larvae microorganisms gram bacteria 0 colonies roy
al queen jelly eubacteria non workers queens produ
ction 2 nest italian 5 fraction nestmates
11
Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
food foragers dance transfer enzyme biosynthesis r
eceivers contrast nectar flight source flow water
information rates ddt rj caucasian visual green
queen worker workers colonies pollen vibration egg
s foraging development brood signal queens bees an
archistic behavioral iridaceae larvae egg pheromon
e may
mammals vertebrates venom nonhuman l ml models mod
el chordates beeswax mug omega embryo mammalia ver
tebrata has chordata nurse coloured vg
12
Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
ecology is species environmental sciences flowerin
g floral terrestrial pollinator visiting reproduct
ion plants c cashew self animalia food insects fab
a size
seed per crop sunflower number cruciferae fruit hy
brid agriculture seeds quality cultivar weight hel
ianthus oilseed compositae annuus yield pollinatio
n set
pollen eep honeybees mating bumblebees sp hive bac
teria scent mimosa brazil undertakers chromatograp
hy marks recently gram eubacteria caraway microorg
anisms propolis
13
Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
bees sucrose conditioning response learning extens
ion proboscis pollen foragers performance between
thresholds honeybees solution discrimination strai
n rate foraging concentration low
dopamine levels development age binding pupal brai
n octopamine division adult colonies labor glass t
reated colony ryr pigmentation chromosomes arolium
da
imidacloprid current memory mushroom neurons 1 exp
ressed 4 cells antennal mb bodies currents nervous
brain mv kinase receptors term protein
14
Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
mite varroa mites brood jacobsoni acarina colonies
parasite for worker control a drone formic popula
tion acid host 0 cells treatment
viruses larvae microorganisms virus bacteria anima
l paenibacillus infection molecular pathogen eubac
teria gram forming endospore positives p apv entom
opathogen
pollen bees foragers their or ta heat at hygienic
foraging protein activity behaviour increased resp
onse blood flight strips metabolic removal
15
Phrase Representations
biochemistry and molecular biophysics endocrine
system chemical coordination and homeostasis
molecular genetics biochemistry and molecular
biophysics sense organs sensory reception
animals arthropods chordates insects
invertebrates mammals system chemical
coordination and homeostasis vertebrata chordata
animalia honey bee behavior terrestrial
ecology mammalia vertebrata chordata animalia
juvenile hormone queen rodentia mammalia
vertebrata chordata animalia worker laid eggs
vibration signal genetics biochemistry and
molecular biophysics dufour s gland mammals
nonhuman mammals workers egg laying queen
mandibular gland pheromone nonhuman vertebrates
iridaceae ixia arthropoda invertebrata animalia
muridae aves vertebrata chordata animalia mug
ml
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
16
Hierarchical Theme Discovery (cont.)
  • A bottom up agglomerative approach
  • Find many micro-themes
  • Group similar micro-themes into larger ones
  • Borrow strategy from data mining
  • BIRCH incrementally form many micro-clusters,
    organized in a tree structure
  • Macro-clustering based on micro-clusters.
  • Problem Again, when to stop?

17
Hierarchical Theme Discovery (cont.)
  • Model-based approach
  • Hofmann, IJCAI 99.
  • Assume we know the collection is generated from a
    hierarchical structure, use a generative model to
    learn the themes. (e.g. make use of GO
    hierarchies)
  • Problem in most cases we dont know the
    hierarchies.

18
Other Research Problems
  • Represent a theme
  • Using top words where to cut
  • Using phrases have to tune the MMR (many
    possible strategies and parameter tuning)
  • Using sentence? Like summarization
  • Themes are interesting but how to make use of
    the themes?
  • How to evaluate themes??

19
Concept Extraction
  • What we have now
  • N-gram algorithm (actually 2-gram) iteratively
    group a pair of terms which are most likely to be
    replaceable considering the context of one term
    before/after it.
  • Time Complexity O(N3), Space Complexity now
    O(N2). Beespace server can deal with lt 9000
    terms now (2.4g memory). (performance not
    evaluated due to the small data size acceptable).
  • Problem based on Mutual Information, preferring
    2-grams with low frequency. Doesnt make use of
    farther context.
  • Will removing stop words help or turn down the
    performance?

20
Some finding
  • A small dataset (200 abstracts containing gene
    synonyms)
  • Only 600 iterations (merge 600 times)
  • Most of them are reasonable, but not really
    useful
  • E.g. head-to-head tail-to-tail
  • E.g. within-locus between-locus
  • FBgn0000017 Dsrc Dabl
  • FBgn0000078 amylase-null AMY-null
  • Problem doc-set too small, n-gram too sparse to
    find useful concepts.

21
Concept Extraction (cont.)
  • Other Possible strategy
  • Lin et al, KDD 02 Use feature vector to
    represent terms, the weights are the mutual
    information between term and context feature.
    Thus more flexible than n-gram. (if only consider
    2-gram as context features, this will be similar
    to what we have)
  • Use committee to represent a cluster, thus
    assures the clusters are tight and robust.
  • Problem not sure how to select features

22
Summary
  • Theme Extraction
  • Generally performs well, if we can find a good k.
  • Hierarchical Clustering can solve this problem,
    but still need to find a reasonable stop
    criteria.
  • Representation is an interesting problem MMR
    phrase extraction should be further tuned
  • Difficult to evaluate other than expert
    justification
  • Concept extraction
  • N-gram has space constraints havent really
    tested the performance Generally, the
    performance should be better on large data sets
  • Other clustering algorithms can be explored.
Write a Comment
User Comments (0)
About PowerShow.com