Title: University of Sheffield CIIR, University of Massachusetts
1University of SheffieldCIIR, University of
Massachusetts
Deriving concept hierarchies from text Mark
Sanderson, Bruce Croft
2The question is...
- What paper already presented at this SIGIR is
most like the one youre about to see? - Well have the answer, right after this!
3Concept hierarchies from documents?
- Hierarchy ofconcepts, Yahoo
- General down to specific
- Child under one or more parents
- No training data
- Why?
- Understandable
4Current methods
5An alternative?
- Monothetic clustering
- Clusters based on a single features
- More Yahoo/Dewey decimal like?
- Easier to understand?
- Preferable to users?
- What about hierarchies of clusters?
6How to arrange cluster terms?
- Existing techniques
- WordNet
- earthquake, volcano (eruption?)
- Key phrases (Hearst 1998)
- such as, especially
- Phrase classification (Grefenstette 1997)
- NP head or modifier types of research from
research things - Hierarchical phrase analysis (Woods 1997)
- Head modifier again, car washing under
washing, not car
7WordNet (aside)
- 1 sense of earthquake, sense 1
- earthquake, quake, temblor, seism -- (shaking and
vibration at the surface of the earth resulting
from underground movement along a fault plane of
from volcanic activity) - geological phenomenon -- (a natural phenomenon
involving the structure or composition of the
earth) - natural phenomenon, nature -- (all non-artificial
phenomena) - phenomenon -- (any state or process known through
the senses rather than by intuition or reasoning)
8WordNet (aside)
- 5 senses of eruption, sense 1
- volcanic eruption, eruption -- (the sudden
occurrence of a violent discharge of steam and
volcanic material) - discharge -- (the sudden giving off of energy)
- happening, occurrence, natural event -- (an event
that happens) - event -- (something that happens at a given place
and time)
9Start with something simpler?
- Term clustering?
- simple monothetic clusters
- No ordering.
10Use subsumption
- Initially using subsumption.
- Finds related terms
- Decides which is more general, which is more
specific (idf?) - Strict interpretation
- X s Y iff P(xy) 1, P(yx) lt 1
- In practice
- X s Y iff P(xy) gt 0.8, P(yx) lt 1
- P(xy) gt 0.8, P(yx) lt P(xy)
11How to build a hierarchy
- X s Y
- X s Z
- X s M
- X s N
- Y s Z
- A s B
- A s Z
- B s Z
X
A
Y
M
N
B
Z
really its a DAG
12How to display it?
- DAGs were big
- Unlikely to get all on screen
- Only want to see current focus plus route to
taken there? - Use a method users are familiar with
- Hierarchical menus
X
A
Y
M
N
B
Z
Z
13What about ambiguity?
- Monothetic clusters of ambiguous terms?
- Derive hierarchy from retrieved documents
- Take a query and retrieve on it,
- take top 500 documents,
- build hierarchy from them.
- Topics/concepts are words/phrases taken from
- Query
- Retrieved documents
- Comparison of frequencies
14Poliomyelitis and Post-PolioTREC topic 302
15Poliomyelitis and Post-PolioTREC topic 302
16Poliomyelitis and Post-PolioTREC topic 302
17Poliomyelitis and Post-PolioTREC topic 302
18Poliomyelitis and Post-PolioTREC topic 302
19Poliomyelitis and Post-PolioTREC topic 302
20Poliomyelitis and Post-PolioTREC topic 302
21Poliomyelitis and Post-PolioTREC topic 302
22Poliomyelitis and Post-PolioTREC topic 302
23Poliomyelitis and Post-PolioTREC topic 302
24Poliomyelitis and Post-PolioTREC topic 302
25Poliomyelitis and Post-PolioTREC topic 302
26Poliomyelitis and Post-PolioTREC topic 302
27Poliomyelitis and Post-PolioTREC topic 302
28Poliomyelitis and Post-PolioTREC topic 302
29Poliomyelitis and Post-PolioTREC topic 302
30Poliomyelitis and Post-PolioTREC topic 302
31Poliomyelitis and Post-PolioTREC topic 302
32Poliomyelitis and Post-PolioTREC topic 302
33Poliomyelitis and Post-PolioTREC topic 302
34Poliomyelitis and Post-PolioTREC topic 302
35Did you guess the paper?
- Bit like Peter Anicks work?
36Experiment
- Test properties of hierarchy
- Does it mimic (in some way) Yahoo-like
categories? - Parent related to child?
- Parent more general than child?
37Experimental set-up
- Gathered eight subjects
- Presented subsumption categories and random
categories. - Ask if parent child pair are interesting.
- If yes, then what type is relationship, (roughly)
from WordNet - Aspect of
- Type of
- Same as
- Opposite of
- Dont know
38Results
- Question of parent/child pairing interesting or
not - Random, 51
- Subsumption, 67
- Difference significant from t-test, plt0.002
- If interesting, what is parent/child type?
Odd?
39Yahoo categories?
40Results and conclusions
- Interesting AND (aspect of OR type of)
- Random, 28 (51 (47 8))
- Subsumption, 48 (67 (49 23))
- Appears that subsumption and an ordering based on
document frequency does a reasonable job. - Term frequency work see.
- Sparck Jones, K. (1972) A statistical
interpretation of term specificity and its
application in retrieval, in Journal of
Documentation, 28(1) 11-21 - Caraballo, S.A., Charniak, E. (1999) Determining
the specificity of nouns from text, in
Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP)
41Future work?
- More user studies.
- Incorporate other term relationship techniques
- Other visualisations
- Application of techniques to whole document
collections. - Presentation of Cross Language IR results?