Title: Text%20Data%20Mining
1Text Data Mining
- Prof. Marti Hearst
- UC Berkeley SIMS
- Guest Lecture, ME 290M
- Prof. Agogino
- May 4, 1999
2Theres Lots of Text Out There
- Is it Information Overload?
3- Why not TURBO-Text?
- How can we SYNTHESIZE whats there to make new
discoveries?
4Talk Outline
- Definitions
- What is Data Mining?
- What is Text Data Mining?
- Text data mining examples
- Lexical knowledge acquisition
- Merging textual records
- Finding cures for diseases (from medical
literature) - Future Directions
5What is Data Mining? (Fayyad Uthurusamy 96,
Fayyad 97)
- Fitting models to or determining patterns from
very large datasets. - A regime which enables people to interact
effectively with massive data stores. - Deriving new information from data.
- finding patterns across large datasets
- discovering heretofore unknown information
6What is Data Mining?
- Potential point of confusion
- The extracting ore from rock metaphor does not
really apply to the practice of data mining - If it did, then standard database queries would
fit under the rubric of data mining - Find all employee records in which employee earns
300/month less than their managers - In practice, DM refers to
- finding patterns across large datasets
- discovering heretofore unknown information
7Why Data Mining?
- Because the data is there.
- Because current DBMS technology does not support
data analysis. - Because
- larger disks
- faster cpus
- high-powered visualization
- networked information
- are becoming widely available.
8DM Touchstone Applications(CACM 39 (11) Special
Issue)
- Finding patterns across data sets
- Reports on changes in retail sales
- to improve sales
- Patterns of sizes of TV audiences
- for marketing
- Patterns in NBA play
- to alter, and so improve, performance
- Deviations in standard phone calling behavior
- to detect fraud
- for marketing
9DM Touchstone Applications(CACM 39 (11) Special
Issue)
- Separating signal from noise
- Classifying faint astronomical objects
- Finding genes within DNA sequences
- Discovering novel tectonic activity
10What is Text Data Mining?
- Peoples first thought
- Make it easier to find things on the Web.
- This is information retrieval!
- The metaphor of extracting ore from rock does
make sense for extracting documents of interest
from a huge pile. - But does not reflect notions of DM in practice
- finding patterns across large collections
- discovering heretofore unknown information
11Text DM ? IR
- Data Mining
- Patterns, Nuggets, Exploratory Analysis
- Information Retrieval
- Finding and ranking documents that match users
information need - ad hoc query
- filtering/standing query
- Rarely Patterns, Exploratory Analysis
12Real Text DM
- The point
- Discovering heretofore unknown information is not
what we usually do with text. - (If it werent known, it could not have been
written by someone.) - However
- There is a field whose goal is to learn about
patterns in text for its own sake ...
13Computational Lingustics
- Goal automated language understanding
- this isnt possible
- instead, go for subgoals, e.g.,
- word sense disambiguation
- phrase recognition
- semantic associations
- Current approach
- statistical analyses of very large text
collections
14WordNet A Lexical Database
A list of hypernyms for each sense of crow
15Lexicographic Knowledge Acquisition
- Given a large lexical database ...
- Wordnet Miller, Fellbaum et al. at Princeton
- http//www.cogsci.princeton.edu/wn
- and a huge text collection
- How to automatically add new relations?
16Idea Use Simple Lexico-Syntactic Analysis
- Patterns of the following type work
- NP0 such as NP1, NP2 , (and or) NPi
- i gt 1, implies
- forall NPi, igt1, hyponym(NPi, NP0)
- Example
- Agar is a substance prepared from a mixture of
red algae, such as Gelidium, for laboratory or
industrial use. - implies hyponym(Gelidium, red algae)
17More Examples
- Felonies, such as shootings and stabbings
implies - hyponym(shootings, felonies)
- hyponym(stabbings, felonies)
- Is this in the WordNet hierarchy?
18Linking Killing to Felonies
19Another Example
- Einstein is (was) a physicist.
- Is/was he a genius?
20Making Einstein a Genius
21Results from such as lexico-syntactic relation
22Results with the or other lexico-syntactic
relation
23Procedure
- Discover a pattern that indicates a lexical
relationship - Scan through a large collection extract
sentences that match the pattern - Extract the NPs from the sentence
- requires some phrase parsing
- Check if suggested relation is in WordNet or not
- this part not automated, but could be
24Discovering New Patterns
- Suggested algorithm
- Decide on a lexical relation of interest, e.g.,
hyponymy - Derive a list of word pairs from WordNet that are
known to hold that relation - e.g., (crow, bird)
- Extract sentence from text collection in which
both terms occur - Find commonalities among lexico-syntactic context
- Test these out against other word pairs known to
hold the relationship in WordNet
25Text Merging ExampleDiscovering Hypocritical
Congresspersons
26Discovering Hypocritical Congresspersons
- Feb 1, 1996
- US House of Reps votes to pass Telecommunications
Reform Act - this contains the CDA (Communications Decency
Act) - violaters subject to fines of 250,000 and 5
years in prison - eventually struck down by court
27Discovering Hypocritical Congresspersons
- Sept 11, 1998
- US House of Reps votes to place the Starr report
online - the content would (most likely) have violated the
CDA - 365 people were members for both votes
- 284 members voted aye both times
- 185 (94) Republicants voted aye both times
- 96 (57) Democrats voted aye both times
28(No Transcript)
29(No Transcript)
30How to find Hypocritical Congresspersons?
- This must have taken a lot of work
- Hand cutting and pasting
- Lots of picky details
- Some people voted on one but not the other bill
- Some people share the same name
- Check for different county/state
- Still messed up on Bono
- Taking stats at the end on various attributes
- Which state
- Which party
- Tools should help streamline, reuse results
31How to find Hypocritical Congresspersons?
- The hard part?
- Knowing two compare these two sets of voting
records.
32How to find causes of disease?Don Swansons
Medical Work
- Given
- medical titles and abstracts
- a problem (incurable rare disease)
- some medical expertise
- find causal links among titles
- symptoms
- drugs
- results
33Swanson Example (1991)
- Problem Migraine headaches (M)
- stress associated with M
- stress leads to loss of magnesium
- calcium channel blockers prevent some M
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD)implicated in
M - high levels of magnesium inhibit SCD
- M patients have high platelet aggregability
- magnesium can suppress platelet aggregability
- All extracted from medical journal titles
34Swansons TDM
- Two of his hypotheses have received some
experimental verification. - His technique
- Only partially automated
- Required medical expertise
- Few people are working on this.
35How to Automate This?
- Idea mixed-initiative interaction
- User applies tools to help explore the hypothesis
space - System runs suites of algorithms to help explore
the space, suggest directions
36Our Proposed Approach
- Three main parts
- UI for building/using strategies
- Backend for interfacing with various databases
and translating different formats - Content analysis/machine learning for figuring
out good hypotheses/throwing out bad ones
37The UI part
- Need support for building strategies
- Mixed-initiative system
- Trade off between user-initiated hypotheses
exploration and system-initiated suggestions - Information visualization
- Another way to show lots of choices
38Candidate Associations
Suggested Strategies
Current Retrieval Results
39Lindi Linking Information for Novel Discovery
and Insight
- Just starting up now (fall 98)
- Initial work Hao Chen, Ketan Mayer-Patel,
Shankar Raman
40Ore-Filled Text Collections
- Congressional Voting Records
- Answer questions like
- Who are the most hypocritical congresspeople?
- Medical Articles
- Create hypotheses about causes of rare diseases
- Create hypotheses about gene function
- Patent Law
- Answer questions like
- Is government funding of research worthwhile?
41Summary
- Text Data Mining
- Extracting heretofore undiscovered information
from large text collections - Not the same as information retrieval
- Examples
- Lexicographic knowledge acquisition
- Merging of text representations
- Linking related information
- The truth is out there!