Title: Text Mining
1Text Mining
- Mike Evans, Doug Svendsen, Stephanie Huls, Matt
Lietzke
2Outline
- Text mining mines natural language text
- Preprocessing makes natural text minable
- Many techniques used in text mining
- We will use classification
- SIAM Competition held in Twin Cities
- We will create a program and compare results
3A Little Bit About Text Mining
- The non-trivial extraction of previously
unknown, interesting facts from a collection of
texts. - Key element
- Linking together of extracted information
- Text Mining vs. Data Mining
- Natural language text vs. structured database of
facts
4Origination
- 1950s
- Attempts to understand and model the information
processing capabilities of the human brain - Original approach
- analyzed a natural language text at the level of
individual sentences - Objective
- create a semantic representation of a sentence in
the form of structured relations between
important words comprising this sentence
5To solve objective
- Pre-developed linguistic molds were tried with
the sentence and its components - Match
- corresponding semantic construction was
associated with the sentence - Proved to be a good first guidance for
understanding the meaning of a text
6Problems
- Too many different pre-developed molds were
needed to build a set for analyzing different
types of sentences - List of exceptional constructions in this
approach quickly grows prohibitively large - Works well only for a limited subset of natural
language texts
7Not Information Extraction
- Information extraction is
- extracting names, addresses, etc.
- Text mining is
- Finding new pieces of data
- Turn information extraction into text mining
- Find relationships between extracted data
- Human analysis
- Ex. Wireless Technology
8Applications
- Biosciences
- Links in different subsets of literature to form
hypothesis - Ex. Don Swanson
- Genomics
- Ex. Proteins
- Hospital charts
- Improve patient outcomes
- Shorten hospital stays
9Limitations/Problems
- Programs cannot fully interpret text like the
human mind - Needed information is not in textual format
- Conversations
- Radio shows
- Television
- Noise
- Spelling errors
- Abbreviations
- Acronyms
10Text Characteristics Lead to Problems
- Dimensionality
- Each word/phrase is considered a dimension
- Dependency
- Relevant information in form of complex
conjunction of words/phrases - Informality
- Emails R u available
11Ambiguity
- Word ambiguity
- Pronouns
- He/she
- Synonyms
- Buy/purchase
- Multiple meanings
- Bat mammal/baseball bat
- Semantic ambiguity
- The chicken is ready to eat
- Police squad help dog bite victim
- We saw the Eiffel Tower flying from London to
Paris - The police were ordered to stop drinking after
midnight
12Historical Text Mining
- Most text mining tools focus on present-day
English - Language of text depends upon
- Where When it was created
- Broken text flows
13Goals
- Improved document classification
- Automatic semantic annotation of documents
- Improved search by semantics and concepts
- Improved clustering of documents by concept
- Summarization
14Problems with Text Mining
- What it is not
- Data retrieval
- Computational linguistics
- Computers do not understand natural speech and
text - Most writing consists of
- Non-technical words
- Slang
- Abbreviations
15Problems with Text Mining
- Lots of trivial words in text
- Common words unique to subject
- Not helpful
- Verbs, Nouns not in simple forms
- Hard to put value on words
- Not all problems need to be fixed
16Preprocessing
- Get rid of useless words (conjunctions, articles,
prepositions) - Turn all words into basic form (just take stems,
not prefix or suffix) - Makes it language specific
- Remove words that are common among all records
17Preprocessing
- Filtering
- Remove predefined words
- Dictionary of bad words
- Remove extremely common words
- Lemmatization
- Change verbs to infinite
- Nouns to singular
- Difficult and time consuming
18Preprocessing
- Stemming
- Simpler version of Lemmatization
- Tries to make basic words
- Ex.Takes s and ing off words
- Index Term Selection
- Selects words based on entropy
- Frequent words low entropy
19Preprocessing
- Other advanced methods commonly used
- Part-of-speech tagging
- Tags words as noun, verb, etc
- Text chunking
- Evaluates chunks of sentences
- Parsing
- Accounts for nearby words in sentences by parsing
into a tree
20Competition
- SIAM Text Mining Competition
- SIAM Society for Industrial and Applied
Mathematics - Conference in Twin Cities April 28th
- Competition already done
- SIAM Provides
- Preprocessed Text Dataset
- Program evaluator
- Winners results
21Competition Goal
- Dataset
- Aviation Safety Reports
- No labels given
- Problem
- Document Classification
- Determine problem(s) in document
- What kind of problem
- Report confidence/precision
22Project Goal
- Use provided dataset
- Possibly try more preprocessing
- Try competition
- Use classification algorithms
- Need to identify problems
- Need to classify documents with problem(s)
- Compare results
23Dataset
- Aviation Safety Reports
- Already preprocessed
- 21,519 reports with 1 report per record
- All reports in one file
- Standard text mining format
- Average document is over a paragraph long
24Dataset Example
- 1AFTER takeoff ON runway _ A loudnoise WAS hear
come FROM FRONT AREA OF aircraft.FOR A WHILE I
AND CREW THOUGHT IT WAS THE AIR drive generate
THAT deploy FROM right NOSE OF aircraft.UPON
FURTHER troubleshoot FOUND THAT THE AIR drive
generate COULD NOT HAVE deploy DUE TO ABSENCE OF
icon AND message ON THE engineindicationandcrewale
rtingsystem system.WE immediate return TO THE
airport FOR AN UNEVENTFUL land.FURTHER examine AT
THE GATE show THE OXYGEN accesspanel pop OPEN
AFTER takeoff cause THE NOISE.PRIOR TO flight THE
normalpreflight show NO AJAR OR OPEN panel ON THE
aircraft.moderateturbulence WAS encounter AFTER
takeoffdue TO STRONG crosswind AND
lowlevelwindshearadvisories IN EFFECT. - 2taxi OFF THE parkingramp THE brake system fail
TO STOP THE aircraft.LATER determine TO BE A BAD
TRUNION SWITCH IN THE right maingear NEITHER
pilot HAD ani control OVER THE aircraft speed AND
DUE TO frequencycongestion WE COULD NOT ALERT
ground OF OUR problem.BECAUSE OF THIS WE WERE
UNABLE TO HOLD SHORT OF THE control PORTION OF
THE airport.THE INCURSION ON THE taxiway DID NOT
PUT US IN DANGER OF collide WE DID BLOCK AN
intersect AFTER WE coast TO A STOP.ground WAS
immediate notify AND COMPANY WAS call TO GET A
TUG AND BRING US BACK TO THE RAMP.I HAVE NEVER
see train TO DEAL WITH brakefailure ON THE ground
IN AN aircraft BUT I SURE WOULD LIKE TO.
25Problems with Dataset
- Words run together - generalaviation
- Possibly introduced by SIAM for competition
- Noise
- Label suggestions
- Too common for typos
- Abbreviations
- Missing spaces
- Overall, text mining is flexible
26Preprocessing Our Dataset
- PLADS SIAM
- Performed stemming and acronym expansion
- Removed non-informative terms
- Place names, etc
- With our goal, place names are not necessary
- Additional preprocessing
- Fix spaces by periods
27Pre Post Preprocessing
- Preprocessing
- After takeoff on runway Zeta a loud noise was
heard coming from the front area of hanger Alpha. - Post-processing
- 1AFTER takeoff ON runway _ A loudnoise WAS hear
come FROM FRONT AREA OF hanger _.
28Classification
- Classification is used to generate class labels
- For text mining, it is used to classify documents
- For our dataset, we could classify the type of
problem that submission was about
29Our Data Set and Classification
- Some of the classification we could use
- Service problems, Time delay problems
- Part Problems
- Personnel Problems
- Etc.
- Part of the dataset we are working with is to
determine all of the class labels
30Using ARM to generate Keywords
- Using Apriori, we can generate our keywords from
our dataset - Modify The algorithm with highly preprocessed
data. - Filter our data for frequent items (keywords)
- (augmented with exclusion list)
- Generate Frequent Item sets and rule generation.
- Use rules generated to draw relational keywords
and frequency
31ARM Applied to our date set
- Relationships
- Noise implies Engine
- Runway implies Landing Gear
32How to use Classification
- Steps to proper classification use
- Define Keywords List
- Use Information Gain Equation
- This equation determines how effective a word is
based on frequency in known documents - Use Match Files Technique
- This techniques takes a list of words the user
has supplied or based on desired search terms or
thesaurus and dictionary entries - Compare list on data to redefine keywords
33Information Gain Equation Explained
- Here p(Lc) is the fraction of training documents
with classes L1 and L2, p(tj1) and p(tj0) is
the number of documents with / without term tj
and p(Lcjtjm) is the conditional probability of
classes L1 and L2 if term tj is contained in the
document or is missing. It measures how useful tj
is for predicting L1 from an information-theoretic
point of view. We may determine IG(tj) for all
terms and remove those with very low information
gain from the dictionary
34Keyword set for our Dataset
- We can create different keywords for different
types of problems - E.g. Part Problem keywords
- Emergency Landing
- Landing Gear
- Noises
- Engine
- Wings
- Pressure Failure
- Service problem, Time delay problems
- Delay
- Time
- Emergency Landing
- Personal Problems Keywords
- Security Personnel
- Illegal
- Fight
35How to use Classification
- Use appropriate algorithm
- Nearest Neighbor Classifier
- Take unknown document and plot against know
documents - Compute the distance from nearest neighbor, based
on k number of neighbors - Based on how many of different classes there are,
give new document class based on neighbors - Decision trees
- You create master collection of words in a
document. - Next, you create the tree based on presence of
keywords (e.g. This document does not have the
word sport in it, nor football, basketball, ect.) - Based on that decision, you continue down the
tree, making decisions based on the word of the
node
36Application of Algorithm applied to our Data Set
- An FP tree can be used to split on keywords list
- Example Personnel Problem class label
- Keywords Fight, Cabin Crew Assistance, ect.
37Application of Algorithm applied to our Data Set
- A Nearest Neighbor can be used to plot messages
against each other with a k variable - Example
- Based on keywords, the distance of our record
from the remaining classification results by
majority vote is 3 votes for equipment problem - (picture from Imad Rahal Slides)
38Issues with classification
- Problems with classification
- Over fitting
- Under fitting
- Keyword Cross listing
39Benefits of Classification
- The information that gain be gleaned from
Classification is directly applicable to airlines - The Classes established can assists with
situation deployment, or continuing customer care
40Information Extraction
- Extract meaningful information from text
- Identify and classify elements
- Sam went bowling with Tommy in St. Cloud at
900pm - People Sam and Tommy
- Place St. Cloud
- Time 900pm
- Basis for many text mining technologies
41Application of Information Extraction
- JUST PRIOR TO rotate A DEER RAN ONTO THE runway.I
rotate AND hear A SOUND AND feel AS IF WE MIGHT
HAVE HIT THE DEER.THE GEAR retract normal.I
decide TO CONTINUE TO sfo airport figure THAT IF
WE HAD BLOWN A TIRE OR sustain DAMAGE TO THE GEAR
ETC THAT IT WOULD BE BETTER TO LAND AT sfo
airport. - Place slo, runway
- Thing deer, sound, damage
- Plane element gear, tire
- Actions Ran, rotate, hit, blown, land
42More In-depth Extraction
- Base weight of word on
- Number of documents it appears in
- Number of times it appears in a document
- High weight
- Appears many times in a document
- Does not appear in many documents
- Low weight to words appearing in many documents.
- Potentially identify important topics in aviation
dataset. - Landing gear
- Flaps
- Fog
- Deer
43Summarization
- Goal Reduced size and detail of document while
retaining main points. - Software lacks human ability to understand
concepts and explain them. - Solution
- Sentence extraction based on
- Weight
- Key phrases
- Headings
- Problem Must still be evaluated by a human.
44Categorization
- Treat input as a bag of words
- Count words as the appear.
- Counts are used to identify main topics.
- Use a thesaurus to identify relationships.
- Rank documents based upon frequency of words
pertaining to a topic. - Lead to organization based upon problem area,
place, malfunction, people, etc.
45Concept Linkage
- Link related documents
- Find links between topics
- Useful in biomedicine
- Find links between symptoms, diseases and
treatments - Useful in aviation safety reports
- Find links between symptoms and problems.
46Clustering
- Goal
- 1. Documents in a cluster are more similar to one
another. - 2. Documents in separate clusters are less
similar. - Creates vectors for documents based upon how they
fit into different categories - Weights are given for how well documents fit into
a cluster. - Similar documents can then be found based upon
their proximities. - www.clusty.com
47Conclusion
- Text mining mines natural language text
- It requires preprocessing to eliminate worthless
words - Our project uses classification on Airline Safety
Reports - Identify problems and compare to SIAM competition
results
48Resources
- Fan, Weiguo, Linda Wallace, Stephanie Rich, and
Zhongju Zhang. Tapping the Power of Text
Mining. Communictions of the ACM 49.9 (2006)
76 - 82 - Arora, Ritu, and Purushotham Bangalore. "Text
Mining Classification \ Clustering of Articles
Related to Sports." ACM-SE 43 Proceedings of the
43rd Annual Southeast Regional Conference.
Kennesaw, Georgia, . - Hotho, Andreas, Andreas Nürnberger, and Gerhard
Paaß. "A Brief Survey of Text Mining." LDV Forum
- GLDV Journal for Computational Linguistics and
Language Technology 20.1 (2005) 19-62. - Raymond Y.K. Lau. Context-sensitive text mining
and belief revision for intelligent information
retrieval on the web. Centre for Information
Technology Innovation, Faculty of Information
Technology, Queensland University of Technology,
GPO Box 2434, Brisbane, Qld 4001, Australia - Larsen, Bjornar, and Chinatsu Aone. "Fast and
Effective Text Mining using Linear-Time Document
Clustering." KDD '99 Proceedings of the Fifth
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. San Diego, California,
United States, . - Introduction to Data Mining by PN Tan, M
Steinbach and V Kumar (ISBN 0-321-32136-7)