Text Mining - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Text Mining

Description:

Wireless Technology. Biosciences. Links in different subsets of literature to form hypothesis ... Basis for many text mining technologies ... – PowerPoint PPT presentation

Number of Views:384

Avg rating:3.0/5.0

Slides: 49

Provided by: mikee94

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining

1
Text Mining

Mike Evans, Doug Svendsen, Stephanie Huls, Matt
Lietzke

2
Outline

Text mining mines natural language text
Preprocessing makes natural text minable
Many techniques used in text mining
We will use classification
SIAM Competition held in Twin Cities
We will create a program and compare results

3
A Little Bit About Text Mining

The non-trivial extraction of previously
unknown, interesting facts from a collection of
texts.
Key element
Linking together of extracted information
Text Mining vs. Data Mining
Natural language text vs. structured database of
facts

4
Origination

1950s
Attempts to understand and model the information
processing capabilities of the human brain
Original approach
analyzed a natural language text at the level of
individual sentences
Objective
create a semantic representation of a sentence in
the form of structured relations between
important words comprising this sentence

5
To solve objective

Pre-developed linguistic molds were tried with
the sentence and its components
Match
corresponding semantic construction was
associated with the sentence
Proved to be a good first guidance for
understanding the meaning of a text

6
Problems

Too many different pre-developed molds were
needed to build a set for analyzing different
types of sentences
List of exceptional constructions in this
approach quickly grows prohibitively large
Works well only for a limited subset of natural
language texts

7
Not Information Extraction

Information extraction is
extracting names, addresses, etc.
Text mining is
Finding new pieces of data
Turn information extraction into text mining
Find relationships between extracted data
Human analysis
Ex. Wireless Technology

8
Applications

Biosciences
Links in different subsets of literature to form
hypothesis
Ex. Don Swanson
Genomics
Ex. Proteins
Hospital charts
Improve patient outcomes
Shorten hospital stays

9
Limitations/Problems

Programs cannot fully interpret text like the
human mind
Needed information is not in textual format
Conversations
Radio shows
Television
Noise
Spelling errors
Abbreviations
Acronyms

10
Text Characteristics Lead to Problems

Dimensionality
Each word/phrase is considered a dimension
Dependency
Relevant information in form of complex
conjunction of words/phrases
Informality
Emails R u available

11
Ambiguity

Word ambiguity
Pronouns
He/she
Synonyms
Buy/purchase
Multiple meanings
Bat mammal/baseball bat
Semantic ambiguity
The chicken is ready to eat
Police squad help dog bite victim
We saw the Eiffel Tower flying from London to
Paris
The police were ordered to stop drinking after
midnight

12
Historical Text Mining

Most text mining tools focus on present-day
English
Language of text depends upon
Where When it was created
Broken text flows

13
Goals

Improved document classification
Automatic semantic annotation of documents
Improved search by semantics and concepts
Improved clustering of documents by concept
Summarization

14
Problems with Text Mining

What it is not
Data retrieval
Computational linguistics
Computers do not understand natural speech and
text
Most writing consists of
Non-technical words
Slang
Abbreviations

15
Problems with Text Mining

Lots of trivial words in text
Common words unique to subject
Not helpful
Verbs, Nouns not in simple forms
Hard to put value on words
Not all problems need to be fixed

16
Preprocessing

Get rid of useless words (conjunctions, articles,
prepositions)
Turn all words into basic form (just take stems,
not prefix or suffix)
Makes it language specific
Remove words that are common among all records

17
Preprocessing

Filtering
Remove predefined words
Dictionary of bad words
Remove extremely common words
Lemmatization
Change verbs to infinite
Nouns to singular
Difficult and time consuming

18
Preprocessing

Stemming
Simpler version of Lemmatization
Tries to make basic words
Ex.Takes s and ing off words
Index Term Selection
Selects words based on entropy
Frequent words low entropy

19
Preprocessing

Other advanced methods commonly used
Part-of-speech tagging
Tags words as noun, verb, etc
Text chunking
Evaluates chunks of sentences
Parsing
Accounts for nearby words in sentences by parsing
into a tree

20
Competition

SIAM Text Mining Competition
SIAM Society for Industrial and Applied
Mathematics
Conference in Twin Cities April 28th
Competition already done
SIAM Provides
Preprocessed Text Dataset
Program evaluator
Winners results

21
Competition Goal

Dataset
Aviation Safety Reports
No labels given
Problem
Document Classification
Determine problem(s) in document
What kind of problem
Report confidence/precision

22
Project Goal

Use provided dataset
Possibly try more preprocessing
Try competition
Use classification algorithms
Need to identify problems
Need to classify documents with problem(s)
Compare results

23
Dataset

Aviation Safety Reports
Already preprocessed
21,519 reports with 1 report per record
All reports in one file
Standard text mining format
Average document is over a paragraph long

24
Dataset Example

1AFTER takeoff ON runway _ A loudnoise WAS hear
come FROM FRONT AREA OF aircraft.FOR A WHILE I
AND CREW THOUGHT IT WAS THE AIR drive generate
THAT deploy FROM right NOSE OF aircraft.UPON
FURTHER troubleshoot FOUND THAT THE AIR drive
generate COULD NOT HAVE deploy DUE TO ABSENCE OF
icon AND message ON THE engineindicationandcrewale
rtingsystem system.WE immediate return TO THE
airport FOR AN UNEVENTFUL land.FURTHER examine AT
THE GATE show THE OXYGEN accesspanel pop OPEN
AFTER takeoff cause THE NOISE.PRIOR TO flight THE
normalpreflight show NO AJAR OR OPEN panel ON THE
aircraft.moderateturbulence WAS encounter AFTER
takeoffdue TO STRONG crosswind AND
lowlevelwindshearadvisories IN EFFECT.
2taxi OFF THE parkingramp THE brake system fail
TO STOP THE aircraft.LATER determine TO BE A BAD
TRUNION SWITCH IN THE right maingear NEITHER
pilot HAD ani control OVER THE aircraft speed AND
DUE TO frequencycongestion WE COULD NOT ALERT
ground OF OUR problem.BECAUSE OF THIS WE WERE
UNABLE TO HOLD SHORT OF THE control PORTION OF
THE airport.THE INCURSION ON THE taxiway DID NOT
PUT US IN DANGER OF collide WE DID BLOCK AN
intersect AFTER WE coast TO A STOP.ground WAS
immediate notify AND COMPANY WAS call TO GET A
TUG AND BRING US BACK TO THE RAMP.I HAVE NEVER
see train TO DEAL WITH brakefailure ON THE ground
IN AN aircraft BUT I SURE WOULD LIKE TO.

25
Problems with Dataset

Words run together - generalaviation
Possibly introduced by SIAM for competition
Noise
Label suggestions
Too common for typos
Abbreviations
Missing spaces
Overall, text mining is flexible

26
Preprocessing Our Dataset

PLADS SIAM
Performed stemming and acronym expansion
Removed non-informative terms
Place names, etc
With our goal, place names are not necessary
Additional preprocessing
Fix spaces by periods

27
Pre Post Preprocessing

Preprocessing
After takeoff on runway Zeta a loud noise was
heard coming from the front area of hanger Alpha.
Post-processing
1AFTER takeoff ON runway _ A loudnoise WAS hear
come FROM FRONT AREA OF hanger _.

28
Classification

Classification is used to generate class labels
For text mining, it is used to classify documents
For our dataset, we could classify the type of
problem that submission was about

29
Our Data Set and Classification

Some of the classification we could use
Service problems, Time delay problems
Part Problems
Personnel Problems
Etc.
Part of the dataset we are working with is to
determine all of the class labels

30
Using ARM to generate Keywords

Using Apriori, we can generate our keywords from
our dataset
Modify The algorithm with highly preprocessed
data.
Filter our data for frequent items (keywords)
(augmented with exclusion list)
Generate Frequent Item sets and rule generation.
Use rules generated to draw relational keywords
and frequency

31
ARM Applied to our date set

Relationships
Noise implies Engine
Runway implies Landing Gear

32
How to use Classification

Steps to proper classification use
Define Keywords List
Use Information Gain Equation
This equation determines how effective a word is
based on frequency in known documents
Use Match Files Technique
This techniques takes a list of words the user
has supplied or based on desired search terms or
thesaurus and dictionary entries
Compare list on data to redefine keywords

33
Information Gain Equation Explained

Here p(Lc) is the fraction of training documents
with classes L1 and L2, p(tj1) and p(tj0) is
the number of documents with / without term tj
and p(Lcjtjm) is the conditional probability of
classes L1 and L2 if term tj is contained in the
document or is missing. It measures how useful tj
is for predicting L1 from an information-theoretic
point of view. We may determine IG(tj) for all
terms and remove those with very low information
gain from the dictionary

34
Keyword set for our Dataset

We can create different keywords for different
types of problems
E.g. Part Problem keywords
Emergency Landing
Landing Gear
Noises
Engine
Wings
Pressure Failure
Service problem, Time delay problems
Delay
Time
Emergency Landing
Personal Problems Keywords
Security Personnel
Illegal
Fight

35
How to use Classification

Use appropriate algorithm
Nearest Neighbor Classifier
Take unknown document and plot against know
documents
Compute the distance from nearest neighbor, based
on k number of neighbors
Based on how many of different classes there are,
give new document class based on neighbors
Decision trees
You create master collection of words in a
document.
Next, you create the tree based on presence of
keywords (e.g. This document does not have the
word sport in it, nor football, basketball, ect.)
Based on that decision, you continue down the
tree, making decisions based on the word of the
node

36
Application of Algorithm applied to our Data Set

An FP tree can be used to split on keywords list
Example Personnel Problem class label
Keywords Fight, Cabin Crew Assistance, ect.

37
Application of Algorithm applied to our Data Set

A Nearest Neighbor can be used to plot messages
against each other with a k variable
Example
Based on keywords, the distance of our record
from the remaining classification results by
majority vote is 3 votes for equipment problem
(picture from Imad Rahal Slides)

38
Issues with classification

Problems with classification
Over fitting
Under fitting
Keyword Cross listing

39
Benefits of Classification

The information that gain be gleaned from
Classification is directly applicable to airlines
The Classes established can assists with
situation deployment, or continuing customer care

40
Information Extraction

Extract meaningful information from text
Identify and classify elements
Sam went bowling with Tommy in St. Cloud at
900pm
People Sam and Tommy
Place St. Cloud
Time 900pm
Basis for many text mining technologies

41
Application of Information Extraction

JUST PRIOR TO rotate A DEER RAN ONTO THE runway.I
rotate AND hear A SOUND AND feel AS IF WE MIGHT
HAVE HIT THE DEER.THE GEAR retract normal.I
decide TO CONTINUE TO sfo airport figure THAT IF
WE HAD BLOWN A TIRE OR sustain DAMAGE TO THE GEAR
ETC THAT IT WOULD BE BETTER TO LAND AT sfo
airport.
Place slo, runway
Thing deer, sound, damage
Plane element gear, tire
Actions Ran, rotate, hit, blown, land

42
More In-depth Extraction

Base weight of word on
Number of documents it appears in
Number of times it appears in a document
High weight
Appears many times in a document
Does not appear in many documents
Low weight to words appearing in many documents.
Potentially identify important topics in aviation
dataset.
Landing gear
Flaps
Fog
Deer

43
Summarization

Goal Reduced size and detail of document while
retaining main points.
Software lacks human ability to understand
concepts and explain them.
Solution
Sentence extraction based on
Weight
Key phrases
Headings
Problem Must still be evaluated by a human.

44
Categorization

Treat input as a bag of words
Count words as the appear.
Counts are used to identify main topics.
Use a thesaurus to identify relationships.
Rank documents based upon frequency of words
pertaining to a topic.
Lead to organization based upon problem area,
place, malfunction, people, etc.

45
Concept Linkage

Link related documents
Find links between topics
Useful in biomedicine
Find links between symptoms, diseases and
treatments
Useful in aviation safety reports
Find links between symptoms and problems.

46
Clustering

Goal
1. Documents in a cluster are more similar to one
another.
2. Documents in separate clusters are less
similar.
Creates vectors for documents based upon how they
fit into different categories
Weights are given for how well documents fit into
a cluster.
Similar documents can then be found based upon
their proximities.
www.clusty.com

47
Conclusion

Text mining mines natural language text
It requires preprocessing to eliminate worthless
words
Our project uses classification on Airline Safety
Reports
Identify problems and compare to SIAM competition
results

48
Resources

Fan, Weiguo, Linda Wallace, Stephanie Rich, and
Zhongju Zhang. Tapping the Power of Text
Mining. Communictions of the ACM 49.9 (2006)
76 - 82
Arora, Ritu, and Purushotham Bangalore. "Text
Mining Classification \ Clustering of Articles
Related to Sports." ACM-SE 43 Proceedings of the
43rd Annual Southeast Regional Conference.
Kennesaw, Georgia, .
Hotho, Andreas, Andreas Nürnberger, and Gerhard
Paaß. "A Brief Survey of Text Mining." LDV Forum
- GLDV Journal for Computational Linguistics and
Language Technology 20.1 (2005) 19-62.
Raymond Y.K. Lau. Context-sensitive text mining
and belief revision for intelligent information
retrieval on the web. Centre for Information
Technology Innovation, Faculty of Information
Technology, Queensland University of Technology,
GPO Box 2434, Brisbane, Qld 4001, Australia
Larsen, Bjornar, and Chinatsu Aone. "Fast and
Effective Text Mining using Linear-Time Document
Clustering." KDD '99 Proceedings of the Fifth
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. San Diego, California,
United States, .
Introduction to Data Mining by PN Tan, M
Steinbach and V Kumar (ISBN 0-321-32136-7)