Text Mining - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Text Mining

Description:

Wireless Technology. Biosciences. Links in different subsets of literature to form hypothesis ... Basis for many text mining technologies ... – PowerPoint PPT presentation

Number of Views:384
Avg rating:3.0/5.0
Slides: 49
Provided by: mikee94
Category:

less

Transcript and Presenter's Notes

Title: Text Mining


1
Text Mining
  • Mike Evans, Doug Svendsen, Stephanie Huls, Matt
    Lietzke

2
Outline
  • Text mining mines natural language text
  • Preprocessing makes natural text minable
  • Many techniques used in text mining
  • We will use classification
  • SIAM Competition held in Twin Cities
  • We will create a program and compare results

3
A Little Bit About Text Mining
  • The non-trivial extraction of previously
    unknown, interesting facts from a collection of
    texts.
  • Key element
  • Linking together of extracted information
  • Text Mining vs. Data Mining
  • Natural language text vs. structured database of
    facts

4
Origination
  • 1950s
  • Attempts to understand and model the information
    processing capabilities of the human brain
  • Original approach
  • analyzed a natural language text at the level of
    individual sentences
  • Objective
  • create a semantic representation of a sentence in
    the form of structured relations between
    important words comprising this sentence

5
To solve objective
  • Pre-developed linguistic molds were tried with
    the sentence and its components
  • Match
  • corresponding semantic construction was
    associated with the sentence
  • Proved to be a good first guidance for
    understanding the meaning of a text

6
Problems
  • Too many different pre-developed molds were
    needed to build a set for analyzing different
    types of sentences
  • List of exceptional constructions in this
    approach quickly grows prohibitively large
  • Works well only for a limited subset of natural
    language texts

7
Not Information Extraction
  • Information extraction is
  • extracting names, addresses, etc.
  • Text mining is
  • Finding new pieces of data
  • Turn information extraction into text mining
  • Find relationships between extracted data
  • Human analysis
  • Ex. Wireless Technology

8
Applications
  • Biosciences
  • Links in different subsets of literature to form
    hypothesis
  • Ex. Don Swanson
  • Genomics
  • Ex. Proteins
  • Hospital charts
  • Improve patient outcomes
  • Shorten hospital stays

9
Limitations/Problems
  • Programs cannot fully interpret text like the
    human mind
  • Needed information is not in textual format
  • Conversations
  • Radio shows
  • Television
  • Noise
  • Spelling errors
  • Abbreviations
  • Acronyms

10
Text Characteristics Lead to Problems
  • Dimensionality
  • Each word/phrase is considered a dimension
  • Dependency
  • Relevant information in form of complex
    conjunction of words/phrases
  • Informality
  • Emails R u available

11
Ambiguity
  • Word ambiguity
  • Pronouns
  • He/she
  • Synonyms
  • Buy/purchase
  • Multiple meanings
  • Bat mammal/baseball bat
  • Semantic ambiguity
  • The chicken is ready to eat
  • Police squad help dog bite victim
  • We saw the Eiffel Tower flying from London to
    Paris
  • The police were ordered to stop drinking after
    midnight

12
Historical Text Mining
  • Most text mining tools focus on present-day
    English
  • Language of text depends upon
  • Where When it was created
  • Broken text flows

13
Goals
  • Improved document classification
  • Automatic semantic annotation of documents
  • Improved search by semantics and concepts
  • Improved clustering of documents by concept
  • Summarization

14
Problems with Text Mining
  • What it is not
  • Data retrieval
  • Computational linguistics
  • Computers do not understand natural speech and
    text
  • Most writing consists of
  • Non-technical words
  • Slang
  • Abbreviations

15
Problems with Text Mining
  • Lots of trivial words in text
  • Common words unique to subject
  • Not helpful
  • Verbs, Nouns not in simple forms
  • Hard to put value on words
  • Not all problems need to be fixed

16
Preprocessing
  • Get rid of useless words (conjunctions, articles,
    prepositions)
  • Turn all words into basic form (just take stems,
    not prefix or suffix)
  • Makes it language specific
  • Remove words that are common among all records

17
Preprocessing
  • Filtering
  • Remove predefined words
  • Dictionary of bad words
  • Remove extremely common words
  • Lemmatization
  • Change verbs to infinite
  • Nouns to singular
  • Difficult and time consuming

18
Preprocessing
  • Stemming
  • Simpler version of Lemmatization
  • Tries to make basic words
  • Ex.Takes s and ing off words
  • Index Term Selection
  • Selects words based on entropy
  • Frequent words low entropy

19
Preprocessing
  • Other advanced methods commonly used
  • Part-of-speech tagging
  • Tags words as noun, verb, etc
  • Text chunking
  • Evaluates chunks of sentences
  • Parsing
  • Accounts for nearby words in sentences by parsing
    into a tree

20
Competition
  • SIAM Text Mining Competition
  • SIAM Society for Industrial and Applied
    Mathematics
  • Conference in Twin Cities April 28th
  • Competition already done
  • SIAM Provides
  • Preprocessed Text Dataset
  • Program evaluator
  • Winners results

21
Competition Goal
  • Dataset
  • Aviation Safety Reports
  • No labels given
  • Problem
  • Document Classification
  • Determine problem(s) in document
  • What kind of problem
  • Report confidence/precision

22
Project Goal
  • Use provided dataset
  • Possibly try more preprocessing
  • Try competition
  • Use classification algorithms
  • Need to identify problems
  • Need to classify documents with problem(s)
  • Compare results

23
Dataset
  • Aviation Safety Reports
  • Already preprocessed
  • 21,519 reports with 1 report per record
  • All reports in one file
  • Standard text mining format
  • Average document is over a paragraph long

24
Dataset Example
  • 1AFTER takeoff ON runway _ A loudnoise WAS hear
    come FROM FRONT AREA OF aircraft.FOR A WHILE I
    AND CREW THOUGHT IT WAS THE AIR drive generate
    THAT deploy FROM right NOSE OF aircraft.UPON
    FURTHER troubleshoot FOUND THAT THE AIR drive
    generate COULD NOT HAVE deploy DUE TO ABSENCE OF
    icon AND message ON THE engineindicationandcrewale
    rtingsystem system.WE immediate return TO THE
    airport FOR AN UNEVENTFUL land.FURTHER examine AT
    THE GATE show THE OXYGEN accesspanel pop OPEN
    AFTER takeoff cause THE NOISE.PRIOR TO flight THE
    normalpreflight show NO AJAR OR OPEN panel ON THE
    aircraft.moderateturbulence WAS encounter AFTER
    takeoffdue TO STRONG crosswind AND
    lowlevelwindshearadvisories IN EFFECT.
  • 2taxi OFF THE parkingramp THE brake system fail
    TO STOP THE aircraft.LATER determine TO BE A BAD
    TRUNION SWITCH IN THE right maingear NEITHER
    pilot HAD ani control OVER THE aircraft speed AND
    DUE TO frequencycongestion WE COULD NOT ALERT
    ground OF OUR problem.BECAUSE OF THIS WE WERE
    UNABLE TO HOLD SHORT OF THE control PORTION OF
    THE airport.THE INCURSION ON THE taxiway DID NOT
    PUT US IN DANGER OF collide WE DID BLOCK AN
    intersect AFTER WE coast TO A STOP.ground WAS
    immediate notify AND COMPANY WAS call TO GET A
    TUG AND BRING US BACK TO THE RAMP.I HAVE NEVER
    see train TO DEAL WITH brakefailure ON THE ground
    IN AN aircraft BUT I SURE WOULD LIKE TO.

25
Problems with Dataset
  • Words run together - generalaviation
  • Possibly introduced by SIAM for competition
  • Noise
  • Label suggestions
  • Too common for typos
  • Abbreviations
  • Missing spaces
  • Overall, text mining is flexible

26
Preprocessing Our Dataset
  • PLADS SIAM
  • Performed stemming and acronym expansion
  • Removed non-informative terms
  • Place names, etc
  • With our goal, place names are not necessary
  • Additional preprocessing
  • Fix spaces by periods

27
Pre Post Preprocessing
  • Preprocessing
  • After takeoff on runway Zeta a loud noise was
    heard coming from the front area of hanger Alpha.
  • Post-processing
  • 1AFTER takeoff ON runway _ A loudnoise WAS hear
    come FROM FRONT AREA OF hanger _.

28
Classification
  • Classification is used to generate class labels
  • For text mining, it is used to classify documents
  • For our dataset, we could classify the type of
    problem that submission was about

29
Our Data Set and Classification
  • Some of the classification we could use
  • Service problems, Time delay problems
  • Part Problems
  • Personnel Problems
  • Etc.
  • Part of the dataset we are working with is to
    determine all of the class labels

30
Using ARM to generate Keywords
  • Using Apriori, we can generate our keywords from
    our dataset
  • Modify The algorithm with highly preprocessed
    data.
  • Filter our data for frequent items (keywords)
  • (augmented with exclusion list)
  • Generate Frequent Item sets and rule generation.
  • Use rules generated to draw relational keywords
    and frequency

31
ARM Applied to our date set
  • Relationships
  • Noise implies Engine
  • Runway implies Landing Gear

32
How to use Classification
  • Steps to proper classification use
  • Define Keywords List
  • Use Information Gain Equation
  • This equation determines how effective a word is
    based on frequency in known documents
  • Use Match Files Technique
  • This techniques takes a list of words the user
    has supplied or based on desired search terms or
    thesaurus and dictionary entries
  • Compare list on data to redefine keywords

33
Information Gain Equation Explained
  • Here p(Lc) is the fraction of training documents
    with classes L1 and L2, p(tj1) and p(tj0) is
    the number of documents with / without term tj
    and p(Lcjtjm) is the conditional probability of
    classes L1 and L2 if term tj is contained in the
    document or is missing. It measures how useful tj
    is for predicting L1 from an information-theoretic
    point of view. We may determine IG(tj) for all
    terms and remove those with very low information
    gain from the dictionary

34
Keyword set for our Dataset
  • We can create different keywords for different
    types of problems
  • E.g. Part Problem keywords
  • Emergency Landing
  • Landing Gear
  • Noises
  • Engine
  • Wings
  • Pressure Failure
  • Service problem, Time delay problems
  • Delay
  • Time
  • Emergency Landing
  • Personal Problems Keywords
  • Security Personnel
  • Illegal
  • Fight

35
How to use Classification
  • Use appropriate algorithm
  • Nearest Neighbor Classifier
  • Take unknown document and plot against know
    documents
  • Compute the distance from nearest neighbor, based
    on k number of neighbors
  • Based on how many of different classes there are,
    give new document class based on neighbors
  • Decision trees
  • You create master collection of words in a
    document.
  • Next, you create the tree based on presence of
    keywords (e.g. This document does not have the
    word sport in it, nor football, basketball, ect.)
  • Based on that decision, you continue down the
    tree, making decisions based on the word of the
    node

36
Application of Algorithm applied to our Data Set
  • An FP tree can be used to split on keywords list
  • Example Personnel Problem class label
  • Keywords Fight, Cabin Crew Assistance, ect.

37
Application of Algorithm applied to our Data Set
  • A Nearest Neighbor can be used to plot messages
    against each other with a k variable
  • Example
  • Based on keywords, the distance of our record
    from the remaining classification results by
    majority vote is 3 votes for equipment problem
  • (picture from Imad Rahal Slides)

38
Issues with classification
  • Problems with classification
  • Over fitting
  • Under fitting
  • Keyword Cross listing

39
Benefits of Classification
  • The information that gain be gleaned from
    Classification is directly applicable to airlines
  • The Classes established can assists with
    situation deployment, or continuing customer care

40
Information Extraction
  • Extract meaningful information from text
  • Identify and classify elements
  • Sam went bowling with Tommy in St. Cloud at
    900pm
  • People Sam and Tommy
  • Place St. Cloud
  • Time 900pm
  • Basis for many text mining technologies

41
Application of Information Extraction
  • JUST PRIOR TO rotate A DEER RAN ONTO THE runway.I
    rotate AND hear A SOUND AND feel AS IF WE MIGHT
    HAVE HIT THE DEER.THE GEAR retract normal.I
    decide TO CONTINUE TO sfo airport figure THAT IF
    WE HAD BLOWN A TIRE OR sustain DAMAGE TO THE GEAR
    ETC THAT IT WOULD BE BETTER TO LAND AT sfo
    airport.
  • Place slo, runway
  • Thing deer, sound, damage
  • Plane element gear, tire
  • Actions Ran, rotate, hit, blown, land

42
More In-depth Extraction
  • Base weight of word on
  • Number of documents it appears in
  • Number of times it appears in a document
  • High weight
  • Appears many times in a document
  • Does not appear in many documents
  • Low weight to words appearing in many documents.
  • Potentially identify important topics in aviation
    dataset.
  • Landing gear
  • Flaps
  • Fog
  • Deer

43
Summarization
  • Goal Reduced size and detail of document while
    retaining main points.
  • Software lacks human ability to understand
    concepts and explain them.
  • Solution
  • Sentence extraction based on
  • Weight
  • Key phrases
  • Headings
  • Problem Must still be evaluated by a human.

44
Categorization
  • Treat input as a bag of words
  • Count words as the appear.
  • Counts are used to identify main topics.
  • Use a thesaurus to identify relationships.
  • Rank documents based upon frequency of words
    pertaining to a topic.
  • Lead to organization based upon problem area,
    place, malfunction, people, etc.

45
Concept Linkage
  • Link related documents
  • Find links between topics
  • Useful in biomedicine
  • Find links between symptoms, diseases and
    treatments
  • Useful in aviation safety reports
  • Find links between symptoms and problems.

46
Clustering
  • Goal
  • 1. Documents in a cluster are more similar to one
    another.
  • 2. Documents in separate clusters are less
    similar.
  • Creates vectors for documents based upon how they
    fit into different categories
  • Weights are given for how well documents fit into
    a cluster.
  • Similar documents can then be found based upon
    their proximities.
  • www.clusty.com

47
Conclusion
  • Text mining mines natural language text
  • It requires preprocessing to eliminate worthless
    words
  • Our project uses classification on Airline Safety
    Reports
  • Identify problems and compare to SIAM competition
    results

48
Resources
  • Fan, Weiguo, Linda Wallace, Stephanie Rich, and
    Zhongju Zhang. Tapping the Power of Text
    Mining. Communictions of the ACM 49.9 (2006)
    76 - 82
  • Arora, Ritu, and Purushotham Bangalore. "Text
    Mining Classification \ Clustering of Articles
    Related to Sports." ACM-SE 43 Proceedings of the
    43rd Annual Southeast Regional Conference.
    Kennesaw, Georgia, .
  • Hotho, Andreas, Andreas Nürnberger, and Gerhard
    Paaß. "A Brief Survey of Text Mining." LDV Forum
    - GLDV Journal for Computational Linguistics and
    Language Technology 20.1 (2005) 19-62.
  • Raymond Y.K. Lau. Context-sensitive text mining
    and belief revision for intelligent information
    retrieval on the web. Centre for Information
    Technology Innovation, Faculty of Information
    Technology, Queensland University of Technology,
    GPO Box 2434, Brisbane, Qld 4001, Australia
  • Larsen, Bjornar, and Chinatsu Aone. "Fast and
    Effective Text Mining using Linear-Time Document
    Clustering." KDD '99 Proceedings of the Fifth
    ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining. San Diego, California,
    United States, .
  • Introduction to Data Mining by PN Tan, M
    Steinbach and V Kumar (ISBN 0-321-32136-7)
Write a Comment
User Comments (0)
About PowerShow.com