Title: Text Mining
1Text Mining
2Text Mining Definition
- Many definitions in the literature
- The non trivial extraction of implicit,
previously unknown, and potentially useful
information from (large amount of) textual data - An exploration and analysis of textual
(natural-language) data by automatic and semi
automatic means to discover new knowledge - What is previously unknown information?
- Strict definition
- Information that not even the writer knows
- Lenient definition
- Rediscover the information that the author
encoded in the text
3Text Mining Views from T2K and ThemeWeaver
4Text Mining Themescape and ThemeRiver
- Visualizing Relationships Between Documents
Images from Pacific Northwest Laboratory
5Text Characteristics (1)
- Large textual data base
- Enormous wealth of textual information on the Web
- Publications are electronic
- High dimensionality
- Consider each word/phrase as a dimension
- Noisy data
- Spelling mistakes
- Abbreviations
- Acronyms
- Text messages are very dynamic
- Web pages are constantly being generated
(removed) - Web pages are generated from database queries
- Not well structured text
- Email/Chat rooms
- r u available ?
- Hey whazzzzzz up
- Speech
6Text Characteristics (2)
- Dependency
- Relevant information is a complex conjunction of
words/phrases - Order of words in the query
- hot dog stand in the amusement park
- hot amusement stand in the dog park
- Ambiguity
- Word ambiguity
- Pronouns (he, she )
- Synonyms (buy, purchase)
- Words with multiple meanings (bat it is related
to baseball or mammal) - Semantic ambiguity
- The king saw the rabbit with his glasses.
(multiple meanings) - Authority of the source
- IBM is more likely to be an authorized source
then my second far cousin
7Text Mining Process
- Text Preprocessing
- Syntactic/Semantic Text Analysis
- Part of Speech (POS) Tagging
- Features Generation
- Bag of Words
- Feature Selection
- Simple Counting
- Statistics
- Selection based on POS
- Text/Data Mining
- Classification- Supervised Learning
- Clustering- Unsupervised Learning
- Information Extraction
- Analyzing Results
8Text PreProcessing Syntactic / Semantic Text
Analysis
- Part Of Speech (POS) Tagging
- Find the corresponding POS for each word
- e.g., John (noun) gave (verb) the (det) ball
(noun) - Word Sense Disambiguation
- Context based or proximity based
- Very accurate
- Parsing
- Generates a parse tree (graph) for each sentence
- Each sentence is a stand alone graph
9Feature Generation Bag of Words
- Text document is represented by the words it
contains (and their occurrences) - e.g., Lord of the rings ? the, Lord,
rings, of - Highly efficient
- Makes learning far simpler and easier
- Order of words is not that important for certain
applications - Stemming
- Reduce dimensionality
- Identifies a word by its root
- e.g., flying, flew ? fly
- Stop words
- Identifies the most common words that are
unlikely to help with text mining - e.g., the, a, an, you
10Feature Selection
- Reduce Dimensionality
- Learners have difficulty addressing tasks with
high dimensionality - Irrelevant Features
- Not all features help!
- Remove features that occur in only a few
documents - Reduce features that occur in too many documents
11Text Mining General Application Areas
- Information Retrieval
- Indexing and retrieval of textual documents
- Finding a set of (ranked) documents that are
relevant to the query - Information Extraction
- Extraction of partial knowledge in the text
- Web Mining
- Indexing and retrieval of textual documents and
extraction of partial knowledge using the web - Classification
- Predict a class for each text document
- Clustering
- Generating collections of similar text documents
12Text Mining Applications
- Email Spam filtering
- News Feeds Discover what is interesting
- Medical Identify relationships and link
information from different medical fields - Homeland Security
- Marketing Discover distinct groups of potential
buyers and make suggestions for other products - Industry Identifying groups of competitors web
pages - Job Seeking Identify parameters in searching for
jobs
13Text Mining Supervised vs. Unsupervised Learning
- Supervised learning (Classification)
- Data (observations, measurements, etc.) are
accompanied by labels indicating the class of the
observations - Split into training data and test data for model
building process - New data is classified based on the model built
with the training data - Unsupervised learning (Clustering)
- Class labels of training data is unknown
- Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
14Text Mining Classification Definition
- Given Collection of labeled records
- Each record contains a set of features
(attributes), and the true class (label) - Create a training set to build the model
- Create a testing set to test the model
- Find Model for the class as a function of the
values of the features - Goal Assign a class (as accurately as possible)
to previously unseen records - Evaluation What Is Good Classification?
- Correct classification
- Known label of test example is identical to the
predicted class from the model - Accuracy ratio
- Percent of test set examples that are correctly
classified by the model - Distance measure between classes can be used
- e.g., classifying football document as a
basketball document is not as bad as
classifying it as crime
15Text Mining Clustering Definition
- Given Set of documents and a similarity measure
among documents - Find Clusters such that
- Documents in one cluster are more similar to one
another - Documents in separate clusters are less similar
to one another - Goal
- Finding a correct set of documents
- Similarity Measures
- Euclidean distance if attributes are continuous
- Other problem-specific measures
- e.g., how many words are common in these
documents - Evaluation What Is Good Clustering?
- Produce high quality clusters with
- high intra-class similarity
- low inter-class similarity
- Quality of a clustering method is also measured
by its ability to discover some or all of the
hidden patterns
16Classification Techniques
- Bayesian classification
- Decision trees
- Neural networks
- Instance-Based Methods
- Support Vector Machines
17What is Information Extraction?
Advisory Programmer - Oracle (Austin, TX)
Response Code 1008-0074-97-iexc-jcn
Responsibilities This is an exciting opportunity
with Siemens Wireless Terminals a start-up
venture fully capitalized by a Global Leader in
Advanced Technologies. Qualified candidates will
Responsible for assisting with requirements
definition, analysis, design and implementation
that meet objectives, codes difficult and
sophisticated routines . Develops project plans,
schedules and cost data. Develop test plans and
implement physical design of databases. Develop
shell scripts for administrative and background
tasks, stored procedures and triggers. Using
Oracles Designer 2000, assist with Data Model
maintenance and assist with applications
development using Oracle Forms. Qualifications
BSCS, BSMIS or closely related field or related
equivalent knowledge normally obtained through
technical education programs. 5-8 years of
professional experience in development, system
design analysis, programming, installation using
Oracle development
- Given
- Source of textual documents
- Well defined limited query (text based)
- Find
- Sentences with relevant information
- Extract the relevant information and ignore
non-relevant information (important!) - Link related information and output in a
predetermined format
Example from Dan Roth Web Page
18T2K and Using GATE
- Load GATE_IE_Viz itinerary to see Information
Extraction in action
19Information Extraction from Streaming Text
- Information extraction
- process of using advanced automated machine
learning approaches - to identify entities in text documents
- extract this information along with the
relationships these entities may have in the text
documents - This project demonstrates information extraction
of names, places and organizations from real-time
news feeds. As news articles arrive, the
information is extracted and displayed.
20Bayesian Classification
- Idea assign to example X the class label C such
that P(CX) is maximal - Computes the distribution of an input associated
with each class, for example, given the variable
X with a value at xi the probability of it being
in Class A is greater than it being in Class B
Mathematically speaking If one knows how P(X
C), and the densities P(xi) and P(cj) (prior
probabilities) are known then the classifier is
one which assigns class cj to datum xi if cj has
the highest posterior probability given the data.
21Bayesian Classification Why?
- Probabilistic learning Calculate explicit
probabilities for hypothesis, is among the most
practical approaches to certain types of learning
problems - Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct - Prior knowledge Can be combined with observed
data - Standard
- Provide a standard of optimal decision making
against which other methods can be measured - In a simpler form, provide a baseline against
which other methods can be measured
22Naïve Bayesian Classification
- Naïve assumption
- Feature independence
- P(xi C) is estimated as the relative frequency
of examples having value xi as feature in class C - Computationally easy!!!
23Classification by Decision Tree
- Decision tree
- Flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
distribution - Decision tree generation consists of two phases
- Tree construction
- Tree pruning
- Identify and remove branches that reflect noise
or outliers - Use of decision tree
- Classifying an unknown example
- Test the attribute of the example against the
decision tree
24- Text Mining in D2K
- Email CLASSIFICATION
- Naïve Bayesian
25Email Classification
- Input
- Multiple mailboxes where each mailbox represents
a class - Output
- Results of the model on the testing set
- Model that classifies future email
26Mailbox Files
- MONO.mbx
- Mono Developer Discussion List
- mono-list_at_lists.ximian.com
- 216 messages
- SPAM.mbx
- Spam Mailbox
- 100 messages
- JINI.mbx
- JINI-Users mail list
- JINI-USERS_at_JAVA.SUN.COM
- 104 messages
27Opening the Itinerary
- Click on the Itinerary Pane in the Resource
Panel - Expand the T2K directory with a single click
- Double click on EmailClassification-T2K
28D2K A Few Features
- Properties indicate that a module has settings
that can be changed before execution - Indicated by a P in the lower left corner of a
module - e.g., filename, maximum iterations, etc.
- Resource Manager
- Load data that is accessible by all modules
29EmailClassification Itinerary
- Use of D2Ks Resource Manager to store data that
will serve as a dictionary - Contextual Rule File
- Lexical Rule File
- Stop words
- Lexicon
30EmailClassification Itinerary (2)
- Load the mailbox data
- Input File Name
- Specify a directory by changing properties of
- ReadFileNames
- Sends each filename in this directory as output
- MBX Email Parser
- Parses the mailbox files
- Email -gt Document
- Converts the email document to the standard
document object - Flags on whether to include sender/receiver info
31EmailClassification Itinerary (3)
- Pre-Process text data
- Tokenizer
- Forms word tokens for each word or symbol
- Brill Pre-Tagger
- Assigns part of speech tag to each token
- Can be used without following 2 modules
- Brill Lexical Tagger
- Adjusts tag based on lexical rules
- Lexical must precede Contextual
- Brill Contextual Tagger
- Adjusts tag based on contextual rules
32EmailClassification Itinerary (4)
- More Text Pre-Processing
- Filler Stop Words
- Removes Stop Words
- Stemmer
- Transforms words into their word stem
- Removes plurals, etc.
- Select Tokens By Speech Tag
- Removes tokens that do not match speech tag of
interest - Document-gtTermList
- Counts the frequency of terms in the document
- Adjusts counts for title weighting
33EmailClassification Itinerary (5)
- Creation of Sparse Table for learning
- Add Series of Ints
- Counts all values it receives
- Outputs sum
- TermLists -gt SparseTable
- Creates Sparse Table to be used for mining
- Term counts across documents is sparse
- Conserve on memory and usage
- Feature Filter
- Eliminates terms that occur in only one document
34EmailClassification Itinerary (6)
- Selecting input and output attributes for
classification - Choose Attributes
- Select input and attributes
- Select output attribute classification_DOCPROP
35EmailClassification Itinerary (7)
- Setting testing and training sets
- Simple Train Test
- Property window to set train and test percentages
36EmailClassification Itinerary (8)
- Model building and testing
- Naïve Bayes Text Model
- Builds a Naïve Bayesian classification model
- Model Predict
- Applies the model to the testing data
37EmailClassification Itinerary (9)
- Results of model on testing data
- Prediction Table Report
- Shows classification error
- Shows confusion matrix
38EmailClassification Itinerary (10)
- Table Viewer
- Shows original data
- Shows the predicted column
39Execute Itinerary
- Check Properties for Input FileName modules
- Click the Run button
40Results
- Results of model on testing data
- Prediction Table Report
- Shows classification error
- Shows confusion matrix
- Original table had 6718 attributes
- After filtering the table had 2830 attributes
41Other Scenarios
- Change the weight for words in titles
- Change whether or not sender/receiver info is
included - Take the filter module out
- What happens to accuracy of the model??
- What happens to performance??
42Scenario Verbs only
- Original table had 1020 attributes
- After filtering the table had 636 attributes
43T2K in Review
- Review T2K modules
- Review T2K itineraries
44The ALG Team
- Staff
- Bernie Acs
- Loretta Auvil
- David Clutter
- Vered Goren
- Eugene Grois
- Luigi Marini
- Robert McGrath
- Chris Navarro
- Greg Pape
- Barry Sanders
- Andrew Shirk
- David Tcheng
- Michael Welge
- Students
- Chen Chen
- Hong Cheng
- Yaniv Eytani
- Fang Guo
- Govind Kabra
- Chao Liu
- Haitao Mo
- Xuanhui Wang
- Qian Yang
- Feida Zhu