Title: Text-Mining: analysis of text data
1Text-Mining analysis of text data
- Dunja Mladenic
- J.Stefan Institute, Ljubljana, Slovenia and
Carnegie Mellon University, USA - http//www-ai.ijs.si/DunjaMladenic/
- http//www.cs.cmu.edu/dunja/
2Web user profiling
- imagine the user browsing the Web, most of the
time by clicking hyperlinks - goal provide help by highlighting the clicked
hyperlinks (we assume that the user is clicking
on interesting hyperlinks) - induce a profile for each user separately
- the profile can be used to predict clicking on
hyperlinks (in our case), to collect interesting
Web-pages, to compare different users and share
knowledge between them (collaborative agents)
3Structure of the personal browsing assistant -
Personal WebWatcher
URL
URL
proxy (adviser)
page
User
The Web
User profile
modified page
Personal WebWatcher
4Personal WebWatcher in action (1996)
Highlight interesting hyperlinks
5Data Pyramid
Wisdom
Knowledge plus experience
Knowledge
Information plus rules
Information
Data plus context
Data
6What is Data Mining?
- Data mining (knowledge discovery in databases -
KDD, business intelligence) - finding interesting (non-trivial, hidden,
previously unknown and potentially useful)
regularities in large datasets - Say something interesting about the data.
- Decribe this data.
7Data Mining Potential usage
- Market analysis
- Risk analysis
- Fraud detection
- Text Mining
- Web Mining
- ...
8Why text analysis?
- The amount of text data on electronic media is
growing daily - e-mail, business documents, the Web, organized
databases of documents,... - There is a lot of information contained in the
text - Available methods and approaches enabling solving
interesting and non-trivial problems
9Problem description (I)
- Text information filtering
- Help with browsing the Web
- Generation and analysis of user profiles
- Automatic document categorization and keyword
assignment to documents - Document clustering
- Document visualization
- Document authorship detection
- Document copying identification
- Language identification in text
10Document categorization
Document Classifier
labeled documents
???
document category (label)
unlabeled document
11Yahoo! page for one category
12Automatic document categorization
- Problem given is a set of content categories
filled with documents. - The goal is to automatically insert a new
document (assign one or more relevant categories
to a new document). - Content categories can be structured (eg., Yahoo,
Medline) or unstructured (eg., Reuters) - The problem is similar to assigning keywords to
documents
13Document to categorize CFP for CoNLL-2000
14Some predicted categories
15Our approach to document categorization
- Data is obtained from the existing collection of
manually categorized documents, where the used
content categories are structured - Using Text Mining methods, we constructed a model
that captures manual work of editors - The model is used to automatically assign content
categories and the corresponding keywords to new,
previously unseen documents
16System architecture
Feature construction
Web
vectors of n-grams
Subproblem definition Feature selection Classifier
construction
labeled documents (from Yahoo! hierarchy)
??
Document Classifier
unlabeled document
document category (label)
17Summary of experiments and results
- learning from categorization hierarchy
considering only promising categories during the
classification (5-15 of categories) - extended document representation new
features for sequences of two words - feature subset selection Odds ratio using 50-100
best features (0.2-5)
18- More can be found at our project page
- www.cs.cmu.edu/
- TextLearning/
- pww/yplanet.html
19Document authorship detection
- Problem based on a database of documents and
authors, assign the most probable author to a new
document - Solution is based on the fact that each author
uses a characteristic frequency distribution over
words and phrases
20Document copying identification
- Problem predict probability that a given
document was copied (partially or completely)
from some other document(s) from our database - Algorithm uses complex indexing methods on
(different length) parts of documents and
compares them against the given document
21Natural language identification
- Text data analysis systems commonly use some
natural language dependent methods - Need for identification of natural language the
document is written in - Problem for a given text identify the natural
language it is written in selecting among the
predefined languages
22Algorithm for natural language identification
- Basic algorithms are simple for each language
build a characteristic frequency table of pairs
and triples of letters that can be simply used to
identify a document language (TextCat publicly
available system, covers 60 languages) - Problem is with short documents - in this case we
can use mechanisms for language dependent
stop-words detection (stop-words are frequent in
all languages)
23Problem description (II)
- Topic identification and tracking in time series
of documents - Document indexing based on content and not only
keywords - Content segmentation of text
- Document summarization
- Link analysis
- Information extraction
24Topic identification and tracking in time series
of documents
- Problem given is a time-sequence of documents
(news) - based on this document sequence we want
to - identify document that introduces new topic
- from the sequence of new documents identify
documents about existing topics and connect them
into a topic sequence
25Text segmentation based on content
- Problem divide text that has no given structure
(content table, paragraphs, etc.) into segments
with similar content - Example applications
- topic tracking in news (spoken news)
- identification of topics in large, unstructured
text databases
26Algorithm for text segmentation
- Algorithm
- Divide text into sentences
- Represent each sentence with words and phrases it
contains - Calculate similarity between the pairs of
sentences - Find a segmentation (sequence of delimiters), so
that the similarity between the sentences inside
the same segment is maximized and minimized
between the segments
27Text Summarization
- Task Given a text document create a summary
reflecting the documents contents - Three main phases
- Analyzing the source text
- Determining its important points
- Synthesizing an appropriate output
- Most methods adopt linear weighting model each
text unit (sentence) is assessed by - Weight(U)LocationInText(U)CuePhrase(U)Statistic
s(U)AdditionalPresence(U) - output consists from topmost text units
(sentences)
28Information extraction
- Collect a set of Home pages from the Web and
build a soft database of people (name, address,
coworkers, research areas and publications,
biography...) - Collect electronic seminar announcements and
extract location (room number), start and end
time, name of the speaker
29Where are we now?
- Growing interest and need for handling large
collections of text - The area is present in Slovenia for over 5 years
with strong international connection - joint RD project with Microsoft Research,
European and American research institutions,
cooperation with Boeing - Organization of international events focused on
Text Mining (ICML-99, KDD-2000, ICDM-2001)
30Instead of conclusions...
- Text Mining enables solving some problems that
are often not expected to be addressed by
computers - document authorship detection, identification of
related content or finding interesting people,
document segmentation and organization, automatic
collection of officer names for the selected
sector companies, finding experts in some area,
who is involved with whom (discovering social
networks), ...
31- To find more information check
- lthttp//www-personal.umich.edu/wfan/text_mining.h
tmlgt - lthttp//ai.about.com/library/weekly/aa102899.htmgt
- lthttp//extractor.iit.nrc.ca/bibliographies/ml-app
lied-to-ir.htmlgt - lthttp//www.content-analysis.de/gt
- get research papers at lthttp//www.researchindex.c
omgt - KDD-2000 Text Mining Workshop lthttp//www.cs.cmu.e
du/dunja/WshKDD2000.htmlgt - ECAI-2000 ML for Information Extraction
lthttp//www.dcs.shef.ac.uk/fabio/ecai-workshop.ht
mlgt - PRICAI-2000 Text and Web MiningWorkshop
lthttp//textmining.krdl.org.sg/cfp.htmlgt - IJCAI-2001 Adaptive Text Extraction and Mining
Workshop lthttp//www.smi.ucd.ie/ATEM2001/gt, Text
Learning Beyond Supervision lthttp//www.cs.cmu.e
du/mccallum/textbeyond/gt - ICDM-2001 Text Mining Workshop
lthttp//www-ai.ijs.si/DunjaMladenic/Te
xtDM01/gt - ECML/PKDD-2001 Text Mining tutorial
lthttp//www-ai.ijs.si/DunjaMladenic/TextDM01/Tut
orial.psgt
32Link Analysis
- Mechanisms for detecting which vertices in the
graph (pages on the web) are more important on
the basis of link structure - Hits algorithm (Hubs Authorities) (Kleinberg
1998) - PageRank (Page 1999) weighting (used by Google to
better rank good pages)
33Link analysis on Amazon data
- We downloaded product pages from Amazon.com web
site - products are connected with cross-sell relation
(customers who bought this product also bought
following products) - 130.000 books and 32.000 music CDs connected into
graph - Question which products (books or CDs) are the
most important? - we used Hits algorithm to calculate the weights
- Harry Potter Beatles won the test.
34Popular books
- Harry Potter and the Goblet of Fire (Book 4) J K
Rowling, Mary Grandpre - The Beatles Anthology The Beatles, Paul
McCartney, George Harrison, Ringo Starr, Lennon,
John Lennon - Prodigal Summer Barbara Kingsolver
- Harry Potter and the Sorcerer's Stone (Book 1) J
K Rowling - The Mark The Beast Rules the World (Left Behind
8) Tim LaHaye, Jerry B Jenkins - Harry Potter and the Chamber of Secrets (Book 2)
J K Rowling - Harry Potter and the Prisoner of Azkaban (Book
3) J K Rowling, Mary Grandpre - The Sibley Guide to Birds (Audubon Society Nature
Guides Ser.) David Allen Sibley - ....
35Popular CDs
- The Beatles
- A Day Without Rain Enya
- Lovers Rock Sade
- All That You Can't Leave Behind U2
- Riding With The King Eric Clapton, BB King
- Black and Blue Backstreet Boys
- Sailing To Philadelphia Mark Knopfler
- You're The One Paul Simon
- Kid A Radiohead
- Music Madonna
- Red Dirt Girl Emmylou Harris
- Renee Fleming
- ...