Text-Mining: analysis of text data - PowerPoint PPT Presentation

About This Presentation

Title:

Text-Mining: analysis of text data

Description:

... Beatles Anthology: The Beatles, Paul McCartney, George Harrison, Ringo Starr, ... You're The One: Paul Simon. Kid A: Radiohead. Music: Madonna. Red Dirt ... – PowerPoint PPT presentation

Number of Views:717

Avg rating:3.0/5.0

Slides: 36

Provided by: jiaw187

Category:

more less

Transcript and Presenter's Notes

Title: Text-Mining: analysis of text data

1
Text-Mining analysis of text data

Dunja Mladenic
J.Stefan Institute, Ljubljana, Slovenia and
Carnegie Mellon University, USA
http//www-ai.ijs.si/DunjaMladenic/
http//www.cs.cmu.edu/dunja/

2
Web user profiling

imagine the user browsing the Web, most of the
time by clicking hyperlinks
goal provide help by highlighting the clicked
hyperlinks (we assume that the user is clicking
on interesting hyperlinks)
induce a profile for each user separately
the profile can be used to predict clicking on
hyperlinks (in our case), to collect interesting
Web-pages, to compare different users and share
knowledge between them (collaborative agents)

3
Structure of the personal browsing assistant -
Personal WebWatcher
URL
URL
proxy (adviser)
page
User
The Web
User profile
modified page
Personal WebWatcher
4
Personal WebWatcher in action (1996)
Highlight interesting hyperlinks
5
Data Pyramid
Wisdom
Knowledge plus experience
Knowledge
Information plus rules
Information
Data plus context
Data
6
What is Data Mining?

Data mining (knowledge discovery in databases -
KDD, business intelligence)
finding interesting (non-trivial, hidden,
previously unknown and potentially useful)
regularities in large datasets
Say something interesting about the data.
Decribe this data.

7
Data Mining Potential usage

Market analysis
Risk analysis
Fraud detection
Text Mining
Web Mining
...

8
Why text analysis?

The amount of text data on electronic media is
growing daily
e-mail, business documents, the Web, organized
databases of documents,...
There is a lot of information contained in the
text
Available methods and approaches enabling solving
interesting and non-trivial problems

9
Problem description (I)

Text information filtering
Help with browsing the Web
Generation and analysis of user profiles
Automatic document categorization and keyword
assignment to documents
Document clustering
Document visualization
Document authorship detection
Document copying identification
Language identification in text

10
Document categorization
Document Classifier
labeled documents
???
document category (label)
unlabeled document
11
Yahoo! page for one category
12
Automatic document categorization

Problem given is a set of content categories
filled with documents.
The goal is to automatically insert a new
document (assign one or more relevant categories
to a new document).
Content categories can be structured (eg., Yahoo,
Medline) or unstructured (eg., Reuters)
The problem is similar to assigning keywords to
documents

13
Document to categorize CFP for CoNLL-2000
14
Some predicted categories
15
Our approach to document categorization

Data is obtained from the existing collection of
manually categorized documents, where the used
content categories are structured
Using Text Mining methods, we constructed a model
that captures manual work of editors
The model is used to automatically assign content
categories and the corresponding keywords to new,
previously unseen documents

16
System architecture
Feature construction
Web
vectors of n-grams
Subproblem definition Feature selection Classifier
construction
labeled documents (from Yahoo! hierarchy)
??
Document Classifier
unlabeled document
document category (label)
17
Summary of experiments and results

learning from categorization hierarchy
considering only promising categories during the
classification (5-15 of categories)
extended document representation new
features for sequences of two words
feature subset selection Odds ratio using 50-100
best features (0.2-5)

More can be found at our project page
www.cs.cmu.edu/
TextLearning/
pww/yplanet.html

19
Document authorship detection

Problem based on a database of documents and
authors, assign the most probable author to a new
document
Solution is based on the fact that each author
uses a characteristic frequency distribution over
words and phrases

20
Document copying identification

Problem predict probability that a given
document was copied (partially or completely)
from some other document(s) from our database
Algorithm uses complex indexing methods on
(different length) parts of documents and
compares them against the given document

21
Natural language identification

Text data analysis systems commonly use some
natural language dependent methods
Need for identification of natural language the
document is written in
Problem for a given text identify the natural
language it is written in selecting among the
predefined languages

22
Algorithm for natural language identification

Basic algorithms are simple for each language
build a characteristic frequency table of pairs
and triples of letters that can be simply used to
identify a document language (TextCat publicly
available system, covers 60 languages)
Problem is with short documents - in this case we
can use mechanisms for language dependent
stop-words detection (stop-words are frequent in
all languages)

23
Problem description (II)

Topic identification and tracking in time series
of documents
Document indexing based on content and not only
keywords
Content segmentation of text
Document summarization
Link analysis
Information extraction

24
Topic identification and tracking in time series
of documents

Problem given is a time-sequence of documents
(news) - based on this document sequence we want
to
identify document that introduces new topic
from the sequence of new documents identify
documents about existing topics and connect them
into a topic sequence

25
Text segmentation based on content

Problem divide text that has no given structure
(content table, paragraphs, etc.) into segments
with similar content
Example applications
topic tracking in news (spoken news)
identification of topics in large, unstructured
text databases

26
Algorithm for text segmentation

Algorithm
Divide text into sentences
Represent each sentence with words and phrases it
contains
Calculate similarity between the pairs of
sentences
Find a segmentation (sequence of delimiters), so
that the similarity between the sentences inside
the same segment is maximized and minimized
between the segments

27
Text Summarization

Task Given a text document create a summary
reflecting the documents contents
Three main phases
Analyzing the source text
Determining its important points
Synthesizing an appropriate output
Most methods adopt linear weighting model each
text unit (sentence) is assessed by
Weight(U)LocationInText(U)CuePhrase(U)Statistic
s(U)AdditionalPresence(U)
output consists from topmost text units
(sentences)

28
Information extraction

Collect a set of Home pages from the Web and
build a soft database of people (name, address,
coworkers, research areas and publications,
biography...)
Collect electronic seminar announcements and
extract location (room number), start and end
time, name of the speaker

29
Where are we now?

Growing interest and need for handling large
collections of text
The area is present in Slovenia for over 5 years
with strong international connection
joint RD project with Microsoft Research,
European and American research institutions,
cooperation with Boeing
Organization of international events focused on
Text Mining (ICML-99, KDD-2000, ICDM-2001)

30
Instead of conclusions...

Text Mining enables solving some problems that
are often not expected to be addressed by
computers
document authorship detection, identification of
related content or finding interesting people,
document segmentation and organization, automatic
collection of officer names for the selected
sector companies, finding experts in some area,
who is involved with whom (discovering social
networks), ...

To find more information check
lthttp//www-personal.umich.edu/wfan/text_mining.h
tmlgt
lthttp//ai.about.com/library/weekly/aa102899.htmgt
lthttp//extractor.iit.nrc.ca/bibliographies/ml-app
lied-to-ir.htmlgt
lthttp//www.content-analysis.de/gt
get research papers at lthttp//www.researchindex.c
omgt
KDD-2000 Text Mining Workshop lthttp//www.cs.cmu.e
du/dunja/WshKDD2000.htmlgt
ECAI-2000 ML for Information Extraction
lthttp//www.dcs.shef.ac.uk/fabio/ecai-workshop.ht
mlgt
PRICAI-2000 Text and Web MiningWorkshop
lthttp//textmining.krdl.org.sg/cfp.htmlgt
IJCAI-2001 Adaptive Text Extraction and Mining
Workshop lthttp//www.smi.ucd.ie/ATEM2001/gt, Text
Learning Beyond Supervision lthttp//www.cs.cmu.e
du/mccallum/textbeyond/gt
ICDM-2001 Text Mining Workshop
lthttp//www-ai.ijs.si/DunjaMladenic/Te
xtDM01/gt
ECML/PKDD-2001 Text Mining tutorial
lthttp//www-ai.ijs.si/DunjaMladenic/TextDM01/Tut
orial.psgt

32
Link Analysis

Mechanisms for detecting which vertices in the
graph (pages on the web) are more important on
the basis of link structure
Hits algorithm (Hubs Authorities) (Kleinberg
1998)
PageRank (Page 1999) weighting (used by Google to
better rank good pages)

33
Link analysis on Amazon data

We downloaded product pages from Amazon.com web
site
products are connected with cross-sell relation
(customers who bought this product also bought
following products)
130.000 books and 32.000 music CDs connected into
graph
Question which products (books or CDs) are the
most important?
we used Hits algorithm to calculate the weights
Harry Potter Beatles won the test.

34
Popular books

Harry Potter and the Goblet of Fire (Book 4) J K
Rowling, Mary Grandpre
The Beatles Anthology The Beatles, Paul
McCartney, George Harrison, Ringo Starr, Lennon,
John Lennon
Prodigal Summer Barbara Kingsolver
Harry Potter and the Sorcerer's Stone (Book 1) J
K Rowling
The Mark The Beast Rules the World (Left Behind
8) Tim LaHaye, Jerry B Jenkins
Harry Potter and the Chamber of Secrets (Book 2)
J K Rowling
Harry Potter and the Prisoner of Azkaban (Book
3) J K Rowling, Mary Grandpre
The Sibley Guide to Birds (Audubon Society Nature
Guides Ser.) David Allen Sibley
....