Text-Mining: analysis of text data - PowerPoint PPT Presentation

About This Presentation
Title:

Text-Mining: analysis of text data

Description:

... Beatles Anthology: The Beatles, Paul McCartney, George Harrison, Ringo Starr, ... You're The One: Paul Simon. Kid A: Radiohead. Music: Madonna. Red Dirt ... – PowerPoint PPT presentation

Number of Views:717
Avg rating:3.0/5.0
Slides: 36
Provided by: jiaw187
Category:

less

Transcript and Presenter's Notes

Title: Text-Mining: analysis of text data


1
Text-Mining analysis of text data
  • Dunja Mladenic
  • J.Stefan Institute, Ljubljana, Slovenia and
    Carnegie Mellon University, USA
  • http//www-ai.ijs.si/DunjaMladenic/
  • http//www.cs.cmu.edu/dunja/

2
Web user profiling
  • imagine the user browsing the Web, most of the
    time by clicking hyperlinks
  • goal provide help by highlighting the clicked
    hyperlinks (we assume that the user is clicking
    on interesting hyperlinks)
  • induce a profile for each user separately
  • the profile can be used to predict clicking on
    hyperlinks (in our case), to collect interesting
    Web-pages, to compare different users and share
    knowledge between them (collaborative agents)

3
Structure of the personal browsing assistant -
Personal WebWatcher
URL
URL
proxy (adviser)
page
User
The Web
User profile
modified page
Personal WebWatcher
4
Personal WebWatcher in action (1996)
Highlight interesting hyperlinks
5
Data Pyramid
Wisdom
Knowledge plus experience
Knowledge
Information plus rules
Information
Data plus context
Data
6
What is Data Mining?
  • Data mining (knowledge discovery in databases -
    KDD, business intelligence)
  • finding interesting (non-trivial, hidden,
    previously unknown and potentially useful)
    regularities in large datasets
  • Say something interesting about the data.
  • Decribe this data.

7
Data Mining Potential usage
  • Market analysis
  • Risk analysis
  • Fraud detection
  • Text Mining
  • Web Mining
  • ...

8
Why text analysis?
  • The amount of text data on electronic media is
    growing daily
  • e-mail, business documents, the Web, organized
    databases of documents,...
  • There is a lot of information contained in the
    text
  • Available methods and approaches enabling solving
    interesting and non-trivial problems

9
Problem description (I)
  • Text information filtering
  • Help with browsing the Web
  • Generation and analysis of user profiles
  • Automatic document categorization and keyword
    assignment to documents
  • Document clustering
  • Document visualization
  • Document authorship detection
  • Document copying identification
  • Language identification in text

10
Document categorization
Document Classifier
labeled documents
???
document category (label)
unlabeled document
11
Yahoo! page for one category
12
Automatic document categorization
  • Problem given is a set of content categories
    filled with documents.
  • The goal is to automatically insert a new
    document (assign one or more relevant categories
    to a new document).
  • Content categories can be structured (eg., Yahoo,
    Medline) or unstructured (eg., Reuters)
  • The problem is similar to assigning keywords to
    documents

13
Document to categorize CFP for CoNLL-2000
14
Some predicted categories
15
Our approach to document categorization
  • Data is obtained from the existing collection of
    manually categorized documents, where the used
    content categories are structured
  • Using Text Mining methods, we constructed a model
    that captures manual work of editors
  • The model is used to automatically assign content
    categories and the corresponding keywords to new,
    previously unseen documents

16
System architecture
Feature construction
Web
vectors of n-grams
Subproblem definition Feature selection Classifier
construction
labeled documents (from Yahoo! hierarchy)
??
Document Classifier
unlabeled document
document category (label)
17
Summary of experiments and results
  • learning from categorization hierarchy
    considering only promising categories during the
    classification (5-15 of categories)
  • extended document representation new
    features for sequences of two words
  • feature subset selection Odds ratio using 50-100
    best features (0.2-5)

18
  • More can be found at our project page
  • www.cs.cmu.edu/
  • TextLearning/
  • pww/yplanet.html

19
Document authorship detection
  • Problem based on a database of documents and
    authors, assign the most probable author to a new
    document
  • Solution is based on the fact that each author
    uses a characteristic frequency distribution over
    words and phrases

20
Document copying identification
  • Problem predict probability that a given
    document was copied (partially or completely)
    from some other document(s) from our database
  • Algorithm uses complex indexing methods on
    (different length) parts of documents and
    compares them against the given document

21
Natural language identification
  • Text data analysis systems commonly use some
    natural language dependent methods
  • Need for identification of natural language the
    document is written in
  • Problem for a given text identify the natural
    language it is written in selecting among the
    predefined languages

22
Algorithm for natural language identification
  • Basic algorithms are simple for each language
    build a characteristic frequency table of pairs
    and triples of letters that can be simply used to
    identify a document language (TextCat publicly
    available system, covers 60 languages)
  • Problem is with short documents - in this case we
    can use mechanisms for language dependent
    stop-words detection (stop-words are frequent in
    all languages)

23
Problem description (II)
  • Topic identification and tracking in time series
    of documents
  • Document indexing based on content and not only
    keywords
  • Content segmentation of text
  • Document summarization
  • Link analysis
  • Information extraction

24
Topic identification and tracking in time series
of documents
  • Problem given is a time-sequence of documents
    (news) - based on this document sequence we want
    to
  • identify document that introduces new topic
  • from the sequence of new documents identify
    documents about existing topics and connect them
    into a topic sequence

25
Text segmentation based on content
  • Problem divide text that has no given structure
    (content table, paragraphs, etc.) into segments
    with similar content
  • Example applications
  • topic tracking in news (spoken news)
  • identification of topics in large, unstructured
    text databases

26
Algorithm for text segmentation
  • Algorithm
  • Divide text into sentences
  • Represent each sentence with words and phrases it
    contains
  • Calculate similarity between the pairs of
    sentences
  • Find a segmentation (sequence of delimiters), so
    that the similarity between the sentences inside
    the same segment is maximized and minimized
    between the segments

27
Text Summarization
  • Task Given a text document create a summary
    reflecting the documents contents
  • Three main phases
  • Analyzing the source text
  • Determining its important points
  • Synthesizing an appropriate output
  • Most methods adopt linear weighting model each
    text unit (sentence) is assessed by
  • Weight(U)LocationInText(U)CuePhrase(U)Statistic
    s(U)AdditionalPresence(U)
  • output consists from topmost text units
    (sentences)

28
Information extraction
  • Collect a set of Home pages from the Web and
    build a soft database of people (name, address,
    coworkers, research areas and publications,
    biography...)
  • Collect electronic seminar announcements and
    extract location (room number), start and end
    time, name of the speaker

29
Where are we now?
  • Growing interest and need for handling large
    collections of text
  • The area is present in Slovenia for over 5 years
    with strong international connection
  • joint RD project with Microsoft Research,
    European and American research institutions,
    cooperation with Boeing
  • Organization of international events focused on
    Text Mining (ICML-99, KDD-2000, ICDM-2001)

30
Instead of conclusions...
  • Text Mining enables solving some problems that
    are often not expected to be addressed by
    computers
  • document authorship detection, identification of
    related content or finding interesting people,
    document segmentation and organization, automatic
    collection of officer names for the selected
    sector companies, finding experts in some area,
    who is involved with whom (discovering social
    networks), ...

31
  • To find more information check
  • lthttp//www-personal.umich.edu/wfan/text_mining.h
    tmlgt
  • lthttp//ai.about.com/library/weekly/aa102899.htmgt
  • lthttp//extractor.iit.nrc.ca/bibliographies/ml-app
    lied-to-ir.htmlgt
  • lthttp//www.content-analysis.de/gt
  • get research papers at lthttp//www.researchindex.c
    omgt
  • KDD-2000 Text Mining Workshop lthttp//www.cs.cmu.e
    du/dunja/WshKDD2000.htmlgt
  • ECAI-2000 ML for Information Extraction
    lthttp//www.dcs.shef.ac.uk/fabio/ecai-workshop.ht
    mlgt
  • PRICAI-2000 Text and Web MiningWorkshop
    lthttp//textmining.krdl.org.sg/cfp.htmlgt
  • IJCAI-2001 Adaptive Text Extraction and Mining
    Workshop lthttp//www.smi.ucd.ie/ATEM2001/gt, Text
    Learning Beyond Supervision lthttp//www.cs.cmu.e
    du/mccallum/textbeyond/gt
  • ICDM-2001 Text Mining Workshop
    lthttp//www-ai.ijs.si/DunjaMladenic/Te
    xtDM01/gt
  • ECML/PKDD-2001 Text Mining tutorial
    lthttp//www-ai.ijs.si/DunjaMladenic/TextDM01/Tut
    orial.psgt

32
Link Analysis
  • Mechanisms for detecting which vertices in the
    graph (pages on the web) are more important on
    the basis of link structure
  • Hits algorithm (Hubs Authorities) (Kleinberg
    1998)
  • PageRank (Page 1999) weighting (used by Google to
    better rank good pages)

33
Link analysis on Amazon data
  • We downloaded product pages from Amazon.com web
    site
  • products are connected with cross-sell relation
    (customers who bought this product also bought
    following products)
  • 130.000 books and 32.000 music CDs connected into
    graph
  • Question which products (books or CDs) are the
    most important?
  • we used Hits algorithm to calculate the weights
  • Harry Potter Beatles won the test.

34
Popular books
  • Harry Potter and the Goblet of Fire (Book 4) J K
    Rowling, Mary Grandpre
  • The Beatles Anthology The Beatles, Paul
    McCartney, George Harrison, Ringo Starr, Lennon,
    John Lennon
  • Prodigal Summer Barbara Kingsolver
  • Harry Potter and the Sorcerer's Stone (Book 1) J
    K Rowling
  • The Mark The Beast Rules the World (Left Behind
    8) Tim LaHaye, Jerry B Jenkins
  • Harry Potter and the Chamber of Secrets (Book 2)
    J K Rowling
  • Harry Potter and the Prisoner of Azkaban (Book
    3) J K Rowling, Mary Grandpre
  • The Sibley Guide to Birds (Audubon Society Nature
    Guides Ser.) David Allen Sibley
  • ....

35
Popular CDs
  • The Beatles
  • A Day Without Rain Enya
  • Lovers Rock Sade
  • All That You Can't Leave Behind U2
  • Riding With The King Eric Clapton, BB King
  • Black and Blue Backstreet Boys
  • Sailing To Philadelphia Mark Knopfler
  • You're The One Paul Simon
  • Kid A Radiohead
  • Music Madonna
  • Red Dirt Girl Emmylou Harris
  • Renee Fleming
  • ...
Write a Comment
User Comments (0)
About PowerShow.com