The many dimensions of Information Mining - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

The many dimensions of Information Mining

Description:

Finding and making sense of this material is potentially useful, but difficult ... Das and Chen [Das2001], which classifies investor sentiment for certain stocks. ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 20
Provided by: thimaljay
Category:

less

Transcript and Presenter's Notes

Title: The many dimensions of Information Mining


1
The many dimensions of Information Mining
  • Searching textual information sources
  • Thimal Jayasooriya
  • thimal_at_cs.york.ac.uk

2
Introduction
  • Exponentially increasing amounts of material are
    available
  • Finding and making sense of this material is
    potentially useful, but difficult with present
    search technology.
  • Information mining includes information
    extraction, information retrieval, natural
    language processing and document summarization.
    Dixon1997
  • Many diverse research areas and conferences cover
    the topic.
  • TREC (Text Retrieval Conference)
  • MUC (Message Understanding Conference)
  • SIGIR-ACM (ACM Special Interest Group on
    Information Retrieval)

3
Introduction continued
  • Some problems which search can currently solve
  • Retrieving a document with known attributes. A
    conference paper can be found by using the author
    name and the conference title.
  • Retrieving documents which have been manually
    annotated or reviewed as relevant.
  • Retrieving documents which contain exact matches
    for specific words or phrases.

4
Introduction - continued
  • And, some problems that arent so easy
  • Finding the most relevant information for a
    particular topic
  • Finding out opinions (positive, negative or
    neutral)
  • Inferring the bias of a writer based on his
    articles
  • Cross referencing domain knowledge with another
    field
  • Finding other relevant resources or similar
    documents
  • Yang2002, vanRijke2003

5
Why isnt search easier ?
  • Unstructured and freeform nature of text
  • Not always possible to distinguish content from
    fluff.
  • The content refreshing problem for web search
  • Content continually changes and leaves no time
    to discover new material.
  • Ambiguity
  • Lack of semantic knowledge
  • Difficult to automatically discern the sense
    of a sentence.
  • Lack of domain knowledge

6
Why isnt search easier - continued
  • Manning2000 defines several steps that are
    required to disambiguate a corpus of text
  • Distinguish word sense
  • Define the word category
  • Decide the syntactic structure
  • Resolve semantic ambiguity and scope

7
Ambiguity in text
Time flies like the wind, fruit flies like banana
8
The search procedure
Gather data source (corpora of text)
Filter and index content
End user queries indices
Retrieves documents matching query
9
Using metadata effectively
  • Most search techniques rely on metadata to
    increase relevance.
  • For instance, frequently occurring words are more
    likely to be the topic of a particular paper.
    vanRijsbergen1979, Manning2000
  • Other forms of metadata can include formatting
    and markup. Header tags and bold text, for
    example, may be more relevant.
  • HTML and some other document types can link to
    other pages. This was used by Brin1998 to
    calculate pagerank. The higher the pagerank,
    the higher the relevance.

10
Using metadata effectively - continued
  • Hawking Hawking2000 extends this concept to
    searching email archives.
  • Users could constrain searches by metadata, as
    well as by content.
  • Search for all emails with SIGIR facm.org
    (with acm.org in the from field)
  • A precursor to these papers is seen in
    InfoHarness Shklar1995 which proposes a generic
    framework for organizing any sort of metadata.

11
Data mining
  • Fayyad et al. notes that data mining techniques
    can be adopted for finding useful patterns in
    information. Fayyad1996
  • The use of KDD (Knowledge Discovery in Databases)
    and techniques to analyse interesting patterns
    can also be adopted for text search.
  • A related concept is data organization in
    warehouses and decision support systems.
  • Data is organized into dimensions, and each
    dimension is analysed using a special algorithm.
  • Some analysis techniques include LSI (Latent
    Semantic Indexing), SVD (Singular Value
    Decomposition) and CFA (Correspondence Factorial
    Analysis) Mothe2000

12
Document dimensions
  • In datawarehousing terminology, a dimension is "a
    structure which categorizes data in order to
    allow users to answer questions".
  • Mothe employs dimensions to cluster and find
    similar documents in astronomical data.
    Mothe2000
  • Mothe uses mainly five elements of a document for
    similarity detection. They are
  • The title of the document
  • The author name
  • The journal name
  • The date published
  • The text of the document
  • Each dimension is then analysed using Singular
    Value Decomposition or its derivatives (such as
    LSI or CFA).

13
Document dimensions - continued
  • Limitations of Mothes dimensioning for plain
    text search
  • Treats the entire textual content of a document
    as a single dimension.
  • Has a single set of dimensions for all documents,
    regardless of type. Only five dimensions were
    discovered and used.
  • Merely focuses on plotting of dimensions over
    time.

14
Enhancements to dimensions
  • The accessibility dimension Roellke2001
    addresses some of the limitations in Mothes
    model.
  • The content of a document is mapped to a tree
    structure.
  • Each segment of text is analysed for its
    relationship to the given search topic.
  • A segment of text is broken up into sub contexts
    and super contexts.
  • Each search is modelled according to its term
    frequency (tf) to the inverse document frequency
    (idf) to its accessibility dimension (ac).

15
Proposed enhancements to dimensions
Entities (names, places and objects)
Dates (signifying timelines)
Opinions (positive, negative)
16
Potential uses of document dimensions
  • Document dimensions can be applied to a number of
    natural language processing problems.
  • Opinion classification
  • Opinion classification is a mining technique used
    to discover branding and product information in
    public media.
  • Traditionally requires human analysis to
    determine subject area and bias for a particular
    area.
  • Document clustering
  • Document clustering is used to group similar
    documents.
  • It can be employed before search to reduce the
    searchable document set vanRijsbergen1979, and
    also as a means of organizing result documents.

17
Opinion classification
  • Majority of attempts employ machine learning
    techniques.
  • A definitive work is Das and Chen Das2001,
    which classifies investor sentiment for certain
    stocks.
  • Finn et al. Finn2002 followed up with product
    classification for entertainment (movies, theatre
    and football).
  • Multiple techniques (Part of Speech tagging, Bag
    of Words and Text statistics) were evaluated
  • He observed that a machine learning approach had
    reduced domain transferability compared to
    linguistic techniques.
  • Dini Dini2002 attempted a hybrid technique,
    combining tokenization of words with machine
    learning.

18
Opinion classification - continued
  • A conclusion of both Dini and Dave et. al
    Dave2003 is that statistical methods are
    unlikely to give larger than 90 accuracy at any
    time.
  • They conclude that combining non statistical
    natural language techniques and semantic
    understanding might provide better results.

19
Conclusion and implementation
  • Retrieving relevant, precise information from a
    corpora of text is a hard AI problem.
  • Increasing the amount of metadata stored about a
    corpus is another means of increasing search
    relevance.
  • Document dimensions aim to integrate existing NLP
    techniques into a higher level metadata
    framework.
  • An existing search product (Jakarta Lucene 1.3)
    will be extended and subclassed to provide
    dimension based functionality.
Write a Comment
User Comments (0)
About PowerShow.com