The many dimensions of Information Mining - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

The many dimensions of Information Mining

Description:

Finding and making sense of this material is potentially useful, but difficult ... Das and Chen [Das2001], which classifies investor sentiment for certain stocks. ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 20

Provided by: thimaljay

Category:

more less

Transcript and Presenter's Notes

Title: The many dimensions of Information Mining

1
The many dimensions of Information Mining

Searching textual information sources
Thimal Jayasooriya
thimal_at_cs.york.ac.uk

2
Introduction

Exponentially increasing amounts of material are
available
Finding and making sense of this material is
potentially useful, but difficult with present
search technology.
Information mining includes information
extraction, information retrieval, natural
language processing and document summarization.
Dixon1997
Many diverse research areas and conferences cover
the topic.
TREC (Text Retrieval Conference)
MUC (Message Understanding Conference)
SIGIR-ACM (ACM Special Interest Group on
Information Retrieval)

3
Introduction continued

Some problems which search can currently solve
Retrieving a document with known attributes. A
conference paper can be found by using the author
name and the conference title.
Retrieving documents which have been manually
annotated or reviewed as relevant.
Retrieving documents which contain exact matches
for specific words or phrases.

4
Introduction - continued

And, some problems that arent so easy
Finding the most relevant information for a
particular topic
Finding out opinions (positive, negative or
neutral)
Inferring the bias of a writer based on his
articles
Cross referencing domain knowledge with another
field
Finding other relevant resources or similar
documents
Yang2002, vanRijke2003

5
Why isnt search easier ?

Unstructured and freeform nature of text
Not always possible to distinguish content from
fluff.
The content refreshing problem for web search
Content continually changes and leaves no time
to discover new material.
Ambiguity
Lack of semantic knowledge
Difficult to automatically discern the sense
of a sentence.
Lack of domain knowledge

6
Why isnt search easier - continued

Manning2000 defines several steps that are
required to disambiguate a corpus of text
Distinguish word sense
Define the word category
Decide the syntactic structure
Resolve semantic ambiguity and scope

7
Ambiguity in text
Time flies like the wind, fruit flies like banana
8
The search procedure
Gather data source (corpora of text)
Filter and index content
End user queries indices
Retrieves documents matching query
9
Using metadata effectively

Most search techniques rely on metadata to
increase relevance.
For instance, frequently occurring words are more
likely to be the topic of a particular paper.
vanRijsbergen1979, Manning2000
Other forms of metadata can include formatting
and markup. Header tags and bold text, for
example, may be more relevant.
HTML and some other document types can link to
other pages. This was used by Brin1998 to
calculate pagerank. The higher the pagerank,
the higher the relevance.

10
Using metadata effectively - continued

Hawking Hawking2000 extends this concept to
searching email archives.
Users could constrain searches by metadata, as
well as by content.
Search for all emails with SIGIR facm.org
(with acm.org in the from field)
A precursor to these papers is seen in
InfoHarness Shklar1995 which proposes a generic
framework for organizing any sort of metadata.

11
Data mining

Fayyad et al. notes that data mining techniques
can be adopted for finding useful patterns in
information. Fayyad1996
The use of KDD (Knowledge Discovery in Databases)
and techniques to analyse interesting patterns
can also be adopted for text search.
A related concept is data organization in
warehouses and decision support systems.
Data is organized into dimensions, and each
dimension is analysed using a special algorithm.
Some analysis techniques include LSI (Latent
Semantic Indexing), SVD (Singular Value
Decomposition) and CFA (Correspondence Factorial
Analysis) Mothe2000

12
Document dimensions

In datawarehousing terminology, a dimension is "a
structure which categorizes data in order to
allow users to answer questions".
Mothe employs dimensions to cluster and find
similar documents in astronomical data.
Mothe2000
Mothe uses mainly five elements of a document for
similarity detection. They are
The title of the document
The author name
The journal name
The date published
The text of the document
Each dimension is then analysed using Singular
Value Decomposition or its derivatives (such as
LSI or CFA).

13
Document dimensions - continued

Limitations of Mothes dimensioning for plain
text search
Treats the entire textual content of a document
as a single dimension.
Has a single set of dimensions for all documents,
regardless of type. Only five dimensions were
discovered and used.
Merely focuses on plotting of dimensions over
time.

14
Enhancements to dimensions

The accessibility dimension Roellke2001
addresses some of the limitations in Mothes
model.
The content of a document is mapped to a tree
structure.
Each segment of text is analysed for its
relationship to the given search topic.
A segment of text is broken up into sub contexts
and super contexts.
Each search is modelled according to its term
frequency (tf) to the inverse document frequency
(idf) to its accessibility dimension (ac).

15
Proposed enhancements to dimensions
Entities (names, places and objects)
Dates (signifying timelines)
Opinions (positive, negative)
16
Potential uses of document dimensions

Document dimensions can be applied to a number of
natural language processing problems.
Opinion classification
Opinion classification is a mining technique used
to discover branding and product information in
public media.
Traditionally requires human analysis to
determine subject area and bias for a particular
area.
Document clustering
Document clustering is used to group similar
documents.
It can be employed before search to reduce the
searchable document set vanRijsbergen1979, and
also as a means of organizing result documents.

17
Opinion classification

Majority of attempts employ machine learning
techniques.
A definitive work is Das and Chen Das2001,
which classifies investor sentiment for certain
stocks.
Finn et al. Finn2002 followed up with product
classification for entertainment (movies, theatre
and football).
Multiple techniques (Part of Speech tagging, Bag
of Words and Text statistics) were evaluated
He observed that a machine learning approach had
reduced domain transferability compared to
linguistic techniques.
Dini Dini2002 attempted a hybrid technique,
combining tokenization of words with machine
learning.

18
Opinion classification - continued

A conclusion of both Dini and Dave et. al
Dave2003 is that statistical methods are
unlikely to give larger than 90 accuracy at any
time.
They conclude that combining non statistical
natural language techniques and semantic
understanding might provide better results.

19
Conclusion and implementation

Retrieving relevant, precise information from a
corpora of text is a hard AI problem.
Increasing the amount of metadata stored about a
corpus is another means of increasing search
relevance.
Document dimensions aim to integrate existing NLP
techniques into a higher level metadata
framework.
An existing search product (Jakarta Lucene 1.3)
will be extended and subclassed to provide
dimension based functionality.