Title: The many dimensions of Information Mining
1The many dimensions of Information Mining
- Searching textual information sources
- Thimal Jayasooriya
- thimal_at_cs.york.ac.uk
2Introduction
- Exponentially increasing amounts of material are
available - Finding and making sense of this material is
potentially useful, but difficult with present
search technology. - Information mining includes information
extraction, information retrieval, natural
language processing and document summarization.
Dixon1997 - Many diverse research areas and conferences cover
the topic. - TREC (Text Retrieval Conference)
- MUC (Message Understanding Conference)
- SIGIR-ACM (ACM Special Interest Group on
Information Retrieval)
3Introduction continued
- Some problems which search can currently solve
- Retrieving a document with known attributes. A
conference paper can be found by using the author
name and the conference title. - Retrieving documents which have been manually
annotated or reviewed as relevant. - Retrieving documents which contain exact matches
for specific words or phrases.
4Introduction - continued
- And, some problems that arent so easy
- Finding the most relevant information for a
particular topic - Finding out opinions (positive, negative or
neutral) - Inferring the bias of a writer based on his
articles - Cross referencing domain knowledge with another
field - Finding other relevant resources or similar
documents - Yang2002, vanRijke2003
5Why isnt search easier ?
- Unstructured and freeform nature of text
- Not always possible to distinguish content from
fluff. - The content refreshing problem for web search
- Content continually changes and leaves no time
to discover new material. - Ambiguity
- Lack of semantic knowledge
- Difficult to automatically discern the sense
of a sentence. - Lack of domain knowledge
6Why isnt search easier - continued
- Manning2000 defines several steps that are
required to disambiguate a corpus of text - Distinguish word sense
- Define the word category
- Decide the syntactic structure
- Resolve semantic ambiguity and scope
7Ambiguity in text
Time flies like the wind, fruit flies like banana
8The search procedure
Gather data source (corpora of text)
Filter and index content
End user queries indices
Retrieves documents matching query
9Using metadata effectively
- Most search techniques rely on metadata to
increase relevance. - For instance, frequently occurring words are more
likely to be the topic of a particular paper.
vanRijsbergen1979, Manning2000 -
- Other forms of metadata can include formatting
and markup. Header tags and bold text, for
example, may be more relevant. - HTML and some other document types can link to
other pages. This was used by Brin1998 to
calculate pagerank. The higher the pagerank,
the higher the relevance.
10Using metadata effectively - continued
- Hawking Hawking2000 extends this concept to
searching email archives. - Users could constrain searches by metadata, as
well as by content. - Search for all emails with SIGIR facm.org
(with acm.org in the from field) - A precursor to these papers is seen in
InfoHarness Shklar1995 which proposes a generic
framework for organizing any sort of metadata.
11Data mining
- Fayyad et al. notes that data mining techniques
can be adopted for finding useful patterns in
information. Fayyad1996 - The use of KDD (Knowledge Discovery in Databases)
and techniques to analyse interesting patterns
can also be adopted for text search. - A related concept is data organization in
warehouses and decision support systems. - Data is organized into dimensions, and each
dimension is analysed using a special algorithm. - Some analysis techniques include LSI (Latent
Semantic Indexing), SVD (Singular Value
Decomposition) and CFA (Correspondence Factorial
Analysis) Mothe2000
12Document dimensions
- In datawarehousing terminology, a dimension is "a
structure which categorizes data in order to
allow users to answer questions". - Mothe employs dimensions to cluster and find
similar documents in astronomical data.
Mothe2000 - Mothe uses mainly five elements of a document for
similarity detection. They are - The title of the document
- The author name
- The journal name
- The date published
- The text of the document
- Each dimension is then analysed using Singular
Value Decomposition or its derivatives (such as
LSI or CFA).
13Document dimensions - continued
- Limitations of Mothes dimensioning for plain
text search - Treats the entire textual content of a document
as a single dimension. - Has a single set of dimensions for all documents,
regardless of type. Only five dimensions were
discovered and used. - Merely focuses on plotting of dimensions over
time.
14Enhancements to dimensions
- The accessibility dimension Roellke2001
addresses some of the limitations in Mothes
model. - The content of a document is mapped to a tree
structure. - Each segment of text is analysed for its
relationship to the given search topic. - A segment of text is broken up into sub contexts
and super contexts. - Each search is modelled according to its term
frequency (tf) to the inverse document frequency
(idf) to its accessibility dimension (ac).
15Proposed enhancements to dimensions
Entities (names, places and objects)
Dates (signifying timelines)
Opinions (positive, negative)
16Potential uses of document dimensions
- Document dimensions can be applied to a number of
natural language processing problems. - Opinion classification
- Opinion classification is a mining technique used
to discover branding and product information in
public media. - Traditionally requires human analysis to
determine subject area and bias for a particular
area. - Document clustering
- Document clustering is used to group similar
documents. - It can be employed before search to reduce the
searchable document set vanRijsbergen1979, and
also as a means of organizing result documents.
17Opinion classification
- Majority of attempts employ machine learning
techniques. - A definitive work is Das and Chen Das2001,
which classifies investor sentiment for certain
stocks. - Finn et al. Finn2002 followed up with product
classification for entertainment (movies, theatre
and football). - Multiple techniques (Part of Speech tagging, Bag
of Words and Text statistics) were evaluated - He observed that a machine learning approach had
reduced domain transferability compared to
linguistic techniques. - Dini Dini2002 attempted a hybrid technique,
combining tokenization of words with machine
learning.
18Opinion classification - continued
- A conclusion of both Dini and Dave et. al
Dave2003 is that statistical methods are
unlikely to give larger than 90 accuracy at any
time. - They conclude that combining non statistical
natural language techniques and semantic
understanding might provide better results.
19Conclusion and implementation
- Retrieving relevant, precise information from a
corpora of text is a hard AI problem. - Increasing the amount of metadata stored about a
corpus is another means of increasing search
relevance. - Document dimensions aim to integrate existing NLP
techniques into a higher level metadata
framework. - An existing search product (Jakarta Lucene 1.3)
will be extended and subclassed to provide
dimension based functionality.