Mining Keywords from Large Topic Taxonomies

About This Presentation

Title:

Mining Keywords from Large Topic Taxonomies

Description:

Our approach: Classify the document based on keywords of each topic-node. ... Select only web documents located under the topic /Top/Arts/Movies ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 63

Provided by: christ59

Category:

more less

Transcript and Presenter's Notes

Title: Mining Keywords from Large Topic Taxonomies

1
Mining Keywords from Large Topic Taxonomies

Dissertation Presentation

12 November 2004
Christiana Christophi
Supervisor Dr. Marios Dikaiakos
2
Outline

Goal
Motivation
Related Work and Technology
System Description
Evaluation
Conclusions and Future Work

3
Goal

This dissertation researches the area of
Information Retrieval and Web Mining in order to
create a system that dynamically enriches a large
topic taxonomy with a set of related keywords at
each topic node.

4
What is a taxonomy

A structure that organizes large text databases
hierarchically by topic to help searching,
browsing and filtering.
Many corpora are organized into taxonomies like
internet directories (Yahoo)
digital libraries (Library of Congress Catalogue)
patent databases (IBM patent database)

5
Motivation

There is not a publicly available general-purpose
keyword taxonomy.
Construction of hierarchical keyword taxonomies
requires a continuous and intensive human effort.
The result has a restricted size and subject
coverage.
Our approach Exploit an existing taxonomy and
then use the contents of pages within to produce
a general-purpose keyword catalog.
Slow process of classifying documents
A classification task requires an example
document set for each topic and then performing
classification based on all the content of the
example set.
Our approach Classify the document based on
keywords of each topic-node.
Drawbacks of keyword based searches.
Search engines simply return results that contain
the users querying terms
Our approach The catalog could help build a
taxonomy classifier.

6
Synonymy and Polysemy

Synonymy
the same concept can be expressed using different
sets of terms
e.g. classification categorization taxonomy
negatively affects recall
Polysemy
identical terms can be used in very different
semantic contexts
e.g. jaguar
negatively affects precision

7
Precision and Recall

Precision
Represents the percentage of retrieved documents
that are relevant to a query
Recall
Represents the percentage of documents that are
relevant to the query and were retrieved.

8
Related Work

Extraction of keywords for classification or link
analysis.
Corpus of documents that was processed was
relatively small.
WebKB knowledge base that consists of barely
8,282 pages describing university people, courses
and research projects
Hoovers data set consisting of 4,285 companies
corporate information.
Datasets are created be extracting sub-trees from
Web directories like Yahoo! and ODP hierarchy
(aka Dmoz)
e.g. Study for topic-based properties of the Web
by Chakrabarti uses as an initial set a 482-topic
Dmoz collection.
The ACCIO system by Davidov et al in
Parameterized Generation of Labeled Datasets for
Text Categorization Based on a Hierarchical
Directory which performs automatic acquisition
of labeled datasets for text categorization.

9
Related Work

Hierarchical classification by Chakrabarti
Effort to select features for text documents.
Two models of documents
Bernoulli number of documents a term appears in
Binary a term either appears or not in a
document
Focus on classification performance rather than
better keyword extraction.

10
Information Retrieval

Information Retrieval (IR) constructs an index
for a given corpus and responds to queries by
retrieving all the relevant corpus objects and as
few non-relevant objects as possible.
index a collection of documents (access
efficiency)
given users query
rank documents by importance (accuracy)
determine a maximally related subset of documents

11
Indexing

An index structure is a hash indexed or B tree
indexed table.
This table consists of a set of records each
containing two fields ID and posting_list.
Objects from the corpus pointing to a list of
lexical items
Inverted index
effective for very large collections of documents
associates lexical items to their occurrences in
the collection
A dictionary example
each key is a term t ? V, where V is the
Vocabulary
associated value p(t) points to a posting list

12
Index
13
Inverted Index
14
IR Models

StatisticalUse statistical measures to rank
documents that best satisfy the query terms.
Boolean model
Documents are represented by a set of index terms
Each term is viewed as Boolean variable (e.g.
TRUE Term is present)
Vector space model (VSM)
SemanticUse syntactic and semantic analysis to
simulate human understanding of text.
Natural Language Processing (NLP)
Match query and documents semantic content.
Latent Semantic Indexing (LSI)
Reduces space so that each dimension in the
reduced space tends to represent those terms that
are more likely to co-occur in the collection.

15
Statistical IR Models

Theoretic models - Boolean model
Documents are represented by a set of index terms
Each term is viewed as Boolean variable (e.g.
TRUE Term is present)
User query represents a Boolean expression.
Documents retrieved satisfy the Boolean
expression.
Terms are combined using Boolean operators like
AND, OR and NOT.
No ranking of the results is possible a document
either satisfies the query or not.

16
Vector space model (VSM)

Documents are represented by a set of vectors in
a vector space, where each unique term represents
one dimension of that space.
A vector-space representation of m text documents
is as follows
Let n denote the number of terms in the
vocabulary.
Consider an n x m matrix A, whose entry aij
indicates either the presence or absence of term
i in document j.
The presence of a word can either be denoted by a
1 or a numeric weight.

17
Semantic IR Models

Natural Language Processing (NLP)
Match query and documents semantic content.
Indexing phrases (improve precision),
thesaurus-group (improve recall)
Latent Semantic Indexing (LSI)
Vector information retrieval method, which in
coordination with singular value decomposition
(SVD) is used for dimensionality reduction.
Each dimension in the reduced space tends to
represent those terms that are more likely to
co-occur in the collection.
LSI considers as semantically close documents
that have many words in common and as
semantically distant documents with few words in
common.

18
Term Weighting

A term that appears many times within a document
is likely to be more important than a term that
appears only once
A term that occurs in a few documents is likely
to be a better discriminator than a term that
appears in most or all documents
Large documents must be treated equally to
smaller documents

19
Term Weighting

VSM weighting
Term Frequency Element (Local Weight of term i in
document j) Lij
Collection Frequency Element (Global Weight of
term i) Gi
Normalization Length Element (Normalization
factor for document j) Nj

wij Lij Gi Nj
20
Local Weight
21
Global Weight
22
Normalization length
23
Web Mining

Web Mining The use of Data Mining techniques to
automatically discover and extract information
from the WWW documents and services.
The Web provides a fertile ground for data
mining an immense and dynamic collection of
pages with countless hyperlinks and huge volumes
of access and usage information.
Searching, comprehending, and using the
semistructured information stored on the Web
poses a significant challenge because
Web page complexity exceeds the complexity of
traditional text collection.
The Web constitutes a highly dynamic information
source.
The Web serves a broad spectrum of user
communities.
A small portion of Web pages contain truly
relevant or useful information.
DM to supplement keyword-based indexing, which is
the cornerstone for Web search engines.
Classification maps data into predefined groups
or classes

24
Classification schemes

Nearest Neighbour (NN)
Index documents.
Given a document d, fetch k training documents
that are closer to d.
The class that occurs most times is assigned to d.

25
Classification schemes

Naïve Bayes (NB)
Naïve assumption attribute independence
P(E1,,EkC) P(E1C)P(EkC)
C class value of instance
E Evidence (instance)
Support Vector Machines (SVM)
Construct a direct function from term space to
the class variable.
Ability to learn can be independent of the
dimensionality of the feature space.

26
Classification evaluation

Hold-out estimate
Training set and testing set used are mutually
independent.
Repeatedly resample the dataset to calculate
hold-out estimates by randomly reordering and
partitioning it into training (2/3) and test sets
(1/3)
Computing average and standard deviation of
accuracy.
Cross-validation
Randomly reorder the data set and then split it
into n folds of equal size.
In each iteration, 1 fold used for testing and
n-1 folds for training
accuracy averaged test results over all folds.
The folds can be random or modified to reflect
class distributions in each fold as in the
complete dataset (stratified cross-validation).
Leave-one-out (loo) cross-validation
n of examples.
Less reliable results.
Used with small datasets

27
System Description

Focus enrich a topic taxonomy with good-quality
keywords.
Extract URLs from the ODP and download the pages.
Process these pages by extracting keywords.
Assess the keywords gathered for each document
and coalesce them into keywords for the category
to which the Web page belongs in the topic
taxonomy.

28
Design Issues

Scalable process a large corpus of documents
Flexible incorporate more documents and
additively process them without requiring a
reprocessing of the whole document body
Highly customizable use different system
parameters
Modular Different algorithms can be used like
plugplay

29
The taxonomy used

Acquired from the Open Directory Project (ODP).
Indexed over 4 million sites catalogued in
590,000 categories
Initially named Gnuhoo and then NewHoo by Skrenta
and Truel (1998)
Within 2 weeks - 200 editors, 27000 sites and
2000 categories.
Within 5 weeks - acquired by Netcenter.
Basic idea As the Internet expands, the number
of Internet users also augments. These users can
each help organise a small portion of the web.
Available through a Resource Description
Framework (RDF).

30
System Architecture

The system comprises of several modules.
Each module takes as input the results of the
previous module in line and outputs an Index.
Each module of our system is composed by
Worker class, which performs the processing tasks
for each step,
Index Creator class that coordinatesWorkers by
delegating tasks
Indexer class that outputs the results of each
module to a file in binary form
Index Reader class that reloads the Index from
the file back to memory.
The whole procedure is split into two phases.
Document Keyword Extraction result - a catalog
with a set of keywords for each document of the
Information Repository,
Topic Keyword Extraction result - a keyword
catalog for each topic in the hierarchical
taxonomy.

31
System Architecture
32
Document Keyword Extraction
33
Preprocessing
Preprocessing

Extracts a list of URLs from
the ODP taxonomy down to
a specified depth.
Downloads Web Pages.
Assigns a unique number
(docID) to each page

34
Forward Index Creator

Processes Web pages
Parses each file and extracts
keywords.
Ignores stop word list.
336 common English words
Perform stemming
Calculate local weight
Produce a Forward Index and
Topics Index

Forward Index Creator
35
Stemming

Reduce all morphological variants of a word to a
single index term
Porter Stemming Algorithm
Five rules for reduction, which are applied to
the word sequentially. The rule chosen applies to
the longest suffix. Examples of the trimming
performed are
processes ? process
policies ? polici
emotional ? emotion
e.g. if suffixIZATION and prefix contains at
least one vowel followed by a consonant, replace
with suffixIZE
BINARIZATION gt BINARIZE

36
Local Weight

Calculate local weight for each word in each
document.
Consider the HTML tag within which a word
appears. Group tags into 7 class. Assign each
class with an importance factor called CI (Class
Importance).
Calculate the frequency in each class and produce
Term Frequency Vector (TFV).
Local weight Lij

37
Forward Index Creator
Web documents
Forward Index Creator
Forward Index
Topics Index
38
Lexicon Creator

Processes Forward Indexes.
For each word, it calculates the global weight.
Produces a Lexicon Index

Global weight is Inverted Document Frequency
(IDF).
IDF scales down words that appear in many
documents.

Lexicon Creator
39
Lexicon Creator
Lexicon Creator
FwIndexes
Lexicon Index
40
Normalization Index Creator

Processes Forward Indexes and Lexicons.
For each document, calculate the normalization
length.
Produces a Normalization Index

Cosine normalization to compensate for document
length.
This data transformation essentially projects
each vector onto the unit Euclidian sphere.

Normalization Index Creator
41
Normalization Index Creator
FwIndexes
Normalization Index Creator
Lexicon
Normalization Index
42
Weight Calculator

Processes Forward Indexes, Lexicon and
Normalization Indexes.
Calculate the weight of each word in each document

Weight Calculator
43
Weight Calculator
FwIndexes
Weight Calculator
Lexicon
NormIndexes
Forward Index
44
Topic Keyword Extraction
45
Forward Index and Topics Index Sorter

Must match the document set of keywords to topic
set of keywords
Must replace docID with topicID
Have to sort indexes to do this efficiently.
Large volume of the indexes to be processed.
Data cannot fit into main memory
External Sorting

FwIndex TopicsIndex Sorter
46
External Sorting

Repetitively load on memory a segment from the
file, sort it with an algorithm like Quicksort,
and write the sorted data back to the file.
Merge sorted blocks with MergeSort by making
several passes through the file, creating
consecutively larger sorted blocks until the
whole file is sorted.
The most common scheme is the "k-Way Merging.

A 4-way merge on 16 runs
47
Topics ID Replacer

Replace each docID in the sorted Forward Index
with the corresponding topicID

TopicsID Replacer
48
Topics ID Replacer
TopicsID Replacer
Topics Keywords Index
49
Topics Keywords Index Sorter

Sort the Topics Keyword Index by topicID

Topics Keyword Index Sorter
50
Topics Records Merger

Process the Topics Keyword Index
Coalesce together multiple records of the same
topicID, wordID by calculating the mean value
of the weight

topicID, wordID, (weight1,, weightn)
Topics Record Merger
51
Evaluating Results

Extracted URLs from ODP whose topics are up to
depth 6.
Root-Topic/ Sub-topic1/ Sub-topic2 /
Sub-topic3 / Sub-topic4 / Sub-topic5 /
Sub-topic6
Extracted 1,153,086 such URLs.
Downloaded those pages using the eRACE crawler.
The crawler managed to download 1,046,021
webpages.
From these pages, 257,978 pages were under the
topic Top/World, (the majority are non English).
For the processing of these pages, we used a
non-exclusive Solaris machine having 4 CPUs and
8GB of memory. We executed the Java program using
4GB of memory.

52
Experiments and Results
53
Experiments and Results
54
Experiments and Results
55
Experiments and Results

In order to evaluate the keywords produces we try
to classify documents based on them.
Tool used was Weka
Weka is a collection of machine learning
algorithms for data mining tasks
Contains tools for data pre-processing,
classification, regression, clustering,
association rules, and visualization.
Open source software issued under the GNU General
Public License.
Input data are represented in ARFF
(Attribute-Relation File Format).
Includes a collection of instances that have a
set of common characteristics (features).

56
Experiment 1

Select only web documents located under the topic
/Top/Arts/Movies/.
? reduction of the dataset to 28102 webpages 5756
unique topics and 123253 wordIDs
Minimum threshold of the weight of a word to be
0,01.
This reduced the keywords set to 19137.
A big number to represent features we selected
the 40 most important keywords.
For the majority of topics only a few webpages as
samples
We selected topics for which there were at least
10 webpages
This reduced the set of topics from 5756 to 93
and the number of instances (webpages) of the
data set from 28102 to 2432.

57
Experiment 1

Topic Appearance before selection

58
Experiment 1 Classification Results
59
Experiment 2

Dataset includes documents that belong to topics
of depth 2 in the taxonomy.
The result is a dataset of 3939 documents, 194
topics and 16373 words.
We produced an ARFF file, where each record
represents a document and states the weight for
each of the 16373 keywords in the particular
document, as well as the topic to which the
document belongs.

60
Experiment 2 Classification Results
61
Experiment 2 (NN, k5)

TP Rate It is the fraction of positive instances
predicted as positive.
FP Rate It is the fraction of negative instances
predicted as positive.

62
Experiment 2 (NN, k1)

TP Rate It is the fraction of positive instances
predicted as positive.
FP Rate It is the fraction of negative instances
predicted as positive.

63
Conclusions

We have proposed a system that takes advantage of
the content of Web pages by using well-known IR
techniques to extract keywords for enriching a
large-scale taxonomy.
The challenge was to build a scalable, flexible
and highly customizable system.
We have tested the system with a corpus of over
1,000,000 documents with satisfactory performance
and speed.
The algorithms are designed to eliminate large
in-memory processing.
The memory is analogous to the size of the
lexicon (words) and not the number of documents
The system is able to incorporate more documents
and additively process them without requiring a
reprocessing of the whole document corpus.
It is customizable and modular enough to use
different parameters IR algorithms. This gives
the ability to compare the results of different
techniques or to better tune parameter values.
The results presented have indicated the
good-quality of the keywords produced, although
it seems there is room for improvement.

64
Future Work

We believe that there is a lot of work that can
be done in improving and exploiting this tool.
Reduce the processing time by improving the
algorithms used.
Better management of Input/Output tasks, e.g. the
deployment of the Mapped IO functionality of Java
or employment of lower level languages to perform
such tasks.
Recognize the language of each Internet page, and
therefore use the appropriate stemmer and stop
word list for each one.
Extract meta-data assigned by the volunteer
editors from the ODP dump.
We must investigate what is the optimal set of
attributes that can describe a given topic and
maximize the quality of the application as well
as what parameters can affect this decision
The large database of keywords for each category
can be exploited in different ways.
Help automatically catalog documents into a topic
hierarchy while they can complement the documents
with metadata.
Useful tool for statistically analyzing large
hierarchies like ODP and Yahoo.
It can be used to assign a topic to user queries
and thus filter out irrelevant topics.

65
Future Work

A more sophisticated version of the tool could
also exploit thesaurus like WordNet. This could
enforce the semantic meaning of the results.
We could study methodologies of diffusing
keywords up or down the hierarchy.
We will need some visualization tools for the
presentation of the results.
The system could also be provided as an Internet
service where a user could query the keywords of
a particular topic or sub-tree. The results could
be presented using some visualization technique.