Title: Mining Keywords from Large Topic Taxonomies
1Mining Keywords from Large Topic Taxonomies
- Dissertation Presentation
12 November 2004
Christiana Christophi
Supervisor Dr. Marios Dikaiakos
2Outline
- Goal
- Motivation
- Related Work and Technology
- System Description
- Evaluation
- Conclusions and Future Work
3Goal
- This dissertation researches the area of
Information Retrieval and Web Mining in order to
create a system that dynamically enriches a large
topic taxonomy with a set of related keywords at
each topic node.
4What is a taxonomy
- A structure that organizes large text databases
hierarchically by topic to help searching,
browsing and filtering. - Many corpora are organized into taxonomies like
- internet directories (Yahoo)
- digital libraries (Library of Congress Catalogue)
- patent databases (IBM patent database)
5Motivation
- There is not a publicly available general-purpose
keyword taxonomy. - Construction of hierarchical keyword taxonomies
requires a continuous and intensive human effort.
The result has a restricted size and subject
coverage. - Our approach Exploit an existing taxonomy and
then use the contents of pages within to produce
a general-purpose keyword catalog. - Slow process of classifying documents
- A classification task requires an example
document set for each topic and then performing
classification based on all the content of the
example set. - Our approach Classify the document based on
keywords of each topic-node. - Drawbacks of keyword based searches.
- Search engines simply return results that contain
the users querying terms - Our approach The catalog could help build a
taxonomy classifier.
6Synonymy and Polysemy
- Synonymy
- the same concept can be expressed using different
sets of terms - e.g. classification categorization taxonomy
- negatively affects recall
- Polysemy
- identical terms can be used in very different
semantic contexts - e.g. jaguar
- negatively affects precision
7Precision and Recall
- Precision
- Represents the percentage of retrieved documents
that are relevant to a query - Recall
- Represents the percentage of documents that are
relevant to the query and were retrieved.
8Related Work
- Extraction of keywords for classification or link
analysis. - Corpus of documents that was processed was
relatively small. - WebKB knowledge base that consists of barely
8,282 pages describing university people, courses
and research projects - Hoovers data set consisting of 4,285 companies
corporate information. - Datasets are created be extracting sub-trees from
Web directories like Yahoo! and ODP hierarchy
(aka Dmoz) - e.g. Study for topic-based properties of the Web
by Chakrabarti uses as an initial set a 482-topic
Dmoz collection. - The ACCIO system by Davidov et al in
Parameterized Generation of Labeled Datasets for
Text Categorization Based on a Hierarchical
Directory which performs automatic acquisition
of labeled datasets for text categorization.
9Related Work
- Hierarchical classification by Chakrabarti
- Effort to select features for text documents.
- Two models of documents
- Bernoulli number of documents a term appears in
- Binary a term either appears or not in a
document - Focus on classification performance rather than
better keyword extraction.
10Information Retrieval
- Information Retrieval (IR) constructs an index
for a given corpus and responds to queries by
retrieving all the relevant corpus objects and as
few non-relevant objects as possible. - index a collection of documents (access
efficiency) - given users query
- rank documents by importance (accuracy)
- determine a maximally related subset of documents
11Indexing
- An index structure is a hash indexed or B tree
indexed table. - This table consists of a set of records each
containing two fields ID and posting_list. - Objects from the corpus pointing to a list of
lexical items - Inverted index
- effective for very large collections of documents
- associates lexical items to their occurrences in
the collection - A dictionary example
- each key is a term t ? V, where V is the
Vocabulary - associated value p(t) points to a posting list
12Index
13Inverted Index
14IR Models
- StatisticalUse statistical measures to rank
documents that best satisfy the query terms. - Boolean model
- Documents are represented by a set of index terms
- Each term is viewed as Boolean variable (e.g.
TRUE Term is present) - Vector space model (VSM)
- SemanticUse syntactic and semantic analysis to
simulate human understanding of text. - Natural Language Processing (NLP)
- Match query and documents semantic content.
- Latent Semantic Indexing (LSI)
- Reduces space so that each dimension in the
reduced space tends to represent those terms that
are more likely to co-occur in the collection.
15Statistical IR Models
- Theoretic models - Boolean model
- Documents are represented by a set of index terms
- Each term is viewed as Boolean variable (e.g.
TRUE Term is present) - User query represents a Boolean expression.
Documents retrieved satisfy the Boolean
expression. - Terms are combined using Boolean operators like
AND, OR and NOT. - No ranking of the results is possible a document
either satisfies the query or not.
16Vector space model (VSM)
- Documents are represented by a set of vectors in
a vector space, where each unique term represents
one dimension of that space. - A vector-space representation of m text documents
is as follows - Let n denote the number of terms in the
vocabulary. - Consider an n x m matrix A, whose entry aij
indicates either the presence or absence of term
i in document j. - The presence of a word can either be denoted by a
1 or a numeric weight.
17Semantic IR Models
- Natural Language Processing (NLP)
- Match query and documents semantic content.
- Indexing phrases (improve precision),
thesaurus-group (improve recall) - Latent Semantic Indexing (LSI)
- Vector information retrieval method, which in
coordination with singular value decomposition
(SVD) is used for dimensionality reduction. - Each dimension in the reduced space tends to
represent those terms that are more likely to
co-occur in the collection. - LSI considers as semantically close documents
that have many words in common and as
semantically distant documents with few words in
common.
18Term Weighting
- A term that appears many times within a document
is likely to be more important than a term that
appears only once - A term that occurs in a few documents is likely
to be a better discriminator than a term that
appears in most or all documents - Large documents must be treated equally to
smaller documents
19Term Weighting
- VSM weighting
- Term Frequency Element (Local Weight of term i in
document j) Lij - Collection Frequency Element (Global Weight of
term i) Gi - Normalization Length Element (Normalization
factor for document j) Nj
wij Lij Gi Nj
20Local Weight
21Global Weight
22Normalization length
23Web Mining
- Web Mining The use of Data Mining techniques to
automatically discover and extract information
from the WWW documents and services. - The Web provides a fertile ground for data
mining an immense and dynamic collection of
pages with countless hyperlinks and huge volumes
of access and usage information. - Searching, comprehending, and using the
semistructured information stored on the Web
poses a significant challenge because - Web page complexity exceeds the complexity of
traditional text collection. - The Web constitutes a highly dynamic information
source. - The Web serves a broad spectrum of user
communities. - A small portion of Web pages contain truly
relevant or useful information. - DM to supplement keyword-based indexing, which is
the cornerstone for Web search engines. - Classification maps data into predefined groups
or classes
24Classification schemes
- Nearest Neighbour (NN)
- Index documents.
- Given a document d, fetch k training documents
that are closer to d. - The class that occurs most times is assigned to d.
25Classification schemes
- Naïve Bayes (NB)
- Naïve assumption attribute independence
- P(E1,,EkC) P(E1C)P(EkC)
- C class value of instance
- E Evidence (instance)
- Support Vector Machines (SVM)
- Construct a direct function from term space to
the class variable. - Ability to learn can be independent of the
dimensionality of the feature space.
26Classification evaluation
- Hold-out estimate
- Training set and testing set used are mutually
independent. - Repeatedly resample the dataset to calculate
hold-out estimates by randomly reordering and
partitioning it into training (2/3) and test sets
(1/3) - Computing average and standard deviation of
accuracy. - Cross-validation
- Randomly reorder the data set and then split it
into n folds of equal size. - In each iteration, 1 fold used for testing and
n-1 folds for training - accuracy averaged test results over all folds.
- The folds can be random or modified to reflect
class distributions in each fold as in the
complete dataset (stratified cross-validation). - Leave-one-out (loo) cross-validation
- n of examples.
- Less reliable results.
- Used with small datasets
27System Description
- Focus enrich a topic taxonomy with good-quality
keywords. - Extract URLs from the ODP and download the pages.
- Process these pages by extracting keywords.
- Assess the keywords gathered for each document
and coalesce them into keywords for the category
to which the Web page belongs in the topic
taxonomy.
28Design Issues
- Scalable process a large corpus of documents
- Flexible incorporate more documents and
additively process them without requiring a
reprocessing of the whole document body - Highly customizable use different system
parameters - Modular Different algorithms can be used like
plugplay
29The taxonomy used
- Acquired from the Open Directory Project (ODP).
- Indexed over 4 million sites catalogued in
590,000 categories - Initially named Gnuhoo and then NewHoo by Skrenta
and Truel (1998) - Within 2 weeks - 200 editors, 27000 sites and
2000 categories. - Within 5 weeks - acquired by Netcenter.
- Basic idea As the Internet expands, the number
of Internet users also augments. These users can
each help organise a small portion of the web. - Available through a Resource Description
Framework (RDF).
30System Architecture
- The system comprises of several modules.
- Each module takes as input the results of the
previous module in line and outputs an Index. - Each module of our system is composed by
- Worker class, which performs the processing tasks
for each step, - Index Creator class that coordinatesWorkers by
delegating tasks - Indexer class that outputs the results of each
module to a file in binary form - Index Reader class that reloads the Index from
the file back to memory. - The whole procedure is split into two phases.
- Document Keyword Extraction result - a catalog
with a set of keywords for each document of the
Information Repository, - Topic Keyword Extraction result - a keyword
catalog for each topic in the hierarchical
taxonomy.
31System Architecture
32Document Keyword Extraction
33Preprocessing
Preprocessing
- Extracts a list of URLs from
- the ODP taxonomy down to
- a specified depth.
- Downloads Web Pages.
- Assigns a unique number
- (docID) to each page
34Forward Index Creator
- Processes Web pages
- Parses each file and extracts
- keywords.
- Ignores stop word list.
- 336 common English words
- Perform stemming
- Calculate local weight
- Produce a Forward Index and
- Topics Index
Forward Index Creator
35Stemming
- Reduce all morphological variants of a word to a
single index term - Porter Stemming Algorithm
- Five rules for reduction, which are applied to
the word sequentially. The rule chosen applies to
the longest suffix. Examples of the trimming
performed are - processes ? process
- policies ? polici
- emotional ? emotion
- e.g. if suffixIZATION and prefix contains at
least one vowel followed by a consonant, replace
with suffixIZE - BINARIZATION gt BINARIZE
36Local Weight
- Calculate local weight for each word in each
document. - Consider the HTML tag within which a word
appears. Group tags into 7 class. Assign each
class with an importance factor called CI (Class
Importance). - Calculate the frequency in each class and produce
Term Frequency Vector (TFV). - Local weight Lij
37Forward Index Creator
Web documents
Forward Index Creator
Forward Index
Topics Index
38Lexicon Creator
- Processes Forward Indexes.
- For each word, it calculates the global weight.
- Produces a Lexicon Index
- Global weight is Inverted Document Frequency
(IDF). - IDF scales down words that appear in many
documents.
Lexicon Creator
39Lexicon Creator
Lexicon Creator
FwIndexes
Lexicon Index
40Normalization Index Creator
- Processes Forward Indexes and Lexicons.
- For each document, calculate the normalization
length. - Produces a Normalization Index
- Cosine normalization to compensate for document
length. - This data transformation essentially projects
each vector onto the unit Euclidian sphere.
Normalization Index Creator
41Normalization Index Creator
FwIndexes
Normalization Index Creator
Lexicon
Normalization Index
42Weight Calculator
- Processes Forward Indexes, Lexicon and
Normalization Indexes. - Calculate the weight of each word in each document
Weight Calculator
43Weight Calculator
FwIndexes
Weight Calculator
Lexicon
NormIndexes
Forward Index
44Topic Keyword Extraction
45Forward Index and Topics Index Sorter
- Must match the document set of keywords to topic
set of keywords - Must replace docID with topicID
- Have to sort indexes to do this efficiently.
- Large volume of the indexes to be processed.
- Data cannot fit into main memory
- External Sorting
FwIndex TopicsIndex Sorter
46External Sorting
- Repetitively load on memory a segment from the
file, sort it with an algorithm like Quicksort,
and write the sorted data back to the file. - Merge sorted blocks with MergeSort by making
several passes through the file, creating
consecutively larger sorted blocks until the
whole file is sorted. - The most common scheme is the "k-Way Merging.
A 4-way merge on 16 runs
47Topics ID Replacer
- Replace each docID in the sorted Forward Index
with the corresponding topicID
TopicsID Replacer
48Topics ID Replacer
TopicsID Replacer
Topics Keywords Index
49Topics Keywords Index Sorter
- Sort the Topics Keyword Index by topicID
Topics Keyword Index Sorter
50Topics Records Merger
- Process the Topics Keyword Index
- Coalesce together multiple records of the same
topicID, wordID by calculating the mean value
of the weight
topicID, wordID, (weight1,, weightn)
Topics Record Merger
51Evaluating Results
- Extracted URLs from ODP whose topics are up to
depth 6. - Root-Topic/ Sub-topic1/ Sub-topic2 /
Sub-topic3 / Sub-topic4 / Sub-topic5 /
Sub-topic6 - Extracted 1,153,086 such URLs.
- Downloaded those pages using the eRACE crawler.
The crawler managed to download 1,046,021
webpages. - From these pages, 257,978 pages were under the
topic Top/World, (the majority are non English). - For the processing of these pages, we used a
non-exclusive Solaris machine having 4 CPUs and
8GB of memory. We executed the Java program using
4GB of memory.
52Experiments and Results
53Experiments and Results
54Experiments and Results
55Experiments and Results
- In order to evaluate the keywords produces we try
to classify documents based on them. - Tool used was Weka
- Weka is a collection of machine learning
algorithms for data mining tasks - Contains tools for data pre-processing,
classification, regression, clustering,
association rules, and visualization. - Open source software issued under the GNU General
Public License. - Input data are represented in ARFF
(Attribute-Relation File Format). - Includes a collection of instances that have a
set of common characteristics (features).
56Experiment 1
- Select only web documents located under the topic
/Top/Arts/Movies/. - ? reduction of the dataset to 28102 webpages 5756
unique topics and 123253 wordIDs - Minimum threshold of the weight of a word to be
0,01. - This reduced the keywords set to 19137.
- A big number to represent features we selected
the 40 most important keywords. - For the majority of topics only a few webpages as
samples - We selected topics for which there were at least
10 webpages - This reduced the set of topics from 5756 to 93
and the number of instances (webpages) of the
data set from 28102 to 2432.
57Experiment 1
- Topic Appearance before selection
58Experiment 1 Classification Results
59Experiment 2
- Dataset includes documents that belong to topics
of depth 2 in the taxonomy. - The result is a dataset of 3939 documents, 194
topics and 16373 words. - We produced an ARFF file, where each record
represents a document and states the weight for
each of the 16373 keywords in the particular
document, as well as the topic to which the
document belongs.
60Experiment 2 Classification Results
61Experiment 2 (NN, k5)
- TP Rate It is the fraction of positive instances
predicted as positive. - FP Rate It is the fraction of negative instances
predicted as positive.
62Experiment 2 (NN, k1)
- TP Rate It is the fraction of positive instances
predicted as positive. - FP Rate It is the fraction of negative instances
predicted as positive.
63Conclusions
- We have proposed a system that takes advantage of
the content of Web pages by using well-known IR
techniques to extract keywords for enriching a
large-scale taxonomy. - The challenge was to build a scalable, flexible
and highly customizable system. - We have tested the system with a corpus of over
1,000,000 documents with satisfactory performance
and speed. - The algorithms are designed to eliminate large
in-memory processing. - The memory is analogous to the size of the
lexicon (words) and not the number of documents - The system is able to incorporate more documents
and additively process them without requiring a
reprocessing of the whole document corpus. - It is customizable and modular enough to use
different parameters IR algorithms. This gives
the ability to compare the results of different
techniques or to better tune parameter values. - The results presented have indicated the
good-quality of the keywords produced, although
it seems there is room for improvement.
64Future Work
- We believe that there is a lot of work that can
be done in improving and exploiting this tool. - Reduce the processing time by improving the
algorithms used. - Better management of Input/Output tasks, e.g. the
deployment of the Mapped IO functionality of Java
or employment of lower level languages to perform
such tasks. - Recognize the language of each Internet page, and
therefore use the appropriate stemmer and stop
word list for each one. - Extract meta-data assigned by the volunteer
editors from the ODP dump. - We must investigate what is the optimal set of
attributes that can describe a given topic and
maximize the quality of the application as well
as what parameters can affect this decision - The large database of keywords for each category
can be exploited in different ways. - Help automatically catalog documents into a topic
hierarchy while they can complement the documents
with metadata. - Useful tool for statistically analyzing large
hierarchies like ODP and Yahoo. - It can be used to assign a topic to user queries
and thus filter out irrelevant topics.
65Future Work
- A more sophisticated version of the tool could
also exploit thesaurus like WordNet. This could
enforce the semantic meaning of the results. - We could study methodologies of diffusing
keywords up or down the hierarchy. - We will need some visualization tools for the
presentation of the results. - The system could also be provided as an Internet
service where a user could query the keywords of
a particular topic or sub-tree. The results could
be presented using some visualization technique.
66- THANK YOU FOR YOUR ATTENTION.
- QUESTIONS?