Mining Keywords from Large Topic Taxonomies - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Mining Keywords from Large Topic Taxonomies

Description:

Our approach: Classify the document based on keywords of each topic-node. ... Select only web documents located under the topic /Top/Arts/Movies ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 63
Provided by: christ59
Category:

less

Transcript and Presenter's Notes

Title: Mining Keywords from Large Topic Taxonomies


1
Mining Keywords from Large Topic Taxonomies
  • Dissertation Presentation

12 November 2004
Christiana Christophi
Supervisor Dr. Marios Dikaiakos
2
Outline
  • Goal
  • Motivation
  • Related Work and Technology
  • System Description
  • Evaluation
  • Conclusions and Future Work

3
Goal
  • This dissertation researches the area of
    Information Retrieval and Web Mining in order to
    create a system that dynamically enriches a large
    topic taxonomy with a set of related keywords at
    each topic node.

4
What is a taxonomy
  • A structure that organizes large text databases
    hierarchically by topic to help searching,
    browsing and filtering.
  • Many corpora are organized into taxonomies like
  • internet directories (Yahoo)
  • digital libraries (Library of Congress Catalogue)
  • patent databases (IBM patent database)

5
Motivation
  • There is not a publicly available general-purpose
    keyword taxonomy.
  • Construction of hierarchical keyword taxonomies
    requires a continuous and intensive human effort.
    The result has a restricted size and subject
    coverage.
  • Our approach Exploit an existing taxonomy and
    then use the contents of pages within to produce
    a general-purpose keyword catalog.
  • Slow process of classifying documents
  • A classification task requires an example
    document set for each topic and then performing
    classification based on all the content of the
    example set.
  • Our approach Classify the document based on
    keywords of each topic-node.
  • Drawbacks of keyword based searches.
  • Search engines simply return results that contain
    the users querying terms
  • Our approach The catalog could help build a
    taxonomy classifier.

6
Synonymy and Polysemy
  • Synonymy
  • the same concept can be expressed using different
    sets of terms
  • e.g. classification categorization taxonomy
  • negatively affects recall
  • Polysemy
  • identical terms can be used in very different
    semantic contexts
  • e.g. jaguar
  • negatively affects precision

7
Precision and Recall
  • Precision
  • Represents the percentage of retrieved documents
    that are relevant to a query
  • Recall
  • Represents the percentage of documents that are
    relevant to the query and were retrieved.

8
Related Work
  • Extraction of keywords for classification or link
    analysis.
  • Corpus of documents that was processed was
    relatively small.
  • WebKB knowledge base that consists of barely
    8,282 pages describing university people, courses
    and research projects
  • Hoovers data set consisting of 4,285 companies
    corporate information.
  • Datasets are created be extracting sub-trees from
    Web directories like Yahoo! and ODP hierarchy
    (aka Dmoz)
  • e.g. Study for topic-based properties of the Web
    by Chakrabarti uses as an initial set a 482-topic
    Dmoz collection.
  • The ACCIO system by Davidov et al in
    Parameterized Generation of Labeled Datasets for
    Text Categorization Based on a Hierarchical
    Directory which performs automatic acquisition
    of labeled datasets for text categorization.

9
Related Work
  • Hierarchical classification by Chakrabarti
  • Effort to select features for text documents.
  • Two models of documents
  • Bernoulli number of documents a term appears in
  • Binary a term either appears or not in a
    document
  • Focus on classification performance rather than
    better keyword extraction.

10
Information Retrieval
  • Information Retrieval (IR) constructs an index
    for a given corpus and responds to queries by
    retrieving all the relevant corpus objects and as
    few non-relevant objects as possible.
  • index a collection of documents (access
    efficiency)
  • given users query
  • rank documents by importance (accuracy)
  • determine a maximally related subset of documents

11
Indexing
  • An index structure is a hash indexed or B tree
    indexed table.
  • This table consists of a set of records each
    containing two fields ID and posting_list.
  • Objects from the corpus pointing to a list of
    lexical items
  • Inverted index
  • effective for very large collections of documents
  • associates lexical items to their occurrences in
    the collection
  • A dictionary example
  • each key is a term t ? V, where V is the
    Vocabulary
  • associated value p(t) points to a posting list

12
Index
13
Inverted Index
14
IR Models
  • StatisticalUse statistical measures to rank
    documents that best satisfy the query terms.
  • Boolean model
  • Documents are represented by a set of index terms
  • Each term is viewed as Boolean variable (e.g.
    TRUE Term is present)
  • Vector space model (VSM)
  • SemanticUse syntactic and semantic analysis to
    simulate human understanding of text.
  • Natural Language Processing (NLP)
  • Match query and documents semantic content.
  • Latent Semantic Indexing (LSI)
  • Reduces space so that each dimension in the
    reduced space tends to represent those terms that
    are more likely to co-occur in the collection.

15
Statistical IR Models
  • Theoretic models - Boolean model
  • Documents are represented by a set of index terms
  • Each term is viewed as Boolean variable (e.g.
    TRUE Term is present)
  • User query represents a Boolean expression.
    Documents retrieved satisfy the Boolean
    expression.
  • Terms are combined using Boolean operators like
    AND, OR and NOT.
  • No ranking of the results is possible a document
    either satisfies the query or not.

16
Vector space model (VSM)
  • Documents are represented by a set of vectors in
    a vector space, where each unique term represents
    one dimension of that space.
  • A vector-space representation of m text documents
    is as follows
  • Let n denote the number of terms in the
    vocabulary.
  • Consider an n x m matrix A, whose entry aij
    indicates either the presence or absence of term
    i in document j.
  • The presence of a word can either be denoted by a
    1 or a numeric weight.

17
Semantic IR Models
  • Natural Language Processing (NLP)
  • Match query and documents semantic content.
  • Indexing phrases (improve precision),
    thesaurus-group (improve recall)
  • Latent Semantic Indexing (LSI)
  • Vector information retrieval method, which in
    coordination with singular value decomposition
    (SVD) is used for dimensionality reduction.
  • Each dimension in the reduced space tends to
    represent those terms that are more likely to
    co-occur in the collection.
  • LSI considers as semantically close documents
    that have many words in common and as
    semantically distant documents with few words in
    common.

18
Term Weighting
  • A term that appears many times within a document
    is likely to be more important than a term that
    appears only once
  • A term that occurs in a few documents is likely
    to be a better discriminator than a term that
    appears in most or all documents
  • Large documents must be treated equally to
    smaller documents

19
Term Weighting
  • VSM weighting
  • Term Frequency Element (Local Weight of term i in
    document j) Lij
  • Collection Frequency Element (Global Weight of
    term i) Gi
  • Normalization Length Element (Normalization
    factor for document j) Nj

wij Lij Gi Nj
20
Local Weight
21
Global Weight
22
Normalization length
23
Web Mining
  • Web Mining The use of Data Mining techniques to
    automatically discover and extract information
    from the WWW documents and services.
  • The Web provides a fertile ground for data
    mining an immense and dynamic collection of
    pages with countless hyperlinks and huge volumes
    of access and usage information.
  • Searching, comprehending, and using the
    semistructured information stored on the Web
    poses a significant challenge because
  • Web page complexity exceeds the complexity of
    traditional text collection.
  • The Web constitutes a highly dynamic information
    source.
  • The Web serves a broad spectrum of user
    communities.
  • A small portion of Web pages contain truly
    relevant or useful information.
  • DM to supplement keyword-based indexing, which is
    the cornerstone for Web search engines.
  • Classification maps data into predefined groups
    or classes

24
Classification schemes
  • Nearest Neighbour (NN)
  • Index documents.
  • Given a document d, fetch k training documents
    that are closer to d.
  • The class that occurs most times is assigned to d.

25
Classification schemes
  • Naïve Bayes (NB)
  • Naïve assumption attribute independence
  • P(E1,,EkC) P(E1C)P(EkC)
  • C class value of instance
  • E Evidence (instance)
  • Support Vector Machines (SVM)
  • Construct a direct function from term space to
    the class variable.
  • Ability to learn can be independent of the
    dimensionality of the feature space.

26
Classification evaluation
  • Hold-out estimate
  • Training set and testing set used are mutually
    independent.
  • Repeatedly resample the dataset to calculate
    hold-out estimates by randomly reordering and
    partitioning it into training (2/3) and test sets
    (1/3)
  • Computing average and standard deviation of
    accuracy.
  • Cross-validation
  • Randomly reorder the data set and then split it
    into n folds of equal size.
  • In each iteration, 1 fold used for testing and
    n-1 folds for training
  • accuracy averaged test results over all folds.
  • The folds can be random or modified to reflect
    class distributions in each fold as in the
    complete dataset (stratified cross-validation).
  • Leave-one-out (loo) cross-validation
  • n of examples.
  • Less reliable results.
  • Used with small datasets

27
System Description
  • Focus enrich a topic taxonomy with good-quality
    keywords.
  • Extract URLs from the ODP and download the pages.
  • Process these pages by extracting keywords.
  • Assess the keywords gathered for each document
    and coalesce them into keywords for the category
    to which the Web page belongs in the topic
    taxonomy.

28
Design Issues
  • Scalable process a large corpus of documents
  • Flexible incorporate more documents and
    additively process them without requiring a
    reprocessing of the whole document body
  • Highly customizable use different system
    parameters
  • Modular Different algorithms can be used like
    plugplay

29
The taxonomy used
  • Acquired from the Open Directory Project (ODP).
  • Indexed over 4 million sites catalogued in
    590,000 categories
  • Initially named Gnuhoo and then NewHoo by Skrenta
    and Truel (1998)
  • Within 2 weeks - 200 editors, 27000 sites and
    2000 categories.
  • Within 5 weeks - acquired by Netcenter.
  • Basic idea As the Internet expands, the number
    of Internet users also augments. These users can
    each help organise a small portion of the web. 
  • Available through a Resource Description
    Framework (RDF).

30
System Architecture
  • The system comprises of several modules.
  • Each module takes as input the results of the
    previous module in line and outputs an Index.
  • Each module of our system is composed by
  • Worker class, which performs the processing tasks
    for each step,
  • Index Creator class that coordinatesWorkers by
    delegating tasks
  • Indexer class that outputs the results of each
    module to a file in binary form
  • Index Reader class that reloads the Index from
    the file back to memory.
  • The whole procedure is split into two phases.
  • Document Keyword Extraction result - a catalog
    with a set of keywords for each document of the
    Information Repository,
  • Topic Keyword Extraction result - a keyword
    catalog for each topic in the hierarchical
    taxonomy.

31
System Architecture
32
Document Keyword Extraction
33
Preprocessing
Preprocessing
  • Extracts a list of URLs from
  • the ODP taxonomy down to
  • a specified depth.
  • Downloads Web Pages.
  • Assigns a unique number
  • (docID) to each page

34
Forward Index Creator
  • Processes Web pages
  • Parses each file and extracts
  • keywords.
  • Ignores stop word list.
  • 336 common English words
  • Perform stemming
  • Calculate local weight
  • Produce a Forward Index and
  • Topics Index

Forward Index Creator
35
Stemming
  • Reduce all morphological variants of a word to a
    single index term
  • Porter Stemming Algorithm
  • Five rules for reduction, which are applied to
    the word sequentially. The rule chosen applies to
    the longest suffix. Examples of the trimming
    performed are
  • processes ? process
  • policies ? polici
  • emotional ? emotion
  • e.g. if suffixIZATION and prefix contains at
    least one vowel followed by a consonant, replace
    with suffixIZE
  • BINARIZATION gt BINARIZE

36
Local Weight
  • Calculate local weight for each word in each
    document.
  • Consider the HTML tag within which a word
    appears. Group tags into 7 class. Assign each
    class with an importance factor called CI (Class
    Importance).
  • Calculate the frequency in each class and produce
    Term Frequency Vector (TFV).
  • Local weight Lij

37
Forward Index Creator
Web documents
Forward Index Creator
Forward Index
Topics Index
38
Lexicon Creator
  • Processes Forward Indexes.
  • For each word, it calculates the global weight.
  • Produces a Lexicon Index
  • Global weight is Inverted Document Frequency
    (IDF).
  • IDF scales down words that appear in many
    documents.

Lexicon Creator
39
Lexicon Creator
Lexicon Creator
FwIndexes
Lexicon Index
40
Normalization Index Creator
  • Processes Forward Indexes and Lexicons.
  • For each document, calculate the normalization
    length.
  • Produces a Normalization Index
  • Cosine normalization to compensate for document
    length.
  • This data transformation essentially projects
    each vector onto the unit Euclidian sphere.

Normalization Index Creator
41
Normalization Index Creator
FwIndexes
Normalization Index Creator
Lexicon
Normalization Index
42
Weight Calculator
  • Processes Forward Indexes, Lexicon and
    Normalization Indexes.
  • Calculate the weight of each word in each document

Weight Calculator
43
Weight Calculator
FwIndexes
Weight Calculator
Lexicon
NormIndexes
Forward Index
44
Topic Keyword Extraction
45
Forward Index and Topics Index Sorter
  • Must match the document set of keywords to topic
    set of keywords
  • Must replace docID with topicID
  • Have to sort indexes to do this efficiently.
  • Large volume of the indexes to be processed.
  • Data cannot fit into main memory
  • External Sorting

FwIndex TopicsIndex Sorter
46
External Sorting
  • Repetitively load on memory a segment from the
    file, sort it with an algorithm like Quicksort,
    and write the sorted data back to the file.
  • Merge sorted blocks with MergeSort by making
    several passes through the file, creating
    consecutively larger sorted blocks until the
    whole file is sorted.
  • The most common scheme is the "k-Way Merging.

A 4-way merge on 16 runs
47
Topics ID Replacer
  • Replace each docID in the sorted Forward Index
    with the corresponding topicID

TopicsID Replacer
48
Topics ID Replacer
TopicsID Replacer
Topics Keywords Index
49
Topics Keywords Index Sorter
  • Sort the Topics Keyword Index by topicID

Topics Keyword Index Sorter
50
Topics Records Merger
  • Process the Topics Keyword Index
  • Coalesce together multiple records of the same
    topicID, wordID by calculating the mean value
    of the weight

topicID, wordID, (weight1,, weightn)
Topics Record Merger
51
Evaluating Results
  • Extracted URLs from ODP whose topics are up to
    depth 6.
  • Root-Topic/ Sub-topic1/ Sub-topic2 /
    Sub-topic3 / Sub-topic4 / Sub-topic5 /
    Sub-topic6
  • Extracted 1,153,086 such URLs.
  • Downloaded those pages using the eRACE crawler.
    The crawler managed to download 1,046,021
    webpages.
  • From these pages, 257,978 pages were under the
    topic Top/World, (the majority are non English).
  • For the processing of these pages, we used a
    non-exclusive Solaris machine having 4 CPUs and
    8GB of memory. We executed the Java program using
    4GB of memory.

52
Experiments and Results
53
Experiments and Results
54
Experiments and Results
55
Experiments and Results
  • In order to evaluate the keywords produces we try
    to classify documents based on them.
  • Tool used was Weka
  • Weka is a collection of machine learning
    algorithms for data mining tasks
  • Contains tools for data pre-processing,
    classification, regression, clustering,
    association rules, and visualization.
  • Open source software issued under the GNU General
    Public License.
  • Input data are represented in ARFF
    (Attribute-Relation File Format).
  • Includes a collection of instances that have a
    set of common characteristics (features).

56
Experiment 1
  • Select only web documents located under the topic
    /Top/Arts/Movies/.
  • ? reduction of the dataset to 28102 webpages 5756
    unique topics and 123253 wordIDs
  • Minimum threshold of the weight of a word to be
    0,01.
  • This reduced the keywords set to 19137.
  • A big number to represent features we selected
    the 40 most important keywords.
  • For the majority of topics only a few webpages as
    samples
  • We selected topics for which there were at least
    10 webpages
  • This reduced the set of topics from 5756 to 93
    and the number of instances (webpages) of the
    data set from 28102 to 2432.

57
Experiment 1
  • Topic Appearance before selection

58
Experiment 1 Classification Results
59
Experiment 2
  • Dataset includes documents that belong to topics
    of depth 2 in the taxonomy.
  • The result is a dataset of 3939 documents, 194
    topics and 16373 words.
  • We produced an ARFF file, where each record
    represents a document and states the weight for
    each of the 16373 keywords in the particular
    document, as well as the topic to which the
    document belongs.

60
Experiment 2 Classification Results
61
Experiment 2 (NN, k5)
  • TP Rate It is the fraction of positive instances
    predicted as positive.
  • FP Rate It is the fraction of negative instances
    predicted as positive.

62
Experiment 2 (NN, k1)
  • TP Rate It is the fraction of positive instances
    predicted as positive.
  • FP Rate It is the fraction of negative instances
    predicted as positive.

63
Conclusions
  • We have proposed a system that takes advantage of
    the content of Web pages by using well-known IR
    techniques to extract keywords for enriching a
    large-scale taxonomy.
  • The challenge was to build a scalable, flexible
    and highly customizable system.
  • We have tested the system with a corpus of over
    1,000,000 documents with satisfactory performance
    and speed.
  • The algorithms are designed to eliminate large
    in-memory processing.
  • The memory is analogous to the size of the
    lexicon (words) and not the number of documents
  • The system is able to incorporate more documents
    and additively process them without requiring a
    reprocessing of the whole document corpus.
  • It is customizable and modular enough to use
    different parameters IR algorithms. This gives
    the ability to compare the results of different
    techniques or to better tune parameter values.
  • The results presented have indicated the
    good-quality of the keywords produced, although
    it seems there is room for improvement.

64
Future Work
  • We believe that there is a lot of work that can
    be done in improving and exploiting this tool.
  • Reduce the processing time by improving the
    algorithms used.
  • Better management of Input/Output tasks, e.g. the
    deployment of the Mapped IO functionality of Java
    or employment of lower level languages to perform
    such tasks.
  • Recognize the language of each Internet page, and
    therefore use the appropriate stemmer and stop
    word list for each one.
  • Extract meta-data assigned by the volunteer
    editors from the ODP dump.
  • We must investigate what is the optimal set of
    attributes that can describe a given topic and
    maximize the quality of the application as well
    as what parameters can affect this decision
  • The large database of keywords for each category
    can be exploited in different ways.
  • Help automatically catalog documents into a topic
    hierarchy while they can complement the documents
    with metadata.
  • Useful tool for statistically analyzing large
    hierarchies like ODP and Yahoo.
  • It can be used to assign a topic to user queries
    and thus filter out irrelevant topics.

65
Future Work
  • A more sophisticated version of the tool could
    also exploit thesaurus like WordNet. This could
    enforce the semantic meaning of the results.
  • We could study methodologies of diffusing
    keywords up or down the hierarchy.
  • We will need some visualization tools for the
    presentation of the results.
  • The system could also be provided as an Internet
    service where a user could query the keywords of
    a particular topic or sub-tree. The results could
    be presented using some visualization technique.

66
  • THANK YOU FOR YOUR ATTENTION.
  • QUESTIONS?
Write a Comment
User Comments (0)
About PowerShow.com