Pattern Analysis - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Pattern Analysis

Description:

... Finding Similar LOs Ranking LOs Grouping LOs Extracting LO Summary Extracting LO Concepts Extracting Image Description ... of learning objects for quick ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 30
Provided by: Mohame95
Category:

less

Transcript and Presenter's Notes

Title: Pattern Analysis


1
Pattern Analysis Machine IntelligenceResearch
GroupUNIVERSITY OF WATERLOO
LORNET Theme 4
Data Mining and Knowledge Extraction for LO
T L Mohamed Kamel PIs O. Basir, F. Karray, H.
Tizhoosh Assoc PIs A. Wong, C. DiMarco
2
Knowledge Extraction and LO Mining
  • GOAL
  • Develop Data mining and knowledge extraction
    techniques and tools for learning object
    repositories.
  • These tools can provide context and facilitate
    interactions, efficient organization, efficient
    delivery, navigation and retrieval.

3
Theme Overview
From Text Syntactic Keyword, Keyphrase-based
Semantic Concept-based From Images Image
Features, Shape Features From Text Images
Describing Images with Text Enriching Text
with Images
Knowledge Extraction
Classification (MCS, Data Partitioning,
Imbalanced Classes) Clustering
(Parallel/Distributed Clustering, Cluster
Aggregation)
LO Similarity and Ranking Association Rules /
Social Networks Reinforcement Learning Specialized
/ Personalized Search
Tagging and Organizing
Matching and Ranking
4
Types of Data in LORNET
LCMS
TELOS
Course
Module
Lesson
LO
Course
Module
Lesson
LO
Course
Module
Lesson
LO
Subject Matter Text, Images, Flash, Applets,
Metadata, Interaction Logs
Resource
Resource
Resource
Discussion Board
Thread
Post
Board
Thread
Post
Board
Thread
Post
Board
SemanticLayer
Discussions Text, Interaction Logs
LOR
Metadata
Record
Metadata
Record
Metadata
Record
LO Descriptors Metadata
Resources Metadata,Semantic References
5
LO Mining Scenarios
Task Environment Knowledge Extraction Tagging / Organizing Matching / Ranking
TELOS Ontology Construction Grouping Components Finding Ranking Components
E-Learning Design Environment (LMS) Extracting LO Summary Extracting LO Concepts Extracting Image Description Grouping LOs Finding Similar LOs Ranking LOs
Learning Object Content MS (LCMS) Summarizing Documents Extracting Concepts from Documents Grouping Documents Tagging Documents Finding Similar Topics Finding Similar Profiles Building Social Networks Detect Plagiarism
LO Repository Extracting Metadata Extracting Ontologies Classifying LOs Building LO Clusters Detecting Duplicate LOs Ranking LOs Metadata Matching
6
LO Mining and Knowledge Extraction
7
Projects Overview
Information Extraction Analyzing content to
extract relevant information
Categorization Organizing LOs according to their
content
Classification
- Traditional - MCS - Imbalanced
Keyword Extraction Summarization Concept
Extraction Social Network Analysis
- Traditional - Ensembles - Distributed
Clustering
Personalization Providing user-specific results
Image Mining Describing and finding relevant
images
ReinforcementLearning
- Traditional - Opposition- based
CBIR
- Traditional - Fusion-based
Integration and Applications
Software Components
In Progress
Publications
Theme and Industry Collaboration
8
Information Extraction Summarization
LO Content Package Summarization
  • Learning objects stored in IMS content pacakges
    are loaded and parsed. Textual content files are
    extracted for analysis.
  • Statistical term weighting and sentence ranking
    are performed on each document, and to the whole
    collection.
  • Top relevant sentences are extracted for each
    document.
  • Planned functionality Summarization of whole
    modules or lessons (as opposed to single
    documents).
  • Benefits
  • Provide summarized overview of learning objects
    for quick browsing and access to learning
    material.
  • Scenarios
  • Learning Management Systems can call the
    summarization component to produce summaries for
    content packages.

Data is courtesy University of Saskatchewan
9
Information Extraction Concept Extraction
Concept-Based Statistical Analyser
F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.723 0.925 27.94
ACM 0.697 0.918 31.70
Brown 0.581 0.906 55.93
Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.251 0.012 -95.21
ACM 0.317 0.043 -86.43
Brown 0.385 0.018 -95.32
Conceptual Ontological Graph (COG) Ranking
Precision of Search Precision of Search Precision of Search Precision of Search
Single-Term Concept-based Improvement
Cran 0.536 0.901 68.09
Reuters 0.591 0.897 51.77
Recall of Search Result Recall of Search Result Recall of Search Result Recall of Search Result
Single-Term Concept-based Improvement
Cran 0.486 0.827 70.16
Reuters 0.452 0.841 86.06
10
Information Extraction Keyword Extraction
  • Semantic Keyword Extraction
  • Tasks
  • Developing tools and techniques to extract
    semantic keywords toward facilitating metadata
    generation
  • Developing algorithms to enrich metadata (tags)
    which can be applied in index-based multimedia
    retrieval
  • Progress
  • Proposed a new information theoretic inclusion
    index to measure the asymmetric dependency
    between terms (and concepts), which can be used
    in term selection (keyword extraction) and
    taxonomy extraction (pseudo ontology)
  • Makrehchi, M. and Kamel, ICDM07, WI 07

11
Information Extraction Keyword Extraction
  • Rule base size shows quick initial growth,
    followed by slow and irregular growth and rule
    elimination
  • Learns 20 rules from the first 50 training rules
  • Learns 13 additional rules from the next 220
    training rules

Rule-based Keyword Extraction
  • Learn rules to find keywords in English sentences
  • Rules represent sentence fragments
  • Specific enough for reliable keyword extraction
  • General enough to be applied to unseen sentences
  • Rule generalization
  • Begin with an exact sentence fragment
  • Merge with another by moving different words to
    the lowest common level in the part-of-speech
    hierarchy
  • Keep merged rule if it does not reduce precision
    and recall of keyword extraction keep original
    rules otherwise
  • Keyword extraction
  • Find sequence of rules that best cover an unseen
    sentence
  • Extract keywords according to rules
  • Both precision and recall values increase during
    training
  • Precision (blue) increases 10
  • Recall (red) shows slight upward trend

12
Categorization Ensemble-based Clustering
  • Consensus Clustering
  • Categorization of learning objects using proposed
    consensus clustering algorithms.
  • The goal of consensus clustering is to find a
    clustering of the data objects that optimally
    summarizes an ensemble of multiple clusterings.
  • Consensus clustering can offer several advantages
    over a single data clustering, such as the
    improvement of clustering accuracy, enhancing the
    scalability of clustering algorithms to large
    volumes of data objects, and enhancing the
    robustness by reducing the sensitivity to outlier
    data objects or noisy attributes.
  • Tasks
  • Development of techniques for producing ensembles
    of multiple data clusterings where diverse
    information about the structure of the data is
    likely to occur.
  • Development of consensus algorithms to aggregate
    the individual clusterings.
  • Develop solutions for the cluster symbolic-label
    matching problem
  • Empirical analysis on real-world data and
    validation of proposed method.

13
Categorization using cluster ensemble
Dataset samples attributes classes K-means Mean Error Rate in Ensembles Mean Error Rate in
Synthetic1 1000 8 5 17.41 0
Yahoo! (text) 2340 1458 6 38.23 16.24
Texture (image) 5500 40 11 37.99 11.54
Optical Digit Recognition 500 64 10 27.31 16.40
14
Categorization Distributed Clustering
Hierarchical P2P Document Clustering
  • Peer nodes are arranged into groups called
    neighborhoods.
  • Multiple neighborhoods are formed at each level
    of the hierarchy.
  • This size of each neighborhood is determined
    through a network partitioning factor.
  • Each neighborhood has a designated supernode.
  • Supernodes of level h form the neibhorhoods for
    level h1.
  • Clustering is done within neighborhood
    boundaries, then is merged up the hierarchy
    through the supernodes.
  • Benefits
  • Significant speedup over centralized clustering
    and flat peer-to-peer clustering.
  • Multiple levels of clusters.
  • Distributed summarization of clusters using
    CorePhrase keyphrase extraction.
  • Scenarios

HP2PC Architecture
HP2PC Example3-level network, 16 nodes
15
Categorization Multiple Classifier Systems
  • Progress
  • Proposed a set of evaluation measures to select
    sub-optimal training partitions for training
    classifier ensembles.
  • Proposed an ensemble training algorithm called
    Clustering, De-clustering, and Selection (CDS).
  • Proposed and optimized a cooperative training
    algorithm called Cooperative Clustering,
    De-clustering, and Selection (CO-CDS).
  • Investigated the applications of proposed
    training methods (CDS and CO-CDS) on LO
    classification.
  • Tasks
  • To investigate various aspects of cooperation in
    Multiple Classifier Systems (Classifier
    Ensembles)
  • To develop evaluation measures in order to
    estimate various types of cooperation in the
    system
  • To gain insight into the impact of changes in the
    cooperative components with respect to system
    performance using the proposed evaluation
    measures
  • To apply these findings to optimize existing
    ensemble methods
  • To apply these findings to develop novel ensemble
    methods with the goal of improving classification
    accuracy and reducing computation complexity

16
Categorization Imbalanced Class Distribution
  • Objective
  • Advance classification of multi-class imbalanced
    data
  • Tasks
  • To develop cost-sensitive boosting algorithm
    AdaC2.M1
  • To improve the identification performance on the
    important classes
  • To balance classification performance among
    several classes

17
Categorization Imbalanced Class Distribution
Performance of Base Classification and AdaBoost
Class Distribution
C4.5 C4.5 HPWR (Od3) HPWR (Od3)
class Meas. Base AdaBoost Base AdaBoost
C1 R 0 5.11 10.70 44.06
C1 P N/A 6.5 11.82 32.89
C1 F N/A 5.84 10.83 35.84
C2 R 73.21 92.28 88.31 87.43
C2 P 69.53 88.75 86.79 91.99
C2 F 72.29 90.38 87.43 89.64
C3 R 67.94 91.36 87.63 88.42
C3 P 73.89 87.88 87.07 89.91
C3 F 71.91 89.42 86.99 89.03
G-measure G-measure 0 11.46 33.32 68.50
Ind. size Dist.
C1 49 7.84
C2 288 46.08
C3 288 46.08
Balanced performance among classes - Evaluated by
G-mean
C4.5 C4.5 C4.5 HPWR (Od3) HPWR (Od3) HPWR (Od3)
Class Meas. Base AdaBoost AdaC2.M1 Base AdaBoost AdaC2.M1
C1 R 0 5.11 77.58 10.70 44.06 65.72
C1 P N/A 6.50 14.12 11.82 32.89 30.83
C2 R 73.21 92.28 64.73 88.31 87.43 83.12
C2 P 69.53 88.75 97.24 86.79 91.99 91.38
C3 R 67.94 91.36 65.23 87.63 88.42 83.95
C3 P 73.89 87.88 93.22 87.07 89.91 90.81
G-mean G-mean 0 11.46 68.42 33.32 68.50 76.08
18
Personalization
  • Opposition-based Reinforcement Learning for
    Personalizing Image Search
  • Developing a reliable technique to assist users,
    facilitate and enhance the learning process
  • Personalized ORL tool assists user to observe the
    searched images desirable for her/him
  • Personalized tool gathers images of the searched
    results, selects a sample of them
  • By interacting with user and presenting the
    sample, it learns the users preferences

19
Personalization
20
Image Mining CBIR
  • Content based image retrieval
  • Build an IR system that can retrieve images based
    on Textual Cues, Image content, NL Queries
  • Documents contain QI

Image Retrieval Tool Set
images
  • Images contain QT
  • Images match QI
  • NL Description of Image

Rich Documents
  • Automated image tagging
  • Query Image QI
  • Query Text QT
  • Query Document

21
Illustrative Example
IZM
FD
Accuracy 55
Accuracy 70
Accuracy 95
Accuracy 60
The proposed approach
MTAR
22
Experimental Results (Contd)
The Performance of the proposed approach
23
Integration and Applications
  • Progress
  • Finished core parts of the common data mining
    framework.
  • Built components and services from theme
    researchers work around the data mining
    framework.
  • Provided documentation for the data mining
    framework and software components.
  • Launched web site to host components and
    documentation from Theme 4http//pami.uwaterloo.
    ca/projects/lornet/software/

24
Integration and Applications
  • Progress
  • Core parts of the common data mining framework
    are available, including
  • Vector and matrix manipulation.
  • Document parsing and tokenization.
  • Statistical term and sentence analysis.
  • Similarity calculation using multiple distance
    functions.
  • IMS Content Package compliant parser.
  • Components and tools built around the common data
    mining framework
  • Metadata extraction from single documents
    supports Dublin Core encoding.
  • Document similarity calculation using cosine
    similarity.
  • Single document and content package
    summarization.
  • Building of standard text datasets from large
    document collections.
  • Integration with TELOS
  • Developed C TELOS connector for integrating
    Theme 4 components.
  • Worked on component manifest specification with
    Theme 6.
  • Provided metadata extraction as part of a
    complete scenario for TELOS components
    integration.

25
Industry Collaboration
  • Pattern Discovery Software (PDS) provided data
    mining software tools for use by researchers.
  • Vestech provided opportunities for researchers to
    work on speech technologies.
  • Desire2Learn opened job opportunities for LORNET
    researchers.

26
Software Components
Overview of Components
Scenarios for Use of Software Components
  • General Tools
  • C Connector for TELOS
  • Common Data Mining Framework
  • Standard Text Mining Tools
  • Metadata Extractor
  • Document Summarizer
  • Content Package Summarizer
  • Document Similarity
  • LO Recommender
  • Metadata Harvester
  • Keyword Extractor
  • Taxonomy Extractor
  • Metadata Enrichment Tools
  • Concept-based and Semantic Text Mining Tools
  • Metadata Extractor
  • LO Search Engine
  • Document Similarity

Environment
Data Types
Tasks
TELOS
  • Metadata
  • Ontology
  • Ontology construction and unification
  • Finding relations between components
  • Ranking components
  • Grouping components
  • Tagging components

Learning Object Repository
  • Metadata
  • Structured Text
  • Categorical
  • Automatic metadata extraction
  • LO automatic classification
  • LO organization through clustering
  • Multiple organization strategies through cluster
    ensembles

e-Learning Environment
  • Structured Text
  • Images
  • Object Relationships
  • Context
  • Extracting concepts from LO
  • Summarizing Documents
  • Grouping LOs
  • Tagging LOs
  • Discovering Similar Topics
  • Discovering Similar Peers
  • Building Social Networks
  • Detecting Plagiarism
  • LO recommendation using similarity ranking
  • Personalization / Specialization through
    reinforcement learning
  • User-centric Tools
  • Personalized Search Engine
  • Social Network Learner
  • Image Mining Tools
  • Content-based Image Search
  • Personalized Image Search
  • Consensus-based Fusion for Image Retrieval
  • Legend
  • Integrated
  • Ready
  • In Progress
  • Year 5

27
Publications
Papers (accepted / published) Papers (submitted / in prep) Theses (completed / in progress)
4.1 Information Extraction from Text 11 7 3/2
4.2 Semantic Knowledge Synthesis from Text 10 4 4/1
4.3 Knowledge Discovery through Categorization 12 10 4/1
4.4 Knowledge from Interaction 8 3 1/2
4.5 Knowledge from Image Mining 10 3 2/1
Total 51 27 14//7 21
28
Theme 4 TeamLeader M. Kamel
  • Dr. Karray
  • Asso PI (Wong, DiMarco
  • M. Shokri
  • S. Hassan
  • A. Farahat
  • Dr. R. Khoury
  • PDS,
  • Vestech,
  • Desire2Learn
  • Graduated
  • R. Khoury, PhD 07
  • L. Chen, PhD 07
  • M. Makhreshi,PhD 07
  • K.Hammouda,PhD 07
  • R. Dara, PhD 07
  • Y.Sun, PhD 07
  • K. Shaban, PhD 06
  • Y. Sun, PhD 06
  • M. Hussin, PhD 05
  • Jan Bakus, PhD 05
  • A. Adegorite, MA.Sc04
  • A. Khandani, MA.Sc05.
  • S. Podder, MA.Sc.04
  • PIs
  • Dr. Basir
  • Dr. Tizhoosh
  • Researchers
  • H. Ayad
  • R. Kashef
  • A. Ghazel
  • Dr. Makhreshi
  • Funding
  • CRC/CFI/OIT
  • NSERC
  • PAMI Lab

29
Pattern Analysis and Machine Intelligence
Lab Electrical and Computer Engineering Universit
y of Waterloo Canada
www.pami.uwaterloo.ca
  • www.pami.uwaterloo.ca/projects/lornet/software/
  • www.pami.uwaterloo.ca/kamel.html

publications
Write a Comment
User Comments (0)
About PowerShow.com