Title: Pattern Analysis
1Pattern Analysis Machine IntelligenceResearch
GroupUNIVERSITY OF WATERLOO
LORNET Theme 4
Data Mining and Knowledge Extraction for LO
T L Mohamed Kamel PIs O. Basir, F. Karray, H.
Tizhoosh Assoc PIs A. Wong, C. DiMarco
2 Knowledge Extraction and LO Mining
- GOAL
- Develop Data mining and knowledge extraction
techniques and tools for learning object
repositories. - These tools can provide context and facilitate
interactions, efficient organization, efficient
delivery, navigation and retrieval.
3Theme Overview
From Text Syntactic Keyword, Keyphrase-based
Semantic Concept-based From Images Image
Features, Shape Features From Text Images
Describing Images with Text Enriching Text
with Images
Knowledge Extraction
Classification (MCS, Data Partitioning,
Imbalanced Classes) Clustering
(Parallel/Distributed Clustering, Cluster
Aggregation)
LO Similarity and Ranking Association Rules /
Social Networks Reinforcement Learning Specialized
/ Personalized Search
Tagging and Organizing
Matching and Ranking
4Types of Data in LORNET
LCMS
TELOS
Course
Module
Lesson
LO
Course
Module
Lesson
LO
Course
Module
Lesson
LO
Subject Matter Text, Images, Flash, Applets,
Metadata, Interaction Logs
Resource
Resource
Resource
Discussion Board
Thread
Post
Board
Thread
Post
Board
Thread
Post
Board
SemanticLayer
Discussions Text, Interaction Logs
LOR
Metadata
Record
Metadata
Record
Metadata
Record
LO Descriptors Metadata
Resources Metadata,Semantic References
5LO Mining Scenarios
Task Environment Knowledge Extraction Tagging / Organizing Matching / Ranking
TELOS Ontology Construction Grouping Components Finding Ranking Components
E-Learning Design Environment (LMS) Extracting LO Summary Extracting LO Concepts Extracting Image Description Grouping LOs Finding Similar LOs Ranking LOs
Learning Object Content MS (LCMS) Summarizing Documents Extracting Concepts from Documents Grouping Documents Tagging Documents Finding Similar Topics Finding Similar Profiles Building Social Networks Detect Plagiarism
LO Repository Extracting Metadata Extracting Ontologies Classifying LOs Building LO Clusters Detecting Duplicate LOs Ranking LOs Metadata Matching
6LO Mining and Knowledge Extraction
7Projects Overview
Information Extraction Analyzing content to
extract relevant information
Categorization Organizing LOs according to their
content
Classification
- Traditional - MCS - Imbalanced
Keyword Extraction Summarization Concept
Extraction Social Network Analysis
- Traditional - Ensembles - Distributed
Clustering
Personalization Providing user-specific results
Image Mining Describing and finding relevant
images
ReinforcementLearning
- Traditional - Opposition- based
CBIR
- Traditional - Fusion-based
Integration and Applications
Software Components
In Progress
Publications
Theme and Industry Collaboration
8Information Extraction Summarization
LO Content Package Summarization
- Learning objects stored in IMS content pacakges
are loaded and parsed. Textual content files are
extracted for analysis. - Statistical term weighting and sentence ranking
are performed on each document, and to the whole
collection. - Top relevant sentences are extracted for each
document. - Planned functionality Summarization of whole
modules or lessons (as opposed to single
documents). - Benefits
- Provide summarized overview of learning objects
for quick browsing and access to learning
material. - Scenarios
- Learning Management Systems can call the
summarization component to produce summaries for
content packages.
Data is courtesy University of Saskatchewan
9Information Extraction Concept Extraction
Concept-Based Statistical Analyser
F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.723 0.925 27.94
ACM 0.697 0.918 31.70
Brown 0.581 0.906 55.93
Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.251 0.012 -95.21
ACM 0.317 0.043 -86.43
Brown 0.385 0.018 -95.32
Conceptual Ontological Graph (COG) Ranking
Precision of Search Precision of Search Precision of Search Precision of Search
Single-Term Concept-based Improvement
Cran 0.536 0.901 68.09
Reuters 0.591 0.897 51.77
Recall of Search Result Recall of Search Result Recall of Search Result Recall of Search Result
Single-Term Concept-based Improvement
Cran 0.486 0.827 70.16
Reuters 0.452 0.841 86.06
10Information Extraction Keyword Extraction
- Semantic Keyword Extraction
- Tasks
- Developing tools and techniques to extract
semantic keywords toward facilitating metadata
generation - Developing algorithms to enrich metadata (tags)
which can be applied in index-based multimedia
retrieval - Progress
- Proposed a new information theoretic inclusion
index to measure the asymmetric dependency
between terms (and concepts), which can be used
in term selection (keyword extraction) and
taxonomy extraction (pseudo ontology) - Makrehchi, M. and Kamel, ICDM07, WI 07
11Information Extraction Keyword Extraction
- Rule base size shows quick initial growth,
followed by slow and irregular growth and rule
elimination - Learns 20 rules from the first 50 training rules
- Learns 13 additional rules from the next 220
training rules
Rule-based Keyword Extraction
- Learn rules to find keywords in English sentences
- Rules represent sentence fragments
- Specific enough for reliable keyword extraction
- General enough to be applied to unseen sentences
- Rule generalization
- Begin with an exact sentence fragment
- Merge with another by moving different words to
the lowest common level in the part-of-speech
hierarchy - Keep merged rule if it does not reduce precision
and recall of keyword extraction keep original
rules otherwise - Keyword extraction
- Find sequence of rules that best cover an unseen
sentence - Extract keywords according to rules
- Both precision and recall values increase during
training - Precision (blue) increases 10
- Recall (red) shows slight upward trend
12Categorization Ensemble-based Clustering
- Consensus Clustering
- Categorization of learning objects using proposed
consensus clustering algorithms. - The goal of consensus clustering is to find a
clustering of the data objects that optimally
summarizes an ensemble of multiple clusterings. - Consensus clustering can offer several advantages
over a single data clustering, such as the
improvement of clustering accuracy, enhancing the
scalability of clustering algorithms to large
volumes of data objects, and enhancing the
robustness by reducing the sensitivity to outlier
data objects or noisy attributes. - Tasks
- Development of techniques for producing ensembles
of multiple data clusterings where diverse
information about the structure of the data is
likely to occur. - Development of consensus algorithms to aggregate
the individual clusterings. - Develop solutions for the cluster symbolic-label
matching problem - Empirical analysis on real-world data and
validation of proposed method.
13 Categorization using cluster ensemble
Dataset samples attributes classes K-means Mean Error Rate in Ensembles Mean Error Rate in
Synthetic1 1000 8 5 17.41 0
Yahoo! (text) 2340 1458 6 38.23 16.24
Texture (image) 5500 40 11 37.99 11.54
Optical Digit Recognition 500 64 10 27.31 16.40
14Categorization Distributed Clustering
Hierarchical P2P Document Clustering
- Peer nodes are arranged into groups called
neighborhoods. - Multiple neighborhoods are formed at each level
of the hierarchy. - This size of each neighborhood is determined
through a network partitioning factor. - Each neighborhood has a designated supernode.
- Supernodes of level h form the neibhorhoods for
level h1. - Clustering is done within neighborhood
boundaries, then is merged up the hierarchy
through the supernodes. - Benefits
- Significant speedup over centralized clustering
and flat peer-to-peer clustering. - Multiple levels of clusters.
- Distributed summarization of clusters using
CorePhrase keyphrase extraction. - Scenarios
HP2PC Architecture
HP2PC Example3-level network, 16 nodes
15Categorization Multiple Classifier Systems
- Progress
- Proposed a set of evaluation measures to select
sub-optimal training partitions for training
classifier ensembles. - Proposed an ensemble training algorithm called
Clustering, De-clustering, and Selection (CDS). - Proposed and optimized a cooperative training
algorithm called Cooperative Clustering,
De-clustering, and Selection (CO-CDS). - Investigated the applications of proposed
training methods (CDS and CO-CDS) on LO
classification.
- Tasks
- To investigate various aspects of cooperation in
Multiple Classifier Systems (Classifier
Ensembles) - To develop evaluation measures in order to
estimate various types of cooperation in the
system - To gain insight into the impact of changes in the
cooperative components with respect to system
performance using the proposed evaluation
measures - To apply these findings to optimize existing
ensemble methods - To apply these findings to develop novel ensemble
methods with the goal of improving classification
accuracy and reducing computation complexity
16Categorization Imbalanced Class Distribution
- Objective
- Advance classification of multi-class imbalanced
data - Tasks
- To develop cost-sensitive boosting algorithm
AdaC2.M1 - To improve the identification performance on the
important classes - To balance classification performance among
several classes
17Categorization Imbalanced Class Distribution
Performance of Base Classification and AdaBoost
Class Distribution
C4.5 C4.5 HPWR (Od3) HPWR (Od3)
class Meas. Base AdaBoost Base AdaBoost
C1 R 0 5.11 10.70 44.06
C1 P N/A 6.5 11.82 32.89
C1 F N/A 5.84 10.83 35.84
C2 R 73.21 92.28 88.31 87.43
C2 P 69.53 88.75 86.79 91.99
C2 F 72.29 90.38 87.43 89.64
C3 R 67.94 91.36 87.63 88.42
C3 P 73.89 87.88 87.07 89.91
C3 F 71.91 89.42 86.99 89.03
G-measure G-measure 0 11.46 33.32 68.50
Ind. size Dist.
C1 49 7.84
C2 288 46.08
C3 288 46.08
Balanced performance among classes - Evaluated by
G-mean
C4.5 C4.5 C4.5 HPWR (Od3) HPWR (Od3) HPWR (Od3)
Class Meas. Base AdaBoost AdaC2.M1 Base AdaBoost AdaC2.M1
C1 R 0 5.11 77.58 10.70 44.06 65.72
C1 P N/A 6.50 14.12 11.82 32.89 30.83
C2 R 73.21 92.28 64.73 88.31 87.43 83.12
C2 P 69.53 88.75 97.24 86.79 91.99 91.38
C3 R 67.94 91.36 65.23 87.63 88.42 83.95
C3 P 73.89 87.88 93.22 87.07 89.91 90.81
G-mean G-mean 0 11.46 68.42 33.32 68.50 76.08
18Personalization
- Opposition-based Reinforcement Learning for
Personalizing Image Search - Developing a reliable technique to assist users,
facilitate and enhance the learning process - Personalized ORL tool assists user to observe the
searched images desirable for her/him - Personalized tool gathers images of the searched
results, selects a sample of them - By interacting with user and presenting the
sample, it learns the users preferences
19Personalization
20Image Mining CBIR
- Content based image retrieval
- Build an IR system that can retrieve images based
on Textual Cues, Image content, NL Queries
Image Retrieval Tool Set
images
Rich Documents
- Query Image QI
- Query Text QT
- Query Document
21Illustrative Example
IZM
FD
Accuracy 55
Accuracy 70
Accuracy 95
Accuracy 60
The proposed approach
MTAR
22Experimental Results (Contd)
The Performance of the proposed approach
23Integration and Applications
- Progress
- Finished core parts of the common data mining
framework. - Built components and services from theme
researchers work around the data mining
framework. - Provided documentation for the data mining
framework and software components. - Launched web site to host components and
documentation from Theme 4http//pami.uwaterloo.
ca/projects/lornet/software/
24Integration and Applications
- Progress
- Core parts of the common data mining framework
are available, including - Vector and matrix manipulation.
- Document parsing and tokenization.
- Statistical term and sentence analysis.
- Similarity calculation using multiple distance
functions. - IMS Content Package compliant parser.
- Components and tools built around the common data
mining framework - Metadata extraction from single documents
supports Dublin Core encoding. - Document similarity calculation using cosine
similarity. - Single document and content package
summarization. - Building of standard text datasets from large
document collections. - Integration with TELOS
- Developed C TELOS connector for integrating
Theme 4 components. - Worked on component manifest specification with
Theme 6. - Provided metadata extraction as part of a
complete scenario for TELOS components
integration.
25Industry Collaboration
- Pattern Discovery Software (PDS) provided data
mining software tools for use by researchers. - Vestech provided opportunities for researchers to
work on speech technologies. - Desire2Learn opened job opportunities for LORNET
researchers.
26Software Components
Overview of Components
Scenarios for Use of Software Components
- General Tools
- C Connector for TELOS
- Common Data Mining Framework
- Standard Text Mining Tools
- Metadata Extractor
- Document Summarizer
- Content Package Summarizer
- Document Similarity
- LO Recommender
- Metadata Harvester
- Keyword Extractor
- Taxonomy Extractor
- Metadata Enrichment Tools
- Concept-based and Semantic Text Mining Tools
- Metadata Extractor
- LO Search Engine
- Document Similarity
Environment
Data Types
Tasks
TELOS
- Ontology construction and unification
- Finding relations between components
- Ranking components
- Grouping components
- Tagging components
Learning Object Repository
- Metadata
- Structured Text
- Categorical
- Automatic metadata extraction
- LO automatic classification
- LO organization through clustering
- Multiple organization strategies through cluster
ensembles
e-Learning Environment
- Structured Text
- Images
- Object Relationships
- Context
- Extracting concepts from LO
- Summarizing Documents
- Grouping LOs
- Tagging LOs
- Discovering Similar Topics
- Discovering Similar Peers
- Building Social Networks
- Detecting Plagiarism
- LO recommendation using similarity ranking
- Personalization / Specialization through
reinforcement learning
- User-centric Tools
- Personalized Search Engine
- Social Network Learner
- Image Mining Tools
- Content-based Image Search
- Personalized Image Search
- Consensus-based Fusion for Image Retrieval
- Legend
- Integrated
- Ready
- In Progress
- Year 5
27Publications
Papers (accepted / published) Papers (submitted / in prep) Theses (completed / in progress)
4.1 Information Extraction from Text 11 7 3/2
4.2 Semantic Knowledge Synthesis from Text 10 4 4/1
4.3 Knowledge Discovery through Categorization 12 10 4/1
4.4 Knowledge from Interaction 8 3 1/2
4.5 Knowledge from Image Mining 10 3 2/1
Total 51 27 14//7 21
28Theme 4 TeamLeader M. Kamel
- Dr. Karray
- Asso PI (Wong, DiMarco
- M. Shokri
- S. Hassan
- A. Farahat
- Dr. R. Khoury
- PDS,
- Vestech,
- Desire2Learn
- Graduated
- R. Khoury, PhD 07
- L. Chen, PhD 07
- M. Makhreshi,PhD 07
- K.Hammouda,PhD 07
- R. Dara, PhD 07
- Y.Sun, PhD 07
- K. Shaban, PhD 06
- Y. Sun, PhD 06
- M. Hussin, PhD 05
- Jan Bakus, PhD 05
- A. Adegorite, MA.Sc04
- A. Khandani, MA.Sc05.
- S. Podder, MA.Sc.04
- PIs
- Dr. Basir
- Dr. Tizhoosh
- Researchers
- H. Ayad
- R. Kashef
- A. Ghazel
- Dr. Makhreshi
- Funding
- CRC/CFI/OIT
- NSERC
- PAMI Lab
29Pattern Analysis and Machine Intelligence
Lab Electrical and Computer Engineering Universit
y of Waterloo Canada
www.pami.uwaterloo.ca
- www.pami.uwaterloo.ca/projects/lornet/software/
- www.pami.uwaterloo.ca/kamel.html
publications