Title: Xavier Polanco
1Textual Information Clusteringand Visualization
for Knowledge Discovery and Management
- Xavier Polanco
- URI-INIST-CNRS
2Introduction
- We are concerned with the design and development
of computer-based information analysis tools - Cluster analysis, computational linguistics and
artificial intelligence techniques are combined
3On the technology side
- An information analysis computer-based system is
- an integrated environment that somehow assisted a
user - in carrying out the complex process of converting
information from the textual data sources to
knowledge
4Information Analysis System
Lexicons or terminological resources
French or English text-data
Dataset or Corpus
Clustering and Mapping
DBMS-R
Term Extraction And Indexation
Bibliometric statistics
WWW Server
SDOC
HENOCH
NEURODOC
MIRIAD
ILC
Mac
PC
WS
5Home Pages
Intranet
Extranet
6Plan
- Text Mining
- Cluster Analysis
- Visualization or Mapping
- Knowledge Discovery
- Knowledge Management
7Textual Information
- Big amount of information is available in textual
form in databases and online sources - In this context, manual analysis and effective
extraction of useful information are not possible
- It is relevant to provide automatic tools for
analyzing large textual collections
8Text Mining
- Text mining consists of extraction information
from hidden patterns in large text-data
collections - The results can be important both
- for the analysis of the collection, and
- for providing intelligent navigation and browsing
methods
9Process
- The text mining process can be organized roughly
into five-major steps - Data Selection
- Term Extraction and Filtering
- Data Clustering and Classification
- Mapping or Visualization
- Result Interpretation
- Iterative and interactive process
10Natural Language Processing
- Experience shows that linguistic engineering
approach insures a higher performance of the data
mining algorithms - Part-of-speech tagging (tagging texts), and
lemmatization are tasks generally admit
11The approach
- Our approach to text mining is based on
extracting meaningful terms from documents - In this presentation, the focus is on the term
extraction process, and - The need of the organization of the generated
terms in a taxonomy
12The main tasks
- Term extraction or acquisition
- Indexation
- Human control and screening
- Indexing quality control
- Index screening ? clustering phase
13Language Engineering
Natural Language Engineering System
Lexicons
Text-DB
Indexed Corpus
Lexicons Management and Linguistic
Processing Texts Part-of-speech tagging,
lemmatization, and indexation
14Variation
15Taxonomy
- A taxonomic structure should improve text mining
- Considering the clustering techniques that might
be used in text mining. One must be mindful that
more taxonomic classifying capabilities would be
incorporated into text mining - A taxonomic classifying capability might also
facilitate cluster interpretation by giving the
user some kind of rules
16Clustering
- Clustering is a descriptive task where one seeks
to identify a finite set of categories - Clustering is used to segment a database into
subsets or clusters - Clustering means finding the clusters themselves
from a given set of data
17Clustering Process
Similarity Measures s(x,y)
Clustering Algorithm
D(n,p)
C(m,p)
Dissimilarity Measures d(x,y)
18Documents ? Keywords
KW1 KW2 KW3 KW4 KW5 KW6
D1 1 0 1 0 1
1 D2 1 0 1 0 1
1 D3 0 1 0 1 0
0 D4 1 0 0 1 0
1
Di ? KWj 1,0 Di ? KWj 1, 2, , n
C1 (D1,D2KW1,KW3,KW5,KW6) C2
(D4KW1,KW4,KW6) C3 (D3KW2,KW4)
19Clustering Algorithms
- Major families of clustering methods
- Sequential algorithms
- Hierarchical algorithms
- Agglomerative algorithms
- Divisive algorithms
- Fuzzy clustering algorithms
20Information Analysis Process
- The text-data information analysis is divided
into two phases - Cluster generation
- Map display of clusters
- A hypertext user interface enables the analyst to
explore and interpret results
21Example
Antibiotic Resistance
2 DB
4025 documents (1998-1999)
Data
30
Medicine
Molecular Biology
Hypertext
Clusters
Map
22Information Visualization
- Definition The use of computer-supported,
interactive, visual representation of abstract
data to amplify the acquisition or use of
knowledge (Card et al., 1999) - Visual artifacts aid human thought
- The progress of civilization can be read in the
invention of visual artifacts, from writing to
mathematics, to maps, to diagrams, to visual
computing
23Process
- Raw Data ? Data Tables
- Data Tables ? Clustering
- Clustering ? Visual Structures Map
- Visual Structures ? Views
24Visual Structures
- Data Tables are mapped to Visual Structures,
which augment a spatial substrate with marks and
graphical properties to encode information - A Graphic Representation is said to be expressive
if all and only the data in the Data Table are
also represented in the Visual Structure - A Graphic Representation is said to be more
effective if it is faster to interpret
25Map Display
- We are concerned with map display of the clusters
- A problem of particular interest is how to
visualize data set with many variables - Multivariate-Data are clustered, and
- Clusters are mapped
26Mapping tools
- For mapping, we use the following techniques
- Density and Centrality Diagrams
- Principal Component Analysis (PCA)
- Multi-Layer Perceptrons (MLP)
- Self-Organizing Maps (SOM)
- Multi-SOMs
27Multi-Layer Perceptron 1
ISEs-x2
prion
proteins
Wcij
Wsjk
s1
scrapie
x1
sk
xi
human disease
spongiform encephalopathy
mankind
Wc(p,2)
Ws(2,p)
xp
sp
CJD
28Multi-Layer Perceptron 2
protein
infection resistance
Agrobacterium
plasmids
29Multi-SOM Platform
30Multi-Self-Organizing Map Display
Maps associated to 5 viewpoints Map 1 ?
Plants Map 2 ? Plant Parts Map 3 ? Pathogen
Agents Map 4 ? Genetic Techniques Map 5 ?
Patenting Firms
5
4
2
1
Rice Area Activated
Use of the inter-Map Communication Mechanism
31Knowledge Discovery
- KD is informally defined as the extraction of
useful knowledge from databases or large amounts
of data - One of the most important research topics in KD
is the rule discovery or extraction - The discovered knowledge is usually expressed in
the form of if-then rules
32Association Rules
- Association rules can be seen as one of the key
tasks of KDD - The intuitive meaning of an association rule X ?
Y, where X and Y are keywords or descriptors, is
a document set containing keyword X is likely
to also contain keyword Y
33Example
- In a given a food-industry corpus
- 98 of the documents which are interested on
apple juice does it related with the
chromatography analytic technique - X ? Y apple juice ? chromatography
34The Galois Lattice
- Our current research includes an approach based
on the lattice structure to discover concepts and
rules to the objects (documents) and their
properties (keywords) - The Galois lattice approach is also known as
conceptual clustering
35The concept lattice
Given the context (D1,T1) where D1
d1,d2,d3,d4 T1 t1,t2,t3,t4,t5,t6
Hasse Diagram
C1(D1,Ø)
R t1 t2 t3 t4 t5 t6 d1 1 0 1 0 1 1 d2 1
0 1 0 1 1 d3 0 1 0 1 0 0 d4 1 0 0 1
0 1
C2(d1,d2,d4,t1,t6
C3(d3,d4,t4
C4(d1,d2,t1,t3,t5,t6
C5(d4,t1,t4,t6
C6(d3,t2,t4
Table The input relation R documents ?
keywords
C7(Ø, T1)
The formal concept C4 has two own terms t3,t5
and two inherited terms t1,t6
36Association Rules Extraction
- The formal concept C4 makes it possible the
following rules - R1 t3 ? t1 ? t6
- R2 t5 ? t1 ? t6
- R3 t3 ? t5
- The interpretation of the R1 and R2 The use of
terms t3 or t5 is always associated with that of
terms t1 and t6 - The rule R3 express mutual equivalence of the
terms t3,t5 All the documents which have the
term t3 also have the t5 term.
37Summary
Text Mining
Clustering
Mapping
Knowledge Discovery
38Knowledge Management
- A knowledge management system is concerned with
the identification, acquisition, development,
diffusion, use, and preservation of the
enterprises knowledge
39KM Objectives
- Using advanced technology
- For facilitating creation, access, and reuse of
knowledge - For converting knowledge from the sources
accessible to an organization and connecting
people with that knowledge
40Project
- Adding to the information analysis system a
formalized operator for processing together - The knowledge that is extracted from databases
- The knowledge that the experts produce when they
analyze the clusters, maps, concepts and rules
41We have reached our last subject, but not the
end !
42Merci
Gracias
Obrigado
Thanks
Xavier Polanco