Artificial Immune Systems: An Emerging Technology

About This Presentation

Title:

Artificial Immune Systems: An Emerging Technology

Description:

Title: Artificial Immune Systems: An Emerging Technology Author: Computing Laboratory Last modified by: klopotek Created Date: 3/27/2001 8:49:50 AM – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 22

Provided by: Comput104

Category:

more less

Transcript and Presenter's Notes

Title: Artificial Immune Systems: An Emerging Technology

1
Bayesian Networks in Document Clustering
Slawomir Wierzchon , Mieczyslaw Klopotek Michal
Draminski Krzysztof Ciesielski Mariusz
Kujawiak Institute of Computer Science, Polish
Academy of Sciences Warsaw
Research partially supported by the KBN research
project 4 T11C 026 25 "Maps and intelligent
navigation in WWW using Bayesian networks and
artificial immune systems"
2
A search engine with SOM-based document set
representation
3
Map visualizations in 3D (BEATCA)
4

The preparation of documents is done by an
indexer, which turns a document into a
vector-space model representation
Indexer also identifies frequent phrases in
document set for clustering and labelling
purposes
Subsequently, dictionary optimization is
performed - extreme entropy and extremely
frequent terms excluded
The map creator is applied, turning the
vector-space representation into a form
appropriate for on-the-fly map generation
The best (wrt some similarity measure) map is
used by the query processor in response to the
users query

5
Document model in search engines
My dog likes this food
dog

In the so-called vector model a document is
considered as a vector in space spanned by the
words it contains.

food
When walking, I take some food
walk
6
Clustering document vectors
r
x
m
Mocna zmiana polozenia (gruba strzalka)
Document space
2D map
Important difference to general clustering not
only clusters with similar documents, but also
neighboring clusters similar
7
Our problem

Instability
Pre-defined major themes needed
Our approach
Find a coarse clustering into a few themes

8
Bayesian Networks in Document Clustering

SOM document-map based search engines require
initial document clustering in order to present
results in a meaningful way.
Latent semantic Indexing based methods appear to
be promising for this purpose.
One of them, the PLSA, has been empirically
investigated.
A modification is proposed to the original
algorithm and an extension via TAN-like Bayesian
networks is suggested.

9
A Bayesian Network
Represents joint probability distribution as a
product of conditional probabilities of childs on
parents in a directed acyclic graph High
compression, Simpliofication of reasoning .
10
BN application in text processing

Document classification
Document Clustering
Query Expansion

11
Hidden variable approaches

PLSA (Probabilistic Latent Semantic Analysis)
PHITS (Probabilistic Hyperlink Analysis)
Combined PLSA/PHITS
Assumption of a hidden variable expressing the
topic of the document.
The topic probabilistically influence the
appearence of the document (links in PHITS, terms
in PLSA)

12
PLSA - concept
T1
Hidden variable

N be term-document matrix of word counts, i.e.,
Nij denotes how often a term (single word or
phrase) ti occurs in document dj.
probabilistic decomposition into factors zk
(1? k ? K)
P(ti dj) Sk P(tizk)P(zkdj), with
non-negative probabilities and two sets of
normalization constraints
Si P(tizk) 1 for all k and
Sk P(zk dj) 1 for all j.

T2
D
Z
.....
Tn
13
PLSA - concept

PLSA aims at maximizing L Si,j Nij log Sk
P(tizk)P(zkdj).
Factors zk can be interpreted as states of a
latent mixing variable associated with each
observation (i.e., word occurrence),
Expectation-Maximization (EM) algorithm can be
applied to find a local maximum of L.

.....

different factors usually capture distinct
"topics" of a document collection
by clustering documents according to their
dominant factors, useful topic-specific document
clusters often emerge

14
EM algorithm step 0
Z randomly initialized
15
EM algorithm step 1
BN trained
16
EM algorithm step 2
Z sampled for each record according to the
probability distribution P(Z1Dd,T1t1,...,Tntn
) P(Z2Dd,T1t1,...,Tntn) ....
Z sampled from BN
GOTO step 1 untill convergence (Z assignment
stable)
17
The problem