Title: Artificial Immune Systems: An Emerging Technology
1Bayesian Networks in Document Clustering
Slawomir Wierzchon , Mieczyslaw Klopotek Michal
Draminski Krzysztof Ciesielski Mariusz
Kujawiak Institute of Computer Science, Polish
Academy of Sciences Warsaw
Research partially supported by the KBN research
project 4 T11C 026 25 "Maps and intelligent
navigation in WWW using Bayesian networks and
artificial immune systems"
2A search engine with SOM-based document set
representation
3Map visualizations in 3D (BEATCA)
4- The preparation of documents is done by an
indexer, which turns a document into a
vector-space model representation - Indexer also identifies frequent phrases in
document set for clustering and labelling
purposes - Subsequently, dictionary optimization is
performed - extreme entropy and extremely
frequent terms excluded - The map creator is applied, turning the
vector-space representation into a form
appropriate for on-the-fly map generation - The best (wrt some similarity measure) map is
used by the query processor in response to the
users query
5Document model in search engines
My dog likes this food
dog
- In the so-called vector model a document is
considered as a vector in space spanned by the
words it contains.
food
When walking, I take some food
walk
6Clustering document vectors
r
x
m
Mocna zmiana polozenia (gruba strzalka)
Document space
2D map
Important difference to general clustering not
only clusters with similar documents, but also
neighboring clusters similar
7Our problem
- Instability
- Pre-defined major themes needed
- Our approach
- Find a coarse clustering into a few themes
8Bayesian Networks in Document Clustering
- SOM document-map based search engines require
initial document clustering in order to present
results in a meaningful way. - Latent semantic Indexing based methods appear to
be promising for this purpose. - One of them, the PLSA, has been empirically
investigated. - A modification is proposed to the original
algorithm and an extension via TAN-like Bayesian
networks is suggested.
9A Bayesian Network
Represents joint probability distribution as a
product of conditional probabilities of childs on
parents in a directed acyclic graph High
compression, Simpliofication of reasoning .
10BN application in text processing
- Document classification
- Document Clustering
- Query Expansion
-
11Hidden variable approaches
- PLSA (Probabilistic Latent Semantic Analysis)
- PHITS (Probabilistic Hyperlink Analysis)
- Combined PLSA/PHITS
- Assumption of a hidden variable expressing the
topic of the document. - The topic probabilistically influence the
appearence of the document (links in PHITS, terms
in PLSA)
12PLSA - concept
T1
Hidden variable
- N be term-document matrix of word counts, i.e.,
Nij denotes how often a term (single word or
phrase) ti occurs in document dj. - probabilistic decomposition into factors zk
(1? k ? K) - P(ti dj) Sk P(tizk)P(zkdj), with
non-negative probabilities and two sets of
normalization constraints - Si P(tizk) 1 for all k and
- Sk P(zk dj) 1 for all j.
T2
D
Z
.....
Tn
13PLSA - concept
- PLSA aims at maximizing L Si,j Nij log Sk
P(tizk)P(zkdj). - Factors zk can be interpreted as states of a
latent mixing variable associated with each
observation (i.e., word occurrence), - Expectation-Maximization (EM) algorithm can be
applied to find a local maximum of L.
.....
- different factors usually capture distinct
"topics" of a document collection - by clustering documents according to their
dominant factors, useful topic-specific document
clusters often emerge
14EM algorithm step 0
Z randomly initialized
15EM algorithm step 1
BN trained
16EM algorithm step 2
Z sampled for each record according to the
probability distribution P(Z1Dd,T1t1,...,Tntn
) P(Z2Dd,T1t1,...,Tntn) ....
Z sampled from BN
GOTO step 1 untill convergence (Z assignment
stable)
17The problem
- Too high number of adjustable variables
- Pre-defined clusters not identified
- Long computation times
- instability
18Solution
- Our suggestion
- Use the Naive Bayes sharp version document
assigned to the most probable class - We were successful
- Up to five classes well clustered
- High speed (with 20,000 documents)
19Next step
- Naive bayes assumes document and term
independence - What if they are in fact dependent?
- Our solution
- TAN APPROACH
- First we create a BN of terms/documents
- Then assume there is a hidden variable
- Promissing results, need a deeper study
20PLSA a model with term TAN
Hidden variable
D1
T6
T5
D2
Z
Dk
T4
T2
T3
T1
21PLSA a model with document TAN
Hidden variable
T1
T2
Z
Ti