Title: Contributions to MiningMart
1Contributions to MiningMart
- Petr Berka
- Laboratory for Intelligent Systems
- University of Economics, Prague
- berka_at_vse.cz
2University of Economics, Prague
- LISp - Laboratory for Intelligent Systems
- SALOME - Laboratory for Multidisciplinary
Approaches to Decision-making Support in
Economics and Management
3LISp research
- probabilistic methods - decomposable probability
models and bayesian networks - symbolic ML methods - 4FT association rules and
decision rules - logical calculi for knowledge discovery in
databases
4LISp activities
- Organized conferences
- ECML97, PKDD99
- Organized workshops
- Discovery Challenge (PKDD99, PKDD2000,
PKDD20001), WUPES97, WUPES2000 - International Projects
- MLNet, Sol-Eu-Net, EUNITE, MUM, MGT
- KDNet
5SALOME research
- Quantitative and AI (pattern recognition, fuzzy,
neural nets) approaches to support of decision
making in econmics and management
6SALOME activities
- Organized workshops
- STIPR97, MME99
- International Projects
- Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge
7LISp software
- LISp-Miner (data mining system)
- DataSource (for data manipulation)
- 4FT Miner (4FT association rules) and
- KEX (decision rules)
- experimental software for building graphical
models - preprocessing procedures
- related to KEX
- based on information theoretic approach
8LISP-Miner procedures
- DataSource
- creating new (virtual) attributes using SQL
- ekvidistant and equifrequent discretization
- grouping attribute values
- computing attribute-value frequencies
9LISP-Miner procedures
- 4FT-Miner (GUHA procedure)
- 4FT association rules in the form
- Ant Suc / Cond
- KEX
- weighted decision rules in the form
- Ant ? C (weight)
104FT-Miner basic idea
- Generate a (potential) rule, e.g.
- COLOUR(red) ? SIZE(small) ?0.9, 20 TEMP(high)
- AGE(21-30) ? SALARY(low) ?0.85,15 PAYMENTS
(High) ? LOAN(bad) - Verify a rule using four-fold table
114FT-Miner
Data Matrix CLIENTS
LOANS Id Age Sex Salary District
Amount Payment Months Quality 1 45
F 28 000 Prague 48 000 1 000 48
good ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
... 70 000 18 M 12 000 Brno
36 000 2 000 18 bad Problem Are
there segments of clients SC and segments of
loans SL such that To be in SC is
at 90 equivalent to have a loan from SL and
there is at least 100 such
clients Ant is at 90 equivalent to Suc
Ant ?0.90, 100 Suc is true iff
a/(abc) ? 0.9 ? a ? 100
Suc
?Suc a - number of objects
satisfying Ant and Suc Ant a
b b- number of objects
satisfying Ant and not satisfying Suc ?Ant
c d c-
number of objects not satisfying Ant and
satisfying Suc
d- number of objects
satisfying neither Ant nor Suc
124FT-Miner
- Input
- Data matrix,
- quantifier ?0.90, 100
- Derived attributes for SC (possible Ant) Age (7
values), Sex (2 values),
Salary (3 values), District (77 values)
- Derived attributes for SL (possible Suc)
Amount (6 values), Duration (5 values),
Quality (2 values) - Output
- All Ant ?0.90, 100 Suc true in data matrix
- (5 equivalences from about 5 milions possible
relations) - an example
- Age(20 - 30) ? Sex(F) ? Salary(low) ? District
(Prague) ?0.90, 100 Amountlt20,50) ?
Quality(Bad) - Suc
?Suc - a/(abc) 0.95 ? 0.9
Ant 950 30 - ? 950 ? 100 ?Ant
20 69000
13KEX basic idea
- Generate a (potential) rule, e.g.
- YEARS-IN-COMPANY(0-3) ? AGE(0-25) ? LOAN(GOOD)
- If rule refines current set of rules
- (validity a/(ab) differs from weight
inferred during consultation) - add into rule base with proper weight
14KEX - classification
15KEX - learning
16LISp-Miner architecture
MetaData (ODBC ACCESS)
Results
Data (ODBC ACCESS)
LM Windows
17Preprocessing (LISp)
- KEX-oriented
- (fuzzy) discretization grouping of values
- computing the amount of noise in data
- random sampling balancing of data
- handling missing values
- Information theory
- attribute selection
- attribute grouping
18 fuzzy discretization
19 amount of noise
- Amount of noise 20
- max. possible accuracy 80
20 data sampling
- random split into training and testing set
- select random stratified sample
- balance unbalanced classes
21 handling missing values
- remove example
- substitute missing with new value
- substitute missing with majority value
- proportional substitution
22 information theory
- Attribute selection - based on mutual
information - Attribute grouping - based on information content
23Preprocessing architecture
Input data (ASCII)
Output data (ASCII)
procedure
Results
Data (ASCII)
procedure
24SALOME software
- Feature Selection Toolbox (Multi-Purpose Tool for
Pattern Recognition) - feature selection
- approximation-based modeling
- classification
- a consulting system helping to choose the
most suitable method is being developed
25Search strategies for FS
- Search for a subset maximizing a criterion
function (distance, divergence) - with apriori information
- exhaustive search
- branch and bound based algorithms
- floating search algorithms
- without apriori information
- approximation method
- divergence method
26FST architecture
Data (ASCII)
Results
FST Windows
27References
- LISp-Miner
- Berka,P. - Ivanek,J. Automated Knowledge
Acquisition for PROSPECTOR-like Expert Systems.
In (Bergadano, deRaedt eds.) Proc. ECML'94,
Springer 1994, 339-342. - Berka,P. - Rauch,J. Data Mining using GUHA and
KEX. In (Callaos, Yang, Aguilar eds.) 4th. Int.
Conf. on Information Systems, Analysis and
Synthesis ISAS'98, 1998, Vol 2, 238- 244. - Rauch,J. Classes of Four Fold Table Quantifiers.
In (Zytkow, Quafafou eds.) Principles of Data
Mining and Knowledge Discovery. Springer 1998,
203 - 211.
28References
- Preprocessing
- Bruha,I. - Berka,P. Discretization and
Fuzzification of Numerical Attributes in
Attribute-Based Learning. In Szepaniak, Lisboa,
Kacprzyk (eds.) Fuzzy Systems in Medicine,
Physica Verlag, 2000, 112-138. - Pudil, P., Novovicová J. Novel Methods for
Subset Selection with Respect to Problem
Knowledge, IEEE Transactions on Intelligent
Systems - Special Issue on Feature
Transformation and Subset Selection 1998, 66-74 - J. Zvarova and M. Studeny Information
theoretical approach to constitution and
reduction of medical data. International Journal
of Medical Informatics 45 (1997), n. 1-2, pp.
65-74.