Title: Decision Making with Uncertainty and Data Mining
1Decision Making with Uncertainty and Data Mining
- David L. Olson University of Nebraska
- Desheng Wu University of Science Technology
of China - ADMA05 Wuhan, China, 22-24 July 2005
2Decision Making under Uncertainty
- Uncertainty exists in data
- Imprecise data
- Missing data
- Human subjectivity
- Fuzzy set theory
- A means to reflect uncertainty
- Grey related analysis (interval vague)
- A type of fuzzy set data
3Monte Carlo Simulation
- Analytic models preferred
- But simulation needed if
- High levels of uncertainty make analytic models
too messy to calculate - High levels of complexity make analytic models
intractable
4Fuzzy Simulation
- Fuzzy input often expressed in trapezoidal form
- Minimum, range of most likely, maximum
- Triangular, interval special cases
- Can be analyzed through Monte Carlo
5Fuzzy Distribution Forms
6Grey Related Analysis
- Deng 1982
- Means to incorporate uncertainty
- Incomplete or unknown elements
- Interval numbers
- Standardize through norms
- Transform index values through product operations
- Minimize distance to ideal, max from nadir
- Simple, practical
- Dont require large sample sizes, nonparametric
7Demonstration MCDM
- MultiCriteria Decision Making
- Modern decision making complex
- Need to balance tradeoffs among conflicting
criteria (attributes objectives goals) - Fuzzy MCDM
- Alternative scores on each criterion uncertain
- Measures of weights vary across group members
8Implementations of Fuzzy Multiattribute Idea
- Fuzzy theory
- DuBois Prade 1980
- Rough sets
- Pawlak 1982
- Grey sets
- Interval analysis Moore 1966 1979
- Deng 1982
- Vague sets Gau Buehrer 1993
- Probability theory
- Pearl 1988
9PROMETHEE
- J.P. Brans, P. Vincke, B. Mareschal (Belgium)
- basically a workable ELECTRE
- PROMETHEE I partial order
- PROMETHEE II full ranking
- GAIA graphical (concordance analysis)
10criteria scales
- I -0 if indifferent or worse, 1 if better
- II -0 if not better by parameter q, 1 if
- III -d is degree better than alternative
- 0 if not better by parameter q
- d/p if between q p, 1 if dgtp
- IV -step 0 if dltq .5 if qltdltp 1 if dgtp
- V - slope
- VI - normal
11Promethee Criteria
- II INTERVAL
- III TRIANGULAR
- V TRAPEZOIDAL
- Promethee doesnt use value function
- But demonstrates the incorporation of fuzzy input
into MCDM
12Demo Model
- Group Decision
- Conservative, Liberal, Business
- Energy Options
- S1 Nuclear
- S2 Coal
- S3 Conservation
- S4 Import
- Criteria
- C1 Cost (minimize)
- C2 Pollution (miniimize)
- C3 Risk of catastrophe (minimize)
- C4 Energy Independence (maximize)
13Weights for each group memberTrapezoidal (grey
related)
C1 Cost C2 Pollution C3 Risk C4 Independent
Conservative 0.4, 0.5, 0.7, 0.8 0, 0, 0.05, 0.1 0, 0.03, 0.05, 0.15 0.05, 0.1, 0.15, 0.25
Liberal 0.05, 0.1, 0.15, 0.2 0.2, 0.4, 0.5, 0.6 0.2, 0.3, 0.4, 0.6 0.03, 0.05, 0.1, 0.15
Business 0.25, 0.27, 0.29, 0.3 0.12, 0.15, 0.2, 0.25 0.16, 0.2, 0.25, 0.3 0.25, 0.3, 0.35, 0.4
14Cost Scores for each group memberTrapezoidal
S11 Nuclear S12 Coal S13 Conserve S14 Import
Conservative 0, 0.05, 0.1, 0.2 0.3, 0.4, 0.5, 0.7 0.6, 0.75, 0.85, 0.9 0.6, 0.7, 0.75, 0.8
Liberal 0.3, 0.5, 0.6, 0.8 0.5, 0.6, 0.7, 0.9 0.6, 0.7, 0.85, 0.95 0.6, 0.7, 0.8, 0.9
Business 0.4, 0.5, 0.6, 0.7 0.7, 0.75, 0.85, 0.9 0.8, 0.9, 0.95, 1.0 0.75, 0.8, 0.85, 0.9
15MethodWu, Olson, Liang
- Use grey related analysis
- Inputs are uncertain
- Use alpha-cut method to convert trapezoidal into
interval - Simulate
- Very complex preference model
- Know distribution of uncertainty
- Possibility that different alternatives may turn
out to be preferred
16Simulation Output
Nuclear Coal Conserve Import
Conservative 0 0 0.99 0.01
Liberal 0.07 0.34 0.39 0.20
Business 0.24 0.36 0.19 0.21
Consensus 0.02 0.18 0.47 0.07
17Monte Carlo Simulation of Grey Related Data
- Given interval data
- Draw uniform random number
- Assume value that proportion from minimum to
maximum - Do this for every interval number
- These become crisp numbers for this sample
- Calculate outcomes
- Value
- Get probabilistic picture of outcomes in complex
system involving uncertainty (grey related
intervals)
18DemonstrationOlson Wu
- Hiring decision
- Multiple criteria, Six applicants
- Criteria
- C1 Experience in business
- C2 Experience in function
- C3 Education
- C4 Leadership
- C5 Adaptability
- C6 Age
- C7 Aptitude for Teamwork
19Alternative Performance Matrix
C1-bus C2-funct C3-educ C4-lead C5-adapt C6- age C7-team
Antonio .65-.85 .75-.95 .25-.45 .45-.85 .05-.45 .45-.75 .75-1.0
Fabio .25-.45 .05-.25 .65-.85 .30-.65 .30-.75 .05-.25 .05-.45
Alberto .45-.65 .20-.80 .65-.85 .50-.80 .35-.90 .20-.45 .75-1.0
Fernand .85-1.0 .35-.75 .65-.85 .15-.65 .30-.70 .45-.80 .35-.70
Isabel .50-.95 .65-.95 .45-.65 .65-.95 .05-.50 .45-.80 .50-.90
Rafaela .65-.85 .15-.35 .45-.65 .25-.75 .05-.45 .45-.80 .10-.55
20Grey Related Weights
Criteria Weights
C1 Experience-Business 0.20-0.35
C2 Experience-Job Function 0.30-0.55
C3 Educational Background 0.05-0.30
C4 Leadership Capacity 0.25-0.50
C5 Adaptability 0.15-0.45
C6 Age 0.05-0.30
C7 Aptitude for Teamwork 0.25-0.55
21Grey Related data
- Weights interval
- Scores interval
- Used Grey Related model to identify best for each
simulation run - Best average weighted distance to reference point
- Reflect both min to ideal, max from nadir
- Ran 1,000 replications for each of 10 seeds
22Probabilities of Best
Anton Fabio Alberto Fernand Isabel Rafaela
Crisp Grey - - - - X -
Interval avg 0.358 0 0.189 0.047 0.410 0
min 0.336 0 0.168 0.040 0.384 0
max 0.393 0 0.210 0.053 0.429 0
Trapezoidal 0.354 0 0.189 0.044 0.409 0
min 0.328 0 0.171 0.035 0.382 0
max 0.381 0 0.206 0.051 0.424 0
23Implications
- Crisp Grey Related
- Isabel is the best choice
- Antonio very close
- Alberto, Fernando not far back
- SIMULATION
- Isabels probability of being best is 0.41
- Antonio 0.35, Alberto 0.19, Fernando 0.05
- Fabio Rafaela never won
- Simulation provides better picture
24Simulation of Grey Related Data in Data Mining
- Decision tree analysis (PolyAnalyst)
- Real credit card data
- 1,000 observations (900 train 100 test)
- 140 default, 860 no problem
- 65 available explanatory variables (used 26)
- Due to imbalance, initial models degenerate
- Called all test cases OK
- Differential cost models also degenerate
- Called all test cases default
25Fuzzified Data
- Of 26 explanatory variables
- 5 binary
- 1 categorical
- 20 continuous
- Fuzzified into 3 categories each
- Case by case, roughly equally sized categories
26Decision Tree Models
- Minimum support minimum of 1
- PolyAnalyst allowed
- Optimistic split of criteria
- Pessimistic split of criteria
- Different decision tree model each run
27Continuous Data Output
- Varied degree of perturbation (uncertainty)
- Continuous Data
- Many models overlapping
- Three unique decision trees
- Used a total of 8 explanatory variables
- Categorical Data
- Four unique decision trees
- Used a total of 7 explanatory variables
28Continuous Model 1
- Bal/Pay ratio lt 6.44 NO
- Bal/Pay ratio 6.44
- Utilization lt 1.54 Default
- Utilization 1.54
- AvgPayment lt 3.91 NO
- AvgPayment 3.91 Default
29Continuous Model 2
- Bal/Pay ratio lt 6.44 NO
- Bal/Pay ratio 6.44 Default
30Continuous Model 3
- Bal/Pay ratio lt 6.44 NO
- Bal/Pay ratio 6.44
- Utilization lt 1.54 Default
- Utilization 1.54
- AvgRevolvePay lt 2.28 Default
- AvgRevolvePay 2.28 NO
31Categorical Model 1
- Bal/Pay ratio high
- CreditLine high
- CalcIntRate I mid NO
- CalcIntRate I NOT mid Default
- CreditLine NOT high Default
- Bal/Pay ratio NOT high NO
32Categorical Model 2
- Bal/Pay ratio high
- CreditLine low
- ChangeLine mid
- PurchBal low Default
- PurchBal NOT low NO
- ChangeLine low NO
- ChangeLine high Default
- CreditLine high
- CalcIntRate I mid NO
- CalcIntRate I NOT mid Default
- CreditLine mid Default
- Bal/Pay ratio NOT high NO
33Categorical Model 3
- Bal/Pay ratio high Default
- Bal/Pay ratio NOT high NO
34Categorical Model 4
- Bal/Pay ratio high
- CreditLine low
- ChangeLine mid
- PurchBal low Default
- PurchBal NOT low NO
- ChangeLine low NO
- Residence 0 Default
- Residence 1 or 2 NO
- ChangeLine high Default
- CreditLine high
- CalcIntRate I mid NO
- CalcIntRate I NOT mid Default
- CreditLine mid Default
- Bal/Pay ratio NOT high NO
35Continuous 1 Coincidence matrix
Model 0 Model 1
Actual 0 43 16 59
Actual 1 14 27 41
57 43 0.70
36Simulation Output Continuous 1(Crystal Ball
test set accuracy)
37Continuous 1
- Simulation accuracy of 100 observations, 1000
simulation runs - perturbation -0.25,0.25 0.67-0.73
- perturbation -0.50,0.50 0.65-0.74
- perturbation -1,1 0.62-0.75
- perturbation -2,2 0.58-0.74
- perturbation -3,3 0.57-0.74
- perturbation -4,4 0.56-0.75
38Mean Model AccuracyMeasured on Test Set
Crisp 0.25 0.50 1.00 2.00 3.00 4.00
Con1 0.70 0.70 0.70 0.68 0.67 0.66 0.65
Con2 0.67 0.67 0.67 0.67 0.67 0.66 0.66
Con3 0.71 0.71 0.70 0.69 0.67 0.67 0.66
CON 0.693 0.693 0.690 0.680 0.670 0.667 0.657
Cat1 0.70 0.70 0.68 0.67 0.66 0.66 0.65
Cat2 0.70 0.70 0.70 0.69 0.68 0.67 0.67
Cat3 0.70 0.70 0.70 0.69 0.69 0.68 0.67
Cat4 0.70 0.70 0.70 0.69 0.68 0.67 0.67
CAT 0.700 0.700 0.700 0.688 0.678 0.670 0.665
39Inferences
- Continuous models declined in accuracy more than
categorical - Categorizing data one basic form of fuzziness
40Applying to Data Mining
- Easiest way to apply fuzzy concepts to data
mining - CATEGORIZE DATA
- Simulation a way to deal with fuzzy data
- Application of Simulation to fuzzy data mining
not as simple - Large scale data sets
- Create additional columns
- Still very promising research area
41Conclusions
- Interesting research directions
- Simulation in data mining
- Fuzzy data is probabilistic, so simulation seems
appropriate - Simulation involves a lot more work than closed
form (CRISP) simplifications - Group preference aggregation
- Fuzzy data may be fuzzy due to different group
member opinions - Interesting ways to aggregate