Title: Decision Trees
1Decision Trees
Klassifikations- und Clustering-Methoden für die
ComputerlinguistikSabine Schulte im Walde, Irene
Cramer, Stefan SchachtUniversität des
Saarlandes, Winter 2004/2005
2Outline
- Example
- What are decision trees?
- Some characteristics
- A (tentative) definition
- How to build them?
- Lots of questions
- Discussion
- Advantages disadvantages
- When should we use them?
3Illustration Classification example
Remember example at the black board
4DiscussionIllustration Results
- Lets gather some characteristics of our decision
tree - binäre Entscheidungsfragen (ja/nein-Frage)
- nicht unbedingt balanciert
- Baumtiefe beliebig, aber abhängig von
gewünschter Feinheit - Merkmale und Klassen stehen fest
- annotierte Daten
- Nominale und ordinale Merkmale
- What questions did arise?
- Größe nicht mit ja/nein antworten
- Reihenfolge, davon auch abh. Baum unbalanciert
-
5Illustration Results
- Lets gather some characteristics of our decision
tree - annotated data at hand
- look for clever features (knowledge about
features) - at each node tree splits data in subsets ? decide
whether further grow the tree or stop - set of rules ? rule at each node
- binary ? thus answer yes or no at each step
- nominal features but also real valued possible
- tree rule set
- What questions did arise?
- over fitting possible if impurity at each node 0
? - when to prune?
6Our First Definition
- A decision tree is a graph
- It consists of nodes, edges and leafs
- nodes ? questions about features
- edges ? possible value of a feature
- leafs ? class labels
- Path from root to leaf ? conjunction of questions
(rules) - A decision tree is learned by splitting the
source data into subsets based on features/rules
(how we will see later on) - This process is repeated recursively until
splitting is either non-feasible, or a singular
classification can be applied to each element of
the derived subset
7Building Decision Trees
- We meet a lot of questions while building/using
decision trees - Should we only allow binary questions? Why?
- Which features (properties) should we use? Thus,
what questions should we ask? - Under what circumstances is a node a leaf?
- How large should our tree become?
- How should the category labels be assigned?
- What should we do with corrupted data?
8Only Binary Questions?
Taken from the web http//www.smartdraw.com/resou
rces/examples/business/images/decision_tree_diagra
m.gif
9Only Binary Questions?
Taken from the web http//www.cs.cf.ac.uk/Dave/AI
2/dectree.gif
10Only Binary Questions?
- Branching factor how many edges do we have?
- Binary ? branching factor 2
- All decision trees can be converted into binary
ones - Binary trees are very expressive
- Binary decision trees are simpler to train
- With a binary tree 2n possible classifications
(n is number of features)
11What Questions Should We Ask?
- Try to follow Ockhams Razor ? prefer the
simplest model thus prefer those
features/questions that lead to a simple tree
(not very helpful?)
12What Questions Should We Ask?
- Measure impurity at each split
- Impurity (i(N))
- Metaphorically speaking, shows how many different
classes we have at each node - Best would be just one class ? leaf
- Some impurity measures
- Entropy Impurity
- Gini Impurity
- Misclassification Impurity
13What Questions Should We Ask?
- Entropy Impurity
- Gini Impurity
- Misclassification Impurity
- where P(?j) is the fraction of patterns at node N
that are - in class ?j
14Illustration
Scanned from Pattern Classification by Duda,
Hart, and Stork
15What Questions Should We Ask?
- Calculate best question/rule at a node
- where NL and NR are the left and right descendent
nodes i(NL) and i(NR) - are their impurities and PL is the fraction of
patterns in node N that will go to - NL when this question is used
- ?i(N) should be as high as possible
- Most common Entropy Impurity
16What Questions Should We Ask?
- Additional information about questions
- monothetic (i.e. nominal) vs. polythetic (i.e.
real valued) - we now understand why binary trees are simpler
- Keep in mind a local optimum isnt necessarily a
global one!
17 When to Declare Node Leaf?
- On the one hand on the other
- if i(N) near 0 ? over fit (possible)
- tree to small ? (highly) erroneous classification
- 2 solutions
- stop before i(N) 0 ? how to decide when?
- pruning ? how?
18 When to Declare Node Leaf?
- When to stop growing?
- Cross validation
- split training data in two subsets
- train with bigger set
- validate with smaller
- ?i(N) lt threshold
- get unbalanced tree
- what threshold is reasonable?
- P(NL), P(NR) lt threshold
- reasonable thresholds 5, 10 of data
- advantage good partition where high data density
- ?i(N) ? 0 ? significant
- Hypothesis testing
-
19Large Tree vs. Small Tree?
- Tree to large? Prune!
- first grow the tree fully, then cut
- cut those nodes/leafs where i(N) is very small
- avoid horizon effect
- Tree to large? Merge branches or rules!
20When to assign Category Label ? Leaf?
- If i(N) 0, then category label is class of all
objects - If i(N) gt 0, then category label is class of most
objects
21DiscussionWe have learnt until now.
- Merkmale von Entscheidungsbäumen
- Entscheidungsfragen gestellt
- Unterschied zwischen entropy impurity und gini
impurity? - Frage nach optimalem Baum nicht geklärt - NP
vollständig Problem
22Examples
Scanned from Pattern Classification by Duda,
Hart, and Stork
23Examples
Scanned from Pattern Classification by Duda,
Hart, and Stork
24Examples
Scanned from Pattern Classification by Duda,
Hart, and Stork
25What to do with Corrupted Data?
- Missing attributes
- during classification
- look for surrogate questions
- use virtual value
- during training
- calculate impurity of basis of attributes at hand
- dirty solution dont consider data with missing
attributes
26Some Terminology
- CART (classification and regression trees)
- general framework ? instances in many ways
- see questions on slide before
- ID3
- for unordered nominal attributes (if real valued
variables ? intervals) - seldom binary
- algorithm continues until nodes is pure or no
more variables left - no pruning
- C4.5
- refinement of ID3 (in various aspects, i.e.
real-valued variables, pruning etc.)
27Advantages Disadvantages
- Advantages of decision trees
- non-metric data (nominal features) ?yes/no
questions - easily interpretable for humans
- information in tree can be converted in rules
- include expertise
- Disadvantages of decision trees
- deduced rules can be very complex
- decision tree could be suboptimal (i.e. cross
check, over fitting) - need annotated data
28Discussion When could we use decision trees?
- Named Entity Recognition
- Verbklassifikation
- Polysemie
- Spamfilter
- immer dann, wenn nomiale Merkmale
- POS
29Literature
- Richard O. Duda, Peter E. Hart und David G. Stork
(2000) Pattern Classification. John Wiley
Sons, New York. - Tom M. Mitchell (1997) Machine Learning.
McGraw-Hill, Boston. - www.wikipedia.org