Title: CSCI 548/B480: Introduction to Bioinformatics Fall 2002
1CSCI 548/B480 Introduction to
BioinformaticsFall 2002
Topic 5 Machine Intelligence - Learning and
Evolution
- Dr. Jeffrey Huang, Assistant Professor
- Department of Computer and Information Science,
IUPUI - E-mail huang_at_cs.iupui.edu
2Machine Intelligence
- Machine Learning
- The subfield of AI concerned with intelligent
systems that learn. - The computational study of algorithms that
improve performance based on experience. - The attempt to build intelligent entities
- We must understand intelligent entities first
- Computational Brain
- Mathematics
- Philosophy staked most of the ideas of AI but to
make it a formal science the mathematical
formalization is needed in - Computation
- Logic
- Probability
3Behavior-Based AI vs. Knowledge Based
- Definitions of Machine Learning
- Reasoning
- The effort to make computers think and solve
problem - The study of mental faculties through the use of
computational models - Behavior
- Make machines to perform human actions requiring
intelligence - Seeks to explain intelligent behavior in terms of
computational processes - Agents
4Operational Agents
- Operational Views of Intelligence
- The ability to perform intellectual tasks
- Prove theorems, play chess, solve puzzle
- Focus on what goes on between the ears
- Emphasize the ability to build and effectively
use mental models - The ability to perform intellectually challenging
real world tasks - Medical diagnosis, tax advising, financial
investing - Introduce new issues such as critical
interactions with the world, model grounding,
uncertainty - The ability to survive, adapt, and function in a
constantly changing world - Autonomous agents
- Vision, locomotion, and manipulation, many I/O
issues - Self-assessment, learning, curiosity, etc.
5Building Intelligent Artifacts
- Symbolic Approaches
- Construct goal-oriented symbol manipulation
systems - Focus on high end abstract thinking
- Non-symbolic approaches
- Build performance-oriented systems
- Focus on behavior
- Need both in tightly coupled form
- Difficult in building such systems
- Growing need to automate this process
- Good approach Evolutionary Algorithms
6- Behavior-Based AI
- Behavior-Based AI vs. Knowledge-Based
- "Situated" in environment
- Multiple competencies ('routines')
- Autonomy
- Adaptation and Competition
- Artificial Life (A-Life)
- Agents Reactive Behavior
- Abstracting the logical principles of living
organism - Collective Behavior Competition and Cooperation
7Classification vs. Prediction
- Classification
- predicts categorical class labels
- classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data - Prediction
- models continuous-valued functions, i.e.,
predicts unknown or missing values
8ClassificationA Two-Step Process
- Model construction describing a set of
predetermined classes - Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute - The set of tuples used for model construction
training set - The model is represented as classification rules,
decision trees, or mathematical formulae - Model usage for classifying future or unknown
objects - Estimate accuracy of the model
- The known label of test sample is compared with
the classified result from the model - Accuracy rate is the percentage of test set
samples that are correctly classified by the
model - Test set is independent of training set,
otherwise over-fitting will occur
9Classification Process
Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
Use the Model in Prediction
(Jeff, Professor, 2)
Tenured?
10Supervised vs. Unsupervised Learning
- Supervised learning (classification)
- Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations - New data is classified based on the training set
- Unsupervised learning (clustering)
- The class labels of training data is unknown
- Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
11Classification and Prediction
- Data Preparation
- Data cleaning
- Preprocess data in order to reduce noise and
handle missing values - Relevance analysis (feature selection)
- Remove the irrelevant or redundant attributes
- Data transformation
- Generalize and/or normalize data
- Evaluating Classification Methods
- Predictive accuracy
- Speed and scalability
- time to construct the model
- time to use the model
- Robustness handling noise and missing values
- Scalability efficiency in disk-resident
databases - Interpretability understanding and insight
provided by the model - Goodness of rules
- decision tree size
- compactness of classification rules
12From Learning to Evolutionary
- Optimization
- Accomplishing abstract task Solving problem
- searching through a space of potential
solution - finding the best solution
- ? an optimization process
- Classical Exhaustive Methods??
- Large Space?? Special machine learning technique
- Evolution Algorithms
- Stochastic Algorithms
- Search methods model some phenomena
- Genetic Inheritance
- Darwinian strife for survival
13- the metaphor underlying genetic algorithms is
that of natural evolution. In evolution, the
problem each species faces is one of searching
for beneficial adaptations to a complicated and
changing environment. The knowledge that each
species has gained is embodied in the makeup of
chromosomes of its members - - L. David and M. Steenstrup, Genetic Algorithms
and Simulated Annealing, pp. 1-11, Kaufmann,
1987
14The Essence Components
- Genetic representation for potential solutions to
the problem - A way to create an Initial population of
potential solutions - An evaluation function that plays the ole of the
environment, rating solutions in term of their
fitness - i.e. the use of fitness to determine survival
and reproductive rates - Genetic operators that alter the composition of
children
15Evolutionary Algorithm Search Procedure
16Historical Background
- Three paradigms emerged in the 1960s
- Genetic Algorithms
- Introduced by Holland (MSU) ? De Jong (GMU)
- Envisioned for broad range of adaptive systems
- Evolution Strategies
- Introduced by Rechenberg
- Focused on real-valued parameter optimization
- Evolutionary Programming
- Introduced by Fogel and Koza
- Applied to AI and machine learning problem
- Today
- Wide variety of evolutionary algorithms
- Applied to many area of science and engineering
17Examples of Evolutionary AI
- Parameter Tuning
- Pervasiveness of parameterized models
- Complex behavioral changes due to non-linear
interactions - Example
- Weights of an Artificial Neural networks
- Parameters of a heuristic evolution function
- Parameter of a rule induction system
- Parameter of membership functions
- Goal evolve over time useful set of discrete/
continuous parameter
18- Evolving Structure
- Effect behavior change via more complex
structures - Example
- Selecting/constructing the topology of ANNs
- Selecting/constructing the feature sets
- Selecting/constructing plans/scenarios
- Selecting/constructing membership functions
- Goal evolve useful structure over time
- Evolving Programs
- Goal acquire new behaviors and adapt existing
ones - Example
- Acquire/adapt behavioral rules sets
- Acquire/adapt arm/joint control programs
- Acquire/adapt task-oriented programming code
19How Does Genetic Algorithm Work?
- A simple example of function optimization
- Find max f(x)x2, for x? 0, 4
- Representation
- Genotype (chromosome) internally points in the
search space are represented as (binary) string
over some alphabet - Phenotype the expressed traits of an individual
- With a precision for x in 0,4 of 10-4 it
needs14 bits - 8,000 ? 213 lt 10,000 lt 214 ? 16,000
- Simple fixed length binary
- Assigned 0.0 to the string 00 0000 0000 0000
- Assign 0.0 bin2dec(binary string)4/(214 -1)
- the string 00 0000 0000 0001 and so on
- Phenotype 4.0 genotype 11 1111 1111 1111
2000000000000000 00000000000001 11111111111111
0.0 4/(214 -1) 4.0
genotype
Phenotype
- Initial population
- Create a population (pop_size) of chromosomes,
where each chromosome is a binary vector of 14
bits - All 14 bits for each chromosome are initialized
randomly - Evaluation function
- Evaluation function eval for binary vectors v is
equal to the function f - eval(v) f(x)
- ex eval(v1) f(x1) fitness1
21- Parameters
- pop_size 24,
- Prob. of Xover, pc 0.6,
- Prob. of mutation, pm 0.01
- Recombination using genetic operations
- Crossover (pc)
- v1 01111100010011 gt v1 01110101011100
- v2 00010101011100 gt v2 00011100010011
- Mutation (pm)
- v2 00011100010011 gt v2 00011110010011
22- Selection M(t) from M(t1) using roulette wheel
- Total fitness of the population
- Probability of selection probi for each
chromosome vi - Cumulative prob qi
- Generate random numbers rj, from 0,1, where j
1pop_size - Select chromosome vi such that qi-1 lt rj lt qi
23(No Transcript)
24Homing to the Optimal Solution
25Best-so-far Curve
26Optimal Feature Subset
- Search for the Subsets of Discriminatory Features
- Combination optimization problem
- Two general approaches to identifying optimal
subsets of features - Abstract measurement for important properties of
good feature sets - Orthogonality (ex. PCA), information content, low
variance - Less expensive process
- Fall in suboptimal performance if the abstract
measures do not correlate well with actual
performance - Building a classifier from the feature subset and
evaluating its performance on actual
classification tasks. - Better classification performance
- the cost of building and testing classifiers
prohibits any kind of systematic evaluation of
feature subsets - suboptimal in practice large numbers of
candidate features cannot be handled by any form
of systematic search - 2N possible candidate subsets of N features.
27Inductive Learning
- Learning From Examples
- Decision Tree (DT)
- Information Theory (IT)
- Question what are the BEST attributes
(Features) for building the decision tree? - Answer BEST attribute is the one that it is
MOST informative and for whom
ambiguity/uncertainty is least - Solution Measure (information) contents using
the expected amount of information provided by
the attribute
28Classification by Decision Tree Induction
- Decision tree
- A flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
- distribution
- Decision tree generation consists of two phases
- Tree construction
- At start, all the training examples are at the
root - Partition examples recursively based on selected
attributes - Tree pruning
- Identify and remove branches that reflect noise
or outliers - Use of decision tree Classifying an unknown
sample - Test the attribute values of the sample against
the decision tree
Exs. Class Size Color Surface
1 A Small Yellow Smooth
2 A Medium Red Smooth
3 A Medium Red Smooth
4 A Big Red Rough
5 B Medium Yellow Smooth
6 B Medium Yellow Smooth
29- Entropy
- Define an entropy function H such that
- where pi the probability associated with ith
class - For a feature, the entropy is calculated for each
value. - The sum of the entropy weighted by the
probability of each value is the entropy for that
feature - Example Toss a fair coin
- if the coin is not fair, i.e. Pheads 99,
then - So, by tossing the coin you get very little
(extra) information (that you didnt expect)
30- In general, if you have p positive examples, and
n negative examples - For p n ? H 1
- i.e. originally there is most uncertainty on the
eventual outcome (picking up an example) and most
to gain by picking the example. -
31Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning - Majority voting is employed for classifying the
leaf - There are no samples left
32Algorithm
- Select a random subset W (called the window) from
the training set T - Build a DT for the current W
- Select the best feature which minimizes the
entropy H (or max. gain) - Categorize training instances (examples) into
subsets by this feature - Repeat this process recursively until each subset
contains instances of one kind (class) or some
statistical criterion is satisfied - Scan the entire training set for exceptions to
the DT - If exceptions are found insert some of them into
W and repeat from step 2
33- Information Gain
- The information gain from the ? attribute test is
defined as the difference between the original
information requirement and the new requirement - Note that the Remainder(?) is an weighted (by
attribute values) entropy function - Maximize Gain(?) ? Minimize Remainder(?) and
then ? is the most informative attribute
(question)
34The ID3 Algorithm and Quinlans C4.5
- C4.5
- Tutorial http//yoda.cis.temple.edu8080/UGAIWWW/
lectures/C45/ - Matlab program http//www.cs.wisc.edu/olvi/uwmp/
msmt.html - See 5/ C5.0
- Tutorial http//borba.ncc.up.pt/niaad/Software/c5
0/c50manual.html - Software for Win2000 http//www.rulequest.com/dow
nload.html
35Exs. Class Size Color Surface
1 A Small Yellow Smooth
2 A Medium Red Smooth
3 A Medium Red Smooth
4 A Big Red Rough
5 B Medium Yellow Smooth
6 B Medium Yellow Smooth
36- Noise and Overfitting
- Question what about two or more examples with
the same description but different
classifications? - Answer Each leaf node reports either MAJORITY
classification or relative frequencies - Question what about irrelevant attributes (noise
and overfitting)? - Answer Tree pruning
- Solution An information gain close to zero is a
good clue to irrelevance, actual number of ()
and (-) exs. In each subset i, pi and ni vs.
expected numbers pi and ni assuming true
irrelevance - Where p and n are the total number of positive
and negative exs to start with. - Total deviation (regarding statistical
significant) - Under the null hypothesis, D chi-squared
distribution
37Extracting Classification Rules from Trees
- Represent the knowledge in the form of IF-THEN
rules - One rule is created for each path from the root
to a leaf - Each attribute-value pair along a path forms a
conjunction - The leaf node holds the class prediction
- Rules are easier for humans to understand
- Example
- IF age lt30 AND student no THEN
buys_computer no - IF age lt30 AND student yes THEN
buys_computer yes - IF age 3140 THEN buys_computer yes
- IF age gt40 AND credit_rating excellent
THEN buys_computer yes - IF age gt40 AND credit_rating fair THEN
buys_computer no
38Decision Tree
- Avoid Overfitting in Classification
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees - Use a set of data different from the training
data to decide which is the best pruned tree
39- Approaches to Determine the Final Tree Size
- Separate training (2/3) and testing (1/3) sets
- Use cross validation, e.g., 10-fold cross
validation - Use all the data for training
- but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a node
may improve the entire distribution - Use minimum description length (MDL) principle
- halting growth of the tree when the encoding is
minimized
40Decision Tree
- Enhancements to basic decision tree induction
- Allow for continuous-valued attributes
- Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals - Handle missing attribute values
- Assign the most common value of the attribute
- Assign probability to each of the possible values
- Attribute construction
- Create new attributes based on existing ones that
are sparsely represented - This reduces fragmentation, repetition, and
replication