Title: Data Mining
1Data Mining Machine LearningIntroduction
- Intelligent Systems Lab.
- Soongsil University
Thanks to Raymond J. Mooney in the University of
Texas at Austin, Isabelle Guyon
2Artificial Intelligence (AI) Research Areas
Learning Algorithms Inference Mechanisms Knowledge
Representation Intelligent System Architecture
Research
Intelligent Agents Information Retrieval Electroni
c Commerce Data Mining Bioinformatics Natural
Language Proc. Expert Systems
Artificial Intelligence
Application
Rationalism (Logical) Empiricism
(Statistical) Connectionism (Neural) Evolutionary
(Genetic) Biological (Molecular)
Paradigm
3Artificial Intelligence (AI) Paradigms
4What is Machine Learning?
Trained machine
TRAINING DATA
Answer
?
Query
5Definition of learning
- Definition A computer program is said to learn
from experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E
Task, T
Experience, E
Task
Learned Program
Learning Program
Program
Performance
Performance, P
6What is Learning?
- Herbert Simon Learning is any process by which
a system improves performance from experience.
7Machine Learning
- Supervised Learning
- Estimate an unknown mapping from known input-
output pairs - Learn fw from training set D(x,y) s.t.
- Classification y is discrete
- Regression y is continuous
- Unsupervised Learning
- Only input values are provided
- Learn fw from D(x) s.t.
- Clustering
8Why Machine Learning?
- Recent progress in algorithms and theory
- Growing flood of online data
- Computational power is available
- Knowledge engineering bottleneck. Develop
systems that are too difficult/expensive to
construct manually because they require specific
detailed skills or knowledge tuned to a specific
task - Budding industry
9Niches using machine learning
- Data mining from large databases.
- Market basket analysis (e.g. diapers and beer)
- Medical records ? medical knowledge
- Software applications we cant program by hand
- Autonomous driving
- Speech recognition
- Self customizing programs to individual users.
- Spam mail filter
- Personalized tutoring
- Newsreader that learns user interests
10Trends leading to Data Flood
- More data is generated
- Bank, telecom, other business transactions ...
- Scientific data astronomy, biology, etc
- Web, text, and e-commerce
11Big Data Examples
- Europe's Very Long Baseline Interferometry (VLBI)
has 16 telescopes, each of which produces 1
Gigabit/second of astronomical data over a 25-day
observation session - storage and analysis a big problem
- ATT handles billions of calls per day
- so much data, it cannot be all stored -- analysis
has to be done on the fly, on streaming data
12Largest databases in 2007
- Commercial databases
- ATT 312 TB
- World Data Centre for Climate 220 TB
- YouTube 45TB of videos
- Amazon 42 TB (250,000 full textbooks)
- Central Intelligence Agency (CIA) ?
13Data Growth
In 2 years, the size of the largest database
TRIPLED!
14Machine Learning / Data Mining Application areas
- Science
- astronomy, bioinformatics, drug discovery,
- Business
- CRM (Customer Relationship management), fraud
detection, e-commerce, manufacturing,
sports/entertainment, telecom, targeted
marketing, health care, - Web
- search engines, advertising, web and text mining,
- Government
- surveillance, crime detection, profiling tax
cheaters,
15Data Mining for Customer Modeling
- Customer Tasks
- attrition prediction
- targeted marketing
- cross-sell, customer acquisition
- credit-risk
- fraud detection
- Industries
- banking, telecom, retail sales,
16Customer Attrition Case Study
- Situation Attrition rate at for mobile phone
customers is around 25-30 a year ! - With this in mind, what is our task?
- Assume we have customer information for the past
N months.
17Customer Attrition Case Study
- Task
- Predict who is likely to attrite next month.
- Estimate customer value and what is the
cost-effective offer to be made to this customer.
18Customer Attrition Results
- Verizon Wireless built a customer data warehouse
-
- Identified potential attriters
- Developed multiple, regional models
- Targeted customers with high propensity to accept
the offer - Reduced attrition rate from over 2/month to
under 1.5/month (huge impact, with gt30 M
subscribers) - (Reported in 2003)
19Assessing Credit Risk Case Study
- Situation Person applies for a loan
- Task Should a bank approve the loan?
- Note People who have the best credit dont need
the loans, and people with worst credit are not
likely to repay. Banks best customers are in
the middle
20Credit Risk - Results
- Banks develop credit models using variety of
machine learning methods. -
- Mortgage and credit card proliferation are the
results of being able to successfully predict if
a person is likely to default on a loan - Widely deployed in many countries
21Successful e-commerce Case Study
- Task Recommend other books (products) this
person is likely to buy - Amazon does clustering based on books bought
- customers who bought Advances in Knowledge
Discovery and Data Mining, also bought Data
Mining Practical Machine Learning Tools and
Techniques with Java Implementations - Recommendation program is quite successful
22Security and Fraud Detection - Case Study
- Credit Card Fraud Detection
- Detection of Money laundering
- FAIS (US Treasury)
- Securities Fraud
- NASDAQ KDD system
- Phone fraud
- ATT, Bell Atlantic, British Telecom/MCI
- Bio-terrorism detection at Salt Lake Olympics 2002
23Example ProblemHandwritten Digit Recognition
Handcrafted rules will result in large no.
of rules and Exceptions Better to have a
machine that learns from a large training set
Wide variability of same numeral
24Chess Game
In 1997, Deep Blue(IBM) beat Garry Kasparov(?).
Let IBMs stock increase by 18 billion at
that year
25Some Successful Applications ofMachine Learning
- Learning to drive an
- autonomous vehicle
- Train computer-controlled vehicles
- to steer correctly
- Drive at 70 mph for 90 miles on public
- highways
- Associate steering commands with
- image sequence
- 1200 computer-generated images as
- training examples
- Half-hour training
An additional information from previous image
indicating the darkness or lightness of the road
26Some Successful Applications ofMachine Learning
- Learning to recognize spoken words
- Speech recognition/synthesis
- Natural language understanding/generation
- Machine translation
27Example 1 visual object categorization
- A classification problem predict category y
based on image x. - Little chance to hand-craft a solution, without
learning. - Applications robotics, HCI, web search (a real
image Google..)
28Face Recognition - 1
Given multiple angles/ views of a person, learn
to identify them. Learn to distinguish male from
female faces.
29Face Recognition - 2
Learn to recongnize emotions, gestures Li, Ye,
Kambhametta, 2003
30Robot
Sony AIBO robot Available on June 1, 1999
Weight 1.6 KG Adaptive learning and
growth capabilities Simulate emotion such as
happiness and anger
31Robot
Honda ASIMO (Advanced Step in Innovate
MObility) Born on 31 October, 2001
Height 120 CM, Weight 52 KG
http//blog.makezine.com/archive/2009/08/asimo_avo
ids_moving_obstacles.html?CMPOTC-0D6B48984890
32Biomedical / Biometrics
- Medicine
- Screening
- Diagnosis and prognosis
- Drug discovery
- Security
- Face recognition
- Signature / fingerprint
- DNA fingerprinting
33Computer / Internet
- Computer interfaces
- Troubleshooting wizards
- Handwriting and speech
- Brain waves
- Internet
- Spam filtering
- Text categorization
- Text translation
- Recommendation
7
34Classification
- Assign object/event to one of a given finite set
of categories. - Medical diagnosis
- Credit card applications or transactions
- Fraud detection in e-commerce
- Worm detection in network packets
- Spam filtering in email
- Recommended articles in a newspaper
- Recommended books, movies, music, or jokes
- Financial investments
- DNA sequences
- Spoken words
- Handwritten letters
- Astronomical images
35Problem Solving / Planning / Control
- Performing actions in an environment in order to
achieve a goal. - Solving calculus problems
- Playing checkers, chess, or backgammon
- Driving a car or a jeep
- Flying a plane, helicopter, or rocket
- Controlling an elevator
- Controlling a character in a video game
- Controlling a mobile robot
36Applications
37Disciplines Related with Machine Learning
- Artificial intelligence
- ?? ?? ??, ????, ????, ????? ??
- Bayesian methods
- ?? ????? ??, naĂŻve Bayes classifier, unobserved
?? ? ?? - Computational complexity theory
- ?? ??, ?? ???? ??, ??? ? ?? ??? ??? ??? ??
- Control theory
- ?? ??? ??? ????? ????? ?? ?? ??? ??
38Disciplines Related with Machine Learning (2)
- Information theory
- Entropy? Information Content? ??, Minimum
Description Length, Optimal Code ? Optimal
Training? ?? - Philosophy
- Occams Razor, ???? ??? ??
- Psychology and neurobiology
- Neural network models
- Statistics
- ??? ??? ??? ???? ??? ???, ????, ??? ??
39Definition of learning
- Definition A computer program is said to learn
from experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E
40Example checkers
Task T Playing checkers. Performance measure P
of games won. Training experience E Practice
games by playing against itself.
41Example Recognizing handwritten letters
Task T Recognizing and classifying handwritten
words within images. Performance
measure P words correctly classified. Training
experience E A database of handwritten words
with given
classifications.
42Example Robot driving
Task T Driving on public four-lane highway using
vision sensors. Performance measure P Average
distance traveled before an error (as judged by a
human overseer). Training experience E A
sequence of images and steering commands recorded
while observing a human driver.
43Designing a learning system
Task T Playing checkers. Performance measure P
of games won. Training experience E Practice
games by playing against itself.
What does this mean? and what can we learn
from it?
44Measuring Performance
- Classification Accuracy
- Solution correctness
- Solution quality (length, efficiency)
- Speed of performance
45Designing a Learning System
- 1. Choose the training experience
- 2. Choose exactly what is to be learned, i.e. the
target function. - 3. Choose how to represent the target function.
- 4. Choose a learning algorithm to infer the
target function from the experience.
Learner
Environment/ Experience
Knowledge
Performance Element
46Designing a Learning System
1. Choosing the Training Experience
- Key Attributes
- Direct/indirect feedback ? ????? ?
- Direct feedback checkers state and correct move
- Indirect feedback move sequence and final
outcomes - Credit assignment problem
- Degree of controlling the sequence of training
example - Learner? ?? ??? ?? ? teacher? ??? ?? ??
- Distribution of examples
- Train examples? ??? Test examples? ??? ??
- ???? ??? ???? ???? ?? ??? ? ???? ?
- ??? ??? ??? ?? (The Checkers World Champion? ??
??? ??? ??? ?? ?? ?)
47Training vs. Test Distribution
- Generally assume that the training and test
examples are independently drawn from the same
overall distribution of data. - IID Independently and identically distributed
- If examples are not independent, requires
collective classification. - (e.g. communication network, financial
transaction network, social network? ?? ?? ????
?) - If test distribution is different, requires
transfer learning. that is, achieving cumulative
learning
48?? Transfer learning
- Transfer learning is what happens when someone
finds it much easier to learn to play chess
having already learned to play checkers - or to recognize tables having already learned
to recognize chairs - or to learn Spanish having already learned
Italian. - Achieving significant levels of transfer learning
across tasks -- that is, achieving cumulative
learning -- is perhaps the central problem facing
machine learning.
49Training Experience
- Direct experience Given sample input and output
pairs for a useful target function. - Checker boards labeled with the correct move,
e.g. extracted from record of expert play - Indirect experience Given feedback which is not
direct I/O pairs for a useful target function. - Potentially arbitrary sequences of game moves and
their final game results. - Credit/Blame Assignment Problem How to assign
credit blame to individual moves given only
indirect feedback?
50Source of Training Data
- Provided random examples outside of the learners
control. (??? ???? ??? ??) - Negative examples available or only positive?
- Good training examples selected by a benevolent
teacher. (Teacher ? ??) - Near miss examples
- Learner can query an oracle about class of an
unlabeled example in the environment. (??? ?? ??) - Learner can construct an arbitrary example and
query an oracle for its label. (??? ???? ??? ??) - Learner can design and run experiments directly
in the environment without any human guidance. - (??? ???? ???? ??? ???? ???)
51Designing a Learning System
- 1. Choose the training experience
- 2. Choose exactly what is to be learned, i.e. the
target function. - 3. Choose how to represent the target function.
- 4. Choose a learning algorithm to infer the
target function from the experience.
Learner
Environment/ Experience
Knowledge
Performance Element
52Designing a Learning System
2. Choosing a Target Function
- ?? ??? ?????, ?? ???? ??? ??? ?? ?? ? ??? ?
- ?? ??????, ??? ????? ??? ????? ???? ??? ??,
??? ??? ???? ????. - Could learn a function
- 1. ChooseMove B ?M(??? ???)
- Or
- 2. Evaluation function, V B ? R
-
- ? ??? ??? ?? ??? ????? ?? ??? ???.
- V? ??? ???? ?? ??? ??? ?? ????? ???? ??
- ??? ?? ?? ??? ?? ? ?? ???? ???? ??.
53Designing a Learning System
2. Choosing the Target Function
- A function that chooses the best move M for any B
- ChooseMove B ?M
- Difficult to learn
- It is useful to reduce the problem of improving
performance P at task T, to the problem of
learning some particular target function. - An evaluation function that assigns a numerical
score to any B - V B ? R
54The start of the learning work
- Instead of learning ChooseMove we establish a
value function - target function, V B ? R
- that maps any legal board state in B to some real
value in R. - ?? Position??? ? Position? ???? ?? Score ? ???
??? ???? ????? ??. - 1. if b is a final board state that is won, then
V (b) 100. - 2. if b is a final board state that is lost, then
V (b) -100. - 3. if b is a final board state that is drawn,
then V (b) 0. - 4. if b is not a final board state, then V (b)
55The start of the learning work
- Instead of learning ChooseMove we establish a
value function - target function, V B ? R
- that maps any legal board state in B to some real
value in R. - .
- ?? Position??? ? Position? ???? ?? Score ? ???
??? ???? ????? ??. - 1. if b is a final board state that is won, then
V (b) 100. - 2. if b is a final board state that is lost, then
V (b) -100. - 3. if b is a final board state that is drawn,
then V (b) 0. - 4. if b is not a final board state, then V (b)
V (b), -
- ??? b ??? ? ?? ??? ??? ?? (the best final
board state) - (? ???? ???? ?? ??? ??)
- Unfortunately, this did not take us any further!
56Approximating V(b)
- Computing V(b) is intractable since it involves
searching the complete exponential game tree. - Therefore, this definition is said to be
non-operational. - An operational definition can be computed in
reasonable (polynomial) time. - Need to learn an operational approximation to the
ideal evaluation function.
57Designing a Learning System
- 1. Choose the training experience
- 2. Choose exactly what is to be learned, i.e. the
target function. - 3. Choose how to represent the target function.
- 4. Choose a learning algorithm to infer the
target function from the experience.
Learner
Environment/ Experience
Knowledge
Performance Element
583. Choosing a Representation for the Target
Function
- Describing the function
- Tables
- Rules
- Polynomial functions
- Neural nets
- Trade-off in choice
- Expressive power
- Size of training data
- ? ?? ??? ? ?? ??? ?? ??? ? ?? ??. ? ?? ?? ????
?? ??? ?? ? ??. - ??? ??? ??? ?? ??? ? ?? ??? ???? ? ??? ???? ??.
59Approximate representation
w1 - w6 weights
60Linear Function for Representing V(b)
- Use a linear approximation of the evaluation
function.
(b) w0 w1x1 w2x2 w3x3 w4x4 w5x5
w6x6
61Designing a Learning System
- 1. Choose the training experience
- 2. Choose exactly what is to be learned, i.e. the
target function. - 3. Choose how to represent the target function.
- 4. Choose a learning algorithm to infer the
target function from the experience.
Learner
Environment/ Experience
Knowledge
Performance Element
624. Choosing a Function Approximation Algorithm
- A training example is represented as an ordered
pair ltb, Vtrain(b) gt - b board state
- Vtrain(b) training value for b
- Instance black has won the game
- ltltx13, x20, x31, x40, x50,
x60gt, 100gt - (x2 0) indicates that white has
no remaining pieces. - Estimating training values for intermediate board
states - Vtrain(b) ? (Successor(b))
- current approximation to V, (? the learned
function, hypothesis) - Successor(b) the next board state, ? b1 state
- ??? b ????? ?? training value? ? ????? ???
??(b1)? ??? ???? ??
63DESIGNING A LEARNING SYSTEM
Estimating Training Values
64???? Temporal Difference Learning
- Estimate training values for intermediate
(non-terminal) board positions by the estimated
value of their successor in an actual game trace.
- where successor(b) is the next board position
- where it is the programs move in actual
play. - Values towards the end of the game are initially
more accurate and continued training slowly
backs up accurate values to earlier board
positions.
65How to learn?
66How to learn?
67How to change the weights?
68How to change the weights?
69Obtaining Training Values
- Direct supervision may be available for the
target function. - With indirect feedback, training values can be
estimated using temporal difference learning
(used in reinforcement learning where supervision
is delayed reward).
70Learning Algorithm
- Uses training values for the target function to
induce a hypothesized definition that fits these
examples and hopefully generalizes to unseen
examples. - In statistics, learning to approximate a
continuous function is called regression. - Attempts to minimize some measure of error (loss
function) such as mean squared error
71The LMS(Least Mean Square) weight update rule
- Due to mathematical reasoning, the following
update rule is very sensible.
72LMS Discussion
- Intuitively, LMS executes the following rules
- ??? ??? ??(the output for an example ) ? ?????,
??? ?? ???. - ??? ??? ?? ? ?? ?? ???, ?? features? ??
???? weight?? ???. ??? ???? ??? ??? ???? ??. - ??? ??? ?? ? ?? ?? ???, ?? features? ??
???? weight?? ???. ??? ???? ??? ??? ???? ??. - Under the proper weak assumptions, LMS can be
proven to eventually converge to a set of weights
that minimizes the mean squared error.
73Lessons Learned about Learning
- Learning? ?
- ??? target function ? ???(approximation) ??
?? direct or indirect experience? ????. - Function approximation ?? ?
- a space of hypotheses?? training data?? ?? ???
??(hypotheses)? ???? Search? ?? ? -
- Different learning methods assume different
hypothesis spaces (representation languages)
and/or employ different search techniques.
74Various Function Representations
- Numerical functions
- Linear regression
- Neural networks
- Support vector machines
- Symbolic functions
- Decision trees
- Rules in propositional logic
- Rules in first-order predicate logic
- Instance-based functions
- Nearest-neighbor
- Case-based
- Probabilistic Graphical Models
- NaĂŻve Bayes
- Bayesian networks
- Hidden-Markov Models (HMMs)
- Probabilistic Context Free Grammars (PCFGs)
- Markov networks
75Various Search Algorithms
- Gradient descent
- Perceptron
- Backpropagation
- Dynamic Programming
- HMM Learning
- Probabilistic Context Free Grammars (PCFGs)
Learning - Divide and Conquer
- Decision tree induction
- Rule learning
- Evolutionary Computation
- Genetic Algorithms (GAs)
- Genetic Programming (GP)
- Neuro-evolution
76Evaluation of Learning Systems
- Experimental
- Conduct controlled cross-validation experiments
to compare various methods on a variety of
benchmark datasets. - Gather data on their performance, e.g. test
accuracy, training-time, testing-time. - Analyze differences for statistical significance.
- Theoretical
- Analyze algorithms mathematically and prove
theorems about their - Computational complexity
- Ability to fit training data
- Sample complexity (number of training examples
needed to learn an accurate function)
77Core parts of the machine learning
( )
(Initial game board)
(Game history)
Many machine learning systems can be usefully
characterized in terms of these four generic
modules.
78Four Components of a Learning System(1)
- Performance system
- - Solve the given performance task
- - Use the learned target function
- - New problem ? trace of its solution
- Critic
- - Output a set of training examples of the
target function
79Four Components of a Learning System (2)
- Generalizer
- Input training example
- Output hypothesis (estimate of the target
function) - Generalizes from the specific training examples
- Hypothesizes a general function
- Experiment generator
- Input - current hypothesis
- Output - a new problem
- Picks new practice problem maximizing the
learning rate
80History of Machine Learning
- 1950s
- Samuels checker player
- Selfridges Pandemonium
- 1960s
- Neural networks Perceptron
- Pattern recognition
- Learning in the limit theory
- Minsky and Papert prove limitations of Perceptron
- 1970s
- Symbolic concept induction
- Winstons arch learner
- Expert systems and the knowledge acquisition
bottleneck - Quinlans ID3
- Michalskis AQ and soybean diagnosis
- Scientific discovery with BACON
- Mathematical discovery with AM
81History of Machine Learning (cont.)
- 1980s
- Advanced decision tree and rule learning
- Explanation-based Learning (EBL)
- Learning and planning and problem solving
- Utility problem
- Analogy
- Cognitive architectures
- Resurgence of neural networks (connectionism,
backpropagation) - Valiants PAC Learning Theory
- Focus on experimental methodology
- 1990s
- Data mining
- Adaptive software agents and web applications
- Text learning
- Reinforcement learning (RL)
- Inductive Logic Programming (ILP)
- Ensembles Bagging, Boosting, and Stacking
- Bayes Net learning
82History of Machine Learning (cont.)
- 2000s
- Support vector machines
- Kernel methods
- Graphical models
- Statistical relational learning
- Transfer learning
- Sequence labeling
- Collective classification and structured outputs
- Computer Systems Applications
- Compilers
- Debugging
- Graphics
- Security (intrusion, virus, and worm detection)
- E mail management
- Personalized assistants that learn
- Learning in robotics and vision
83Remind
- Learning as search in a space of possible
hypotheses - Learning methods are characterized by their
search strategies and by the underlying structure
of the search spaces.
84Issues in Machine Learning
- ?? ????? ?? general target function? ?? ??? ????
?? -
- ??? data? ??? ??? ??? ?
- ??? ?? ??? ??? ??? ??? ??? ??? ??? ?
- ? ?? ???? ????? ???(Complexity)? ???? ??
- Function approximation? ?? ??? ??? ??? ?
- Target function? ??? ???? ?? ?? ???? ?