CSE 591: Machine learning and Applications - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

CSE 591: Machine learning and Applications

Description:

When does a customer buy, what does he buy, how often he pays on time, etc ... Intuition: how does your brain store these pictures? Model selection ... – PowerPoint PPT presentation

Number of Views:209

Avg rating:3.0/5.0

Slides: 31

Provided by: jiep

Category:

more less

Transcript and Presenter's Notes

Title: CSE 591: Machine learning and Applications

1
CSE 591 Machine learning and Applications

Jieping Ye
Department of Computer Science Engineering
Arizona State University

2
Brief Introduction

Dr. Jieping Ye
Assistant Professor at CSE Dept.
Affiliated with the Center for Evolutionary
Functional Genomics at the Biodesign Institute
Research interests machine learning, data mining
and their applications to bioinformatics
Dimensionality reduction
Semi-supervised learning
Kernel learning
Biological image analysis

3
Outline of lecture

Course information
Project
Introduction to ML
Course schedule
Survey

4
Course Information

Instructor Dr. Jieping Ye
Office BY 568
Phone 727-7451
Email jieping.ye_at_asu.edu
Web http//www.public.asu.edu/jye02/CLASSES/Spri
ng-2007/
Time TTh 440am555pm
Office hours TTh 1000 am -- 1145 am
Location BYAC 270
TA Jianhui Chen
Office hours 330 pm 430 pm, Th

5
Course information (Contd)

Prerequisite Basics of linear algebra, a,
algorithm design and analysis.
Course textbook No textbook is required. (Papers
and other materials are available at the class
web page)
Objective An in-depth understanding of some of
the important machine learning methods and their
applications in bioinformatics and other domains.
Topics Clustering, regression, classification,
semi-supervised learning, feature reduction,
manifold learning, ranking, and kernel learning.

6
Reference books

Pattern Classification. Duda, et al. , 2000.
The Elements of Statistical Learning Data
Mining, Inference, and Prediction. Hastie, et
al., 2001.
Kernel Methods in Computational Biology.
Scholkopf, et al., editors. 2004.
Kernel Methods for Pattern Analysis. Taylor and
Cristianini, 2004.
Introduction to Data Mining. Tan, et al., 2005.

7
Grading

Homework (3) 30
Project 40. Two to three students form a group
to carry out a small research project.
A survey of the state-of-art in an area related
to this course
Machine learning techniques for specific
applications
A comparative study of several well-known
algorithms.
Design of a novel algorithm related to this
course.
Exam (1) 20. There will be one open-book exam
on 3/22/07.
Class participation 10. Students are required
to attend the lecture and participate in the
class discussion.
A 90100, A- 8589, B 8084, B 7079, C
6070

8
Project

Project proposal is due on 2/08/07
One half to one page
Topics, references, and plan
The intermediate project report is due on 4/05/07
Five to ten pages
The final project report is due on 4/26/07
Fifteen to twenty pages
Project presentation
About 5 minutes

9
Programming languages

Matlab
Tutorials
http//www.math.ufl.edu/help/matlab-tutorial/
http//www.math.mtu.edu/msgocken/intro/node1.html
R (Statistics)
http//www.r-project.org/
Or other languages

10
What is machine learning?

Machine learning is the study of computer systems
that improve their performance through
experience.
Learn existing and known structures and rules.
Discover new findings and structures.
Face recognition
Bioinformatics
Supervised learning vs. unsupervised learning
Semi-supervised learning

11
Machine learning versus data mining

A lot of common topics
Clustering
Classification
Many others
Different focuses
ML focuses more on theory (statistics)
DM focuses more on applications

12
Clustering

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

13
Applications of Cluster Analysis

Understanding
Group genes and proteins that have similar
functionality, or group stocks with similar price
fluctuations
Summarization
Reduce the size of large data sets

Clustering precipitation in Australia
14
Classification Definition

Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

15
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
16
Classification Application

Fraud Detection
Goal Predict fraudulent cases in credit card
transactions.
Approach
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.

17
Character Recognition

Given a digit representation.
What is its class?
ATT have used
Neural Networks
Support Vector Machines
Error rates 1.4
Inputs are 28x28 greyscale images.

18
Other applications

Face recognition
Protein function prediction
Cancer detection
Document categorization

19
Data representation

Traditional algorithms work on vectors.
Images can be represented as matrices or vectors.
Abstract data
Graphs
Sequences
3D structures

20
Kernel Methods Basic ideas
21
Applications in bioinformatics

Protein sequence
Protein structure

22
Data integration
mRNA expression data
hydrophobicity data
protein-protein interaction data
sequence data (gene, protein)
Genome-wide data
23
Curse of dimensionality

Large sample size is required for
high-dimensional data.
Query accuracy and efficiency degrade rapidly as
the dimension increases.
Strategies
Feature reduction
Feature selection
Manifold learning
Kernel learning

24
Manifold learning

A manifold is a topological space which is
locally Euclidean.

25
Intuition how does your brain store these
pictures?
26
Model selection

Choose the best model from a set of different
models to fit to the data
Support Vector Machines (SVM), Linear
Discriminant Analysis (LDA)
Models are specified by certain parameters.
How to choose the best parameters?
Cross-validation (leave one out, k-fold CV)

27
Machine learning applications

Bioinformatics Hugh amount of biological data
from the human genome project and human
proteomics initiative.
Goal Understanding of biological systems at the
molecular level from diverse sources of
biological data.
Challenge Scalability, multiple sources,
abstract data.
Applications Microarray data analysis, Protein
classification, Mass spectrometry data analysis,
Protein-protein interaction.
Others Computer vision, information retrieval,
image processing, text mining, web mining, etc.