Title: CSE 591: Machine learning and Applications
1CSE 591 Machine learning and Applications
- Jieping Ye
- Department of Computer Science Engineering
- Arizona State University
2Brief Introduction
- Dr. Jieping Ye
- Assistant Professor at CSE Dept.
- Affiliated with the Center for Evolutionary
Functional Genomics at the Biodesign Institute - Research interests machine learning, data mining
and their applications to bioinformatics - Dimensionality reduction
- Semi-supervised learning
- Kernel learning
- Biological image analysis
3Outline of lecture
- Course information
- Project
- Introduction to ML
- Course schedule
- Survey
4Course Information
- Instructor Dr. Jieping Ye
- Office BY 568
- Phone 727-7451
- Email jieping.ye_at_asu.edu
- Web http//www.public.asu.edu/jye02/CLASSES/Spri
ng-2007/ - Time TTh 440am555pm
- Office hours TTh 1000 am -- 1145 am
- Location BYAC 270
- TA Jianhui Chen
- Office hours 330 pm 430 pm, Th
5Course information (Contd)
- Prerequisite Basics of linear algebra, a,
algorithm design and analysis. - Course textbook No textbook is required. (Papers
and other materials are available at the class
web page) - Objective An in-depth understanding of some of
the important machine learning methods and their
applications in bioinformatics and other domains. - Topics Clustering, regression, classification,
semi-supervised learning, feature reduction,
manifold learning, ranking, and kernel learning.
6Reference books
- Pattern Classification. Duda, et al. , 2000.
- The Elements of Statistical Learning Data
Mining, Inference, and Prediction. Hastie, et
al., 2001. - Kernel Methods in Computational Biology.
Scholkopf, et al., editors. 2004. - Kernel Methods for Pattern Analysis. Taylor and
Cristianini, 2004. - Introduction to Data Mining. Tan, et al., 2005.
7Grading
- Homework (3) 30
- Project 40. Two to three students form a group
to carry out a small research project. - A survey of the state-of-art in an area related
to this course - Machine learning techniques for specific
applications - A comparative study of several well-known
algorithms. - Design of a novel algorithm related to this
course. - Exam (1) 20. There will be one open-book exam
on 3/22/07. - Class participation 10. Students are required
to attend the lecture and participate in the
class discussion. - A 90100, A- 8589, B 8084, B 7079, C
6070
8Project
- Project proposal is due on 2/08/07
- One half to one page
- Topics, references, and plan
- The intermediate project report is due on 4/05/07
- Five to ten pages
- The final project report is due on 4/26/07
- Fifteen to twenty pages
- Project presentation
- About 5 minutes
9Programming languages
- Matlab
- Tutorials
- http//www.math.ufl.edu/help/matlab-tutorial/
- http//www.math.mtu.edu/msgocken/intro/node1.html
- R (Statistics)
- http//www.r-project.org/
- Or other languages
10What is machine learning?
- Machine learning is the study of computer systems
that improve their performance through
experience. - Learn existing and known structures and rules.
- Discover new findings and structures.
- Face recognition
- Bioinformatics
- Supervised learning vs. unsupervised learning
- Semi-supervised learning
11Machine learning versus data mining
- A lot of common topics
- Clustering
- Classification
- Many others
- Different focuses
- ML focuses more on theory (statistics)
- DM focuses more on applications
12Clustering
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
13Applications of Cluster Analysis
- Understanding
- Group genes and proteins that have similar
functionality, or group stocks with similar price
fluctuations - Summarization
- Reduce the size of large data sets
Clustering precipitation in Australia
14Classification Definition
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
15Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
16Classification Application
- Fraud Detection
- Goal Predict fraudulent cases in credit card
transactions. - Approach
- Use credit card transactions and the information
on its account-holder as attributes. - When does a customer buy, what does he buy, how
often he pays on time, etc - Label past transactions as fraud or fair
transactions. This forms the class attribute. - Learn a model for the class of the transactions.
- Use this model to detect fraud by observing
credit card transactions on an account.
17Character Recognition
- Given a digit representation.
- What is its class?
- ATT have used
- Neural Networks
- Support Vector Machines
- Error rates 1.4
- Inputs are 28x28 greyscale images.
18Other applications
- Face recognition
- Protein function prediction
- Cancer detection
- Document categorization
19Data representation
- Traditional algorithms work on vectors.
- Images can be represented as matrices or vectors.
- Abstract data
- Graphs
- Sequences
- 3D structures
20Kernel Methods Basic ideas
21Applications in bioinformatics
- Protein sequence
- Protein structure
22Data integration
mRNA expression data
hydrophobicity data
protein-protein interaction data
sequence data (gene, protein)
Genome-wide data
23Curse of dimensionality
- Large sample size is required for
high-dimensional data. - Query accuracy and efficiency degrade rapidly as
the dimension increases. - Strategies
- Feature reduction
- Feature selection
- Manifold learning
- Kernel learning
24Manifold learning
- A manifold is a topological space which is
locally Euclidean.
25Intuition how does your brain store these
pictures?
26Model selection
- Choose the best model from a set of different
models to fit to the data - Support Vector Machines (SVM), Linear
Discriminant Analysis (LDA) - Models are specified by certain parameters.
- How to choose the best parameters?
- Cross-validation (leave one out, k-fold CV)
27Machine learning applications
- Bioinformatics Hugh amount of biological data
from the human genome project and human
proteomics initiative. - Goal Understanding of biological systems at the
molecular level from diverse sources of
biological data. - Challenge Scalability, multiple sources,
abstract data. - Applications Microarray data analysis, Protein
classification, Mass spectrometry data analysis,
Protein-protein interaction. -
- Others Computer vision, information retrieval,
image processing, text mining, web mining, etc.
28Course schedule
29Survey
- Why are you taking this course?
- What would you like to gain from this course?
- What topics are you most interested in learning
about from this course? - Any other suggestions?
30Next class
- Topics
- Basics of linear algebra
- Basics of probability
- Readings (available at the class webpage)
- Mini tutorial on the Singular Value Decomposition