Title: Microarray Data Analysis and Machine Learning
1Microarray Data Analysis and Machine Learning
- Introduction to some R code for cross-validation
- Machine learning functions in R
- Survival analysis
Genome Analysis feedback
- Internal repeats in Alu elements
- Are there Alu repeats in protein coding sequences
?
2Introduction to the Exercise (part 2) Data flow
Complete data set (data labels)
Test set
Training set (data labels)
feature selection
Test set data
Test set true labels
Training set 2 (data labels)
Test set 2 selected data
predictor training
evaluation
true Pred A B ------------ A TA
FA B FB TB
Predictor (model)
Test set Pred. labels
3R code for cross-validation
randomize samples for cross validation s
sample(179) 1st fold 115 1679
train.ind 1679 test.ind 115 feature
selection t.test.res NULL total.row
nrow (all2.mat,strain.ind) for (i in
1total.row) bcr all2.mati,all2.classs
train.ind "BCR/ABL" neg
all2.mati,all2.classstrain.ind "NEG" res
t.test (bcr,neg)p.value t.test.res
c(t.test.res,res) top.ind
order(t.test.res)1100 extract data and
labels for predictor training train
t(all2.mattop.ind,strain.ind) test
t(all2.mattop.ind,stest.ind) train.class
all2.classstrain.ind test.class
all2.classstest.ind
4R function knn
Script knn.res knn(train,top.ind,test,top.in
d,train.class, k 3)
- Usage
- knn(train, test, cl, k 1, l 0, prob
FALSE, use.all TRUE) - Arguments
- train matrix or data frame of training set
cases. - test matrix or data frame of test set cases.
A vector will be - interpreted as a row vector for a
single case. - cl factor of true classifications of
training set - k number of neighbours considered.
- l minimum vote for definite decision,
otherwise 'doubt'. (More - precisely, less than 'k-l' dissenting
votes are allowed, even - if 'k' is increased by ties.)
- prob If this is true, the proportion of the
votes for the winning - class are returned as attribute 'prob'.
- use.all controls handling of ties. If true, all
distances equal to - the 'k'th largest are included. If
false, a random selection - of distances equal to the 'k'th is
chosen to use exactly 'k' - neighbours.
5R function lda
Script lda.res lda(train,train.class)
pred predict(lda.res,test)
- Usage
- lda(x, grouping, ..., subset, na.action)
- Arguments
- x (required if no formula is given as the
principal argument.) - a matrix or data frame or Matrix
containing the explanatory - variables.
- grouping (required if no formula principal
argument is given.) a - factor specifying the class for each
observation. - (many more parameters)
- Value
- If 'CV TRUE' the return value ...
- Otherwise it is an object of class '"lda"'
containing the - following components ...
6R function svm
Script model lt- svm(train, train.class,
kernel "radial", degree 3, type
"C-classification") pred lt- predict(model,
test)
- Usage
- Default S3 method
- svm(x, y NULL, scale TRUE, type NULL,
kernel - "radial", degree 3, gamma if
(is.vector(x)) 1 else 1 / ncol(x), - (many more parameters)
- Arguments
- x a data matrix, a vector, or a sparse
matrix (object of class - 'Matrix' provided by the 'Matrix'
package, or of class - 'matrix.csr' as provided by the package
'SparseM'). - y a response vector with one label for
each row/component of - 'x'. Can be either a factor (for
classification tasks) or a - numeric vector (for regression).
- type 'svm' can be used as a classification
machine, as a regression - machine, or for novelty detection.
Depending of whether 'y' - kernel the kernel used in training and
predicting. -
- linear u'v
- polynomial (gammau'v coef0)degree
7R function nnet
Script net.res nnet(train, train.class,
size 5,decay 5e-4, maxit 200) pred
predict(net.res, test)
- Usage
- nnet(x, y, weights, size, Wts, mask, ...
- (many more arguments)
- Arguments
- x matrix or data frame of 'x' values for
examples. - y matrix or data frame of target values
for examples. - weights (case) weights for each example - if
missing defaults to 1. - size number of units in the hidden layer.
- (many more arguments)
- Details
- Optimization is done via the BFGS method of
'optim'.
8Survival (time-to-event) Data
- Data structure For each patient (sample)
- time of surveillance (follow-up) after initial
event (e.g. surgery) at t0 - event status (e.g. death from disease)
- Censored data patients with no event at the
end of follow-up period - Survival function Prob(t) that event happens at
time T gt t. - The survival function is visualized by a
so-called Kaplan -Meier plot, see below
Empirical survival function for two patient
groups predicted to have long and short survival
time (good and poor prognosis). of survivors
is divided by the total of patients followed up
till time t. Ticks represent censored patients
9Alu elements, repeat structure and dot matrix
appearance
68.5 identity in 146 nt overlap score 368
E(10,000) 2.3e-23 10 20
30 40 50 60 embAL
GCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG
GCGGGAGGAT
embAL
GCCGGGCGTGGTGGCGCGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAG
GCAGGAGGAT 140 150 160
170 180 190 70
80
90 embAL TGCTTGAGCCCAGGAGTTCGAGAC----------------
---------------CAGCC
embAL
CGCTTGAGCCCAGGAGTTCGAGGCTGCAGTGAGCTATGATCGCGCCACTG
CACTCCAGCC 200 210 220
230 240 250 100
110 embAL TGGGCAACATAGCGAGACCCCGTCTC
embAL
TGGGCGACAGAGCGAGACCCTGTCTC 260 270
280
10Alu elements in human protein sequences
Sequences producing significant alignments
(Bits) Value refXP_001714899.1
PREDICTED hCG_1742852 Homo sapiens
68.9 1e-23 refXP_001714805.1 PREDICTED
hCG_1742852 Homo sapiens 68.9 1e-23
refNP_689672.3 hypothetical protein LOC146556
isoform 1 Ho... 77.0 1e-22 refNP_872601.1
tetratricopeptide repeat protein isoform 1 ...
67.4 2e-22 refXP_001717065.1 PREDICTED
hCG_1742852 Homo sapiens gtre... 66.2 2e-22
refNP_001158312.1 LYR motif containing 4
isoform 2 Homo sa... 87.8 2e-18
refNP_001158011.1 disrupted in schizophrenia
1 isoform c H... 54.3 5e-16
refXP_001719619.1 PREDICTED hypothetical
protein LOC100128... 77.0 3e-15
refNP_001136036.1 cyclic nucleotide gated
channel alpha 1 i... 76.6 4e-15
gtrefNP_872601.1 UniGene infoGene info
tetratricopeptide repeat protein isoform 1 Homo
sapiens Length1079 Score 67.4 bits (163),
Expect(2) 2e-22 Identities 31/50 (62),
Positives 35/50 (70), Gaps 0/50 (0)
Query 258 QAGVQWRDHSSLQPRTPGLKRSSCLSLPSSWDYRRAP
PRPANFCIFCRDG 109 AGQW D SSLQP
PG KR S LSLPSWYR P P NFCIF G Sbjct 995
RAGMQWCDLSSLQPPPPGFKRFSHLSLPNSWNYRHLPSCPTNFCIFVETG
1044 Query 131 FVFFVETGSRYVAQAGLELLGSSNPPA
SASQSAGITGVSHRAR 3 F FVETG V
QA LELL S ASASQSAGITGVSH AR Sbjct 1037
FCIFVETGFHHVGQACLELLTSGGLLASASQSAGITGVSHHAR 1079
11Alu elements in human protein sequences ?
Refseq protein id NP_872601.1
tetratricopeptide repeat protein isoform 1 Homo
sapiens Length1079
Biological function, molecular accident, or
genome annotation error ??