Microarray Data Analysis and Machine Learning - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Microarray Data Analysis and Machine Learning

Description:

Introduction to some R code for cross-validation. Machine learning functions in R ... The survival function is visualized by a so-called Kaplan -Meier plot, see below ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 12
Provided by: isrecI
Category:

less

Transcript and Presenter's Notes

Title: Microarray Data Analysis and Machine Learning


1
Microarray Data Analysis and Machine Learning
  • Introduction to some R code for cross-validation
  • Machine learning functions in R
  • Survival analysis

Genome Analysis feedback
  • Internal repeats in Alu elements
  • Are there Alu repeats in protein coding sequences
    ?

2
Introduction to the Exercise (part 2) Data flow
Complete data set (data labels)
Test set
Training set (data labels)
feature selection
Test set data
Test set true labels
Training set 2 (data labels)
Test set 2 selected data
predictor training
evaluation
true Pred A B ------------ A TA
FA B FB TB
Predictor (model)
Test set Pred. labels
3
R code for cross-validation
randomize samples for cross validation s
sample(179) 1st fold 115 1679
train.ind 1679 test.ind 115 feature
selection t.test.res NULL total.row
nrow (all2.mat,strain.ind) for (i in
1total.row) bcr all2.mati,all2.classs
train.ind "BCR/ABL" neg
all2.mati,all2.classstrain.ind "NEG" res
t.test (bcr,neg)p.value t.test.res
c(t.test.res,res) top.ind
order(t.test.res)1100 extract data and
labels for predictor training train
t(all2.mattop.ind,strain.ind) test
t(all2.mattop.ind,stest.ind) train.class
all2.classstrain.ind test.class
all2.classstest.ind
4
R function knn
Script knn.res knn(train,top.ind,test,top.in
d,train.class, k 3)
  • Usage
  • knn(train, test, cl, k 1, l 0, prob
    FALSE, use.all TRUE)
  • Arguments
  • train matrix or data frame of training set
    cases.
  • test matrix or data frame of test set cases.
    A vector will be
  • interpreted as a row vector for a
    single case.
  • cl factor of true classifications of
    training set
  • k number of neighbours considered.
  • l minimum vote for definite decision,
    otherwise 'doubt'. (More
  • precisely, less than 'k-l' dissenting
    votes are allowed, even
  • if 'k' is increased by ties.)
  • prob If this is true, the proportion of the
    votes for the winning
  • class are returned as attribute 'prob'.
  • use.all controls handling of ties. If true, all
    distances equal to
  • the 'k'th largest are included. If
    false, a random selection
  • of distances equal to the 'k'th is
    chosen to use exactly 'k'
  • neighbours.

5
R function lda
Script lda.res lda(train,train.class)
pred predict(lda.res,test)
  • Usage
  • lda(x, grouping, ..., subset, na.action)
  • Arguments
  • x (required if no formula is given as the
    principal argument.)
  • a matrix or data frame or Matrix
    containing the explanatory
  • variables.
  • grouping (required if no formula principal
    argument is given.) a
  • factor specifying the class for each
    observation.
  • (many more parameters)
  • Value
  • If 'CV TRUE' the return value ...
  • Otherwise it is an object of class '"lda"'
    containing the
  • following components ...

6
R function svm
Script model lt- svm(train, train.class,
kernel "radial", degree 3, type
"C-classification") pred lt- predict(model,
test)
  • Usage
  • Default S3 method
  • svm(x, y NULL, scale TRUE, type NULL,
    kernel
  • "radial", degree 3, gamma if
    (is.vector(x)) 1 else 1 / ncol(x),
  • (many more parameters)
  • Arguments
  • x a data matrix, a vector, or a sparse
    matrix (object of class
  • 'Matrix' provided by the 'Matrix'
    package, or of class
  • 'matrix.csr' as provided by the package
    'SparseM').
  • y a response vector with one label for
    each row/component of
  • 'x'. Can be either a factor (for
    classification tasks) or a
  • numeric vector (for regression).
  • type 'svm' can be used as a classification
    machine, as a regression
  • machine, or for novelty detection.
    Depending of whether 'y'
  • kernel the kernel used in training and
    predicting.
  • linear u'v
  • polynomial (gammau'v coef0)degree

7
R function nnet
Script net.res nnet(train, train.class,
size 5,decay 5e-4, maxit 200) pred
predict(net.res, test)
  • Usage
  • nnet(x, y, weights, size, Wts, mask, ...
  • (many more arguments)
  • Arguments
  • x matrix or data frame of 'x' values for
    examples.
  • y matrix or data frame of target values
    for examples.
  • weights (case) weights for each example - if
    missing defaults to 1.
  • size number of units in the hidden layer.
  • (many more arguments)
  • Details
  • Optimization is done via the BFGS method of
    'optim'.

8
Survival (time-to-event) Data
  • Data structure For each patient (sample)
  • time of surveillance (follow-up) after initial
    event (e.g. surgery) at t0
  • event status (e.g. death from disease)
  • Censored data patients with no event at the
    end of follow-up period
  • Survival function Prob(t) that event happens at
    time T gt t.
  • The survival function is visualized by a
    so-called Kaplan -Meier plot, see below

Empirical survival function for two patient
groups predicted to have long and short survival
time (good and poor prognosis). of survivors
is divided by the total of patients followed up
till time t. Ticks represent censored patients
9
Alu elements, repeat structure and dot matrix
appearance
68.5 identity in 146 nt overlap score 368
E(10,000) 2.3e-23 10 20
30 40 50 60 embAL
GCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG
GCGGGAGGAT
embAL
GCCGGGCGTGGTGGCGCGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAG
GCAGGAGGAT 140 150 160
170 180 190 70
80
90 embAL TGCTTGAGCCCAGGAGTTCGAGAC----------------
---------------CAGCC
embAL
CGCTTGAGCCCAGGAGTTCGAGGCTGCAGTGAGCTATGATCGCGCCACTG
CACTCCAGCC 200 210 220
230 240 250 100
110 embAL TGGGCAACATAGCGAGACCCCGTCTC
embAL
TGGGCGACAGAGCGAGACCCTGTCTC 260 270
280
10
Alu elements in human protein sequences
Sequences producing significant alignments
(Bits) Value refXP_001714899.1
PREDICTED hCG_1742852 Homo sapiens
68.9 1e-23 refXP_001714805.1 PREDICTED
hCG_1742852 Homo sapiens 68.9 1e-23
refNP_689672.3 hypothetical protein LOC146556
isoform 1 Ho... 77.0 1e-22 refNP_872601.1
tetratricopeptide repeat protein isoform 1 ...
67.4 2e-22 refXP_001717065.1 PREDICTED
hCG_1742852 Homo sapiens gtre... 66.2 2e-22
refNP_001158312.1 LYR motif containing 4
isoform 2 Homo sa... 87.8 2e-18
refNP_001158011.1 disrupted in schizophrenia
1 isoform c H... 54.3 5e-16
refXP_001719619.1 PREDICTED hypothetical
protein LOC100128... 77.0 3e-15
refNP_001136036.1 cyclic nucleotide gated
channel alpha 1 i... 76.6 4e-15
gtrefNP_872601.1 UniGene infoGene info
tetratricopeptide repeat protein isoform 1 Homo
sapiens Length1079 Score 67.4 bits (163),
Expect(2) 2e-22 Identities 31/50 (62),
Positives 35/50 (70), Gaps 0/50 (0)
Query 258 QAGVQWRDHSSLQPRTPGLKRSSCLSLPSSWDYRRAP
PRPANFCIFCRDG 109 AGQW D SSLQP
PG KR S LSLPSWYR P P NFCIF G Sbjct 995
RAGMQWCDLSSLQPPPPGFKRFSHLSLPNSWNYRHLPSCPTNFCIFVETG
1044 Query 131 FVFFVETGSRYVAQAGLELLGSSNPPA
SASQSAGITGVSHRAR 3 F FVETG V
QA LELL S ASASQSAGITGVSH AR Sbjct 1037
FCIFVETGFHHVGQACLELLTSGGLLASASQSAGITGVSHHAR 1079
11
Alu elements in human protein sequences ?
Refseq protein id NP_872601.1
tetratricopeptide repeat protein isoform 1 Homo
sapiens Length1079
Biological function, molecular accident, or
genome annotation error ??
Write a Comment
User Comments (0)
About PowerShow.com