More On Preprocessing - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

More On Preprocessing

Description:

Concordance Map. Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50 ... all the quantiles, the concordance correlation coefficient will be equal ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 18
Provided by: javierc1
Category:

less

Transcript and Presenter's Notes

Title: More On Preprocessing


1
More On Preprocessing Javier Cabrera
2
Outline
  1. Transform the data into a scale suitable for
    analysis.
  2. Remove the effects of systematic and obfuscating
    sources of variation.
  3. Identify discrepant observations.

3
Outline
  • Preprocessing gt Quality of downstream analyses
  • log transformation, X ? log(X)
  • The variation of logged intensities may be less
    dependent on magnitude,
  • Logs reduces the skewness of highly skewed
    distributions.
  • Taking logs improves variance estimation.
  • 2. Other Transformations
  • Power transformations (X ? Xb for some b1/2,
    1/3 or other)
  • Amaratunga and Cabrera (2000),  Tusher et al
    (2001)
  • 3. Variance stabilizing transformations
  • X ? log(Xc) Symmetrizing the spot intensity
    data and stabilizing their variances.

4
Transformations
  • 4. Rocke and Durbin (2001) arrays with replicate
    spots.
  • Analogy models used for estimating concentration
    of analyte X a meh e
  • a mean background, mtrue expression level h
    and e normally distributed error (sh2 , se2)
  •  
  • 5. Durbin et al (2002) generalized log
    transformation
  • - a, sh2 and se2 must be estimated.

5
Power Transformations
  • a, b must be estimated.
  • Three criteria
  • Equal variances CV ( gene variances)
  • Low skewness mean( skewness)
  • No Mean Variance correlation correlation
    between mean and variance

6
Example 1 Tissue Data Tissue data 3
treatments applied to mice tissue.
(A,B,C) Arrays Treatment A 11 Treatment B
11 Treatment C 19 Genes 3487 genes. Gene
expression matrix X Dim(X)100x41 treatA.1
treatA.2 treatA.3 treatA.4 treatA.5 treatA.6
treatA.7 treatA.8 treatA.9 treatA.10 treatA.11
treatB.12 treatB.13 1 3.706 3.900 3.877
3.769 3.654 3.805 3.661 3.878
4.213 3.989 3.877 3.797 3.743 2
3.762 4.034 4.402 3.912 3.889
3.988 4.280 3.901 4.385 3.835
4.051 4.583 4.973 3 4.140 4.114
4.182 4.200 4.117 4.029 4.200
4.137 4.344 4.122 3.989 4.273
4.368 4 3.555 3.555 3.555 3.555
3.555 3.555 3.555 3.621 4.181
3.555 3.555 3.555 3.571 5 4.228
4.152 3.828 4.216 3.889 3.923
3.912 4.102 4.273 3.858 4.031
4.144 3.976 6 6.622 6.749 6.625
6.883 6.865 6.335 6.241 6.201
5.895 6.548 6.577 6.298 6.546 7
7.322 7.437 7.523 7.267 7.586
7.562 7.238 7.294 6.812 7.557
7.370 7.497 6.834 8 3.555 3.555
3.555 3.555 3.555 3.555 3.555
3.591 4.165 3.555 3.555 3.555
3.571 9 4.756 4.605 4.935 4.295
4.510 4.571 4.396 4.804 4.639
5.239 4.402 4.502 4.248 10 4.468
4.306 4.483 4.396 4.432 4.008
4.475 4.357 4.344 4.208 4.147
4.227 4.436 gt. . . . .
. . . . . . .
. . . . . . .
.
7

Raw Data Equal 75pctl
Power Trans (X-3.60 )-0.4
Quantile Normalized
Log Transformed
8
Gene selection for classification - Left panel
PC2 vs PC1 plot log transformation - Right panel
PC2 vs PC1 plot power transformation

9
Example 2 Khan et al (2001) 4 types of
small round blue cell tumors (SRBC)
- Neuroblastoma (NB) - Rhabdomyosarcoma
(RMS) - Ewing family of tumors (EWS)
- Burkitt lymphomas (BL) Training set 63 (23
EWS, 20 RMS, 12 NB, 8 BL) Testing set 25 (6
EWS, 5 RMS, 6 NB, 3 BL, 5 ot) Genes Of 6567
initial genes, 2308 genes were selected because
they showed minimal expression Subset A
Cells 23 EWS and 20 RMS from training set. 100
most significant genes after performing a t-test.
Gene expression matrix X Dim(X)100x43
EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9
EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19
EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C9 1 3.203
1.655 3.278 1.006 2.710 2.059 1.848 2.714
2.356 1.929 3.616 2.151 2.312 1.069
0.919 0.925 2.626 1.079 1.099 2 0.068 0.071
0.116 0.191 0.237 0.082 0.123 0.180 0.079
0.252 0.106 0.097 0.160 0.197 0.192
0.089 0.092 0.178 0.166 3 1.046 1.041 0.893
0.430 0.369 0.902 0.998 0.496 0.761
0.574 0.583 0.499 0.579 1.681 0.786
1.511 1.869 2.346 2.019 . . . .
. . . . . . .
. . . . . .
. . .
10

Raw Data Equal 75pctl
Power Trans -(X-0.66 )-0.04
Quantile Normalized
Log Transformed
11
Judging the success of a normalization
  • Yg1 and Yg2.
  • Successful workflow gtArrays are monotonically
    related to each other.
  • Pearsons correlation coefficient measures
    linearity rather than agreement.
  • Concordance correlation coefficient

12
Judging the success of a normalization
  • Yg1 and Yg2.
  • Successful workflow gtArrays are monotonically
    related to each other.
  • Spearmans rank correlation coefficient
  • Rgi is the rank of Ygi when the Ygi are ranked
    from 1 to G.

13
Concordance Map
Image Plot of Concordance Correlations
X44 X45 X46 X47 X48 X49 X50 X44 1.000
0.703 0.622 0.706 0.674 0.746 0.694 X45 0.703
1.000 0.702 0.679 0.784 0.710 0.788 X46 0.622
0.702 1.000 0.791 0.683 0.562 0.776 X47 0.706
0.679 0.791 1.000 0.691 0.607 0.760 X48 0.674
0.784 0.683 0.691 1.000 0.770 0.832 X49 0.746
0.710 0.562 0.607 0.770 1.000 0.727 X50 0.694
0.788 0.776 0.760 0.832 0.727 1.000
14
Concordance Map
Image Plot of Concordance Correlations
X44 X45 X46 X47 X48 X49 X50 X44 1.000
0.756 0.622 0.700 0.695 0.813 0.698 X45 0.756
1.000 0.813 0.722 0.793 0.710 0.803 X46 0.622
0.813 1.000 0.789 0.753 0.655 0.826 X47 0.700
0.722 0.789 1.000 0.714 0.663 0.763 X48 0.695
0.793 0.753 0.714 1.000 0.779 0.834 X49 0.813
0.710 0.655 0.663 0.779 1.000 0.742 X50 0.698
0.803 0.826 0.763 0.834 0.742 1.000
15
Linear correlation
Standard Normal
t dist, df6
t dist, df2
16
correlation
  1.  If the distributional properties of the values
    change substantially during a normalization
    (e.g., the skewness is decreased), it is possible
    that the concordance correlation coefficients
    might increase, but this may only be an
    artificial improvement.
  2. For microarrays that have been normalized by
    equating all the quantiles, the concordance
    correlation coefficient will be equal to
    Pearsons correlation coefficient. This is
    because, after such a normalization, the
    quantiles of both samples are identical and,
    therefore, both means are equal and both
    variances are equal too
  3. Spearmans rank correlation coefficient is equal
    to (a) Pearsons correlation coefficient
    calculated on the ranks of the data (b) the
    concordance correlation coefficient calculated on
    the ranks of the data.

17
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com