COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis

Description:

COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis M ge Erdo mu Zeynep I k DISEASE: EMPHYSEMA Emphysema is a lung disease that is included in a ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 66
Provided by: Tosh489
Category:

less

Transcript and Presenter's Notes

Title: COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis


1
COMPUTATIONAL GENOME ANALYSIS PROJECT Microarray
data analysis
  • Müge Erdogmus
  • Zeynep Isik

2
DISEASE EMPHYSEMA
  • Emphysema is a lung disease that is included in a
    group of diseases that are called chronic
    obstructive pulmonary disease.
  • The ability of the lungs to expel air is
    diminished for the patients with emphysema.
  • Lungs loose their elasticity thus, they become
    less contractile.
  • In emphysema, the lung tissues which are
    responsible for supporting the physical shape and
    function of the lungs are damaged.

3
EMPHYSEMA
  • The lung tissue around the smaller airways
    bronchioles and the alveoli?targets for the
    destructions.
  • Normally the lungs are very elastic and spongy
    but not in emphsema!

4
Causes of Emphysema
  • Alpha-1-antitrypsin deficiency
  • Cigarette smoking
  • 1)It damages the lung tissue in various ways? The
    cells in the airway which are responsible for the
    clearance of the mucus and other secretions are
    influenced by cigarette smoking.
  • 2) Enhanced mucus secretion? rich source of food
    for bacteria but immune cells are negatively
    influenced by the cigarette in their fight for
    infection.
  • Destructive enzymes from the immune cells?
    loss of proteins associated with elasticity.

5
The Microarray Experiment
  • The gene expression dataset is composed of 30
    samples that
  • are retrieved from NCBIs Gene Expression
    Omnibus.
  • The RNA transcripts that are utilized for
    measuring the
  • expression signals are taken from Homo-sapien
    organism.
  • 18 slides?Severely emphysematous tissue removed
    at LVRS
  • 12 slides? normal or mildly emphysematous lung
    tissue taken
  • from smokers with nodules suspicious for lung
    cancer.
  • More than 33,000 best characterized human genes
    were
  • represented in the dataset that include 1,000,000
    unique
  • oligonucleotide features.

6
Data Analysis
Read XLS Files
Normalization
Outlier Removal
Check Normal Dist.
Hypothesis Testing
PCA
Correlation Based
Classification
7
Data Normalization
  • Why do we normalize data?
  • Scale data for reasonable comparisons
  • Map the expression values of each probe between
    0,1 range
  • Do not disturb the underlying distribution of the
    data

8
Outlier Removal
  • Why do we need to remove outliers?
  • They can distort the mean of the data -gt wrong
    clusterings, wrongs significance values
  • What is an outlier?
  • Expression values that are three or more standard
    deviation away from the mean are outliers
  • Replace outliers with the mean of remaining
    expression values for the probe
  • Detected outliers in 5331 probes

9
Clustering Before Feature Reduction
  • Clustering on samples using all existing features
    to see how bad the situation is
  • K-means clustering with euclidean distance
  • For selection of initial k centers
  • KCENTRES algorithm
  • select k center objects from the distance matrix
    such that the distance between the most distant
    object and the center that is closest to that
    object is minimized.

10
Clustering Before Feature Reduction
11
Clustering Before Feature Reduction
  • Really low clustering performance
  • Data set is noisy
  • Reduce features so as to have only the pobes that
    are really significant
  • Differentially expressed among classes
  • ...can be used to differentiate between the two
    classes of samples

12
Searching for Normally Distributed Probes
  • Why do we need to identify which probes are
    normally distributed and which ones are not?
  • Apply different tests of significance
  • Lilliefors test (95 significance)
  • Tests the normality of the distribution via
    examining the signal intensity data for a probe
  • Modified version of Kolmogorov-Smirnov test
  • No need to specify the parameters of the
    underlying distribution of the data
  • approximates the underlying distribution
  • 14787 probes have normal distribution
  • 7428 probes do not have normal distribution

13
Tests of Significance
  • For probes that have normal distribution
  • T-test or Z-test
  • 30 samples -gt t-test (95 significance)
  • For probes that do not have normal distribution
  • Non-parametric test
  • Wilcoxon rank-sum test (95significance)
  • ... sorts all intensity values for a probe
  • ... gives each intensity value a rank
  • ... sums up the ranks for the signal values for
    both classes
  • ... compares the sums to decide whether the two
    samples come from the same distribution.

14
Tests of Significance
  • 2339 of 22215 probes are differentially expressed
  • eliminated 19876 probes

15
Clustering Before Feature Reduction
  • Clustering on samples using only differentially
    expressed features to see whether feature
    reduction proved to be useful
  • K-means clustering with same procedure

16
Clustering Before Feature Reduction
17
Further Feature Reduction
  • 2339 features is a high number
  • Try to reduce number of features while preserving
    clustering accuracy
  • Two methods
  • Correlation based feature reduction
  • Principal component analysis

18
Correlation Based Feature Reduction
  • Extract uncorelated features
  • Cluster uncorrelated features
  • K-means
  • SOM
  • Prior to clustering find value of k from
    hierarchical clustering
  • Different distance metrics
  • Different hierarchical clustering methods
  • From each cluster of each clustering select
    certain genes
  • These genes will be used for classification

19
Correlation Based Feature Reduction
  • Extract uncorelated features
  • Find correlation matrix
  • Keep one of the highly correlated features and
    remove the others.
  • Highly correlated gt85
  • Keep the feature that is closest to all other
    features
  • 1689 probes are left in our feature set

20
Correlation Based Feature Reduction
  • Prior to clustering find value of k from
    hierarchical clustering
  • Different distance metrics
  • Manhattan
  • Euclidean
  • Mahalanobis
  • Chebyshev
  • Correlation coefficients
  • Different hierarchical clustering methods
  • Complete linkage (max distance clustering)
  • Average linkage (average distance clustering)

21
Correlation Based Feature Reduction
Complete Linkage with Manhattan Distance
22
Correlation Based Feature Reduction
Average Linkage with Manhattan Distance
23
Correlation Based Feature Reduction
Complete Linkage with Euclidean Distance
24
Correlation Based Feature Reduction
Average Linkage with Euclidean Distance
25
Correlation Based Feature Reduction
Complete Linkage with Chebyshev Distance
26
Correlation Based Feature Reduction
Average Linkage with Chebyshev Distance
27
Correlation Based Feature Reduction
Complete Linkage with Correlation Coeff.
28
Correlation Based Feature Reduction
Average Linkage with Correlation Coeff.
29
Correlation Based Feature Reduction
  • complete linkage method is more successful in
    separating between clusters
  • the probes that are very similar to each other
    are put into the same clusters, the ones that are
    very different are put into different clusters
  • focused on the trees that are formed using
    complete linkage with euclidean distance and
    correlation coefficients as distance metrics
  • From which level we should cut the trees?
  • Examine the trees

30
Correlation Based Feature Reduction
31
Correlation Based Feature Reduction
32
Correlation Based Feature Reduction
  • cut from a level above the lower bound lines
  • have a small number of clusters
  • insufficient to explain the closeness of probes.
  • cut from a level below the upper bound lines
  • have a high number of clusters
  • forcing the clustering algorithm to divide
    clusters that consist of very similar samples
  • lower bound 40 clusters
  • upper bound 95 clusters

33
Correlation Based Feature Reduction
  • To find optimal k value
  • run several k-means algorithms for each k value
    between 40 and 95
  • produces the clustering of highest quality
  • High quality small intra-cluster distance
  • k is found to be 80
  • store the clustering result that is produced when
    k 80
  • run SOM clustering with 9x9 map and store the
    clustering result

34
Correlation Based Feature Reduction
  • How do we select the signature probes from the
    clusters?
  • select the ones that are most significant
  • Select how many probes form each cluster ?
  • select the n most significant probes from each
    cluster, where n is depenedent on the quality
    of the cluster
  • Quality of cluster intra-cluster distance
  • Quality value is high -gt intra-cluster similarity
    is low -gt cluster is loose
  • Quality value is low -gt intra-cluster similarity
    is high -gt cluster is tight
  • Take more probes from clusters that are loose in
    order to represent those clusters better

35
Correlation Based Feature Reduction
  • From clusters formed by k-means
  • 144 probes
  • From clusters formed by SOM
  • 141 probes
  • we formed another set of 88 probes that are found
    both in probes selected from k-means clustering
    and SOM clustering (named as common probes)

36
Principal Component Analysis
  • set of statistically significant probes are
    directly given to prtools' pca function
  • 99, 90 and 85 data preservation
  • Number of resulting principal components

Percentage 99 90 85
of PCs 29 22 20
37
Classification
  • Feature sets from feature reduction based on
    correlation that can be used by classifiers
  • K-means probe set
  • Som probe set
  • Common probe set
  • Algorithms
  • Linear classifier
  • Support vector machine
  • 1-narest neighbor classifier
  • 3-nearest neighbor classifier

38
Classification
  • 30 samples in our data set -gt k-fold cross
    validation
  • bias caused by random selection of samples for
    training and testing sets
  • Repeatedly perform 100 classifications for each
    classifier
  • Report the average classification error

39
Classification (k-means probe set)
40
Classification (k-means probe set)
41
Classification (SOM probe set)
42
Classification (common probe set)
43
Classification
  • Three sets of principal components from feature
    reduction with principal component analysis can
    be used by classifiers
  • Algorithms
  • Linear classifier
  • Support vector machine
  • 1-narest neighbor classifier
  • 3-nearest neighbor classifier

44
Classification (PCA 99)
45
Classification (PCA 90)
46
Classification (PCA 85)
47
Classification
  • in all cases support vector machines provides us
    with the best clustering results

48
Classification
  • performance of support vector machines that
    utilize the principal component sets is worse
    than the ones that utilize the probe sets that
    are formed as the result of correlation based
    feature reduction method
  • performance of support vector classifiers that
    utilize the k-means, SOM and common probe sets
    are more or less similar
  • classify samples with 99 accuracy on the average
  • use set of common probes as signature genes
  • aim is to reduce the number of features without
    sacrificing classification performance

49
Final feature reduction
  • 88 features is still a high number
  • use Fishers linear discriminant to further
    reduce number of features
  • resulting signature gene set consisted of 26
    probes
  • performance of classification even improved when
    the number of probes is reduced
  • able to classify the 30 samples with 99.7-100
    accuracy on the average

50
Final classification
51
Signature Genes
  • PTGDS?prostaglandin D2 synthase 21kDa (brain)
  • Key enzyme for the generation of prostanoids in
    the immune system.
  • Prostaglandins?various influences throughout the
    body such as taking role in inflammation , smooth
    muscle contraction.
  • The expression of PTGDS is enhanced in patients
    with emphysema compared to the control patients.

52
Signature Genes
  • There are also cases in the literature that
    support the rise of prostaglandins as being
    potent proinflammatory mediators in response to
    various stimuli including the allergic airway
    inflammations.
  • Thus considering that immune response is
    stimulated in response to the emphysema due to
    the bacterial infection, it is expected to
    observe a rise in the expression of
    prostaglandins due to its role in inflammation.

53
Signature Genes
  • BIRC4?Baculoviral IAP repeat-containing 4
  • The gene that encodes that protein belongs to a
    family of proteins that blocks apoptosis.
  • According to the literature information,15-
    deoxy- delta(12,14)- prostaglandin J2 leads to
    the reduction of BIRC4.
  • The expression results indicate that, a decrease
    in BIRC4 is observed in patient with emphysema
    compared to the patients without emphysema.
  • Speculation? In control cells there is a
    possibility that the cancer tissues influence the
    expression of BIRC4 therefore its expression may
    be analyzed as high.

54
Signature Genes
  • Spastin gene has a key role in cytoskeletol
    rearrangement and dynamic.
  • The expression spastin is decreased in patients
    with emphysema compared to the patients that do
    not have.
  • Considering that macrophages release substances
    that damages proteins that are responsible for
    contracting and expanding of lung,it is highly
    expected to observe a decrease in the level of
    spastin .

55
Signature Genes
  • FGF18 ? fibroblast growth factor 18
  • According to literature data, (Decreased Lung
    Fibroblast
  • Growth Factor 18 and Elastin in Human Congenital
    Diaphragmatic Hernia
  • and Animal Models) FGF18 reduction has been
    associated with
  • the decrease in elastin expression and impaired
    elastin
  • deposition.
  • Therefore, the decrease of FGF18 seen in
    patients with
  • emphysema is an expected finding which is
    correlated with
  • the loss of elasticity in the lungs, specifically
    with the alveoli.

56
Signature Genes
  • TCF12?transcription factor 12 (HTF4, helix-loop-
  • helix transcription factors 4)
  • Important regulators of gene regulation during
  • lymphocyte development
  • This encoded protein is expressed in many
    tissues,
  • among them skeletal muscle, thymus, B- and
    T-cells
  • The expression of that gene is enhanced in
    patients
  • with emphysema compared to the patients without
  • emphysema.
  • That is correlated with emphysema pathogenesis
    due
  • to the existence of the large scale infection
    associated with
  • cigarette smoking.

57
Signature Genes
  • CSNK1A1?casein kinase 1, alpha 1
  • Catalytic unit? involved in the transduction of
    the
  • Wnt signal.
  • According to the previous studies in literature,
    an inhibitor of
  • Wnt signalling(secreted frizzled-related
    protein) is found to be
  • expressed in lung tissue with emphysema but not
    in the healthy
  • lung tissues.
  • The reduction in the expression of that gene in
    patients with
  • emphysema is expected and correlated with the
    literature data.
  • Diminishing of CSNK1A1 leads to the ? prevention
    in
  • Wnt signalling? rise in matrix proteases that
    are able to degrade
  • all types of extracellular matrix proteins

58
Signature Genes
  • PPIB?peptidylprolyl isomerase B (cyclophilin B)
  • These proteins are found in biological fluids in
    response
  • to inflammatory stimuli, or oxidative stress.
  • Cyclophilin B has a role in the T-lymphocyte
    function and
  • recruitment.
  • Therefore, the increase in that gene transcript
    in patients
  • with emphysema compared to the patients who do
    not have
  • the disease, is expected due to the existence of
    an immune
  • response associated with the bacterial
    infection.

59
Signature Genes
  • C1R?complement component 1, r subcomponent
  • In order to begin and propagate the inflammatory
  • response, an early acting mechanism the
    complement
  • system fight against the microbial infections.
  • Therefore, as being a component of this system,
    the
  • increase of that transcript(C1R) in tissues of
    patients with
  • emphysema compared to the patients without
  • emphysema is an expected situation considering
    the
  • elevated levels of infection in empysema patients.

60
Signature Genes
  • FLNB ?filamin B, beta (actin binding protein 278)
  • The protein is responsible for connecting the
    cell
  • membrane components to the actin cytoskeleton.It
    also
  • fixes various transmembrane proteins to the actin
  • cytoskeleton
  • The transcript level of that gene is seemed to be
  • enhanced in tissues of patients with emphysema
  • compared to the tissues of patients without
    emphysema.
  • This increase may be due to the compensation of
    these
  • connective tissues that are damaged to the
    proteases
  • released from immune cells.

61
Signature Genes
  • CST1?cystatin SN
  • These proteins are found to be participated in
  • mechanisms that guard lungs against injury or
    inflammation .
  • As expected , the transcript level for that gene
    is
  • enhanced in tissues of patients with emphysema
    compared
  • to the tissues of patients without emphysema.

62
THANK YOU ?
63
REFERENCES
  • http//www.emedicinehealth.com/emphysema/article_e
    m.htmEmphysema20Overview
  • Effect of thymoquinone on cyclooxygenase
    expression and
  • prostaglandin production in a mouse model of
    allergic airway inflammation
  • Immunology Letters Volume 106, Issue 1, 15 July
    2006, Pages 72-81
  • Decreased Lung Fibroblast Growth Factor 18 and
    Elastin in Human
  • Congenital Diaphragmatic Hernia and Animal
    ModelsOlivier Boucherat1,2,
  • Alexandra Benachi1,3, Anne-Marie Barlier-Mur1,2,
    Montoya1,2, Jelena
  • Martinovic4, Bernard Thébaud5, Bernadette
    Chailley-American Journal of
  • Respiratory and Critical Care Medicine Vol 175.
    pp. 1066-1077, (2007)
  • http//www.abcam.com/index.html?datasheet33581
  • en.wikipedia.org/wiki/Alveoli

64
REFERENCES
  • The basic helix-loop-helix transcription factor
    HEBAlt is expressed in
  • pro-T cells and enhances the generation of T cell
    precursors Wang D,
  • Claus CL, Vaccarelli G, Braunstein M, Schmitt TM,
    Zúñiga-Pflücker JC,
  • Rothenberg EV, Anderson MK. J Immunol. 2006 Jul
    1177(1)109-19.
  • Activation of an Embryonic Gene Product in
    Pulmonary
  • Emphysema Identification of the Secreted
    Frizzled-Related Protein K.
  • Imai, DMD, PhD, and J. DArmiento, MD, PhD Chest.
    2000117229S
  • Regulated expression of the interstitial
    collagenase matrix
  • metalloproteinase-1 (MMP-1) in the lung by MAP
    kinase and Wnt signaling
  • Becky A Mercer, COLUMBIA UNIVERSITY
  • http//en.wikipedia.org/wiki/Matrix_metalloprotein
    ase
  • Interaction with glycosaminoglycans is required
    for cyclophilin B to trigger
  • integrin-mediated adhesion of peripheral blood T
    lymphocytes to
  • extracellular matrix PNAS , March 5, 2002 ,vol.
    99 , no. 5 ,2714-2719

65
REFERENCES
  • Methods for treating conditions associated with
    masp-2 dependent complement activation U.S.
    Provisional Application No. 60/578,847, filed
    Jun. 10, 2004
  • http//biomed.ngic.re.kr/cgibin/cards/carddisp.pl?
    geneFLNBsearchneurodegenerative20or20senilep
    ubmed114
  • http//biomed.ngic.re.kr/cgibin/cards/carddisp.pl?
    geneFLNBsearchneurodegenerative20or20senilep
    ubmed114
Write a Comment
User Comments (0)
About PowerShow.com