Title: COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis
1COMPUTATIONAL GENOME ANALYSIS PROJECT Microarray
data analysis
- Müge Erdogmus
- Zeynep Isik
2DISEASE EMPHYSEMA
- Emphysema is a lung disease that is included in a
group of diseases that are called chronic
obstructive pulmonary disease. - The ability of the lungs to expel air is
diminished for the patients with emphysema. - Lungs loose their elasticity thus, they become
less contractile. - In emphysema, the lung tissues which are
responsible for supporting the physical shape and
function of the lungs are damaged.
3EMPHYSEMA
- The lung tissue around the smaller airways
bronchioles and the alveoli?targets for the
destructions. - Normally the lungs are very elastic and spongy
but not in emphsema!
4Causes of Emphysema
- Alpha-1-antitrypsin deficiency
- Cigarette smoking
- 1)It damages the lung tissue in various ways? The
cells in the airway which are responsible for the
clearance of the mucus and other secretions are
influenced by cigarette smoking. - 2) Enhanced mucus secretion? rich source of food
for bacteria but immune cells are negatively
influenced by the cigarette in their fight for
infection. - Destructive enzymes from the immune cells?
loss of proteins associated with elasticity.
5The Microarray Experiment
- The gene expression dataset is composed of 30
samples that - are retrieved from NCBIs Gene Expression
Omnibus. -
- The RNA transcripts that are utilized for
measuring the - expression signals are taken from Homo-sapien
organism. - 18 slides?Severely emphysematous tissue removed
at LVRS - 12 slides? normal or mildly emphysematous lung
tissue taken - from smokers with nodules suspicious for lung
cancer. -
- More than 33,000 best characterized human genes
were - represented in the dataset that include 1,000,000
unique - oligonucleotide features.
6Data Analysis
Read XLS Files
Normalization
Outlier Removal
Check Normal Dist.
Hypothesis Testing
PCA
Correlation Based
Classification
7Data Normalization
- Why do we normalize data?
- Scale data for reasonable comparisons
- Map the expression values of each probe between
0,1 range - Do not disturb the underlying distribution of the
data
8Outlier Removal
- Why do we need to remove outliers?
- They can distort the mean of the data -gt wrong
clusterings, wrongs significance values - What is an outlier?
- Expression values that are three or more standard
deviation away from the mean are outliers - Replace outliers with the mean of remaining
expression values for the probe - Detected outliers in 5331 probes
9Clustering Before Feature Reduction
- Clustering on samples using all existing features
to see how bad the situation is - K-means clustering with euclidean distance
- For selection of initial k centers
- KCENTRES algorithm
- select k center objects from the distance matrix
such that the distance between the most distant
object and the center that is closest to that
object is minimized.
10Clustering Before Feature Reduction
11Clustering Before Feature Reduction
- Really low clustering performance
- Data set is noisy
- Reduce features so as to have only the pobes that
are really significant - Differentially expressed among classes
- ...can be used to differentiate between the two
classes of samples
12Searching for Normally Distributed Probes
- Why do we need to identify which probes are
normally distributed and which ones are not? - Apply different tests of significance
- Lilliefors test (95 significance)
- Tests the normality of the distribution via
examining the signal intensity data for a probe - Modified version of Kolmogorov-Smirnov test
- No need to specify the parameters of the
underlying distribution of the data - approximates the underlying distribution
- 14787 probes have normal distribution
- 7428 probes do not have normal distribution
13Tests of Significance
- For probes that have normal distribution
- T-test or Z-test
- 30 samples -gt t-test (95 significance)
- For probes that do not have normal distribution
- Non-parametric test
- Wilcoxon rank-sum test (95significance)
- ... sorts all intensity values for a probe
- ... gives each intensity value a rank
- ... sums up the ranks for the signal values for
both classes - ... compares the sums to decide whether the two
samples come from the same distribution.
14Tests of Significance
- 2339 of 22215 probes are differentially expressed
- eliminated 19876 probes
15Clustering Before Feature Reduction
- Clustering on samples using only differentially
expressed features to see whether feature
reduction proved to be useful - K-means clustering with same procedure
16Clustering Before Feature Reduction
17Further Feature Reduction
- 2339 features is a high number
- Try to reduce number of features while preserving
clustering accuracy - Two methods
- Correlation based feature reduction
- Principal component analysis
18Correlation Based Feature Reduction
- Extract uncorelated features
- Cluster uncorrelated features
- K-means
- SOM
- Prior to clustering find value of k from
hierarchical clustering - Different distance metrics
- Different hierarchical clustering methods
- From each cluster of each clustering select
certain genes - These genes will be used for classification
19Correlation Based Feature Reduction
- Extract uncorelated features
- Find correlation matrix
- Keep one of the highly correlated features and
remove the others. - Highly correlated gt85
- Keep the feature that is closest to all other
features - 1689 probes are left in our feature set
20Correlation Based Feature Reduction
- Prior to clustering find value of k from
hierarchical clustering - Different distance metrics
- Manhattan
- Euclidean
- Mahalanobis
- Chebyshev
- Correlation coefficients
- Different hierarchical clustering methods
- Complete linkage (max distance clustering)
- Average linkage (average distance clustering)
21Correlation Based Feature Reduction
Complete Linkage with Manhattan Distance
22Correlation Based Feature Reduction
Average Linkage with Manhattan Distance
23Correlation Based Feature Reduction
Complete Linkage with Euclidean Distance
24Correlation Based Feature Reduction
Average Linkage with Euclidean Distance
25Correlation Based Feature Reduction
Complete Linkage with Chebyshev Distance
26Correlation Based Feature Reduction
Average Linkage with Chebyshev Distance
27Correlation Based Feature Reduction
Complete Linkage with Correlation Coeff.
28Correlation Based Feature Reduction
Average Linkage with Correlation Coeff.
29Correlation Based Feature Reduction
- complete linkage method is more successful in
separating between clusters - the probes that are very similar to each other
are put into the same clusters, the ones that are
very different are put into different clusters - focused on the trees that are formed using
complete linkage with euclidean distance and
correlation coefficients as distance metrics - From which level we should cut the trees?
- Examine the trees
30Correlation Based Feature Reduction
31Correlation Based Feature Reduction
32Correlation Based Feature Reduction
- cut from a level above the lower bound lines
- have a small number of clusters
- insufficient to explain the closeness of probes.
- cut from a level below the upper bound lines
- have a high number of clusters
- forcing the clustering algorithm to divide
clusters that consist of very similar samples - lower bound 40 clusters
- upper bound 95 clusters
33Correlation Based Feature Reduction
- To find optimal k value
- run several k-means algorithms for each k value
between 40 and 95 - produces the clustering of highest quality
- High quality small intra-cluster distance
- k is found to be 80
- store the clustering result that is produced when
k 80 - run SOM clustering with 9x9 map and store the
clustering result
34Correlation Based Feature Reduction
- How do we select the signature probes from the
clusters? - select the ones that are most significant
- Select how many probes form each cluster ?
- select the n most significant probes from each
cluster, where n is depenedent on the quality
of the cluster - Quality of cluster intra-cluster distance
- Quality value is high -gt intra-cluster similarity
is low -gt cluster is loose - Quality value is low -gt intra-cluster similarity
is high -gt cluster is tight - Take more probes from clusters that are loose in
order to represent those clusters better
35Correlation Based Feature Reduction
- From clusters formed by k-means
- 144 probes
- From clusters formed by SOM
- 141 probes
- we formed another set of 88 probes that are found
both in probes selected from k-means clustering
and SOM clustering (named as common probes)
36Principal Component Analysis
- set of statistically significant probes are
directly given to prtools' pca function - 99, 90 and 85 data preservation
- Number of resulting principal components
Percentage 99 90 85
of PCs 29 22 20
37Classification
- Feature sets from feature reduction based on
correlation that can be used by classifiers - K-means probe set
- Som probe set
- Common probe set
- Algorithms
- Linear classifier
- Support vector machine
- 1-narest neighbor classifier
- 3-nearest neighbor classifier
38Classification
- 30 samples in our data set -gt k-fold cross
validation - bias caused by random selection of samples for
training and testing sets - Repeatedly perform 100 classifications for each
classifier - Report the average classification error
39Classification (k-means probe set)
40Classification (k-means probe set)
41Classification (SOM probe set)
42Classification (common probe set)
43Classification
- Three sets of principal components from feature
reduction with principal component analysis can
be used by classifiers - Algorithms
- Linear classifier
- Support vector machine
- 1-narest neighbor classifier
- 3-nearest neighbor classifier
44Classification (PCA 99)
45Classification (PCA 90)
46Classification (PCA 85)
47Classification
- in all cases support vector machines provides us
with the best clustering results
48Classification
- performance of support vector machines that
utilize the principal component sets is worse
than the ones that utilize the probe sets that
are formed as the result of correlation based
feature reduction method - performance of support vector classifiers that
utilize the k-means, SOM and common probe sets
are more or less similar - classify samples with 99 accuracy on the average
- use set of common probes as signature genes
- aim is to reduce the number of features without
sacrificing classification performance
49Final feature reduction
- 88 features is still a high number
- use Fishers linear discriminant to further
reduce number of features - resulting signature gene set consisted of 26
probes - performance of classification even improved when
the number of probes is reduced - able to classify the 30 samples with 99.7-100
accuracy on the average
50Final classification
51Signature Genes
- PTGDS?prostaglandin D2 synthase 21kDa (brain)
- Key enzyme for the generation of prostanoids in
the immune system. - Prostaglandins?various influences throughout the
body such as taking role in inflammation , smooth
muscle contraction. - The expression of PTGDS is enhanced in patients
with emphysema compared to the control patients.
52Signature Genes
- There are also cases in the literature that
support the rise of prostaglandins as being
potent proinflammatory mediators in response to
various stimuli including the allergic airway
inflammations. - Thus considering that immune response is
stimulated in response to the emphysema due to
the bacterial infection, it is expected to
observe a rise in the expression of
prostaglandins due to its role in inflammation.
53Signature Genes
- BIRC4?Baculoviral IAP repeat-containing 4
- The gene that encodes that protein belongs to a
family of proteins that blocks apoptosis. - According to the literature information,15-
deoxy- delta(12,14)- prostaglandin J2 leads to
the reduction of BIRC4. - The expression results indicate that, a decrease
in BIRC4 is observed in patient with emphysema
compared to the patients without emphysema. - Speculation? In control cells there is a
possibility that the cancer tissues influence the
expression of BIRC4 therefore its expression may
be analyzed as high.
54Signature Genes
- Spastin gene has a key role in cytoskeletol
rearrangement and dynamic. - The expression spastin is decreased in patients
with emphysema compared to the patients that do
not have. - Considering that macrophages release substances
that damages proteins that are responsible for
contracting and expanding of lung,it is highly
expected to observe a decrease in the level of
spastin .
55Signature Genes
- FGF18 ? fibroblast growth factor 18
- According to literature data, (Decreased Lung
Fibroblast - Growth Factor 18 and Elastin in Human Congenital
Diaphragmatic Hernia - and Animal Models) FGF18 reduction has been
associated with - the decrease in elastin expression and impaired
elastin - deposition.
- Therefore, the decrease of FGF18 seen in
patients with - emphysema is an expected finding which is
correlated with - the loss of elasticity in the lungs, specifically
with the alveoli.
56Signature Genes
- TCF12?transcription factor 12 (HTF4, helix-loop-
- helix transcription factors 4)
- Important regulators of gene regulation during
- lymphocyte development
- This encoded protein is expressed in many
tissues, - among them skeletal muscle, thymus, B- and
T-cells - The expression of that gene is enhanced in
patients - with emphysema compared to the patients without
- emphysema.
- That is correlated with emphysema pathogenesis
due - to the existence of the large scale infection
associated with - cigarette smoking.
57Signature Genes
- CSNK1A1?casein kinase 1, alpha 1
- Catalytic unit? involved in the transduction of
the - Wnt signal.
- According to the previous studies in literature,
an inhibitor of - Wnt signalling(secreted frizzled-related
protein) is found to be - expressed in lung tissue with emphysema but not
in the healthy - lung tissues.
- The reduction in the expression of that gene in
patients with - emphysema is expected and correlated with the
literature data. - Diminishing of CSNK1A1 leads to the ? prevention
in - Wnt signalling? rise in matrix proteases that
are able to degrade - all types of extracellular matrix proteins
58Signature Genes
- PPIB?peptidylprolyl isomerase B (cyclophilin B)
- These proteins are found in biological fluids in
response - to inflammatory stimuli, or oxidative stress.
- Cyclophilin B has a role in the T-lymphocyte
function and - recruitment.
- Therefore, the increase in that gene transcript
in patients - with emphysema compared to the patients who do
not have - the disease, is expected due to the existence of
an immune - response associated with the bacterial
infection.
59Signature Genes
- C1R?complement component 1, r subcomponent
- In order to begin and propagate the inflammatory
- response, an early acting mechanism the
complement - system fight against the microbial infections.
- Therefore, as being a component of this system,
the - increase of that transcript(C1R) in tissues of
patients with - emphysema compared to the patients without
- emphysema is an expected situation considering
the - elevated levels of infection in empysema patients.
60Signature Genes
- FLNB ?filamin B, beta (actin binding protein 278)
- The protein is responsible for connecting the
cell - membrane components to the actin cytoskeleton.It
also - fixes various transmembrane proteins to the actin
- cytoskeleton
- The transcript level of that gene is seemed to be
- enhanced in tissues of patients with emphysema
- compared to the tissues of patients without
emphysema. - This increase may be due to the compensation of
these - connective tissues that are damaged to the
proteases - released from immune cells.
61Signature Genes
- CST1?cystatin SN
- These proteins are found to be participated in
- mechanisms that guard lungs against injury or
inflammation . - As expected , the transcript level for that gene
is - enhanced in tissues of patients with emphysema
compared - to the tissues of patients without emphysema.
-
62THANK YOU ?
63REFERENCES
- http//www.emedicinehealth.com/emphysema/article_e
m.htmEmphysema20Overview - Effect of thymoquinone on cyclooxygenase
expression and - prostaglandin production in a mouse model of
allergic airway inflammation - Immunology Letters Volume 106, Issue 1, 15 July
2006, Pages 72-81 - Decreased Lung Fibroblast Growth Factor 18 and
Elastin in Human - Congenital Diaphragmatic Hernia and Animal
ModelsOlivier Boucherat1,2, - Alexandra Benachi1,3, Anne-Marie Barlier-Mur1,2,
Montoya1,2, Jelena - Martinovic4, Bernard Thébaud5, Bernadette
Chailley-American Journal of - Respiratory and Critical Care Medicine Vol 175.
pp. 1066-1077, (2007) - http//www.abcam.com/index.html?datasheet33581
- en.wikipedia.org/wiki/Alveoli
64REFERENCES
- The basic helix-loop-helix transcription factor
HEBAlt is expressed in - pro-T cells and enhances the generation of T cell
precursors Wang D, - Claus CL, Vaccarelli G, Braunstein M, Schmitt TM,
Zúñiga-Pflücker JC, - Rothenberg EV, Anderson MK. J Immunol. 2006 Jul
1177(1)109-19. - Activation of an Embryonic Gene Product in
Pulmonary - Emphysema Identification of the Secreted
Frizzled-Related Protein K. - Imai, DMD, PhD, and J. DArmiento, MD, PhD Chest.
2000117229S - Regulated expression of the interstitial
collagenase matrix - metalloproteinase-1 (MMP-1) in the lung by MAP
kinase and Wnt signaling - Becky A Mercer, COLUMBIA UNIVERSITY
- http//en.wikipedia.org/wiki/Matrix_metalloprotein
ase - Interaction with glycosaminoglycans is required
for cyclophilin B to trigger - integrin-mediated adhesion of peripheral blood T
lymphocytes to - extracellular matrix PNAS , March 5, 2002 ,vol.
99 , no. 5 ,2714-2719 -
65REFERENCES
- Methods for treating conditions associated with
masp-2 dependent complement activation U.S.
Provisional Application No. 60/578,847, filed
Jun. 10, 2004 - http//biomed.ngic.re.kr/cgibin/cards/carddisp.pl?
geneFLNBsearchneurodegenerative20or20senilep
ubmed114 - http//biomed.ngic.re.kr/cgibin/cards/carddisp.pl?
geneFLNBsearchneurodegenerative20or20senilep
ubmed114