Title: Final Project of Team 1
1Final Project of Team 1
- To build a statistical tool web-site for advanced
Micro-Array data analysis
???? ??? Masuya ??? ??? ???
2Version context
- This version of power-point is written for
bio-informatics course HW5s guide. - And will be taken as a part of TEAM 1s final
project output.
3Document Outline
- A.Introduction
- A.1)Purpose
- A.2)Process flow
- A.3)Our website
- A.4)Resources
- B.Process flow
- B.1)Statistic Significant Test -SAM
- B.2)Datasets preparation and adjustment
- B.3)Clustering data
4A.Introduction
5A.1.Purpose
- Perform statistics analysis to get the
discriminator markers. (Question 1) - Cluster analysis to distinguish tumor from
non-tumor. (Question 2) - GO analysis of discriminator markers.
6Introduction of Micro-Array
- An array is an orderly arrangement of samples. It
provides a medium for matching known and unknown
DNA samples based on base-pairing rules and
automating the process of identifying the
unknowns. - The sample spot sizes in microarray are typically
less than 200 microns in diameter and these
arrays usually contains thousands of spots.
Microarrays require specialized robotics and
imaging equipment that generally are not
commercially available as a complete system.
7Introduction of Micro-Array(cont.)
- Types of microarray
- Format I probe cDNA (5005,000 bases long) is
immobilized to a solid surface such as glass
using robot spotting and exposed to a set of
targets either separately or in a mixture. This
method, "traditionally" called DNA microarray, is
widely considered as developed at Stanford
University. A recent article by R. Ekins and F.W.
Chu (Microarrays their origins and applications.
Trends in Biotechnology, 1999, 17, 217-218) seems
to provide some generally forgotten facts.
8Introduction of Micro-Array(cont.)
- Format II an array of oligonucleotide (2080-mer
oligos) or peptide nucleic acid (PNA) probes is
synthesized either in situ (on-chip) or by
conventional synthesis followed by on-chip
immobilization. The array is exposed to labeled
sample DNA, hybridized, and the
identity/abundance of complementary sequences are
determined. This method, "historically" called
DNA chips, was developed at Affymetrix, Inc. ,
which sells its photolithographically fabricated
products under the GeneChip trademark. Many
companies are manufacturing oligonucleotide based
chips using alternative in-situ synthesis or
depositioning technologies.
9A.2.Micro-Array data analysis process flow
10Micro-Array data analysis process flow
(Statistic sub-flow)
11A.3.Our Website
- http//linhots.no-ip.com/biap
What you can do From it?
- You can download all dataset from our web-site.
- You can download all the power-points from our
web-site. - You can find all statistics background
information from our website. - You can do some data adjustments on our website.
12A.4.Resource
- Shi, L. 1998. DNA Microarray ( Genome Chip ) --
Monitoring the Genome on a Chip.
http//www.gene-chips.com/ - Leung, Y. F. and D. Cavalieri. 2003.
Fundamentals of cDNA microarray data analysis.
Trends Genet. 19 649 - 59.
13Process flowB.1.Significant test
14Our approach
- We use SAM as significant process tool.
- It is a client-side application embedded in
office Excel - You can download all the information on our
website (include SAM.zip software). - Or you can go to http//www-stat.stanford.edu/tib
s/SAM/index.html to download SAM.zip.
15Is Your Raw data Normalized?
- SAM cannot perform normalization
- So if the data is not normalized
- Feed it to the clustering tool
- In the Adjust data tab, normalize it
- Save the data
- Now the data is ready to be fed into SAM
16Where
http//rana.lbl.gov/
- We can download executable software, source code,
user manual from this website
171.Data Preparation
The input data file may also look like the one
shown below. Yellow regions in the diagram show
that they are optional.
- A minimal cluster input data would
- look like the one shown below
18- Cluster will inform you about the loaded data
19- Here normalization is done
201
21- Note
- For the latest version of the clustering tool,
after pressing Apply button, the data should be
saved using - File-gtsave as-gt filename
22Data is Normalized!
- The new file saved contains normalized data.
- It can easily be worked on the SAM tool.
23System Requirements
SAM
- Approach using Internet Explorer
- latest Microsoft Java Virtual Machine if run with
XP. - The Microsoft Data Access Components.
- Microsoft Excel 97 or higher.
- If your XP still can not install or run SAM, use
win2000 or win98.
24Installation.3-1
SAM
- Download the file after registration from
http//www-stat.stanford.edu/tibs/SAM/index.html - After registration you will get a email with
password in it. Then go to
25Installation.3-2
SAM
After decompression, Double click on Setup.
26Installation.3-3
SAM
- Fire up Excel and click on the Tools menu. Choose
Addins and click on Browse. Select the directory
where the setup process installed SAM (C\Program
Files\SAMVB, if you chose the defaults) and click
on the ??/?? Addin subdirectory. Double click on
the SAM file.The SAM addin will be loaded and the
box against the phrase Significance Analysis for
Microarrays will be checked.Once you click OK,
you should now see two buttons on your Excel
toolbar named SAM and SAM Plot Control .
27Data source format
SAM
1. Normalization data needed 2. SAM can use data
in multiple sheets.
28Data format transformation from HCC1648
SAM
- This data structure is HCC cells
- And normal cells, and are not paired data
- Missing data
- If there is no data in sheet 2 or 3, delete no
data sheet first - Selection all data
- Press SAM function
- ? Two classes, unpaired with missing data
Original
Transformed
29SAM
Parameters input
- Choice you data type
- Check your data is logged or not
- Here we use clone ID
- Missing value will be regenerated by two methods,
choice one and how many neighbors will be
calculated for the missing value. if there were
missing value SAM will generate a new data sheet
in the final output, you can use this for
clustering analysis - You can generate another random seed by yourself.
- Then OK
3
4
5
30SAM
Significant gene no. listed, when you change
delta value or fold change this will change
spontaneously
False predict number
31you can manually input the delta-value also after
you check the list Delta table or just move the
tag
32- The number of significant genes and FDR are
depend on the delta and fold change which you
chose.
33When we choice fold change is 2 and delta value
2.65, then 102 genes will be listed as
significant changes.
34Fold change
Entering this address you will get this gene
information Showing in next slide
35(No Transcript)
36Process flowB.2.Data-adjustment
37Why we need it
- In the process flow Each softwares I/O Data
format maybe different - We provide 2 on-line process tools
- You can reach these 2 tools from our website
1
2
381.Adjust the HCC datasetSeparate the NAME column
1. Upload HCC dataset
2. Set up parameters (or use default value)
3.Click the button and get the result
392.Adjust the HCC datasetFilter with Statistic
result
1. Upload dataset
2. Feed with Statistic result(genelist string)
3.Click the button and get the result
40Process flowB.3.Clustering
41Where
http//rana.lbl.gov/
- We can download executable software, source code,
user manual from this website
42Analyze Process Flow
Raw data collection
Data preparation
Data cleaning
Significant test by SAM
Normalization
Data-mining
Clustering
Visualization
View the result
431.Data Preparation
- Format your data as follows.
Simple form
Complete form
The dif whether we want to take weight and order
into accounted
44Import data into the software
- Cluster will inform we about the loaded data
1
2
3
45 More on filter tab
462.Clustering
- The tool provides us with three methods of
clustering - Hierarchical
- K-means
- Self organizing map ( SOM )
472-1.Hierarchical Clustering
- The hierarchical clustering tab allows to perform
hierarchical clustering on the datasets.
482.2 Partitioned Clustering
- The parameters controlling K-Means clustering
are - The number of clusters ( K )
- The maximum number of cycles.
K-means may be faster!
492.3 SOM Clustering
- Here we have considered one dimensional approach
of SOM
503.Result Viewer
We need a viewing tool
Micro-array Raw data
.CDT and .GTR files
Cluster
Tree-View
Data cleaning Data-mining
Clustering result Visualization
513.Result Tree-viewer
- Use the mouse click/drag/find
- Use the keyboard up/down/left/right
2
1
3
52Appendix
531What Is Good Clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
- The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation. - The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.
542. This is Micro-Array Data-Set
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
553. Take Each row(gene) as a line
Log2(cy5/cy3)
56Then we can have similarity Correlation
Correlation0.97
574. Finally we can do cluster
585. Hierarchical Clustering
Hierarchical clustering. Define Similarity
Choose cluster gene? And if we calculate the
weight?
596. Hierarchical Clustering
- Define Similarity metric
- Center or not
- Absolute or not
- Linear or not
607. Hierarchical Clustering
618. Hierarchical Clustering
629. Hierarchical Clustering
- During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined. - The distance or similarity between clusters must
be calculated. - The rules that govern this calculation are
linkage methods.
6310.. Partition Clustering
- K-mean, Professor had taught us.
- K-Mean/K-Medoid
- K-Medoid choose a new node which is more near the
central instead of central point.
64WE SEPARATE them!!!!!
6511. Ordering of operations
- 1.log transform of all the values
- 2.Mean centre rows
- 3.Median centre rows.
- 4.Normalize rows
- 5.Mean centre columns
- 6.Median centre columns
- 7.Normalize columns