Final Project of Team 1

About This Presentation

Title:

Final Project of Team 1

Description:

This version of power-point is written for bio-informatics course HW5's guide. ... Trends Genet. 19: 649 - 59. 13. Process flow: B.1.Significant test. 14. Our approach ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 66

Provided by: TsaoWe

Category:

more less

Transcript and Presenter's Notes

Title: Final Project of Team 1

1
Final Project of Team 1

To build a statistical tool web-site for advanced
Micro-Array data analysis

???? ??? Masuya ??? ??? ???
2
Version context

This version of power-point is written for
bio-informatics course HW5s guide.
And will be taken as a part of TEAM 1s final
project output.

3
Document Outline

A.Introduction
A.1)Purpose
A.2)Process flow
A.3)Our website
A.4)Resources
B.Process flow
B.1)Statistic Significant Test -SAM
B.2)Datasets preparation and adjustment
B.3)Clustering data

4
A.Introduction
5
A.1.Purpose

Perform statistics analysis to get the
discriminator markers. (Question 1)
Cluster analysis to distinguish tumor from
non-tumor. (Question 2)
GO analysis of discriminator markers.

6
Introduction of Micro-Array

An array is an orderly arrangement of samples. It
provides a medium for matching known and unknown
DNA samples based on base-pairing rules and
automating the process of identifying the
unknowns.
The sample spot sizes in microarray are typically
less than 200 microns in diameter and these
arrays usually contains thousands of spots.
Microarrays require specialized robotics and
imaging equipment that generally are not
commercially available as a complete system.

7
Introduction of Micro-Array(cont.)

Types of microarray
Format I probe cDNA (5005,000 bases long) is
immobilized to a solid surface such as glass
using robot spotting and exposed to a set of
targets either separately or in a mixture. This
method, "traditionally" called DNA microarray, is
widely considered as developed at Stanford
University. A recent article by R. Ekins and F.W.
Chu (Microarrays their origins and applications.
Trends in Biotechnology, 1999, 17, 217-218) seems
to provide some generally forgotten facts.

8
Introduction of Micro-Array(cont.)

Format II an array of oligonucleotide (2080-mer
oligos) or peptide nucleic acid (PNA) probes is
synthesized either in situ (on-chip) or by
conventional synthesis followed by on-chip
immobilization. The array is exposed to labeled
sample DNA, hybridized, and the
identity/abundance of complementary sequences are
determined. This method, "historically" called
DNA chips, was developed at Affymetrix, Inc. ,
which sells its photolithographically fabricated
products under the GeneChip trademark. Many
companies are manufacturing oligonucleotide based
chips using alternative in-situ synthesis or
depositioning technologies.

9
A.2.Micro-Array data analysis process flow
10
Micro-Array data analysis process flow
(Statistic sub-flow)
11
A.3.Our Website

http//linhots.no-ip.com/biap

What you can do From it?

You can download all dataset from our web-site.
You can download all the power-points from our
web-site.
You can find all statistics background
information from our website.
You can do some data adjustments on our website.

12
A.4.Resource

Shi, L. 1998. DNA Microarray ( Genome Chip ) --
Monitoring the Genome on a Chip.
http//www.gene-chips.com/
Leung, Y. F. and D. Cavalieri. 2003.
Fundamentals of cDNA microarray data analysis.
Trends Genet. 19 649 - 59.

13
Process flowB.1.Significant test
14
Our approach

We use SAM as significant process tool.
It is a client-side application embedded in
office Excel
You can download all the information on our
website (include SAM.zip software).
Or you can go to http//www-stat.stanford.edu/tib
s/SAM/index.html to download SAM.zip.

15
Is Your Raw data Normalized?

SAM cannot perform normalization
So if the data is not normalized
Feed it to the clustering tool
In the Adjust data tab, normalize it
Save the data
Now the data is ready to be fed into SAM

16
Where
http//rana.lbl.gov/

We can download executable software, source code,
user manual from this website

17
1.Data Preparation
The input data file may also look like the one
shown below. Yellow regions in the diagram show
that they are optional.

A minimal cluster input data would
look like the one shown below

Cluster will inform you about the loaded data

Here normalization is done

20
1
21

Note
For the latest version of the clustering tool,
after pressing Apply button, the data should be
saved using
File-gtsave as-gt filename

22
Data is Normalized!

The new file saved contains normalized data.
It can easily be worked on the SAM tool.

23
System Requirements
SAM

Approach using Internet Explorer
latest Microsoft Java Virtual Machine if run with
XP.
The Microsoft Data Access Components.
Microsoft Excel 97 or higher.
If your XP still can not install or run SAM, use
win2000 or win98.

24
Installation.3-1
SAM

Download the file after registration from
http//www-stat.stanford.edu/tibs/SAM/index.html
After registration you will get a email with
password in it. Then go to

25
Installation.3-2
SAM
After decompression, Double click on Setup.
26
Installation.3-3
SAM

Fire up Excel and click on the Tools menu. Choose
Addins and click on Browse. Select the directory
where the setup process installed SAM (C\Program
Files\SAMVB, if you chose the defaults) and click
on the ??/?? Addin subdirectory. Double click on
the SAM file.The SAM addin will be loaded and the
box against the phrase Significance Analysis for
Microarrays will be checked.Once you click OK,
you should now see two buttons on your Excel
toolbar named SAM and SAM Plot Control .

27
Data source format
SAM
1. Normalization data needed 2. SAM can use data
in multiple sheets.
28
Data format transformation from HCC1648
SAM

This data structure is HCC cells
And normal cells, and are not paired data
Missing data
If there is no data in sheet 2 or 3, delete no
data sheet first
Selection all data
Press SAM function
? Two classes, unpaired with missing data

Original
Transformed
29
SAM
Parameters input

Choice you data type
Check your data is logged or not
Here we use clone ID
Missing value will be regenerated by two methods,
choice one and how many neighbors will be
calculated for the missing value. if there were
missing value SAM will generate a new data sheet
in the final output, you can use this for
clustering analysis
You can generate another random seed by yourself.
Then OK

3
4
5
30
SAM
Significant gene no. listed, when you change
delta value or fold change this will change
spontaneously
False predict number
31
you can manually input the delta-value also after
you check the list Delta table or just move the
tag
32

The number of significant genes and FDR are
depend on the delta and fold change which you
chose.

33
When we choice fold change is 2 and delta value
2.65, then 102 genes will be listed as
significant changes.
34
Fold change
Entering this address you will get this gene
information Showing in next slide
35
(No Transcript)
36
Process flowB.2.Data-adjustment
37
Why we need it

In the process flow Each softwares I/O Data
format maybe different
We provide 2 on-line process tools
You can reach these 2 tools from our website

1
2
38
1.Adjust the HCC datasetSeparate the NAME column
1. Upload HCC dataset
2. Set up parameters (or use default value)
3.Click the button and get the result
39
2.Adjust the HCC datasetFilter with Statistic
result
1. Upload dataset
2. Feed with Statistic result(genelist string)
3.Click the button and get the result
40
Process flowB.3.Clustering
41
Where
http//rana.lbl.gov/

We can download executable software, source code,
user manual from this website

42
Analyze Process Flow
Raw data collection
Data preparation
Data cleaning
Significant test by SAM
Normalization
Data-mining
Clustering
Visualization
View the result
43
1.Data Preparation

Format your data as follows.

Simple form
Complete form
The dif whether we want to take weight and order
into accounted
44
Import data into the software

Cluster will inform we about the loaded data

1
2
3
45
More on filter tab
46
2.Clustering

The tool provides us with three methods of
clustering
Hierarchical
K-means
Self organizing map ( SOM )

47
2-1.Hierarchical Clustering

The hierarchical clustering tab allows to perform
hierarchical clustering on the datasets.

48
2.2 Partitioned Clustering

The parameters controlling K-Means clustering
are
The number of clusters ( K )
The maximum number of cycles.

K-means may be faster!
49
2.3 SOM Clustering

Here we have considered one dimensional approach
of SOM

50
3.Result Viewer
We need a viewing tool
Micro-array Raw data
.CDT and .GTR files
Cluster
Tree-View
Data cleaning Data-mining
Clustering result Visualization
51
3.Result Tree-viewer

Use the mouse click/drag/find
Use the keyboard up/down/left/right

2
1
3
52
Appendix
53
1What Is Good Clustering?

A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.

54
2. This is Micro-Array Data-Set
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
55
3. Take Each row(gene) as a line
Log2(cy5/cy3)
56
Then we can have similarity Correlation
Correlation0.97
57
4. Finally we can do cluster
58
5. Hierarchical Clustering
Hierarchical clustering. Define Similarity
Choose cluster gene? And if we calculate the
weight?
59
6. Hierarchical Clustering

Define Similarity metric
Center or not
Absolute or not
Linear or not

60
7. Hierarchical Clustering

Center or not

Absolute or not

61
8. Hierarchical Clustering

Linear or not

62
9. Hierarchical Clustering

During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined.
The distance or similarity between clusters must
be calculated.
The rules that govern this calculation are
linkage methods.

63
10.. Partition Clustering