Final Project of Team 1 - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Final Project of Team 1

Description:

This version of power-point is written for bio-informatics course HW5's guide. ... Trends Genet. 19: 649 - 59. 13. Process flow: B.1.Significant test. 14. Our approach ... – PowerPoint PPT presentation

Number of Views:194
Avg rating:3.0/5.0
Slides: 66
Provided by: TsaoWe
Category:
Tags: final | genet | project | team

less

Transcript and Presenter's Notes

Title: Final Project of Team 1


1
Final Project of Team 1
  • To build a statistical tool web-site for advanced
    Micro-Array data analysis

???? ??? Masuya ??? ??? ???
2
Version context
  • This version of power-point is written for
    bio-informatics course HW5s guide.
  • And will be taken as a part of TEAM 1s final
    project output.

3
Document Outline
  • A.Introduction
  • A.1)Purpose
  • A.2)Process flow
  • A.3)Our website
  • A.4)Resources
  • B.Process flow
  • B.1)Statistic Significant Test -SAM
  • B.2)Datasets preparation and adjustment
  • B.3)Clustering data

4
A.Introduction
5
A.1.Purpose
  • Perform statistics analysis to get the
    discriminator markers. (Question 1)
  • Cluster analysis to distinguish tumor from
    non-tumor. (Question 2)
  • GO analysis of discriminator markers.

6
Introduction of Micro-Array
  • An array is an orderly arrangement of samples. It
    provides a medium for matching known and unknown
    DNA samples based on base-pairing rules and
    automating the process of identifying the
    unknowns.
  • The sample spot sizes in microarray are typically
    less than 200 microns in diameter and these
    arrays usually contains thousands of spots.
    Microarrays require specialized robotics and
    imaging equipment that generally are not
    commercially available as a complete system.

7
Introduction of Micro-Array(cont.)
  • Types of microarray
  • Format I probe cDNA (5005,000 bases long) is
    immobilized to a solid surface such as glass
    using robot spotting and exposed to a set of
    targets either separately or in a mixture. This
    method, "traditionally" called DNA microarray, is
    widely considered as developed at Stanford
    University. A recent article by R. Ekins and F.W.
    Chu (Microarrays their origins and applications.
    Trends in Biotechnology, 1999, 17, 217-218) seems
    to provide some generally forgotten facts.

8
Introduction of Micro-Array(cont.)
  • Format II an array of oligonucleotide (2080-mer
    oligos) or peptide nucleic acid (PNA) probes is
    synthesized either in situ (on-chip) or by
    conventional synthesis followed by on-chip
    immobilization. The array is exposed to labeled
    sample DNA, hybridized, and the
    identity/abundance of complementary sequences are
    determined. This method, "historically" called
    DNA chips, was developed at Affymetrix, Inc. ,
    which sells its photolithographically fabricated
    products under the GeneChip trademark. Many
    companies are manufacturing oligonucleotide based
    chips using alternative in-situ synthesis or
    depositioning technologies.

9
A.2.Micro-Array data analysis process flow
10
Micro-Array data analysis process flow
(Statistic sub-flow)
11
A.3.Our Website
  • http//linhots.no-ip.com/biap

What you can do From it?
  • You can download all dataset from our web-site.
  • You can download all the power-points from our
    web-site.
  • You can find all statistics background
    information from our website.
  • You can do some data adjustments on our website.

12
A.4.Resource
  • Shi, L. 1998. DNA Microarray ( Genome Chip ) --
    Monitoring the Genome on a Chip.
    http//www.gene-chips.com/
  • Leung, Y. F. and D. Cavalieri. 2003.
    Fundamentals of cDNA microarray data analysis.
    Trends Genet. 19 649 - 59.

13
Process flowB.1.Significant test
14
Our approach
  • We use SAM as significant process tool.
  • It is a client-side application embedded in
    office Excel
  • You can download all the information on our
    website (include SAM.zip software).
  • Or you can go to http//www-stat.stanford.edu/tib
    s/SAM/index.html to download SAM.zip.

15
Is Your Raw data Normalized?
  • SAM cannot perform normalization
  • So if the data is not normalized
  • Feed it to the clustering tool
  • In the Adjust data tab, normalize it
  • Save the data
  • Now the data is ready to be fed into SAM

16
Where
http//rana.lbl.gov/
  • We can download executable software, source code,
    user manual from this website

17
1.Data Preparation
The input data file may also look like the one
shown below. Yellow regions in the diagram show
that they are optional.
  • A minimal cluster input data would
  • look like the one shown below

18
  • Cluster will inform you about the loaded data

19
  • Here normalization is done

20
1
21
  • Note
  • For the latest version of the clustering tool,
    after pressing Apply button, the data should be
    saved using
  • File-gtsave as-gt filename

22
Data is Normalized!
  • The new file saved contains normalized data.
  • It can easily be worked on the SAM tool.

23
System Requirements
SAM
  • Approach using Internet Explorer
  • latest Microsoft Java Virtual Machine if run with
    XP.
  • The Microsoft Data Access Components.
  • Microsoft Excel 97 or higher.
  • If your XP still can not install or run SAM, use
    win2000 or win98.

24
Installation.3-1
SAM
  • Download the file after registration from
    http//www-stat.stanford.edu/tibs/SAM/index.html
  • After registration you will get a email with
    password in it. Then go to

25
Installation.3-2
SAM
After decompression, Double click on Setup.
26
Installation.3-3
SAM
  • Fire up Excel and click on the Tools menu. Choose
    Addins and click on Browse. Select the directory
    where the setup process installed SAM (C\Program
    Files\SAMVB, if you chose the defaults) and click
    on the ??/?? Addin subdirectory. Double click on
    the SAM file.The SAM addin will be loaded and the
    box against the phrase Significance Analysis for
    Microarrays will be checked.Once you click OK,
    you should now see two buttons on your Excel
    toolbar named SAM and SAM Plot Control .

27
Data source format
SAM
1. Normalization data needed 2. SAM can use data
in multiple sheets.
28
Data format transformation from HCC1648
SAM
  • This data structure is HCC cells
  • And normal cells, and are not paired data
  • Missing data
  • If there is no data in sheet 2 or 3, delete no
    data sheet first
  • Selection all data
  • Press SAM function
  • ? Two classes, unpaired with missing data

Original
Transformed
29
SAM
Parameters input
  • Choice you data type
  • Check your data is logged or not
  • Here we use clone ID
  • Missing value will be regenerated by two methods,
    choice one and how many neighbors will be
    calculated for the missing value. if there were
    missing value SAM will generate a new data sheet
    in the final output, you can use this for
    clustering analysis
  • You can generate another random seed by yourself.
  • Then OK

3
4
5
30
SAM
Significant gene no. listed, when you change
delta value or fold change this will change
spontaneously
False predict number
31
you can manually input the delta-value also after
you check the list Delta table or just move the
tag
32
  • The number of significant genes and FDR are
    depend on the delta and fold change which you
    chose.

33
When we choice fold change is 2 and delta value
2.65, then 102 genes will be listed as
significant changes.
34
Fold change
Entering this address you will get this gene
information Showing in next slide
35
(No Transcript)
36
Process flowB.2.Data-adjustment
37
Why we need it
  • In the process flow Each softwares I/O Data
    format maybe different
  • We provide 2 on-line process tools
  • You can reach these 2 tools from our website

1
2
38
1.Adjust the HCC datasetSeparate the NAME column
1. Upload HCC dataset
2. Set up parameters (or use default value)
3.Click the button and get the result
39
2.Adjust the HCC datasetFilter with Statistic
result
1. Upload dataset
2. Feed with Statistic result(genelist string)
3.Click the button and get the result
40
Process flowB.3.Clustering
41
Where
http//rana.lbl.gov/
  • We can download executable software, source code,
    user manual from this website

42
Analyze Process Flow
Raw data collection
Data preparation
Data cleaning
Significant test by SAM
Normalization
Data-mining
Clustering
Visualization
View the result
43
1.Data Preparation
  • Format your data as follows.

Simple form
Complete form
The dif whether we want to take weight and order
into accounted
44
Import data into the software
  • Cluster will inform we about the loaded data

1
2
3
45
More on filter tab
46
2.Clustering
  • The tool provides us with three methods of
    clustering
  • Hierarchical
  • K-means
  • Self organizing map ( SOM )

47
2-1.Hierarchical Clustering
  • The hierarchical clustering tab allows to perform
    hierarchical clustering on the datasets.

48
2.2 Partitioned Clustering
  • The parameters controlling K-Means clustering
    are
  • The number of clusters ( K )
  • The maximum number of cycles.

K-means may be faster!
49
2.3 SOM Clustering
  • Here we have considered one dimensional approach
    of SOM

50
3.Result Viewer
We need a viewing tool
Micro-array Raw data
.CDT and .GTR files
Cluster
Tree-View
Data cleaning Data-mining
Clustering result Visualization
51
3.Result Tree-viewer
  • Use the mouse click/drag/find
  • Use the keyboard up/down/left/right

2
1
3
52
Appendix
53
1What Is Good Clustering?
  • A good clustering method will produce high
    quality clusters with
  • high intra-class similarity
  • low inter-class similarity
  • The quality of a clustering result depends on
    both the similarity measure used by the method
    and its implementation.
  • The quality of a clustering method is also
    measured by its ability to discover some or all
    of the hidden patterns.

54
2. This is Micro-Array Data-Set
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
55
3. Take Each row(gene) as a line
Log2(cy5/cy3)
56
Then we can have similarity Correlation
Correlation0.97
57
4. Finally we can do cluster
58
5. Hierarchical Clustering
Hierarchical clustering. Define Similarity
Choose cluster gene? And if we calculate the
weight?
59
6. Hierarchical Clustering
  • Define Similarity metric
  • Center or not
  • Absolute or not
  • Linear or not

60
7. Hierarchical Clustering
  • Center or not
  • Absolute or not

61
8. Hierarchical Clustering
  • Linear or not

62
9. Hierarchical Clustering
  • During construction of the hierarchy, decisions
    must be made to determine which clusters should
    be joined.
  • The distance or similarity between clusters must
    be calculated.
  • The rules that govern this calculation are
    linkage methods.

63
10.. Partition Clustering
  • K-mean, Professor had taught us.
  • K-Mean/K-Medoid
  • K-Medoid choose a new node which is more near the
    central instead of central point.

64
WE SEPARATE them!!!!!
65
11. Ordering of operations
  • 1.log transform of all the values
  • 2.Mean centre rows
  • 3.Median centre rows.
  • 4.Normalize rows
  • 5.Mean centre columns
  • 6.Median centre columns
  • 7.Normalize columns
Write a Comment
User Comments (0)
About PowerShow.com