Title: Overview of Affy Analysis Pipeline
1Overview of Affy Analysis Pipeline
2(No Transcript)
3- Main starting point to start an analysis Run
- Three main stop/starting points
- Group files
- Normalize data
- Analyze data
Choose Project
Start New Analysis
Previous analysis sessions
Click to view files with a Group
4- Group CEL files from one or multiple projects
View files already selected
Select projects to choose Affy CEL files from
View all CEL files from selected Projects
5- Once the files are selected, group the files into
Sample Groups - You can add or delete sample groups
- If you order the samples your data will be
reported in the same order - Select a reference sample. This sample will be
compared to all the other sample groups during
the analysis phase
6Make sure all CEL files Are in the correct
Sample Group
7- Start the normalization run
- Data is analyzed using Affy Specific Bioconductor
libraries - justRMA or justGCRMA are the main methods
Select processing method
8- The actual processing is moved off the to the
Batch Scheduler, so large runs should not be a
problem - After analysis is complete, follow the link to
view the normalized data file and to start
differential expression analysis
9- After normalization a single file is produced.
Four annotation columns and 1 column of Log 2
expression values for each CEL file analyzed.
10- To view previous normalization runs, Click the
Normalization tab
Click to view details
11- Start Differential expression analysis
- Two Choices
- MEV - Multiple Expression viewer from TIGR
- Multtest Collection of algorithms from
Bioconductor
Multtest
MEV
12- Strart MEV via Java Web start
- Data is pre-processed and a link is produced to
start MEV
13- Start SAM analysis or run t-test
- Must have at least 2 replicates for each sample
group
14SAM analysis
- SAM analysis description from Tusher et al. 1
"SAM identifies genes with statistically
significant changes in expression by assimilating
a set of gene-specific t tests. Each gene is
assigned a score on the basis of its change in
gene expression relative to the standard
deviation of repeated measurements for that gene.
Genes with scores greater than a threshold are
deemed potentially significant. The percentage of
such genes identified by chance is the false
discovery rate (FDR). To estimate the FDR,
nonsense genes are identified by analyzing
permutations of the measurements. The threshold
can be adjusted to identify smaller or larger
sets of genes, and FDRs are calculated for each
set."1) Significance analysis of microarrays
applied to the ionizing radiation response. Proc
Natl Acad Sci U S A. 2001 Apr 2498(9)5116-21.
Epub 2001 Apr 17. Erratum in Proc Natl Acad Sci
U S A 2001 Aug 2898(18)10515. - The analysis run will automatically compare each
sample group to the reference sample - Example Four Sample groups A, B, C, D would
produce 3 comparisons A_vs_B, A_vs_C, A_vs_D
15Calculating the FDR
- For each Condition calculation run
- At each step the Delta cutoff is plugged into an
equation that gives back a list of significant
genes and the FDR. - The program keeps track of which genes are
significant at each step and what the FDR rate
for the group of genes. - At low Delta cutoffs lots of genes will be
returned but they will have a high FDR. - At higher Delta cutoffs a smaller list of genes
should be returned with much lower FDRs.
16Gene Name FDR
Gene_1 50
Gene_2 50
Gene_3 50
Gene_4 50
Gene_5 50
Gene_6 50
Gene_7 50
Gene_8 50
Gene_9 50
Gene_10 50
Gene_11 50
Gene_12 50
Gene_13 50
Gene_14 50
Gene_15 50
Gene_16 50
Gene_17 50
Gene_18 50
Gene_19 50
Gene_20 50
Gene_21 50
Gene_22 50
Gene_23 50
Gene_24 50
Genes Return At First Delta Cutoff
17Gene Name FDR
Gene_1 10
Gene_2 10
Gene_3 10
Gene_4 10
Gene_5 10
Gene_6 10
Gene_7 10
Gene_8 10
Gene_9 10
Gene_10 10
Gene_11 10
Gene_12 50
Gene_13 50
Gene_14 50
Gene_15 50
Gene_16 50
Gene_17 50
Gene_18 50
Gene_19 50
Gene_20 50
Gene_21 50
Gene_22 50
Gene_23 50
Gene_24 50
Genes Return At Second Delta Cutoff
18Gene Name FDR
Gene_1 5
Gene_2 5
Gene_3 5
Gene_4 10
Gene_5 10
Gene_6 10
Gene_7 10
Gene_8 10
Gene_9 10
Gene_10 10
Gene_11 10
Gene_12 50
Gene_13 50
Gene_14 50
Gene_15 50
Gene_16 50
Gene_17 50
Gene_18 50
Gene_19 50
Gene_20 50
Gene_21 50
Gene_22 50
Gene_23 50
Gene_24 50
Genes Return At Last Delta Cutoff
19- For each delta cutoff a list of genes is
returned. Record the lowest FDR a gene is found
at. - Rank the output according to the FDRs. Now the
data can be queried on the FDR which will return
a population of genes with a known FDR - For each condition 2-3 data files will be created
- HTML file contains a list of genes with the
lowest FDRs. Also provides links to external
annotation. Plus a false color representation of
the log 2 expression data. - Text file with all the Ratio Data. Columns
include - Probe_set_id
- Gene_Symbol, Gene_Title, Unigene, LocusLink,
Public_ID - FDR
- SAM_ratio
- mu_X
- mu_Y
- Log_2_Ratio
- Log_10_Ratio
- All Log 2 expression values for the CEL files
used in the analysis - Text file of updated canonical names
- If the data is uploaded into the Get Expression
table in SBEAM this file is produced - Tries to turn all the canonical names to Ref Seq
protein Accession numbers or a Locus Link ID and
if neither exists it keep the DNA accession
number provided by Affymetrix
20Launch the data into excel
View the web page Directly
21Example of the text output All 45,000 rows for
mouse
Example of HTML out of the top genes
Results Genes Found with less then 6 FDR
Sample Groups Cast_none_Clean_Brain_vs_SJL_4wks_
Infected_Brain Number of Differential expressed
Genes 18 Number of False Positives 1
22Add data to Get Expression Table
23Click button to upload data
24Warning if Condition already exists in SBEAMS
Click Checkbox to ignore warning
25Viewing expression data in Cytoscape
- Wanted to add expression data to Cytoscape with a
minimal amount of effort. - Different protein networks could be added as
needed - Also would like to take a look at the data in a
graphical format to see what differentially
expressed genes are shared between different
conditions - Making a Cytoscape Expression Network
- Query the data from the Get Expression page. The
following options must be selected - Data Columns to Display Log 10 Ratio, False
Discovery Rate. - Display Options Show all conditions., Pivot
Conditions as columns - Behind the scenes the program will sort each
condition by the FDR and take the top 100 genes.
If you select a False Discovery Constraint (which
is recommended) the gene also must fall below the
cutoff - Each condition will be a large diamond shaped
node and all the genes will be a smaller circle.
Draw edges between a condition and a gene if the
gene meets the above criteria. - Once this is done a web page will present with 4
java web starts - We are utilizing the Gaggle version of Cytoscape
from Paul S.
26Select Project
Select Conditions
Select Required Display fields Log10 ratio and
False Discovery Rate
Select required query options Show all Conditions
and Pivot Conditions
27Number of gene selected.
Click to start Gaggle version of Cytoscape
28(No Transcript)
29- The Gaggle Boss window manages all the programs
that can talk to one another
30- Launching Cytoscape loads the expression network
- Each condition is the large diamond shaped node
- Genes are circle nodes
- Edges indicate that the gene is expressed within
a condition its connected to - Node colors are mapped to the Log10 expression
ratio - Condition A_vs_B
- Green Overexpression in sample A
- Red Overexpression in sample B
- Edges color are mapped to the significance value.
False discovery rate in this example
31- The data matrix browser brings in all the
expression data - Currently loads the Significance values and Log
10 ratios for each condition
Click the folder first
Both the Log10 ratio and p-value Will be loaded.
Click the load data icon second
p-value is currently used to hold any
significance measurement
32- All the expression data will be loaded that was
found in the SQL query - Remember that if only one condition fell below
the significance cutoff for a particular gene,
the data for all conditions will be shown - Only the top 100 gene (currently) will be shown
on the Cytoscape map
33Select a condition to map the expression data
onto the Cytoscape map
Run a movie to alternate between the different
conditions expression values
34Broadcast the selected genes to all the programs
listening on the Gaggle
Select Genes of interest
35- Six genes were broadcast from the Cytoscape
window to the DMV - To view the expression profile of the selected
genes across all condition Click the Graph Button
Number of genes Currently selected
36- View the Graph with or without the condition names
37(No Transcript)
38Select a single gene from the profile
Broadcast the single selection
39To find genes with a similar expression profile
use the correlation finder
Select the threshold and click Select in browser
40(No Transcript)
41Use the Volcano plot to get an overview of the
data. Significance val vs Expression val Select
the condition to graph
42Select genes of interest and broadcast to the
other programs
43Highlight genes by their GO ontology
Click the annotation type to view
Start Annotation viewer
44Choose a GO annotation level
Click the Button
45Select the GO term of interest. The genes will
be highlighted in the Cytoscape Window.
46Make a custom view to display data in Cytoscape
in a spreadsheet
47(No Transcript)