Title: Computing with Whole Genomes
1Computing with Whole Genomes
Stuart M. Brown Research Computing, NYU School of
Medicine
2Genome Sequencing
- The ability to sequence entire genomes has
created a huge demand for bioinformatics - Simple data management for the sequencing
projects - Genome assembly
- Annotation
- Public access to the data
- New types of whole genome analyses
3- Genome sequencing factories churn out raw
sequence data at an ever increasing rate - Fewer scientists are involved in generating data
and more are involved in data analysis
4Sequence Pipeline
- Laboratory Information Management - track
samples, store raw data - Assemble fragments
- Track orientation and distance for paired reads
from libraries of known sized clones - Find genes
- Gene prediction algorithms
- Map known genes and cDNAs
- Annotation and public access to data
5Raw Genome Data
6Finding genes in genome sequence is not easy
- About 1 of human DNA encodes functional genes.
- Genes are interspersed among long stretches of
non-coding DNA. - Repeats, pseudo-genes, and introns confound
matters
7- The next step is obviously to locate all of the
genes and describe their functions. This will
probably take another 15-20 years!
8UCSC
9Gene Prediction Works Poorly
- Algorithms are not accurate
- non-consensus splice sites
- where is the true first 5' exon?
- cDNA data is incomplete and confusing
- truncated cDNA sequences
- real alternative splicing
- Pseudo-genes and true gene duplication
- vs.
- Mistakes in the genome assembly
10(No Transcript)
11Ensembl at EBI/EMBL
12(No Transcript)
13Integrate With other Genetic Datasets
- Cytogenetic and molecular markers
- (STS, microsatellites, radiation hybrids)
- Known mutations
- OMIM for humans
- Huge collection of mouse genetic data
- Nearly complete collection of yeast mutants
- SNPs
- Gene Expression
14(No Transcript)
15(No Transcript)
16 SNPs are Very Common
- SNPs are very common in the human population.
- Between any two people, there is an average of
one SNP every 1250 bases. - Most of these have no phenotypic effect
- Venter et al. estimate that only lt1 of all human
SNPs impact protein function (non-coding regions) - Selection against mis-sense mutations
- Some are alleles of genes.
17Genome Sequencing finds SNPs
- The Human Genome Project involves sequencing DNA
cloned from a number of different people. - The Celera sequence comes from 5 people
- Even in a library made from from one persons
DNA, the homologous chromosomes have SNPs - This inevitably leads to the discovery of SNPs -
any single base sequence difference - These SNPs can be valuable as the basis for
diagnostic tests
18(No Transcript)
19The SNP Consortium is an unlikely alliance of
pharmaceutical and computer companies managed by
Lincoln Stein at Cold Spring Harbor Lab.
The SNP Consortium Ltd.. is a non-profit
foundation organized for the purpose of
providing public genomic data. Its mission is to
develop up to 300,000 SNPs distributed evenly
throughout the human genome and to make the
information related to these SNPs available to
the public without intellectual property
restrictions. The project started in April 1999
and is anticipated to continue until the end of
2001.
The current release (Jan 2001) consists of
856,666 SNPs, all of which have been anchored to
the human genome by "in silico" mapping to the
genomic working draft (UCSC Golden Path).
20 We describe a map of 1.42 million single
nucleotide polymorphisms (SNPs) distributed
throughout the human genome, providing an average
density on available sequence of one SNP every
1.9 kilobases. These SNPs were primarily
discovered by two projects The SNP Consortium
and the analysis of clone overlaps by the
International Human Genome Sequencing Consortium.
The map integrates all publicly available SNPs
with described genes and other genomic features.
We estimate that 60,000 SNPs fall within exon
(coding and untranslated regions), and 85 of
exons are within 5 kb of the nearest SNP.
Nucleotide diversity varies greatly across the
genome, in a manner broadly consistent with a
standard population genetic model of human
history. This high-density SNP map provides a
public resource for defining haplotype variation
across the genome, and should help to identify
biomedically important genes for diagnosis and
therapy.
21Search for SNPs in your gene
- an average density of one SNP every 1.9
kilobases - But that does not guarantee a SNP in your
favorite gene!
22GenBank has a dbSNP
- As of Apr 19, 2001 , dbSNP has submissions for
2,842,021 SNPs - It is possible to search dbSNP by BLAST
comparisons to a target sequence
23gtgnldbSNPrs1042574_allelePos51 total len 101
taxid 9606snpClass 1 Length
101 Score 149 bits (75), Expect 3e-33
Identities 79/81 (97) Strand Plus / Plus
Query 1489
ccctcttccctgacctcccaactctaaagccaagcactttatatttttct
cttagatatt 1548
Sbjct 1
ccctcttccctgacctcccaactctaaagccaagcactttatattttcc
tyttagatatt 60
Query 1549 cactaaggacttaaaataaaa 1569
Sbjct 61
cactaaggacttaaaataaaa 81
If a matching SNP is found, then it can
be directly located on the Genome map
24Gene Expression Profiling
- Sequence bulk cDNAs from different tissues
- NCBI CGAP website allows "digital differential
display" - SAGE (sequence short tags from cDNAs)
- Microarrays
25Digital Differential Display
26(No Transcript)
27cDNA spotted microarrays
28Link Gene Expression to Genome Sequence
- Identify promoter and 5' sequence for a group of
co-expressed genes. - Scan for known transcription factor binding
sites. - Predict new regulatory sites based on common
sequence elements.
29Whole Genome Comparisons
- Comparative Genomics
- Use mouse homologs to find human genes
- cDNAs
- Chromosome scanning for conserved regions
- Synteny
- Use knockouts to define function
- Deep homology
- Metabolic reconstruction
30(No Transcript)
31Metabolic Reconstruction
- If we know the genome sequence, and we know the
metabolic pathways - Then we should be able to map genes to the
pathways in every organism - WIT2 (What is There) is an attempt to do this
- http//wit.mcs.anl.gov/WIT2/
- How can organisms lack genes that are essential
in related groups?
32EMP Database
Enzymes and Metabolic Pathways database
(EMP) http//emp.mcs.anl.gov/
33(No Transcript)
342-Oxobutanoate--Isoleucine, 2-Oxoglutarate_Anabol
ism (NADPH,_NADH)
35(No Transcript)
36Clusters of Orthologus Groups (COGs)
- COGs were delineated by comparing protein
sequences encoded in 43 complete genomes,
representing 30 major phylogenetic lineages. - Each COG consists of individual proteins or
groups of paralogs from at least 3 lineages and
thus corresponds to an ancient conserved domain.
37A simple COG with two yeast paralogs. YPL040c is
the yeast mitochondrial isoleucyl-tRNA
synthetase the bacterial orthologs and that from
M. jannaschii are the BeTs for this yeast
protein, but the reverse is true only of the
bacterial proteins., For YBL076c (yeast
cytoplasmic isoleucyl-tRNA synthetase), the M.
jannaschii ortholog is a symmetrical BeT, whereas
the bacterial genes are asymmetrical.
38(No Transcript)
39Proteomics
- Identify all of the proteins in an organism
- Potentially many more than genes due to
alternative splicing and post-translational
modifications - Quantitate in different cell types and in
response to metabolic/environmental factors - Protein-protein interactions
40Protein-Protein Interactions
- Metabolic and regulatory pathways
- Transcription factors
- Co-expression
- Biochemical data
- crosslinking
- yeast 2-hybrid
- affinity tagging
- Useful feedback to genome annotation/protein
function and gene expression
41BIND - The Biomolecular Interaction Network
Database
42(No Transcript)