Computing with Whole Genomes

1 / 42

About This Presentation

Title:

Computing with Whole Genomes

Description:

Computing. with Whole Genomes. Stuart M. Brown. Research Computing, NYU School of Medicine ... New types of whole genome analyses ... –

Number of Views:34

Avg rating:3.0/5.0

Slides: 43

Provided by: researchco3

Category:

more less

Transcript and Presenter's Notes

Title: Computing with Whole Genomes

1
Computing with Whole Genomes
Stuart M. Brown Research Computing, NYU School of
Medicine
2
Genome Sequencing

The ability to sequence entire genomes has
created a huge demand for bioinformatics
Simple data management for the sequencing
projects
Genome assembly
Annotation
Public access to the data
New types of whole genome analyses

Genome sequencing factories churn out raw
sequence data at an ever increasing rate
Fewer scientists are involved in generating data
and more are involved in data analysis

4
Sequence Pipeline

Laboratory Information Management - track
samples, store raw data
Assemble fragments
Track orientation and distance for paired reads
from libraries of known sized clones
Find genes
Gene prediction algorithms
Map known genes and cDNAs
Annotation and public access to data

5
Raw Genome Data
6
Finding genes in genome sequence is not easy

About 1 of human DNA encodes functional genes.
Genes are interspersed among long stretches of
non-coding DNA.
Repeats, pseudo-genes, and introns confound
matters

The next step is obviously to locate all of the
genes and describe their functions. This will
probably take another 15-20 years!

8
UCSC
9
Gene Prediction Works Poorly

Algorithms are not accurate
non-consensus splice sites
where is the true first 5' exon?
cDNA data is incomplete and confusing
truncated cDNA sequences
real alternative splicing
Pseudo-genes and true gene duplication
vs.
Mistakes in the genome assembly

10
(No Transcript)
11
Ensembl at EBI/EMBL
12
(No Transcript)
13
Integrate With other Genetic Datasets

Cytogenetic and molecular markers
(STS, microsatellites, radiation hybrids)
Known mutations
OMIM for humans
Huge collection of mouse genetic data
Nearly complete collection of yeast mutants
SNPs
Gene Expression

14
(No Transcript)
15
(No Transcript)
16
SNPs are Very Common

SNPs are very common in the human population.
Between any two people, there is an average of
one SNP every 1250 bases.
Most of these have no phenotypic effect
Venter et al. estimate that only lt1 of all human
SNPs impact protein function (non-coding regions)
Selection against mis-sense mutations
Some are alleles of genes.

17
Genome Sequencing finds SNPs

The Human Genome Project involves sequencing DNA
cloned from a number of different people.
The Celera sequence comes from 5 people
Even in a library made from from one persons
DNA, the homologous chromosomes have SNPs
This inevitably leads to the discovery of SNPs -
any single base sequence difference
These SNPs can be valuable as the basis for
diagnostic tests

18
(No Transcript)
19
The SNP Consortium is an unlikely alliance of
pharmaceutical and computer companies managed by
Lincoln Stein at Cold Spring Harbor Lab.
The SNP Consortium Ltd.. is a non-profit
foundation organized for the purpose of
providing public genomic data. Its mission is to
develop up to 300,000 SNPs distributed evenly
throughout the human genome and to make the
information related to these SNPs available to
the public without intellectual property
restrictions. The project started in April 1999
and is anticipated to continue until the end of
2001.
The current release (Jan 2001) consists of
856,666 SNPs, all of which have been anchored to
the human genome by "in silico" mapping to the
genomic working draft (UCSC Golden Path).
20
We describe a map of 1.42 million single
nucleotide polymorphisms (SNPs) distributed
throughout the human genome, providing an average
density on available sequence of one SNP every
1.9 kilobases. These SNPs were primarily
discovered by two projects The SNP Consortium
and the analysis of clone overlaps by the
International Human Genome Sequencing Consortium.
The map integrates all publicly available SNPs
with described genes and other genomic features.
We estimate that 60,000 SNPs fall within exon
(coding and untranslated regions), and 85 of
exons are within 5 kb of the nearest SNP.
Nucleotide diversity varies greatly across the
genome, in a manner broadly consistent with a
standard population genetic model of human
history. This high-density SNP map provides a
public resource for defining haplotype variation
across the genome, and should help to identify
biomedically important genes for diagnosis and
therapy.
21
Search for SNPs in your gene

an average density of one SNP every 1.9
kilobases
But that does not guarantee a SNP in your
favorite gene!

22
GenBank has a dbSNP

As of Apr 19, 2001 , dbSNP has submissions for
2,842,021 SNPs
It is possible to search dbSNP by BLAST
comparisons to a target sequence

23
gtgnldbSNPrs1042574_allelePos51 total len 101
taxid 9606snpClass 1 Length
101 Score 149 bits (75), Expect 3e-33
Identities 79/81 (97) Strand Plus / Plus

Query 1489
ccctcttccctgacctcccaactctaaagccaagcactttatatttttct
cttagatatt 1548
Sbjct 1
ccctcttccctgacctcccaactctaaagccaagcactttatattttcc
tyttagatatt 60
Query 1549 cactaaggacttaaaataaaa 1569
Sbjct 61
cactaaggacttaaaataaaa 81
If a matching SNP is found, then it can
be directly located on the Genome map
24
Gene Expression Profiling

Sequence bulk cDNAs from different tissues
NCBI CGAP website allows "digital differential
display"
SAGE (sequence short tags from cDNAs)
Microarrays

25
Digital Differential Display
26
(No Transcript)
27
cDNA spotted microarrays
28
Link Gene Expression to Genome Sequence

Identify promoter and 5' sequence for a group of
co-expressed genes.
Scan for known transcription factor binding
sites.
Predict new regulatory sites based on common
sequence elements.

29
Whole Genome Comparisons

Comparative Genomics
Use mouse homologs to find human genes
cDNAs
Chromosome scanning for conserved regions
Synteny
Use knockouts to define function
Deep homology
Metabolic reconstruction

30
(No Transcript)
31
Metabolic Reconstruction

If we know the genome sequence, and we know the
metabolic pathways
Then we should be able to map genes to the
pathways in every organism
WIT2 (What is There) is an attempt to do this
http//wit.mcs.anl.gov/WIT2/
How can organisms lack genes that are essential
in related groups?

32
EMP Database
Enzymes and Metabolic Pathways database
(EMP) http//emp.mcs.anl.gov/
33
(No Transcript)
34
2-Oxobutanoate--Isoleucine, 2-Oxoglutarate_Anabol
ism (NADPH,_NADH)
35
(No Transcript)
36
Clusters of Orthologus Groups (COGs)

COGs were delineated by comparing protein
sequences encoded in 43 complete genomes,
representing 30 major phylogenetic lineages.
Each COG consists of individual proteins or
groups of paralogs from at least 3 lineages and
thus corresponds to an ancient conserved domain.

37
A simple COG with two yeast paralogs. YPL040c is
the yeast mitochondrial isoleucyl-tRNA
synthetase the bacterial orthologs and that from
M. jannaschii are the BeTs for this yeast
protein, but the reverse is true only of the
bacterial proteins., For YBL076c (yeast
cytoplasmic isoleucyl-tRNA synthetase), the M.
jannaschii ortholog is a symmetrical BeT, whereas
the bacterial genes are asymmetrical.
38
(No Transcript)
39
Proteomics

Identify all of the proteins in an organism
Potentially many more than genes due to
alternative splicing and post-translational
modifications
Quantitate in different cell types and in
response to metabolic/environmental factors
Protein-protein interactions

40
Protein-Protein Interactions