Computing with Whole Genomes - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Computing with Whole Genomes

Description:

Computing. with Whole Genomes. Stuart M. Brown. Research Computing, NYU School of Medicine ... New types of whole genome analyses ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 43
Provided by: researchco3
Category:

less

Transcript and Presenter's Notes

Title: Computing with Whole Genomes


1
Computing with Whole Genomes
Stuart M. Brown Research Computing, NYU School of
Medicine
2
Genome Sequencing
  • The ability to sequence entire genomes has
    created a huge demand for bioinformatics
  • Simple data management for the sequencing
    projects
  • Genome assembly
  • Annotation
  • Public access to the data
  • New types of whole genome analyses

3
  • Genome sequencing factories churn out raw
    sequence data at an ever increasing rate
  • Fewer scientists are involved in generating data
    and more are involved in data analysis

4
Sequence Pipeline
  • Laboratory Information Management - track
    samples, store raw data
  • Assemble fragments
  • Track orientation and distance for paired reads
    from libraries of known sized clones
  • Find genes
  • Gene prediction algorithms
  • Map known genes and cDNAs
  • Annotation and public access to data

5
Raw Genome Data
6
Finding genes in genome sequence is not easy
  • About 1 of human DNA encodes functional genes.
  • Genes are interspersed among long stretches of
    non-coding DNA.
  • Repeats, pseudo-genes, and introns confound
    matters

7
  • The next step is obviously to locate all of the
    genes and describe their functions. This will
    probably take another 15-20 years!

8
UCSC
9
Gene Prediction Works Poorly
  • Algorithms are not accurate
  • non-consensus splice sites
  • where is the true first 5' exon?
  • cDNA data is incomplete and confusing
  • truncated cDNA sequences
  • real alternative splicing
  • Pseudo-genes and true gene duplication
  • vs.
  • Mistakes in the genome assembly

10
(No Transcript)
11
Ensembl at EBI/EMBL
12
(No Transcript)
13
Integrate With other Genetic Datasets
  • Cytogenetic and molecular markers
  • (STS, microsatellites, radiation hybrids)
  • Known mutations
  • OMIM for humans
  • Huge collection of mouse genetic data
  • Nearly complete collection of yeast mutants
  • SNPs
  • Gene Expression

14
(No Transcript)
15
(No Transcript)
16
SNPs are Very Common
  • SNPs are very common in the human population.
  • Between any two people, there is an average of
    one SNP every 1250 bases.
  • Most of these have no phenotypic effect
  • Venter et al. estimate that only lt1 of all human
    SNPs impact protein function (non-coding regions)
  • Selection against mis-sense mutations
  • Some are alleles of genes.

17
Genome Sequencing finds SNPs
  • The Human Genome Project involves sequencing DNA
    cloned from a number of different people.
  • The Celera sequence comes from 5 people
  • Even in a library made from from one persons
    DNA, the homologous chromosomes have SNPs
  • This inevitably leads to the discovery of SNPs -
    any single base sequence difference
  • These SNPs can be valuable as the basis for
    diagnostic tests

18
(No Transcript)
19
The SNP Consortium is an unlikely alliance of
pharmaceutical and computer companies managed by
Lincoln Stein at Cold Spring Harbor Lab.
The SNP Consortium Ltd.. is a non-profit
foundation organized for the purpose of
providing public genomic data. Its mission is to
develop up to 300,000 SNPs distributed evenly
throughout the human genome and to make the
information related to these SNPs available to
the public without intellectual property
restrictions. The project started in April 1999
and is anticipated to continue until the end of
2001.
The current release (Jan 2001) consists of
856,666 SNPs, all of which have been anchored to
the human genome by "in silico" mapping to the
genomic working draft (UCSC Golden Path).
20
We describe a map of 1.42 million single
nucleotide polymorphisms (SNPs) distributed
throughout the human genome, providing an average
density on available sequence of one SNP every
1.9 kilobases. These SNPs were primarily
discovered by two projects The SNP Consortium
and the analysis of clone overlaps by the
International Human Genome Sequencing Consortium.
The map integrates all publicly available SNPs
with described genes and other genomic features.
We estimate that 60,000 SNPs fall within exon
(coding and untranslated regions), and 85 of
exons are within 5 kb of the nearest SNP.
Nucleotide diversity varies greatly across the
genome, in a manner broadly consistent with a
standard population genetic model of human
history. This high-density SNP map provides a
public resource for defining haplotype variation
across the genome, and should help to identify
biomedically important genes for diagnosis and
therapy.
21
Search for SNPs in your gene
  • an average density of one SNP every 1.9
    kilobases
  • But that does not guarantee a SNP in your
    favorite gene!

22
GenBank has a dbSNP
  • As of Apr 19, 2001 , dbSNP has submissions for
    2,842,021 SNPs
  • It is possible to search dbSNP by BLAST
    comparisons to a target sequence

23
gtgnldbSNPrs1042574_allelePos51 total len 101
taxid 9606snpClass 1 Length
101 Score 149 bits (75), Expect 3e-33
Identities 79/81 (97) Strand Plus / Plus

Query 1489
ccctcttccctgacctcccaactctaaagccaagcactttatatttttct
cttagatatt 1548
Sbjct 1
ccctcttccctgacctcccaactctaaagccaagcactttatattttcc
tyttagatatt 60
Query 1549 cactaaggacttaaaataaaa 1569
Sbjct 61
cactaaggacttaaaataaaa 81
If a matching SNP is found, then it can
be directly located on the Genome map
24
Gene Expression Profiling
  • Sequence bulk cDNAs from different tissues
  • NCBI CGAP website allows "digital differential
    display"
  • SAGE (sequence short tags from cDNAs)
  • Microarrays

25
Digital Differential Display
26
(No Transcript)
27
cDNA spotted microarrays
28
Link Gene Expression to Genome Sequence
  • Identify promoter and 5' sequence for a group of
    co-expressed genes.
  • Scan for known transcription factor binding
    sites.
  • Predict new regulatory sites based on common
    sequence elements.

29
Whole Genome Comparisons
  • Comparative Genomics
  • Use mouse homologs to find human genes
  • cDNAs
  • Chromosome scanning for conserved regions
  • Synteny
  • Use knockouts to define function
  • Deep homology
  • Metabolic reconstruction

30
(No Transcript)
31
Metabolic Reconstruction
  • If we know the genome sequence, and we know the
    metabolic pathways
  • Then we should be able to map genes to the
    pathways in every organism
  • WIT2 (What is There) is an attempt to do this
  • http//wit.mcs.anl.gov/WIT2/
  • How can organisms lack genes that are essential
    in related groups?

32
EMP Database
Enzymes and Metabolic Pathways database
(EMP) http//emp.mcs.anl.gov/
33
(No Transcript)
34
2-Oxobutanoate--Isoleucine, 2-Oxoglutarate_Anabol
ism (NADPH,_NADH)
35
(No Transcript)
36
Clusters of Orthologus Groups (COGs)
  • COGs were delineated by comparing protein
    sequences encoded in 43 complete genomes,
    representing 30 major phylogenetic lineages.
  • Each COG consists of individual proteins or
    groups of paralogs from at least 3 lineages and
    thus corresponds to an ancient conserved domain.

37
A simple COG with two yeast paralogs. YPL040c is
the yeast mitochondrial isoleucyl-tRNA
synthetase the bacterial orthologs and that from
M. jannaschii are the BeTs for this yeast
protein, but the reverse is true only of the
bacterial proteins., For YBL076c (yeast
cytoplasmic isoleucyl-tRNA synthetase), the M.
jannaschii ortholog is a symmetrical BeT, whereas
the bacterial genes are asymmetrical.
38
(No Transcript)
39
Proteomics
  • Identify all of the proteins in an organism
  • Potentially many more than genes due to
    alternative splicing and post-translational
    modifications
  • Quantitate in different cell types and in
    response to metabolic/environmental factors
  • Protein-protein interactions

40
Protein-Protein Interactions
  • Metabolic and regulatory pathways
  • Transcription factors
  • Co-expression
  • Biochemical data
  • crosslinking
  • yeast 2-hybrid
  • affinity tagging
  • Useful feedback to genome annotation/protein
    function and gene expression

41
BIND - The Biomolecular Interaction Network
Database
42
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com