Title: Genomics
1Genomics
2The Human Genome Project
- Mapping and Sequencing the Genomes of Model
Organisms - Data Collection and Distribution
- Ethical, Legal, and Social Considerations
- Research Training
- Technology Development
- Technology Transfer
3A Few Genome Resources
- NCBI Genome Resources
- UCSC Human Genome Browser
- Ensembl Human Genome Server
4Genome Sequencing Progress
- NCBI Genome Sequence Repository
- All organisms
- Eukaryotic genomes
- Prokaryotic genomes
- Archaea genomes
- Viruses
5Genome Sequencing
From NCBI, 5/2001
6Human Genome Sequencing 2/11/2001
From NCBI
7Human Genome Progress 2/11/2001
From NCBI
8Microbial Genomes
- Published complete microbial genomes
- Microbial genomes and chromosomes in progress
9Genome Informatics
- Annotation and Analysis
- Data Handling
- Metabolic Reconstruction
- Comparative Genomics
- Functional Genomics
10Genome Project Organization
- Cloning
- Mapping
- Sequencing
- Annotation
- Analysis
11Cloning and Mapping
12Cloning
- Large
- YACs
- 1 Mb
- BACs
- 100 - 200 Kb
- Intermediate
- Cosmids
- Lambda clones
- Small
- Plasmids M13
13Mapping
- Establishment of Guideposts
- Aids in Assembly
- Error Checking
- Useful in mapping of genetic disorders
14Genetic Maps
- Cytogenetic markers
- Linkage maps
- Polymorphic loci screened by PCR to determine
inheritence patterns - Produce linkage map with nearby loci
15Physical Maps
- Radiation Hybrid/YACs/Cosmids
- Restriction Sites
- Sequence Tagged Sites
- 100 Kb resolution needed
- 30,000 STSs
- Expressed Sequence Tags
- Detection
- PCR
- Hybridization
- FISH
- Fluoresecent in situ Hybridization
16Human Genome STS Mapping Strategy
- STS Content Mapping
- Screen YACs by PCR
- Radiation Hybrid Mapping
- Screen RH Cell lines by PCR
- Genetic Mapping
- PCR Screening of polymorphic loci
- Combine above to produce an integrated map
17Mapping Resolution
- YAC mapping
- 1 Mb
- Radiation hybrid mapping
- 10 Mb
- Genetic map
- 30 Mb
18GeneMap98
- Integrated Human Genetic Map
- Over 30,000 unique gene-based markers
- 100 Kb resolution
- http//www.ncbi.nlm.nih.gov/genemap98/
19Map Integration
20Human Chromosome 1 Genetic Map
21Human Chromosome 1 Combination Map
22Sequencing
23Sequencing Methods
- Random Shotgun
- Ordered Shotgun
- Directed
- Primer Walking
- Direct genomic sequencing
24Random Shotgun Sequencing
- Randomly shear or cut DNA into small pieces
- 2-4 Kb
- Clone into M13, pUC or some other sequencing
vector - Sequence the clones from both ends
- Rely on the computer to assemble the sequences
into one (or as few as possible) contigs
25Shotgun Sequencing Statistics
- Lander and Waterman equation
- poisson distribution
- Po e-m
- probability that a base is not sequenced where
msequence coverage
26H. influenza Sequencing
- For 1X random sequence coverage 1.8 Mb
- P 0.37 (63 of the bases are sequenced)
- To get gt 99 of the bases sequenced
- 5X coverage 8.74 Mb of sequence
- Po e-5 0.0067
- This coverage would leave approx. 128 gaps of
about 100 bp in size - From Science 269496-512. 1995
27Ordered Sequencing
- Generate a set of large sequence clones in lambda
phage - May be subcloned from YACs or BACs as necessary
- End sequence the lambda clones and order the
clones to produce a map of the genome - Choose a minimal tiling path of the genome from
the ordered lambda clones
28Ordered Sequencing...
- Shear and subclone the lambda inserts that
comprise the minimal tiling set into sequencing
vectors - Shotgun sequence and assemble each of these
lambda inserts individually - Assemble all sequences into one, contiguous
genome
29Directed Sequencing
- Process used for finishing following the shotgun
sequencing phase - Gap closure
- Use specific sequencing primers to extend
appropriate clones into gap regions - Use specific sequencing primers to sequence
directly from genomic DNA
30Sequence Assembly
31Assembly of Shotgun Fragments
- For H. influenzae (TIGR) 1.8 Mb
- 24,304 Sequence fragments were generated for the
random assembly phase - 11,631,485 bases
- Generated 140 contigs
- Assembled using the TIGR Assembler
- 30 hours of cpu time
32phred/phrap/consed
- Widely used programs for sequence
- base calling (phred)
- assembly (phrap)
- editing (consed)
- Developed at the University of Washington
- Phil Green (phrap)
- Brent Ewing (phred)
- David Gordon (consed)
33Genome Annotation and Analysis
34Sequence Annotation
- ORF identification
- Frameshift resolution
- Genome map construction
- Functional assignments
- Metabolic pathway assignment
- Metabolic pathway Reconstruction
- Comparative analysis
35(No Transcript)
36Annotation Tools
37MAGPIE
- Multipurpose Automated Genome Project
Investigation Environment - Terry Gaasterland et. al.
- http//genomes.rockefeller.edu/magpie/magpie.htmlA
utomated - Semi-automated analysis tool for microbial genome
projects
38MAGPIE Example
39Non-Automated Analysis and Prediction
- The Ureaplasma urealyticum genome database
- Run analysis tool
- Parse results
- Dump results into the database
- View results
- Manually annotate
40Genomic Sequence Database
- Data Storage
- Sequence
- Gene Map
- Annotation
- User Interface
- Web browser
- Customizable
41The Ureaplasma urealyticum Genome Project
- Uu - 751,719 bp
- http//genome.microbio.uab.edu/uu/uugen.htm
- Web-based genome analysis tool
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46Annotation Problems
- Problems with existing sequence databases
- Incomplete datasets
- Skewed datasets
- Incorrectly annotated records
- Annotations based on experimental vs. predicted
data - Nomenclature differences
- Transitive errors in gene function predictions
- Functional predictions for hypothetical genes
47Metabolic Pathway Reconstruction
48Metabolic Pathway Reconstruction
- Role assignment
- Extract metabolic pathways from genomes
- Navigation and analysis
- Pathway editing
49Metabolic Assignments
- Amino acid Biosynthesis
- Biosynthesis of cofactors, prosthetic groups, and
carriers - Cell envelope
- Cellular processes
- Central intermediary metabolism
- Energy metabolism
- Fatty acid and phospholipid metabolism
- Purines, pyrimidines, nucleosides, and
nucleotides - Regulatory functions
- Replication
- Transcription
- Translation
- Transport and binding proteins
- Other categories, Unassigned
- Hypothetical
50Ureaplasma urealyticum Gene Map
1
50,000
100,000
50,001
100,001
150,000
150,001
200,000
200,001
250,000
250,001
300,000
300,001
350,000
350,001
400,000
400,001
450,000
450,001
500,000
500,001
550,000
550,001
600,000
600,001
650,000
650,001
700,000
700,001
750,000
750,001
751,719
Other
Cofactor Biosynthesis
Energy Metabolism
Replication
Cell envelope
Fatty Acid Metabolism
Transcription
RNA
Cellular processes
Hypothetical
Translation
Central Intermediary Metabolism
Nucleotide Metabolism
Transport
tRNA
51Uu Genes
Mg Genes
Role
Amino acid Biosynthesis
1
0.2
0
0.0
Biosynthesis of cofactors
10
1.7
7
1.5
Cell envelope
19
3.1
26
5.4
Cellular processes
13
2.1
15
3.1
Central intermediary metabolism
15
2.5
7
1.5
Energy metabolism
23
3.8
30
6.3
Fatty acid - phospholipids
6
1.0
7
1.5
Hypothetical
293
48.3
169
35.3
Other categories
1
0.2
3
0.6
Purines, pyrimidines
18
3.0
20
4.2
Regulatory functions
4
0.7
4
0.8
Replication
45
7.4
31
6.5
Transcription
17
2.8
19
4.0
Translation
100
16.5
99
20.7
Transport and binding proteins
37
6.1
35
7.3
Unassigned
4
0.7
7
1.5
Total
606
100.0
479
100.0
52EcoCyc
- Peter D. Karp, PhD
- SRI International
- Menlo Park, CA
- http//ecocyc.pangeasystems.com/ecocyc/ecocyc.html
53Pathway Reconstruction
Cell
Annotated Genome
Adapted from P. Karp, Pangea Systems
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61Glycolysis in Uu?
glucose-1-phosphate
?
phosphoglucomutase
glucose-6-phosphate
phosphoglucose isomerase
fructose-6-phosphate
6-phosphofructokinase
fructose-1,6-bisphosphate
fructose bisphosphate aldolase
glyceraldehyde-3-phosphate
glyceraldehyde 3-phosphatedehydrogenase1.2.1.12
glyceraldehyde-3-phosphatedehydrogenase 1.2.1.9
3-phospho-D-glyceroyl-phosphate
phosphoglycerate kinase
pyruvate
3-phosphoglycerate
62Uu Energy Metabolism
- Glycolysis
- Missing several components
- Pentose-phosphate pathway
- Only 2/8 enzyme complexes present
- Proton motive force - ATP synthase complex
- Urease Gene Complex
- Biologically relevant
63Comparative Genomics
- What makes one organism different from all other
organisms? - Molecular Biology
- Physiology
- Pathogenesis
- Epidemiology
- Genetics
64Ortholog Comparisons
- Uu to Mg genes 324
- 53 of Uu 67 of Mg
- 71 hypothetical
- Mh to Mg genes 314
- 41 of Mh 57 of Mg
- 55 hypothetical (2 unique hypothetical)
- Mh to Uu genes 330
- 47 of Uu 43 of Mh
- 82 hypothetical (19 unique hypothetical)
65M. genitalium - M. pneumoniae Gene Order
M. genitalium Gene Position
M. pneumoniae Gene Position
66M. genitalium - U. urealyticum Gene Order
M. genitalium Gene Position
U. urealyticum Gene Position
67Paralog Analysis
- Identification of conserved, paralogous groups
- All against All comparison
- Genes within one organism
- Identifies groups of related genes
- Primary sequence
- Structure
- Function
68Uu Paralogous Clusters gt3
- 4 tRNA synthetase
- 4 Translation factors
- 4 Hypothetical membrane lipoprotein
- 5 ATP synthase alpha, beta chains
- 6 MBA
- 7 Hypothetical membrane lipoprotein
- 8 Hypothetical
- 10 Iron transporters
- 13 Transporters
69Functional Genomics
- Gene Expression
- Gene Regulation
- Genome-wide Mutagenesis
70Expression Arrays
- Cell growth in different environments
- Isolate cDNAs
- Measure expression using array technology
- Create database of expression information
- Display information in an easy-to-use format
- Show ratio of expression under different
conditions
71Putting it all together
72From F. Blattner, U. Wisc.
73Chromosome Views
- Ensembl view
- UC Santa Cruz view
- NCBI View
74(No Transcript)
75(No Transcript)
76(No Transcript)
77A Final Caveat
- The difficulty of identifying genes in anonymous
vertebrate sequences - Claverie JM, Poirot O, Lopez F
- Comput Chem 199721(4)203-14
78The identification of genes in newly determined
vertebrate genomic sequences can range from a
trivial to an impossible task. In a statistical
preamble, we show how "insignificant" are the
individual features on which gene identification
can be rigorously based promoter signals, splice
sites, open reading frames, etc. The practical
identification of genes is thus ultimately a
tributary of their resemblance to those already
present in sequence databases, or incorporated
into training sets. The inherent conservatism of
the currently popular methods (database
similarity search, GRAIL) will greatly limit our
capacity for making unexpected biological
discoveries from increasingly abundant genomic
data. Beyond a very limited subset of trivial
cases, the automated interpretation (i.e. without
experimental validation) of genomic data, is
still a myth. On the other hand, characterizing
the 60,000 to 100,000 genes thought to be hidden
in the human genome by the mean of individual
experiments is not feasible. Thus, it appears
that our only hope of turning genome data into
genome information must rely on drastic
progresses in the way we identify and analyze
genes in silico.
79Only One Final Word of Wisdom...
- ...although the computer is a wonderful helpmate
for the sequence searcher and comparer,
biochemists and molecular biologists must guard
against the blind acceptance of any algorithmic
output given the choice, think like a biologist
and not a statistician. - - Russell F. Doolittle, 1990
80(No Transcript)