Title: Human Genome Lecture
1Human Genome Lecture
- Historical aspects of the HGP
- EST sequencing
- Finding new genes faster than ever
- Using 3 ESTs to generate human gene maps
- First comprehensive genome-wide human gene maps
- Sequence of human genome
- Complex genomic regions and sequence limitations
2Key pre-HGP scientific advances
- Structure of DNA determined (1953)
- Watson Crick
- Recombinant DNA created (1972)
- P. Berg Cohen and Boyer
- Methods for DNA sequencing developed (1977)
- Maxam Gilbert F. Sanger
- PCR invented (1985)
- K. Mullis
- Automated DNA sequencer developed (1986)
- L. Hood
3Obstacles to formation of the HGP
- 1) Financial/political Big biology is bad
biology - -departure from cottage industry culture of
biology - -devoid of hypothesis-driven research
- -what will it cost?
- -will it take away from other programs?
- 2) Why sequence the Junk?
- -protein coding regions make up lt1.5 of the
genome - -waste of time/money to sequence repetitive,
hard-to-sequence regions - 3) It is impossible to do
- -mid 1980s
- -primitive sequencing capabilities (500
bp/day/lab) - -primitive computer capabilities/bioinformatics
resources
4Significance of the HGP
- The book of life, The grail of human biology,
Code of codes - The instructions to create a human being
- The genome is a product of evolution
- - molecular replicator (DNA) heritable
variation time changing environment genome - - record of the evolutionary history of our
species - Comparative genomics the genes that make us
human - The genome unparalled system of information
storage - - 70 trillion cells in human body
- - each cell stores 3 billion units of
information -
5Significance of the HGP (cont)
- Biology in the 21st century
- - equivalent of learning to read a new language
- The genome as dynamic not static
- - perspective on past/future of the species
- Implications for health and disease
- -Genetic disease gene discovery single-gene
diseases multifactorial diseases - -DNA-based diagnostics
- -New drug targets
- -Gene therapy implications
- -Therapeutic uses vs. enhancements
- Accumulation of a molecular parts lists of
human physiology anatomy - - Lander Periodic Table of the Elements
analogy
6(No Transcript)
7Genomics Timelines
8Rapid Gene Identification Mapping ESTs and
Gene-based STSs
- Single-pass sequencing of randomly selected cDNA
clones - Obtain sequences from 5 and 3 ends of cDNA
inserts - Rapidly cheaply identify human genes
- Alzheimers gene discovered by EST database
search - 3UT sequence ideal for STS development
PCR-based gene mapping - Readily scaled up for development of most
comprehensive human gene maps (Science 1996,
1998)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13One gene one STS
- Gene-based STSs as the basis for a human gene map
- Berry et al, Nature Genetics 1995
- ESTablishing a human transcript map
- Boguski and Schuler, Nature Genetics 1995
14(No Transcript)
15Boguski Schuler, Nat. Genet. 1995
16(No Transcript)
17(No Transcript)
18Size and gene content of the 24 human
chromosomes. A, Size of each human chromosome,
in millions of base pairs (1 million base pairs
1 Mb). Chromosomes are ordered left to right by
size. B, Number of genes identified on each human
chromosome. Chromosomes are ordered left to right
by gene content. (Based on www.ensembl.org, v36.)
19(No Transcript)
20Genomic sequencing vs EST sequencing
- EST (single pass cDNA) sequencing
- - very fast but not error-free (e.g. 99
accuracy) - - very rapid gene identification (reliance on
mRNA) - - cDNA abundance influences coverage
- some genes will be missed
- normalized cDNA libraries improve coverage
- provides a gene expression profile
- Genomic sequencing
- - pre-2001 much slower method for gene finding
- -must do gene id by computer prediction
- - will generate complete gene and genome
information, e.g. introns, regulatory regions,
intergenic regions, repeats, etc. - - more expensive way to id genes
- - independent of gene expression level concerns
- - highly accurate when complete
-
21(No Transcript)
22(No Transcript)
23Â
24(No Transcript)
25(No Transcript)
26Significant findings arising from analysis of the
draft sequence of the human genome
- Â The genomic landscape shows marked variation in
the distribution of a number of features,
including genes, transposable elements, GC
content, CpG islands and recombination rate. This
gives us important clues about function. For
example, the developmentally important HOX gene
clusters are the most repeat-poor regions of the
human genome, probably reflecting the very
complex coordinate regulation of the genes in the
clusters. - Â There appear to be about 30,00040,000
protein-coding genes in the human genomeonly
about twice as many as in worm or fly. However,
the genes are more complex, with more alternative
splicing generating a larger number of protein
products. - Â The full set of proteins (the 'proteome')
encoded by the human genome is more complex than
those of invertebrates. This is due in part to
the presence of vertebrate-specific protein
domains and motifs (an estimated 7 of the
total), but more to the fact that vertebrates
appear to have arranged pre-existing components
into a richer collection of domain architectures. - Â Hundreds of human genes appear likely to have
resulted from horizontal transfer from bacteria
at some point in the vertebrate lineage. Dozens
of genes appear to have been derived from
transposable elements. - Â Although about half of the human genome derives
from transposable elements, there has been a
marked decline in the overall activity of such
elements in the hominid lineage. DNA transposons
appear to have become completely inactive and
long-terminal repeat (LTR) retroposons may also
have done so. - Â The pericentromeric and subtelomeric regions of
chromosomes are filled with large recent
segmental duplications of sequence from elsewhere
in the genome. Segmental duplication is much more
frequent in humans than in yeast, fly or worm. - Â Analysis of the organization of Alu elements
explains the longstanding mystery of their
surprising genomic distribution, and suggests
that there may be strong selection in favour of
preferential retention of Alu elements in GC-rich
regions and that these 'selfish' elements may
benefit their human hosts. - Â The mutation rate is about twice as high in
male as in female meiosis, showing that most
mutation occurs in males. - Â Cytogenetic analysis of the sequenced clones
confirms suggestions that large GC-poor regions
are strongly correlated with 'dark G-bands' in
karyotypes. - Â Recombination rates tend to be much higher in
distal regions (around 20 megabases (Mb)) of
chromosomes and on shorter chromosome arms in
general, in a pattern that promotes the
occurrence of at least one crossover per
chromosome arm in each meiosis. - Â More than 1.4 million single nucleotide
polymorphisms (SNPs) in the human genome have
been identified. This collection should allow the
initiation of genome-wide linkage disequilibrium
mapping of the genes in the human population.
27Patterns of intrachromosomal and interchromosomal
duplication in the human genome
Bailey, et al, Science, 2002
28Distribution of gt50 kb gaps in HapMap phase 1 -
CEU
HapMap phase 1
chromosome lengths
gt50 kb gap between SNPs
excluding centromere gaps
heterochromatin
T. Hudson
29Bailey, et al, Science, 2002
30Genome Structural Variation
- Broadest sense all changes in the genome not due
to single base-pair substitutions - Copy number variations (CNVs)
- CNV loci may cover 12 of genome
- Insertions/Deletions (indels)
- e.g. Repeats STRs, VNTRs
- Inversions
- Duplications and translocations
31Limitations of Genome Sequencing
- Nexgen sequencers are short read
- Repeated/duplicated sequences often cant be
positioned - Segmental duplications make up 5 of genome
- gt95 identity gt20kb
- Smaller-size, highly duplicated sequence families
exist - Complex, duplication-rich regions
- gt200 gaps (gt50kb each) in human genome
- Difficult to accurately assemble
- Linked to many human diseases
- Linked to evolutionary adaptation
- Location of missing heritability of GWAS?
- Are critical regions of the genome being
missed/ignored?
32Limitations of next-generation genome sequence
assembly Can Alkan, Saba Sajjadian Evan E
Eichler
Nature Methods Volume 8, Pages 6165 Year
published (2011) DOI doi10.1038/nmeth.1527
Published online 21 November 2010
Abstract Abstract High-throughput sequencing
technologies promise to transform the fields of
genetics and comparative biology by delivering
tens of thousands of genomes in the near future.
Although it is feasible to construct de novo
genome assemblies in a few months, there has been
relatively little attention to what is lost by
sole application of short sequence reads. We
compared the recent de novo assemblies using the
short oligonucleotide analysis package (SOAP),
generated from the genomes of a Han Chinese
individual and a Yoruban individual, to
experimentally validated genomic features. We
found that de novo assemblies were 16.2 shorter
than the reference genome and that 420.2 megabase
pairs of common repeats and 99.1 of validated
duplicated sequences were missing from the
genome. Consequently, over 2,377 coding exons
were completely missing. We conclude that
high-quality sequencing approaches must be
considered in conjunction with high-throughput
sequencing for comparative genomics analyses and
studies of genome evolution.