Title: The Human Genome
1The Human Genome Whats in it? How do we know?
Gary Benson Department of Computer
Science Department of Biology Program in
Bioinformatics Boston University
2Outline of Talk
- Protein Genes
- SNPs
- Haplotypes
- Finding a Disease Locus
3Size of the Genomes
bacteria
yeast
round worm
fruit fly
flowering plant
4The Human Genome
5What the letters stand for
- DNA has four chemical subunits, called nucleotide
bases abbreviated A, C, G, T. - GATTACA
http//en.wikipedia.org/wiki/Nucleotide
6Whats in the Genome?
- Chromosomes 23 pairs
- Genes
- Protein genes
- RNA genes
- MicroRNA genes
- Repeats
- Tandem repeats
- Inverted repeats
- Transposons
- Segmental duplications
- Regulatory regions
- Promoters
- Transcription factor binding sites
7Protein Genes
- A protein gene contains the genetic code for a
protein. The production of protein involves
transcription (copying DNA to RNA) and
translation (using RNA code to produce a
protein).
http//www.slic2.wsu.edu82/hurlbert/micro101/imag
es/TransTranscrip.gif
8Transcription
Translation
http//nobelprize.org/medicine/educational/dna/a/t
ranslation/polysome_em.html
http//users.rcn.com/jkimball.ma.ultranet/BiologyP
ages/M/Miller_Beatty3.jpg
9Finding Protein Genes
- Before the sequencing of genomes, protein genes
were found experimentally. Now, new genes are
predicted computationally using a gene model.
10Finding Protein Genes
- Before the sequencing of genomes, protein genes
were found experimentally. Now, new genes are
predicted computationally using a gene model.
11Finding Protein Genes
- Before the sequencing of genomes, protein genes
were found experimentally. Now, new genes are
predicted computationally using a gene model.
12Finding Protein Genes
- Before the sequencing of genomes, protein genes
were found experimentally. Now, new genes are
predicted computationally using a gene model.
13Finding Protein Genes
- Before the sequencing of genomes, protein genes
were found experimentally. Now, new genes are
predicted computationally using a gene model.
14Building a Gene Model
- Gene models for prediction are based on the
structure of genes in DNA and their messenger
RNAs (mRNAs). This includes exons, introns,
promoters, and the polyadenylation signal.
http//xray.bmc.uu.se/Courses/Bke2/Exercises/Exerc
ise_answers/pre_mRNA_processing.gif
15Exons
- In this example, EXONS are uppercase and introns
are lowercase. Exons contain the code for a
protein, introns interrupt the exons. Before
translation, introns are removed from the
messenger RNA. - DNA
- ACTGCTACAGtctattgaGAACAACATAGtcacgaacttaacgtgcaGT
TTAACAGCACGtctcgaagggca - RNA (before removal of introns)
- ACUGCUACAGucuauugaGAACAACAUAGucacgaacuuaacgugcaGU
UUAACAGCACGucucgaagggca - RNA (after removal of introns)
- ACUGCUACAGGAACAACAUAGGUUUAACAGCACG
16Finding Exons
- The sequence of an exon contains codons. Each
codon is a triplet of nucleotides which codes for
a single amino acid. Amino acids are the building
blocks of a protein.
http//en.wikipedia.org/wiki/Genetic_code
17Genetic Code
- . Each codon specifies one of twenty amino
acids. Three codons are stop codons, which
specify the end of translation.
http//www.emc.maricopa.edu/faculty/farabee/BIOBK/
code.gif
18Open Reading Frame (ORF)
- An open reading frame (ORF), is a sequence of
codons that does not contain a stop codon.
alanine threonine glutamic acid leucine arginine
serine STOP!
http//en.wikipedia.org/wiki/Genetic_code
19Finding Exons
- Sequence
- acggacucuagccuaaugugacgacugacauagguaaauucgcuc
- Even though this sequence contains stop codons,
they are not present in all reading frames. - frame 1
- acg gac ucu agc cua aug uga cga cug aca uag gua
aau ucg cuc - frame 2
- a cgg acu cua gcc uaa ugu gac gac uga cau agg uaa
auu cgc uc - frame 3
- ac gga cuc uag ccu aau gug acg acu gac aua ggu
aaa uuc gcu c - Very short ORFs are unlikely.
20Finding Introns
- Introns usually start at a G T boundary and end
at an A G boundary.
21Finding Exons
- Sequence
- acggacucuagccuaaugugacgacugacauagguaaauucgcuc
- A gene can contain open reading frames connected
across stop codons by an intron - frame 1
- acg gac ucu agc cua aug uga cga cug aca uag gua
aau ucg cuc - frame 3
- ac gga cuc uag ccu aau gug acg acu gac aua ggu
aaa uuc gcu c
22How many genes are there?
- Estimates
- pre 2000 100,000 based on estimates of required
number of genes to account for human
complexity - 2001 30,000 40,000 based on first draft of
human genome - 2003 23,000 24,500 based on gene prediction
computer programs - Why so low?
- alternate splicing of exons
- complex regulatory mechanisms
- inability to predict genes which are unlike
those seen before
http//www.ornl.gov/sci/techresources/Human_Genome
/faq/genenumber.shtml
23RNA Genes
- RNA genes do not code for proteins. Instead, the
RNA molecule itself is functional in the cell. - Examples include
- Ribosomal RNA these molecules form the major
component of the protein building machinery - Transfer RNA work with ribosomal RNA to insert
correct amino acids into growing proteins - MicroRNA a newly discovered class of RNA which
helps regulate gene expression.
24Ribosome
http//www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/R
NA/images/fig_rna12.jpg
25Transcription
Translation
http//nobelprize.org/medicine/educational/dna/a/t
ranslation/polysome_em.html
http//users.rcn.com/jkimball.ma.ultranet/BiologyP
ages/M/Miller_Beatty3.jpg
26RNA Genes
- MicroRNAs are short and show little or no
conservation of sequence. - Unlike protein genes, RNA genes do not contain
codons or open reading frames. But, they do
contain inverted repeats.
27Inverted Repeats (IRs)
- RNA
- G A C U U G A
U C A A G U C
reversed
complemented
Two patterns, one the reverse complement of the
other
28IR Nomenclature
RNA G A C U U G A
U C A A G U C
Right arm
Left arm
Spacer
29Stem-Loop Structure
Structure forms by pairing of complementary bases
30MicroRNA
- MicroRNAs come from a precursor that contains a
stem-loop.
http//www.ma.uni-heidelberg.de/apps/zmf/argonaute
/interface/mirna.jpeg
31Detection of Approximate Inverted Repeats
- Human Chr. 3 173,291,101
- AAGACTTGAA CAACTTTTAA ACATAAGATC AATTATTTCA
AGTAGATTCC CTTTTTCATT CACAATCACA TTCTCACAGA
CACAGTCCCA GTTTCTACCT GACTGAGATG CAGTAAGGAA
TCTGATTATA ACACTCATTG ATTATAACAC TCATTGAATT
TATGGATTCC TTACTGCATC TCATTCAGGT AGAAAAAGGG
ACTGTGTCTG TGAGAATGTG ATTGTGAATG AAAAAGATGG
AATATGTGTA TTTTTGAGTG TCTATGGAAG AGCTTCTGAC
AAGAGAGAGG AAGATTAGGT AAAATGAAAT ATCGCCGTCG
GCATTTCCCC CTACGT
32Detection of Approximate Inverted Repeats
- Human Chr. 3 173,291,101
- AAGACTTGAA CAACTTTTAA ACATAAGATC AATTATTTCA
AGTAGATTCC CTTTTTCATT CACAATCACA TTCTCACAGA
CACAGTCCCA GTTTCTACCT GACTGAGATG CAGTAAGGAA
TCTGATTATA ACACTCATTG ATTATAACAC TCATTGAATT
TATGGATTCC TTACTGCATC TCATTCAGGT AGAAAAAGGG
ACTGTGTCTG TGAGAATGTG ATTGTGAATG AAAAAGATGG
AATATGTGTA TTTTTGAGTG TCTATGGAAG AGCTTCTGAC
AAGAGAGAGG AAGATTAGGT AAAATGAAAT ATCGCCGTCG
GCATTTCCCC CTACGT
Arms are 72 nt long, spacer is 42bp long
33The Problem Find the Inverted Repeat
- Human Chr. 3 173,291,101
- AAGACTTGAA CAACTTTTAA ACATAAGATC AATTATTTCA
AGTAGATTCC CTTTTTCATT CACAATCACA TTCTCACAGA
CACAGTCCCA GTTTCTACCT GACTGAGATG CAGTAAGGAA
TCTGATTATA ACACTCATTG ATTATAACAC TCATTGAATT
TATGGATTCC TTACTGCATC TCATTCAGGT AGAAAAAGGG
ACTGTGTCTG TGAGAATGTG ATTGTGAATG AAAAAGATGG
AATATGTGTA TTTTTGAGTG TCTATGGAAG AGCTTCTGAC
AAGAGAGAGG AAGATTAGGT AAAATGAAAT ATCGCCGTCG
GCATTTCCCC CTACGT
34Single Nucleotide Polymorphisms (SNPs)
- A SNP is a single position in the genome (a
locus) that is not the same in all people. Some
people have one type of nucleotide and other
people have a different nucleotide. Differences
in the population at a single locus are called
polymorphisms and the individual types are called
alleles. - SNPs are found experimentally
a c g t t a t t
a c a t t c c t
SNPs
35Haplotypes
- A haplotype is a collection of SNP alleles on a
single chromosome in an individual. - Shown are SNPS on two chromosomes in each
individual.
a c g t t c a t
a c a t t c a t
t c g t t c a t
a c a g a t a t
a c a t t c c t
a t a g t c c a
a c a g t c c a
a c a t t c c t
t c a t t c a t
a c a t t c a a
36Haplotypes
- A haplotype is a collection of SNP alleles on a
single chromosome in an individual. - Homozygous (same alleles)
a c g t t c a t
a c a t t c a t
t c g t t c a t
a c a g a t a t
a c a t t c c t
a t a g t c c a
a c a g t c c a
a c a t t c c t
t c a t t c a t
a c a t t c a a
37Haplotypes
- A haplotype is a collection of SNP alleles on a
single chromosome in an individual. - Heterozygous (different alleles)
a c g t t c a t
a c a t t c a t
t c g t t c a t
a c a g a t a t
a c a t t c c t
a t a g t c c a
a c a g t c c a
a c a t t c c t
t c a t t c a t
a c a t t c a a
38Haplotypes
- A haplotype is a collection of SNP alleles on a
single chromosome in an individual. - Rare alleles
a c g t t c a t
a c a t t c a t
t c g t t c a t
a c a g a t a t
a c a t t c c t
a t a g t c c a
a c a g a c c a
a c a t t c c t
t c a t t c a t
a c a t t c a a
39Haplotypes
- A haplotype is a collection of SNP alleles on a
single chromosome in an individual. - Strong linkage (usually occur together)
a c g t t c a t
a c a t t c a t
t c g t t c a t
a c a g a t a t
a c a t t c c t
a t a g t c c a
a c a g t c c a
a c a t t c c t
t c a t t c a t
a c a t t c a a
40Linkage Analysis
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
recombination and inheritance
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad
a c a g t c c a
a c a g a c a t
child
41Linkage Analysis
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad
a c a g t c c a
a c a g a c a t
recombination in the mothers chromosomes
child
42Linkage Analysis
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad
a c a g t c c a
a c a g a c a t
recombination in the fathers chromosomes
child
43Linkage Analysis
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad
a c a g t c c a
a c a g a c a t
two to three crossovers per chromosome per
generation
child
44Linkage Analysis
- Key point Alleles that are physically close
together tend to be inherited together because
the chance of a crossover between them is small.
They exhibit strong linkage.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad
a c a g t c c a
a c a g a c a t
child
45Finding an Unknown Disease Locus
- The location on the genome of many diseases is
unknown. SNPs and haplotypes are being used to
search for disease loci using linkage analysis.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
child has disease
46Linkage Analysis Dominant Model
- Assume the disease is caused by a dominant
allele, meaning one copy is enough to cause the
disease.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
SNP alleles in father that are not in mother
child has disease
47Linkage Analysis Dominant Model
- Assume the disease is caused by a dominant
allele, meaning one copy is enough to cause the
disease.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
SNP allele in child, inherited from father with
disease
child has disease
48Linkage Analysis Dominant Model
- Assume the disease is caused by a dominant
allele, meaning one copy is enough to cause the
disease.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
SNP allele and disease are linked indicating
possible disease locus.
child has disease
49Linkage Analysis Recessive Model
- Assume the disease is caused by a recessive
allele, meaning two copies are required to cause
the disease.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
homozygous SNP alleles in father that are
heterozygous in mother
child has disease
50Linkage Analysis Recessive Model
- Assume the disease is caused by a recessive
allele, meaning two copies are required to cause
the disease.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
homozygous SNP allele in child, identical to
fathers
child has disease
51Linkage Analysis Recessive Model
- Assume the disease is caused by a recessive
allele, meaning two copies are required to cause
the disease.
a c a t t c a t
a t a g t c c a
a c a g a t a t
t c a t t c a t
mom
dad has disease
a c a g t c c a
a c a g a c a t
SNP allele and disease are linked indicating
possible disease locus.
child has disease
52(No Transcript)
53BMI weight/height2 in kg/m2, BMI gt 25
overweight, BMI gt 30 obese
54Other Differences Microdeletions
- A microdeletion is the loss of a small piece of
DNA, perhaps as small as 1000 bases. These
pieces can contain genes, parts of genes or
regulatory regions.
a t g t t t
a c a c t c c t
a c a t t c c t
g c g c a t
microdeletions
55Other Differences Microdeletions
- A microdeletion is the loss of a small piece of
DNA, perhaps as small as 1000 bases. These
pieces can contain genes, parts of genes or
regulatory regions.
heterozygous
a t g t t t
a c a c t c c t
a c a t t c c t
g c g c a t
56Other Differences Microdeletions
- A microdeletion is the loss of a small piece of
DNA, perhaps as small as 1000 bases. These
pieces can contain genes, parts of genes or
regulatory regions.
homozygous
a t g t t t
a c a c t c c t
a c a t t c c t
g c g c a t
57Other Differences Microdeletions
- A microdeletion is the loss of a small piece of
DNA, perhaps as small as 1000 bases. These
pieces can contain genes, parts of genes or
regulatory regions.
miscalled homozygous
a t g t t t
a c a c t c c t
a c a t t c c t
g c g c a t
58Apparent Inheritance Inconsistency
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
a t a a g a a c
c c c a c
a c a t c c a c
c c c t c c a c
mom
dad
c c c a c
a c a t c c a c
child
59Apparent Inheritance Inconsistency
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
a t a a g a a c
c c c a c
a c a t c c a c
c c c t c c a c
mom
dad
c c c a c
a c a t c c a c
a a t t ? a t by Mendelian inheritance
child
60Apparent Inheritance Inconsistency
- SNPs and haplotypes are used to identify regions
of the genome that cause disease. The technique
is called linkage analysis and evidence of a
connection is called linkage disequilibrium (LD).
a t a a g a a c
c c c a c
a c a t c c a c
c c c t c c a c
mom
dad
c c c a c
a c a t c c a c
cluster of inconsistencies suggests a
microdeletion.
child
61Microdeletions
- Hundreds of microdeletion haplotypes have been
discovered recently. They may be a major
contributor to human differences and disease.
62Resources
- UCSC Human Genome Browser
- http//genome.ucsc.edu/cgi-bin/hgGateway
- National Center for Biotechnology Information
(NCBI) - http//www.ncbi.nlm.nih.gov/
- PubMed
- http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbP
ubMed