Basic Molecular Biology - PowerPoint PPT Presentation

About This Presentation
Title:

Basic Molecular Biology

Description:

Each undecoded transition represents all potential bases at each position. ... This larva is not going to evolve into a fish, amphibian or anything like that. – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 81
Provided by: ProfVBDe1
Category:

less

Transcript and Presenter's Notes

Title: Basic Molecular Biology


1
Basic Molecular Biology
Many slides by Omkar Deshpande
2
Overview
  • Structures of biomolecules
  • Central Dogma of Molecular Biology
  • Overview of this course
  • Genome Sequencing

3
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
4
(No Transcript)
5
Watson and Crick
6
(No Transcript)
7
Macromolecule (Polymer) Monomer
DNA Deoxyribonucleotides (dNTP)
RNA Ribonucleotides (NTP)
Protein or Polypeptide Amino Acid
8
Nucleic acids (DNA and RNA)
  • Form the genetic material of all living
    organisms.
  • Found mainly in the nucleus of a cell (hence
    nucleic)
  • Contain phosphoric acid as a component (hence
    acid)
  • They are made up of nucleotides.

9
Nucleotides
10
DNA
A T G C
11
The gene and the genome
  • Genome The entire DNA sequence within the
    nucleus.
  • The information in the genome is used for protein
    synthesis
  • A gene is a length of DNA that codes for a
    (single) protein.

12
How big are genomes?
Organism Genome Size (Bases) Estimated Genes
Human (Homo sapiens) 3 billion 20,000
Laboratory mouse (M. musculus) 2.6 billion 20,000
Mustard weed (A. thaliana) 100 million 18,000
Roundworm (C. elegans) 97 million 16,000
Fruit fly (D. melanogaster) 137 million 12,000
Yeast (S. cerevisiae) 12.1 million 5,000
Bacterium (E. coli) 4.6 million 3,200
Human immunodeficiency virus (HIV) 9700 9
13
Repeats
  • The DNA is full of repetitive elements (those
    that occur over over over)
  • There are several type of repeats, including
    SINEs LINEs (Short Long Interspersed
    Elements) (1 million just ALUs) and low
    complexity elements.
  • Their function is poorly understood, but they
    make problems more difficult.

14
Central dogma
ZOOM IN
tRNA
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
15
Transcription
  • The DNA is contained in the nucleus of the cell.
  • A stretch of it unwinds there, and its message
    (or sequence) is copied onto a molecule of mRNA.
  • The mRNA then exits from the cell nucleus.

16
DNA
RNA
A T G C
T ? U
17
More complexity
  • The RNA message is sometimes edited.
  • Exons are nucleotide segments whose codons will
    be expressed.
  • Introns are intervening segments (genetic
    gibberish) that are snipped out.
  • Exons are spliced together to form mRNA.

18
Splicing
  • frgjjthissentencehjfmkcontainsjunkelm
  • thissentencecontainsjunk

19
Key player RNA polymerase
  • It is the enzyme that brings about transcription
    by going down the line, pairing mRNA nucleotides
    with their DNA counterparts.

20
Promoters
  • Promoters are sequences in the DNA just upstream
    of transcripts that define the sites of
    initiation.
  • The role of the promoter is to attract RNA
    polymerase to the correct start site so
    transcription can be initiated.

5
3
Promoter
21
Promoters
  • Promoters are sequences in the DNA just upstream
    of transcripts that define the sites of
    initiation.
  • The role of the promoter is to attract RNA
    polymerase to the correct start site so
    transcription can be initiated.

5
3
Promoter
22
Transcription key steps
DNA
  • Initiation
  • Elongation
  • Termination

DNA

RNA
23
Transcription key steps
DNA
  • Initiation
  • Elongation
  • Termination

24
Transcription key steps
DNA
  • Initiation
  • Elongation
  • Termination

25
Transcription key steps
DNA
  • Initiation
  • Elongation
  • Termination

26
Transcription key steps
DNA
  • Initiation
  • Elongation
  • Termination

DNA

RNA
27
Genes can be switched on/off
  • In an adult multicellular organism, there is a
    wide variety of cell types seen in the adult. eg,
    muscle, nerve and blood cells.
  • The different cell types contain the same DNA
    though.
  • This differentiation arises because different
    cell types express different genes.
  • Promoters are one type of gene regulators

28
Transcription (recap)
  • The DNA is contained in the nucleus of the cell.
  • A stretch of it unwinds there, and its message
    (or sequence) is copied onto a molecule of mRNA.
  • The mRNA then exits from the cell nucleus.
  • Its destination is a molecular workbench in the
    cytoplasm, a structure called a ribosome.

29
Translation
  • How do I interpret the information carried by
    mRNA to the Ribosome?
  • Think of the sequence as a sequence of
    triplets.
  • Think of AUGCCGGGAGUAUAG as AUG-CCG-GGA-GUA-UAG.
  • Each triplet (codon) maps to an amino acid.

30
The Genetic Code
  • f codon amino acid
  • 1968 Nobel Prize in medicine Nirenberg and
    Khorana
  • Important The genetic code is universal!
  • It is also redundant / degenerate.

31
The Genetic Code
32
Proteins
  • Composed of a chain of amino acids.
  • R
  • H2N--C--COOH
  • H

20 possible groups
33
Proteins
R
R
H2N--C--COOH
H2N--C--COOH

H H

34
Dipeptide
This is a peptide bond
R O R
II
H2N--C--C--NH--C--COOH

H H
35
Protein structure
  • Linear sequence of amino acids folds to form a
    complex 3-D structure.
  • The structure of a protein is intimately
    connected to its function.
  • The 3-D shape of proteins gives them
    their working ability the ability to bind
    with other molecules.

36
Our course (2417)
Part 1, DNA Assembly, Evolution, Alignment
Part 2, Genes Prediction, Regulation
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
37
DNA Sequencing
Some slides shamelessly stolen from Serafim
Batzoglou
38
DNA sequencing
  • How we obtain the sequence of nucleotides of a
    species

ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
39
Which representative of the species?
  • Which human?
  • Answer one
  • Answer two it doesnt matter
  • Polymorphism rate number of letter changes
    between two different members of a species
  • Humans 1/1,000 1/10,000
  • Other organisms have much higher polymorphism
    rates

40
Why humans are so similar
  • A small population that interbred reduced the
    genetic variation
  • Out of Africa 40,000 years ago

Out of Africa
41
Migration of human variation
  • http//info.med.yale.edu/genetics/kkidd/point.html

42
Migration of human variation
  • http//info.med.yale.edu/genetics/kkidd/point.html

43
Migration of human variation
  • http//info.med.yale.edu/genetics/kkidd/point.html

44
DNA Sequencing
  • Goal
  • Find the complete sequence of A, C, G, Ts in
    DNA
  • Challenge
  • There is no machine that takes long DNA as an
    input, and gives the complete sequence as output
  • Can only sequence 500 letters at a time

45
DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)


46
DNA sequencing gel electrophoresis
  1. Start at primer (restriction site)
  2. Grow DNA chain
  3. Include dideoxynucleoside (modified a, c, g, t)
  4. Stops reaction at all possible points
  5. Separate products with length, using gel
    electrophoresis

47
Electrophoresis diagrams
48
Challenging to read answer
49
Challenging to read answer
50
Reading an electropherogram
  • Filtering
  • Smoothening
  • Correction for length compressions
  • A method for calling the letters PHRED
  • PHRED PHils Read EDitor (by Phil Green)
  • Based on dynamic programming
  • Several better methods exist, but labs are
    reluctant to change

51
Output of PHRED a read
  • A read 500-700 nucleotides
  • A C G A A T C A G A
  • 16 18 21 23 25 15 28 30 32 21
  • Quality scores -10log10Prob(Error)
  • Reads can be obtained from leftmost, rightmost
    ends of the insert
  • Double-barreled sequencing
  • Both leftmost rightmost ends are sequenced

52
Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get two reads from each segment
500 bp
500 bp
53
Reconstructing The Sequence
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
54
Definition of Coverage
C
  • Length of genomic segment L
  • Number of reads n
  • Length of each read l
  • Definition Coverage C n l / L
  • How much coverage is enough?
  • Lander-Waterman model
  • Assuming uniform distribution of reads, C10
    results in 1 gapped region /1,000,000 nucleotides

55
Challenges with Fragment Assembly
  • Sequencing errors
  • 1-2 of bases are wrong
  • Repeats
  • Computation O( N2 ) where N reads

false overlap due to repeat
56
Repeats
  • Bacterial genomes 5
  • Mammals 50
  • Repeat types
  • Low-Complexity DNA (e.g. ATATATATACATA)
  • Microsatellite repeats (a1ak)N where k 3-6
  • (e.g. CAGCAGTAGCAGCACCAG)
  • Transposons
  • SINE (Short Interspersed Nuclear Elements)
  • e.g., ALU 300-long, 106 copies
  • LINE (Long Interspersed Nuclear Elements)
  • 4000-long, 200,000 copies
  • LTR retroposons (Long Terminal Repeats (700 bp)
    at each end)
  • cousins of HIV
  • Gene Families genes duplicate then diverge
    (paralogs)
  • Recent duplications 100,000-long, very similar
    copies

57
Hierarchical Sequencing
58
Hierarchical Sequencing Strategy
genome
  1. Obtain a large collection of BAC clones
  2. Map them onto the genome (Physical Mapping)
  3. Select a minimum tiling path
  4. Sequence each clone in the path with shotgun
  5. Assemble
  6. Put everything together

59
Methods of physical mapping
  • Goal
  • Make a map of the locations of each clone
    relative to one another
  • Use the map to select a minimal set of clones to
    sequence
  • Methods
  • Hybridization
  • Digestion

60
1. Hybridization
p1
pn
  • Short words, the probes, attach to complementary
    words
  • Construct many probes
  • Treat each BAC with all probes
  • Record which ones attach to it
  • Same words attaching to BACS X, Y ? overlap

61
2. Digestion
  • Restriction enzymes cut DNA where specific words
    appear
  • Cut each clone separately with an enzyme
  • Run fragments on a gel and measure length
  • Clones Ca, Cb have fragments of length li, lj,
    lk ? overlap
  • Double digestion
  • Cut with enzyme A, enzyme B, then enzymes A B

62
Whole-Genome Shotgun Sequencing
63
Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
64
History of DNA Sequencing
Adapted from Eric Green, NIH Adapted from
Messing Llaca, PNAS (1998)
1870
Miescher Discovers DNA
Avery Proposes DNA as Genetic Material
1940
Efficiency (bp/person/year)
Watson Crick Double Helix Structure of DNA
1953
Holley Sequences Yeast tRNAAla
1
15
1965
Wu Sequences ? Cohesive End DNA
150
1970
Sanger Dideoxy Chain Termination Gilbert
Chemical Degradation
1,500
1977
Messing M13 Cloning
15,000
1980
25,000
Hood et al. Partial Automation
50,000
1986
  • Cycle Sequencing
  • Improved Sequencing Enzymes
  • Improved Fluorescent Detection Schemes

200,000
1990
50,000,000
2002
  • Next Generation Sequencing
  • Improved enzymes and chemistry
  • New image processing

100,000,000,000
2009
65
Read length and throughput
1Gb
bases per machine run
100 Mb
10 Mb
1Mb
read length
10 bp
1,000 bp
100 bp
NGS Slides courtesy of Gabor Marth
66
Sequencing chemistries
DNA ligation
DNA base extension
Church, 2005
67
Massively parallel sequencing
Church, 2005
68
Features of NGS data
  • Short sequence reads
  • 100-200bp 454 (Roche)
  • 35-120bp Solexa(Illumina), SOLiD(AB)
  • Huge amount of sequence per run
  • Gigabases per run
  • Huge number of reads per run
  • Up to billions
  • Higher error (compared with Sanger)
  • Different error profile

69
Current and future application areas
70
What can we use them for?
SANGER 454 Solexa AB SOLiD
De novo assembly Mammal (3109) Bacteria, Yeast Bacteria Bacteria?
SNP Discovery Yes Yes 90 of human 90 of human
Larger events Yes Yes Yes Yes
Transcript profiling (rare) No Maybe Yes Yes
71
(No Transcript)
72
Computer scientists vs Biologists
  • Nothing is ever completely true or false in
    Biology.
  • Everything is either true or false in computer
    science.

73
Next Gen Raw Data
  • Machine Readouts are different
  • Read length, accuracy, and error profiles are
    variable.
  • All parameters change rapidly as machine
    hardware, chemistry, optics, and noise filtering
    improves

74
Current and future application areas
75
Fundamental informatics challenges
76
Informatics challenges (contd)
77
AB SOLiD System dibase sequencing
2-base, 4-color 16 probe combinations
  • 4 dyes to encode 16 2-base combinations
  • Detect a single color indicates 4 combinations
    eliminates 12
  • Each color reflects position, not the base call
  • Each base is interrogated by two probes
  • Dual interrogation eases discrimination
  • errors (random or systematic) vs. SNPs (true
    polymorphisms)

78
Converting colors into letters
4 Possible Sequences
  • The decoding matrix allows a sequence of
    transitions to be converted to a base sequence,
    as long as one of two bases is known.

79
SOLiD error checking code
80
Comparison of the technologies
SANGER 454 Solexa AB SOLiD
Output Sequence Flowgram Sequence Colors
Read Length 500-700 250-500 35-70 35-50
Error rate 2 3 (indels) 1 4 or 0.06
Mb per run 0.8 20 10000 20000
Cost per Mb 1000 50 0.15 0.05
Paired? Yes Sort of Yes (lt1k) Yes (lt10k)
Write a Comment
User Comments (0)
About PowerShow.com