Title: STBC2023
1STBC2023 Introduction to
BioinformaticsAnalyses Predictive Methods
Using Nucleotide Sequences
- M. Firdaus Raih
- Room 1166, Bangunan Sains Biologi
- Office Hours Wednesdays
- Phone 0389215961 Email firdaus_at_mfrlab.org
Ver. 23-01-09-1
2Guide
- This is a electronic self study and self
assessment module which is based on the lectures
which cover Topic 4 Analysis at Nucleotide
Level of the STBC2023 Introduction to
Bioinformatics course. - To navigate this module, use the buttons provided
mostly on the bottom right hand corner of the
page or in some slides, the bottom left hand
corner. The Home icon button will automatically
set the slide back to the key questions which we
are trying to answer with this course material.
Several pages have hyperlinks which navigate
immediately to either specific slides OR navigate
away from this module via the default web
browser. To return, simply click back this file.
Not clicking on the buttons properly will result
in normal powerpoint slideshow mode progression
of the slides as opposed to navigating to the
directed pages. - Practicals and self assessment questions to gauge
your comprehension of a given practical session
are also provided throughout. Please attempt the
practicals and the questions on your own before
resorting to the solutions or answers provided.
3Pre-session Questions
- What are nucleic acids?
- What types of nucleic acids are there?
- What functions do nucleic acids have?
- What sort of information do nucleotide sequences
carry? - What can be done with DNA sequences?
- What can be done with RNA sequences?
- Is molecular structure important for RNA
sequences? - What is a sequence alignment?
- What is the relationship of an alignment with
regard to biological function? - Is extracting the encoded information for protein
synthesis the only sequence analysis which can be
done?
4Learning objectives
- Know the basic chemistry and able to understand
the diverse functions of nucleic acids. - Able to generally list potential analyses for
nucleic acid sequence data and the applications
for those analyses based on an understanding of
the functions of nucleic acids. - Able to formulate a strategy and present
processes involved in the analysis of nucleic
acid sequence data. - Able to comprehend the basic concepts involved in
sequence alignments in general and aligning
nucleic acids specifically as well as the
relationship between an alignment to a sequences
biological function.
5Nucleic Acids Chemistry and Molecular Structure
6Nucleic Acids Chemistry and Molecular Structure
- What are nucleic acids?
- Nucleic acids polymer of nucleotides ? 2 types
- DNA deoxyribonucleic acids
- RNA ribonucleic acids
7Nucleic Acids Chemistry and Molecular Structure
- What are nucleic acids?
- Nucleic acids polymer of nucleotides ? 2 types
- DNA deoxyribonucleic acids
- RNA ribonucleic acids
- What is a nucleotide?
8Nucleic Acids Chemistry and Molecular Structure
- What are nucleic acids?
- Nucleic acids polymer of nucleotides ? 2 types
- DNA deoxyribonucleic acids
- RNA ribonucleic acids
- What is a nucleotide?
- Nucleotide nucleoside 1 phosphate group
- Nucleoside nitrogenous base sugar (ribose)
9Nucleic Acids Chemistry and Molecular Structure
What is the basic difference between RNA and DNA
(in terms of chemistry)?
RNA
DNA
Click here for animation
10Nucleic Acids Chemistry and Molecular Structure
How can the nucleotide polymer be represented?
11Nucleic Acids Chemistry and Molecular Structure
How can the nucleotide polymer be represented?
Hydrogen bonded base interactions and base
stacking interactions result in stable structures
of DNA / RNA.
Seq 1. ACTG Seq 2. TGAC
What can be done with such sequence data? How is
the analysis related to biological function?
12Nucleic Acids Biological Functions
- What is/are the function(s) of DNA?
- What is/are the function(s) of RNA?
13Nucleic Acids Biological Functions
- What is the function of DNA?
- Storage of genetic information
- Proteins such as transcription factors also
interact directly with DNA as part of regulatory
pathways - Total genetic content of an organism genome
- Genes are part of genomes
- So what is a gene?
14Nucleic Acids Biological Functions
- What is the function of DNA?
- Storage of hereditary information in genes.
- What is a gene?
While sequencing of the human genome surprised us
with how many protein-coding genes there are, it
did not fundamentally change our perspective on
what a gene is. In contrast, the complex patterns
of dispersed regulation and pervasive
transcription uncovered by the ENCODE project,
together with non-genic conservation and the
abundance of noncoding RNA genes, have challenged
the notion of the gene. To illustrate this, we
review the evolution of operational definitions
of a gene over the past century--from the
abstract elements of heredity of Mendel and
Morgan to the present-day ORFs enumerated in the
sequence databanks. We then summarize the current
ENCODE findings and provide a computational
metaphor for the complexity. Finally, we propose
a tentative update to the definition of a gene A
gene is a union of genomic sequences encoding a
coherent set of potentially overlapping
functional products. Our definition side-steps
the complexities of regulation and transcription
by removing the former altogether from the
definition and arguing that final, functional
gene products (rather than intermediate
transcripts) should be used to group together
entities associated with a single gene. It also
manifests how integral the concept of biological
function is in defining genes.
15Nucleic Acids Biological Functions
- What are the functions of RNA?
16Nucleic Acids Biological Functions
- What are the functions of RNA?
- Information storage and transfer
- Genomes of RNA viruses
- mRNA
- Protein synthesis
- tRNA
- Peptidyl transferase
- Catalysis
- ribozymes
- Regulatory
- Small ncRNAs / microRNAs
- Riboswitches
- Also see The RNA World hypothesis first coined
by Walter Gilbert 1986, Nature
17DNA (Genes) From Sequence to Function
- How does a gene sequence correlate to biological
function?
18DNA (Genes) From Sequence to Function
- How does a gene sequence correlate to biological
function? - Lets first look at
- Information about the amino acid sequence is
contained within the nucleic acids sequence. - Is that the only analysis that can be done for
DNA sequences? - What other analyses, if any, can be done for DNA
sequences? -
19Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- Genome projects DNA sequencing data need to be
assembled into complete genomes. - Genes need to be identified / predicted.
- Comparisons of specific nucleotide level
variations. - Identification and analysis of specific
nucleotide sequence level motifs and patterns. - Identification and analysis of polymorphisms.
20Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- Genome projects DNA sequencing data need to be
assembled into complete genomes. - Genome sequencing generate fragments of sequences
. - These fragments need to be assembled into genes,
chromosomes and finally the complete genome. - Assembly is done by analyzing for contiguous
sequences (contigs). - Contigs are basically found by aligning the short
DNA sequences to one another and finding where
there are overlaps. - More on this topic will be covered in the
Genomics course in Year 3. - After the genome is assembled, the genes need to
be identified.
21Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- From sequence data, genes need to be predicted.
- Several methods to gene prediction
- Searching by signal analysis of sequence
signals which specify a gene. - Searching by content analysis of regions
showing compositional bias that has been
correlated to coding regions. - Homology based prediction comparison against
known gene sequence. involve sequence
alignments - Comparative gene prediction comparing sequences
of interest against anonymous genomic sequences.
involve sequence alignments - The prediction of eukaryotic genes from genomic
DNA data is appreciably more difficult than that
of prokaryotic. Why?
22Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- From sequence data, genes need to be predicted.
- Several methods to gene prediction
- Searching by signal analysis of sequence
signals which specify a gene. - Searching by content analysis of regions
showing compositional bias that has been
correlated to coding regions. - Homology based prediction comparison against
known gene sequence. involve sequence
alignments - Comparative gene prediction comparing sequences
of interest against anonymous genomic sequences.
involve sequence alignments - For this session, we will focus on methods which
involve sequence alignments.
23Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- Comparisons of specific nucleotide level
variations. - Enable differentiation at individual level or
close relationships ie. Between strains of the
same species. - Phylogenetic analysis (discussed by Dr.
Khairina).
24Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- Identification and analysis of specific
nucleotide sequence level motifs, patterns. - This will be discussed further in the following
lecture. - Examples
- PCR Primer design
- Searching / mapping restriction sites
- Go to the corresponding BLAST exercise NOW
- or proceed to the next slide.
25Potential Analyses for DNA Sequences
- What can be done with DNA sequences?
- Identification and analysis of polymorphisms.
- This will be discussed further in the following
lecture. - Examples
- SNPs single nucleotide polymorphisms (more on
SNPs) - Go to the corresponding BLAST exercise NOW
- or proceed to the next slide.
26Sequence Alignments
- What is a sequence alignment?
- A way of arranging or aligning the
similarities between sequences. - Examples
- Gaps (-) are inserted to optimize alignments.
- They represent indel mutations.
- Easy to align short sequences manually. But what
about longer sequences? How can those be aligned?
In order to understand this further, lets look
at a method which we can visualize and track the
alignment. This method is called a dot plot.
27Sequence Alignments
- What is a dot plot?
- A plot where two sequences are written along the
top row and leftmost column of a two-dimensional
matrix and a dot is placed at any point where the
characters in the appropriate columns match. - Parts of the two sequences where the match is
continuous can be traced as a diagonal line ?
region where the sequences are aligned. - A sequence can be plotted against itself and
regions that share significant similarities will
appear as lines off the main diagonal can occur
when a protein consists of multiple similar
structural domains. - A dot plot is not able to detect divergence or
substitutions/mutations which we know can occur.
28Sequence Alignments
- Dot plot for two DNA sequences
- Complete the dot plot for the two DNA sequences
provided below. An example can be seen below. - Seq1 CGATCGCGTAATCGGTGATCGGC
- Seq2 CGGTATCGGTGATCGATCGCA
- Questions
- Which stretch of these sequences can best be
aligned to each other? (Answer) - Can this alignment be extended? (Answer)
- Can you identify a repetitive sequence of 4 bases
which keep occurring in both sequences? (Answer)
29Sequence Alignments
- Dot plot for two DNA sequences
-
- Questions
- 1. Which stretch of these sequences can best be
aligned to each other? - Answer Longest continuous diagonal line from
your dot plot. ATCGGTGATCG - 2. Can this alignment be extended?
- Answer Yes, it can be extended as shown below. 2
nucleotides are not aligned and may possibly be
substitutions. - 3. Can you identify a repetitive sequence of 4
bases which keep occurring in both sequences? - Answer ATCG, this can be deduced from the
repeating short diagonal lines. - 4. What can you attribute all the other plotted
dots to? - Answer They are the result of random sequence
similarities.
30Computational Sequence Alignments
- Weve looked at manual alignments for short
sequences and the dot plot However, manual
alignments cannot be done for lengthy and highly
variable sequences. Therefore for long variable
sequences, computer aided alignments need to be
done. - How can computer aided alignments be done?
- To enable computer aided alignment, algorithms
called dynamic programming algorithms are used. -
- Two common dynamic programming algorithms
approach alignment differently, via - 1. Local alignments Smith-Waterman algorithm
- 2. Global alignments Needleman-Wunsch algorithm
31Computational Sequence Alignments
- What is the difference between global and local
alignments? - Local alignments Smith-Waterman algorithm
- Global alignments Needleman-Wunsch algorithm
- The Smith-Waterman algorithm is currently the
most used because real biological sequences are
usually similar in localized portions and not
over entire lengths. - Examples
- genes from different organisms with similar
exons, different intron structures - Proteins share only certain domains
- Alignments can have gaps which represent
mutations. The ability to add gaps is required as
sequence diverge. - So how do we know that an alignment is
meaningful?
32Computational Sequence Alignments
- How do we know that an alignment is meaningful?
- Insertions and deletions are slow evolutionary
processes, therefore addition of gaps MUST be
controlled to avoid large proportions of matches
by inserting large numbers of gaps. - Gap penalties are given to control addition of
gaps. The penalty system can be constant or
proportional. - Scores are given for matches, while penalties are
given for addition of gaps. - The alignment algorithm then carries out
alignments in order to get the best score. - Like the dot plot, a simple system as above does
not seem to fully consider divergence (ie. point
mutations) only deletions and insertions seem
to be considered. - How can we get around this problem?
33Computational Sequence Alignments
- How do we know that an alignment is meaningful?
(cont.) - Point mutations can result in change as opposed
to deletion or insertion. - A matrix called a substitution matrix can be used
to model the possible changes and provide
quantitative values to changes arising from point
mutations. - The values for substitution can
- take into consideration similarity
- such as physico-chemical
- properties for amino acids or
- transition mutations for nucleic
- acids.
- But there is still probability
- that a search result is random
- especially for large databases.
- How can we be certain the
- alignment achieved is the
- expected result?
Amino acid substitution matrix example
Nucleic acid substitution matrix example
34Computational Sequence Alignments
- How do we know that an alignment is meaningful?
(cont.) - How can we be certain the alignment achieved is
the expected result? - The alignments produced are statistically
evaluated. - As an example, for the BLAST program, a value
called the Expectation (E) value is given. - The number of different alignments with scores
equivalent to or better than S that are expected
to occur in a database search by chance. - The lower the E value, the more significant the
score. -
35Sequence Alignments
- What is the rationale in doing an alignment?
- Proteins perform most cellular functions.
- The structure of a protein is an important
determinant of its function. - If proteins share a similar structure, then it
may also share a similar function. - We know that sequences with 30 similarity, share
a similar fold (Chothia Lesk 1986).
36Sequence Alignments
- What is the rationale of doing an alignment?
- If proteins share a similar function, then it may
also share a similar structure.
37Sequence Alignments
- What is the rationale in doing an alignment?
- If proteins share a similar structure, then it
may also share a similar sequence. - But our interest here are NUCLEIC ACID sequences
- So what is the relevance?
38Sequence Alignments
- What is the rationale in doing an alignment?
- If proteins share a similar structure, then it
may also share a similar sequence. - But our interest here are NUCLEIC ACID sequences
- So what is the relevance?
39Sequence Database Searching
- What is a sequence database?
- What are we searching for, and how do we search
for something in sequence databases?
40Sequence Database Searching
- What is a sequence database?
- A collection of biological macromolecular
sequences. - Can be sequences organized into organisms,
protein families, sources etc. - Has been covered in Topic 2. Example NCBI
GenBank. - What are we searching for, and how do we search
for something in sequence databases?
41Sequence Database Searching
- What is a sequence database?
- A collection of biological macromolecular
sequences - Can be sequences organized into organisms,
protein families, sources etc. - Has been covered in Topic 2. Example NCBI
GenBank - What are we searching for, and how do we search
for something in sequence databases? - We are searching for sequence similarity.
- We can search for sequence similarity by
comparing an input (query) sequence against
sequences in the database. - This comparison is done by aligning the query
sequences to the database sequences ? one tool we
can use is BLAST. - How is this alignment relevant biologically?
42Sequence Database Searching
- What is BLAST?
- Basic Local Alignment Search Tool
- Implements heuristics to approximate the
Smith-Waterman algorithm and search for high
scoring alignments. - The alignment scores are then statistically
evaluated one example is the E value discussed
previously. - BLAST is actually a family of programs.
-
43Sequence Database Searching
- What is BLAST?
- Basic Local Alignment Search Tool
- Implements heuristics to approximate the
Smith-Waterman algorithm and search for high
scoring alignments. - The alignment scores are then statistically
evaluated one example is the E value discussed
previously. - BLAST is actually a family of programs.
-
44BLAST
45BLAST
(1) Select the BLAST program (2) Input the
sequence (query) (3) Choose the database to
search (4) Choose optional parameters Then
click BLAST
46BLAST
47BLAST
- Is that it?... YES and NO. Lets look at some
considerations and strategies for BLAST
searching. -
48BLAST
- Some considerations and strategies
- Input sequence and search database what is it
that youre really interested in? Finding
similarity alone or identifying homologs? Finding
homologs only or perhaps trying to find out if
genes with similar sequences encode for proteins
with available structures? The answer to these
types of questions influence the type of search
program you should use and the database to search
in. -
Protein vs. Nucleotide?
49BLAST
- Some considerations and strategies
- Are you interested in something quite specific?
-
50BLAST
- Some considerations and strategies
- Did you forget to turn something on/off?
- Sequence filters Low-complexity regions have
fewer sequence characters in them because of
repeats of the same sequence character or
pattern. These sequences produce artificially
high-scoring alignments that do not accurately
convey sequence relationships in sequence
similarity searches. Regions of low complexity or
repetitive sequences may be readily visualized in
a dot matrix analysis of a sequence against
itself. Low-complexity regions with a repeat
occurrence of the same residue can appear on the
matrix as horizontal and vertical rows of dots
representing repeated matches of one residue
position in one copy of the sequence against a
series of the same residue in the second copy.
Repeats of a sequence pattern appear in the same
matrix as short diagonals of identity that are
offset from the main diagonal. Such sequences
should be excluded from sequence similarity
searches. -
51BLAST
- Some considerations and strategies
- Did you forget to turn something on/off?
- Options and parameter settings
-
52Output of BLAST Searches
- What are the components of a BLAST search output
- Example blastn vs blastx (GenBank AF390557)
-
blastn
This section overview of the output alignments
blastx
53Output of BLAST Searches
- What are the components of a BLAST search output
- Example blastn vs blastx (GenBank AF390557)
-
blastx
blastn
This section list of hits (alignments) Read
more about interpreting the output.
54Output of BLAST Searches
- What are the components of a BLAST search output
- Example blastn vs blastx (GenBank AF390557)
-
blastx
blastn
This section the alignments
55Output of BLAST Searches
- To be a significant match, a database sequence
that is listed in the program output should have
a small E (expect value) and a reasonable
alignment with the query sequence (or
translations of protein-encoding DNA sequences
should have these same features). - The E of the alignment score between the
sequences gives the statistical chance that an
unrelated sequence in the database or a random
sequence could have achieved such a score with
the query sequence, given as many sequences as
there are in the database. The smaller the E, the
more significant the alignment. A cutoff value in
the range of 0.01-0.05 may be used (Pearson
1996). In genome comparisons, a more stringent
cutoff score (10-100-10-20) may be used to find
sequences that align very well with the query
sequence. However, the alignment should also be
examined for absence of repeats of the same
residue or residue pattern because these patterns
tend to give false high alignment scores. - Filtering of low-complexity regions from the
query sequence in a database search helps to
reduce the number of false positives. The
alignment should also be examined for reasonable
amino acid substitutions and for the appearance
of a believable alignment. - To gain further confidence that the alignment
between the query and database sequences is
significant, either the query sequence or the
matched database sequence may be shuffled many
times, and each random sequence may be realigned
with the other unshuffled sequence to obtain a
score distribution for a set of unrelated
sequences. This distribution may then be used to
evaluate the significance of the true alignment
score. -
.
56BLAST
-
- Carrying out a BLAST search
- Select and copy the sequence from the GenBank
database here. - Go to the BLAST page and carry out database
searches using the above sequence. - First carry out a search against a nucleotide
database. - Which BLAST programs can you use? Name two
possibilities. (Answer) - Next carry out a search against a protein
database - Which BLAST program should you use? (Answer)
- (i) Can you further narrow down the search? (ii)
Also take for example if you were to search for
genes which code for proteins which have
representative 3D structures how would you
conduct such a search? (Answer)
57BLAST
- Answers to questions on carrying out a BLAST
search - First carry out a search against a nucleotide
database. - Which BLAST programs can you use? Name two
possibilities. - Answer blastn and tblastx. tblastn is not a
correct answer because it uses a protein query
although the database searched is a nucleotide
database the input sequence AF390557 is a DNA
sequence. - Next carry out a search against a protein
database - Which BLAST program should you use?
- Answer blastx
- (i) Can you further narrow down the search? (i)
Also take for example if you were to search for
genes which code for proteins which have
representative 3D structures how would you
conduct such a search? - Answer (ii) Yes, searches returning a very large
number of hits can still be narrowed down. A
carefully annotated protein sequence database
(e.g., PIR, SwissProt) will provide a more
manageable output list of matched sequences, and
these proteins have probably been observed in the
laboratory i.e., the genes do produce a protein
product in cells. However, investigators may also
wish to expand the search to include predicted
genes from gene annotations of genomic sequences
that are frequently entered into the DNA sequence
translation databases (e.g., DNA sequences in the
GenBank DNA sequence databases automatically
translated into protein sequences and placed in
the GenPept protein sequence database). To
compare a protein or predicted protein sequence
to EST sequences, the ESTs should be translated
into all six possible reading frames. (ii) Such a
search can be carried out by choosing PDB as the
database option. This will limit the blastx
search to only protein sequences which have known
3D structures in the PDB.
58BLAST
- Carrying out a BLAST search
- Retrieve the sequence provided and use it for
your BLAST search. - See the GenBank page here for the sequence.
Change the format of the view to FASTA by
selecting FASTA from the dropdown menu marked
Display (see here). Use this sequence for a
BLAST search. -
- Questions
- - Identify the sequence which is used. What is
this DNA usually used for? (Answer) - - Search for suitable primers to use for PCR.
Which program can you use? (Answer) - - Identify restriction sites which can be found
on this DNA. How many fragments will a digestion
with the restriction enzyme BsaI generate? In
order to answer this question, you will need to
draw on any general web skills you already have
to find the appropriate resources. BLAST is not
the tool to use in such a case. (Answer)
59BLAST
- Carrying out a BLAST search
- Questions
- - Identify the sequence which is used. What is
this DNA usually used for? - Answer pBR322 plasmid, It is used a cloning
vector for protein (IG-lambda) expression. - - Search for suitable primers to use for PCR.
Which program can you use? What is the largest
product size from a possible primer pair found
using a default search? - Answer The Primer-BLAST program can be used.
The largest possible product is 986bp. - - Identify restriction sites which can be found
on this DNA. How many fragments will a digestion
with the restriction enzyme BsaI generate? - Answer One such tool which can be used is
NEBcutter. Cutting the pBR322 sequence with BsaI
will generate 3 fragments of DNA due to cleavage
at 2 sites in the sequence.
60SNPs
- SNPs (pronounced snips) is a DNA sequence
variation which occurs when a single nucleotide
A, T, C, or G in the genome (or other shared
sequence) differs between members of a species
(or between paired chromosomes in an individual)
and they comprise the largest known class of
human genetic variation. - SNPs may occur
- within coding sequences of genes,
- non-coding regions of genes, or
- in the intergenic regions between genes.
-
- SNPs within a coding sequence will not
necessarily change the amino acid sequence of the
protein that is produced, due to degeneracy of
the genetic code (refer to the codon table
discussed earlier) such changes result in silent
mutations (synonymous). - Non-synonymous changes can result in
- Mis-sense change ? different amino acid coded
- Nonsense change ? premature STOP codon
- Why are SNPs important? If the changes result in
non-functional gene products or no gene products,
a diseased state may be a possible the end
result. - How can we find SNPS? Methods of discovering SNPs
in sequence data the easiest and most used
method is to align two sequences from the DNA of
two individuals and look for high quality
sequence differences.
.
61BLAST
- Carrying out a BLAST search
- Select and copy the sequence from this link.
- Go to the BLAST page and carry out a search for
SNPs on the above sequence. - Observe the output. How is it different from
previous BLAST searches you have carried out.
Correlate the output to what you know about SNPs.
62Ribonucleic Acids
- RNA molecules play crucial roles in molecular
biology. - Known functions include
- Information storage
- Catalysis
- Regulatory roles
- Protein synthesis
- Diversity of functions associated to RNA World
hypothesis - Potential applications
- Molecular scaffolding (nanotechnology)
- Drug targets (riboswitches/ribosomes)
- RNA interference (RNAi)
-
The Economist, June 16th-22nd 2007
63RNA From Sequence to Function
- What is a crucial determinant of functionality
for functional RNAs?
64RNA From Sequence to Function
- What is a crucial determinant of functionality
for functional RNAs? - For functional RNAs, like for proteins, the 3D
structure is crucial for biological function.
65RNA Structure
- What are the major factors involved in
stabilizing the structure of RNA? -
- Base stacking and hydrogen bonding contribute to
the stabilization of nucleic acid structure/ RNA
structure. - RNA bases can form hydrogen bonds with each other
resulting in interactions between - complementary pairings in the canonical Watson
Crick interactions - non-canonical interactions
- Hydrogen bonded base interactions are therefore
are crucial elements of a nucleic acids 3D
structure. -
-
66RNA Base Interactions
32 pairs
eg. Purine-pyrimidine base pairs (10)
after I. Tinoco, Jr. In Appendix 1 of The RNA
World (R. F. Gesteland, J. F. Atkins, Eds.),
Cold Spring Harbor Laboratory Press, 1993, pp.
603-607.
67RNA Structure
- Base stacking and hydrogen bonding contribute to
the stabilization of nucleic acid structure/ RNA
structure. - RNA bases can form hydrogen bonds with each other
resulting in interactions between - complementary pairings in the canonical Watson
Crick interactions - non-canonical interactions
- Hydrogen bonded base interactions are therefore
are crucial elements of a nucleic acids 3D
structure - 3 levels of RNA structure
- Primary sequence, secondary structure, tertiary
structure. -
-
from the Arabic word Qanun which in context
here is better suited as the word rule as
opposed to the literal meaning of law.
68RNA Structure
- How do we get from sequence to structure?
- How can we predict the structure of RNA?
-
-
69RNA Structure
- How do we get from sequence to structure?
- Complex (non helical) RNA structures are not easy
to predict. Reliable structural information are
sourced from X-ray crystal structures. - Commonly, only the secondary structure level
interactions are predicted to give some insights
into what the functional structure may look like. - However such methods lack the detail which an
actual structure model is able to give, such as
the exact orientation of bases and specific
atomic interactions which are occurring. - Such interaction data is important because we
know that RNA bases can be involved in
non-canonical interactions which are different
from the canonical Watson-Crick interactions. - How can we predict the secondary structure of
RNA? - Several programs which calculate the
thermodynamics of folding (energies of the base
interactions) can be used. - One such program is mfold by Michael Zuker.
- Assessment of reliability can be done using
multiple alignments and comparisons to other
predictions and known structures. -
-
70RNA Secondary Structure Prediction the mfold
program
- Predicting the secondary structure of non-coding
RNA - Copy the sequence here as input for the mfold
program. All other parameters can be left at
default settings. - Questions
- How many paired bases are you able to observe in
the predicted structure? (Answer) - How many bases are unpaired? (Answer)
- Name the two types of structures where these
unpaired bases can be found. What type of
secondary structure do you think can be observed
for regions with canonical Watson-Crick base
pairing? (Answer) - Are you able to observe any base pairings which
are non-canonical (non Watson-Crick)? If yes, how
many? (Answer) - Having answered the previous two questions, are
you really able to differentiate a canonical vs a
non-canonical pairing from the secondary
structure diagram alone? (Answer)
71RNA Secondary Structure Prediction the mfold
program
- Predicting the secondary structure of non-coding
RNA - How many paired bases are you able to observe in
the predicted structure? - Answer 29 pairs, 58 paired bases.
- How many bases are unpaired?
- Answer 27
- Name the two types of structures where these
unpaired bases can be found. What type of
secondary structure do you think can be observed
for regions with canonical Watson-Crick base
pairing? - Answer Unpaired bases are found in bulges and
loops. Regions with canonical pairings as in
Watson-Crick are most likely helical. -
- Are you able to observe any base pairings which
are non-canonical (non Watson-Crick)? If yes, how
many? - Answer 4
- Having answered the previous two questions, are
you really able to differentiate a canonical vs a
non-canonical pairing from the secondary
structure diagram alone? - Answer No, not really. Although a GU base pair
is obviously non-canonical, GC and AU base pairs
which may possibly be non-canonical cannot be
determined from the secondary structure alone.
72Analyses for RNA sequence data
- Is predicting the secondary structure the only
analyses we can do for RNA sequence data? -
-
73Analyses for RNA sequence data
- Is predicting the secondary structure the only
analyses we can do for RNA sequence data? - NO.
- Genomic data can be analysed for the presence of
the numerous types of known non-coding or
functional RNA as well as possibly novel or yet
to be discovered functional RNA sequences. - This appreciably more difficult than the problem
of predicting genes. Why? - Currently there are no widely used or general use
methods. - Such investigations are still highly exploratory
and currently remain in the domain of experts in
the field. -
-
74Post-session Questions
- What are nucleic acids?
- What types of nucleic acids are there?
- What functions do nucleic acids have?
- What sort of information do nucleotide sequences
carry? - What can be done with DNA sequences?
- What can be done with RNA sequences?
- Is molecular structure important for RNA
sequences? - What is a sequence alignment?
- What is the relationship of an alignment with
regard to biological function? - Is extracting the encoded information for protein
synthesis the only sequence analysis which can be
done?
75Self Study and Self Assessment
- The self study module for this series of lectures
on analyses of nucleotide sequences are available
for download from SPIN. Format of the file (this
file) is powerpoint show (.pps). - The self assessment quiz is accessible from
within the SPIN interface. - Both these materials are for self assessment and
self study use and DOES NOT contribute to your
final grades for this course. - Also explore the references and texts listed in
the course information file and reading list. - Explore resources made available via the
self-study material.
76Further Reading
- Recommended Textbook (Lesk, 2nd Ed.)
- Basics Chapter 1
- Pages 1-59
- Sequence alignments Chapter 5, Chapter 1
- Pages 242-270
- Pages 21-59
- Other Textbooks
- Baxevanis Oullette, 3rd edition
- Chapters 5-7
- Pevsner