Title: Bioinformatics: overview
1Bioinformatics overview
- Handling a computer
- Opening and saving of files
- Starting programs
- Navigating the WWW
- FTP
- Browsing
- Sequence data
- Primary data
- Sequence formats
2Bioinformatics overview (2)
- Databases
- Entrez
- SRS
- Manipulation of DNA sequences
- Restriction analysis
- in silico cloning
- Translation of nt sequence into protein
- PCR
- Primer design
3Bioinformatics overview (3)
- Comparison of two sequences
- Dot matrix
- Pairwise alignment
- Multiple alignments
- Database searches for similar sequences
- FASTA
- BLAST
4Bioinformatics overview (4)
- Sequence annotation
- Intron/Exon prediction
- Identification of conserved motifs
- Identification of regulatory sequences
- Organismal databases
- D. melanogaster
- A. thaliana
- Expression profiling (chips)
5Copy and Paste
To transfer text/sequence files into programs use
copy/paste
- StrgC for copy StrgV for paste
6Start a program
- Double click on the program you wish to open
- Microsoft word can be found under
- Start -gt Programme-gt Microsoft Word
7Create a Folder
- Start - Programme - Windows Explorer - click
- Desktop - click
- Datei - Neu - Ordner - click
- the new folder will appear on the screen
- rename Neuer Ordner to EDV
- click once on the icon and a second time on the
text field to activate the editor and write EDV - Save all your future files into this folder
8Navigating the Internet
- File transfer protocol (FTP)
- Allows a person or computer to retrieve and send
files from/to another computer. Only copies are
moved the original file remains untouched. - The network terminal protocol (TELNET)
- allows a user to log in on any other computer on
the network, turning the local computer in a
terminal.
9Navigating the Internet
- Every computer in the internet has its own unique
IP address - e.g. 193.171.103.86
- Because these numbers are not intuitive they are
often converted into a name - i122server.vu-wien.ac.at addresses the same
computer - Subdirectories on this computer can be specified
- i122server.vu-wien.ac.at/edv/start.html is a
folder, which has been prepared for this course
10Navigating the Internet
- For internet access you need either a modem or a
direct line - Once you are connected with one server you can
access the full internet - For most purposes an internet browser is
sufficient - Internet Explorer
- Netscape Navigator
- Omniweb
11Getting started
- Open your web browser
- Type in the address http//i122server.vu-wien.ac
.at/edv/start.html
- Press return
- You could make a bookmark of this page
12Links
- Rather than typing a new address each time, it is
possible to click on specially marked text or
symbols - After a single click you will connect to the
address - To view the address before connecting, simply
move your mouse above the link
13Assignment 1
- Open Netscape Communicator
http//i122server.vu-wien.ac.at/edv/start.html - Create a new folder on the desktop
- Open Microsoft Word and type in a random DNA
sequence, save this sequence as text only file
into your folder. - Software Windows-Explorer and Word
14Sequence data
- Automated DNA sequencing heavily relies on the
support of computer algorithms - Data collection
15Sequence data
- The use of 4 different dyes requires intensive
computer calculations to extract sequence
information - Electropherograms
16Sequence formats
- While electorpherograms are useful during the
sequencing project, after the completion
sequences are stored as text. - Plain text contains only the sequence
information - Fasta
17Other sequence formats
18Manipulation of DNA Sequences I
Restriction endonucleases sticky ends XhoI
(c/tcgag), PstI (ctgca/g), ... blunt
ends SmaI (ccc/ggg), DraI (ttt/aaa),.... Rare
cutters large recognition sites Frequent
cutters small recognition sites Multiple
cloning site
19Manipulation of DNA Sequences II
20Assignment 2
1. Open the file pSKII.doc and try to find the
sequences for the sequencing primers M13-forward
(5' gtaaaacgacggccagt 3') and M13-reverse (5'
ggaaacagctatgaccatg 3') as well as the RNA
polymerase promoters T3 (5' aattaaccctcactaaaggg
3') and T7 (5 gtaatacgactcactatagggc 3').
2. Which primers are homologous to the single
stranded SKII sequence and which are
complementary? Software Word, JaMBW (Reverse,
Complement, Inverse) Download pSKII.doc
21Characteristics of cloning vectors I
multiple cloning site region for universal
(M13 /-) sequencing primers RNA
polymerase (T7, T3, SP6) promoters genes for
selections
22Watson-Crick DNA strands
The upper strand of the dsDNA is called, W
(Watson) for forward and the lower strand C
(Crick) for reverse.
- The C strand is complementary (complement/reverse)
to the W strand - C is in antisense to W
23DNA and RNA Polymerases
DNA Polymerases need short primers to start DNA
synthesis RNA Polymerases need short
promoters Polymerases synthesize DNA/RNA only in
the 5 - 3 direction If open reading frame (ORF)
is coded by the W strand - the C strand codes
for the antisense gene Also the C strand can code
for ORFs - than the W strand codes the antisense
gene
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Assignment 3
You received a cDNA clone and the sequence of the
insert (prc1edvkurs.doc) from your colleague. He
told you that the startcodon is the "atg" at
position 79. For synthesis of an antisense RNA
used as Northern Blot probe you have to subclone
the insert into another vector. The vector you
have in the lab is Bluescript (pSKII.doc).
Bluescript contains a multiple cloning site
flanked by sequences for the sequencing primers
M13-forward and M13-reverse and the RNA
polymerase promoters T3 and T7. a. Find the
multiple cloning site in the vector b. Find the
best cloning strategy using only one restriction
enzyme. c. Use directed cloning to ensure that
all clones could be used to produce an
antisense probe with the RNA polymerase T3. d.
Define a strategy to modify the clone of 3c for
the use of T7 RNA polymerase. Which enzymes would
you use? Software Word and Webcutter Download
prc1edvkurs.doc
28Manipulation of DNA Sequences III
Polymerase chain reaction (PCR) http//bibiserv.t
echfak.uni-bielefeld.de/sadr/pcrtutor.html
29Manipulation of DNA Sequences IV
Primer design size between 19 and 25
bases melting temperature 48 C and 60
C Tmforward Tmreverse Tm 2 (A
T) 4 (G C) minimum of G/Cs 9 - 11 ( 40 -
50) distance between primer pairs 10 bp - 40
kb annealing sites unique - 3 end avoid
mispriming primer-primer interaction hairpin
structures
30(No Transcript)
31(No Transcript)
32Assignment 4
- Microsatellites are highly polymorphic markers,
which are extensively used for paternity testing,
genome walking, provenance studies and analysis
of population structures. - They consist of tandemly repeated simple
sequences of di-, tri and tetranucleotids as
(AT)n,(CT)n, (CA)n, (GA)n, (GT)n or (CCT)n,.... - Their length variation results from DNA slippage
a mechanism, which increases and decreases their
repeat number. The repeats are flanked by unique
sequences, which allow to design specific primers
for the amplification of the microsatellite. - Please design primer pairs for the amplification
of a microsatellite using the following criteria
- product length 100 - 300 bp
- annealing temperature higher than 55 C
- primer length between 20 - 24 bp
- Software Word and Primer3
- Download microsatellite.doc
33The importance of centralized databanks
34EMBL Databank
35EMBL SRS
36Entrez
- is a search and retrieval system that integrates
information from databases at NCBI
37(No Transcript)
38PopSet-prealigned multiple data sets
39Taxonomy Browser
40Online Mendelian Inheritance in Man
- This database is a catalog of human genes and
genetic disorders. The database contains textual
information and references. It also contains
copious links to MEDLINE and sequence records in
Entrez
41Objectives
- What is the function of this gene?
- Do other genes have this functional motif?
- Can I predict the higher order structure of this
protein? - Is this gene a member of a known gene family?
- Do other organisms have this gene?
42General Database Search Issues
- Search using amino acid sequence if possible
- Why? Protein evolution is slower than DNA
sequence evolution - Ask the program to translate your query sequence
in all 6 possible reading frames. - Statistical theory is based on unrealistic
assumptions consider searches as exploratory
analyses.
43Similarity Search Jargon
A similarity search of a database is performed by
aligning a query sequence to each sequence in the
database. If good matches are found, the search
returns a list of HSPs High-scoring Segment
Pairs.
44Alignment Jargon
ancestor
Evolutionarily related sequences differ from one
other because of several processes
- Substitutions
- Insertions
- Deletions
Observed sequences
45Alignment Jargon
GCG ACG
Substitution
A?G
46Alignment Jargon
ATCG A-CG
Insertion
?T
- 0 mismatches
- 3 matches
- 1 gap
47Alignment Jargon
Deletion
ATCG A-CG
- 0 mismatches
- 3 matches
- 1 gap
48Alignment Jargon
Results of insertion and deletion events can be
indistinguishable. Indel INsertion or DELetion
49Sequence Alignment
- Sequence alignment is simply the optimal
assignment of substitution and indel events to a
pair of sequences. - Global alignment align entire sequences
- Local alignment find best matching regions of
sequences
50Alignment of pairs of sequences
- Dot matrix analysis
- Dynamic programming
- Word (k-tuple) methods
51Dot matrix
- Sequence A is compared against B
- Matching bases are marked on a AxB grid
52Dot matrix
- Sequence A is compared against B
- Matching bases are marked on a AxB grid
53Dot matrix
- The background could be adjusted by changing the
window size
54Dot matrix
- The background could be adjusted by changing the
window size - (phage lambda and P22 repressor proteins)
- 1/1 7/11 15/23
55Dot matrix
- Search for conserved regions and domains
- Identify repeated nucleic acid and protein
domains - Determine introns and exons
- Find inverted repeats and stem-loop structures
- regions of low complexity
- frameshifts
56(No Transcript)
57Assignments 5 and 6
- You isolated a cDNA clone (PlecDNA.doc) and you
would like to know how many introns are in the
gene. Fortunately you are working with a fully
sequenced organism thus it is easy to retrieve
the full genomic region (Plegenomic.doc). - a) How many introns does the gene contain?
- b) What are the sequences (10 bp) around
introns 2, 3 and the - corresponding exons borders?
- Software Word and Dotlet
- Download PlecDNA.doc, Plegenomic.doc
58Assignment 6
- The previous analysis showed that with the dot
matrix program some useful interpretation can be
made on DNA sequences. You have recently isolated
a genomic fragment (Test.doc) and encouraged by
the former results to analyze it with the dot
matrix program. - How can you explain the pattern you see in the
dot matrix? - Delete an internal portion of the sequence and
compare the full versus the deleted sequence ? - What is the pattern on the dot matrix?
- Software Word and Dotlet
- Download Test.doc
59Dynamic programming
- The dynamic programming algorithm provides a
reliable computation method for aligning
sequences - The method has been proven mathematically to
yield the optimal alignment (note there may be
more than a single optimal alignment) - Both local and global alignments can be produced
60Problem of alignment
- Roughly n x m comparisons need to be made for two
sequences of length n and m. - If the alignment is to include gaps of any length
at any position in either sequence, the number of
comparisons that must be made becomes
astronomical - Dynamic programming is a method of sequence
alignment that can take gaps into account but
requires only a moderate number of comparisons
61The algorithm
V D S C Y V D S L C Y
4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y Y
C C
L -
S S
D D
V V
7
9
-11
4
6
4
16
62A sub-optimal alignment
V D S C Y V D S L C Y
4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y -
C Y
L C
S S
D D
V V
-11
-2
-2
4
6
4
-1
63Measuring Alignment Quality(subjective criteria)
- Good alignments should have
- many exact matches
- few mismatches
- many of the mismatches should be similar
residues - few gaps
64Measuring Alignment Quality(objective criteria)
- What is the expected number of HSPs with a score
of at least S? - K constant dependent on the frequency of
nucleotide - m, n length of sequences
- ? loge (1/p), p probability of a match of
identical bases (1/4 for equal base frequencies
65Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without
detailed knowledge of the scoring system used. By
normalizing a raw score using One attains a
bit score S
66Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without
detailed knowledge of the scoring system used. By
normalizing a raw score using One attains a
bit score S
67Measuring Alignment Quality(objective criteria)
Bit scores The E value to a given bit score is
Bit scores subsume the statistical essence
of the scoring system, hence to calculate
significance one needs to know only the size of
the search space
68Measuring Alignment Quality(objective criteria)
- Significance of a HSP score
- P(Sgtx) 1-exp (-Kmne-?x)
- P(Sgtx) 1-exp (-E)
- m, n effective length of query and databank
sequence - E number of expected HSPs with score at least S
69Measuring Alignment Quality(objective criteria)
- Significance
- Some programs provide E-values rather than
P-values, as E is easier to understande.g.
E-value of 5 vs. 10 corresponds to P-value 0.993
and 0.99995 - P-value is associated with E-value e.g if one
expects to find 3 HSPs with score gtS, the
probability of finding one is 0.95 - When Elt0.01, P-values and E-values are nearly
identical
70Scoring matrices
- Rationale
- certain aa replacements occur often in a protein.
Because proteins are functioning despite these
changes the substituted aa are compatible with
structure and function. Yet other substitutions
are rare. - A scoring matrix is accounting for these
differences
71Scoring matrices
- Dayhoff, 1978
- PAM (point accepted mutation) matrices
- Henikoff Henikoff, 1992
- BLOSSUM (blocks amino acid substitution
matrices)
72PAM matrices
- This family of matrices lists the likelihood of
change from one aa to another in homologous
proteins during evolution - Each matrix gives the changes expected for a
given period of evolutionary time - Assumption
- Each change in the current aa is independent of
previous mutation events at that site. - aa changes observed in short evolutionary times
can be extrapolated to longer periods
73PAM matrices
- aa substitutions that occur in a group of
evolving proteins were estimated. - Because these changes are observed in closely
related proteins, they represent aa substitutions
that do not change the function of the protein -gt
accepted mutations - 1572 changes in 71 groups of protein sequences
were observed - The number of changes at each aa was counted on
a phylogenetic tree - And divided by the exposure to mutation (aa
frequency x number of aa in that group) PAM1 - Asn, Ser, Asp, Glu (highly mutable) Cys, Trp
(least mutable)
74PAM matrices
- The PAM 1 matrix gives the probability of a
single change - To obtain PAM matrices for N mutations, the PAM1
matrix is multiplied to itself N times - PAM250 represents a level of 250 change
(corresponds to 20 similarity) - Computer simulations have shown that PAM250
provides a better scoring alignment than lower
numbered PAMs for distantly (14-27 similarity)
proteins.
75PAM log odds score
- PAM matrices are usually converted in log odds
matrices - The ratio of the hypothesis that the change
represents an authentic evolutionary variation to
the hypothesis that the change occurred because
of random sequence variation (no biol.
significance) - Phe-gtTry
- Phe-Try score in PAM250 0.15
- Frequency of Phe in data 0.04
- Log odds score 10 x (0.15/0.04) 5.7
76PAM250
77BLOSUM matrices
- 500 families of related proteins
- Search for ungapped aa blocks that were present
78Gap scores
- The cost of introducing a gap must be higher than
the cost for extending it - Wx g rx
- g gap opening penalty
- x length of the gap
- r gap extension penalty
79(No Transcript)
80Assignment 7
- You have obtained a peptid sequence
(ASFPCLNGGTCNDQVNGYVCVCAQDTSVSTCET) and - would like to find its position in the full
length protein. - Software Word and Blast2 Sequences
- Download UEGF1.doc
81Multiple alignments
- Problem
- Alignment of
- two sequences (length N) N2 comparisons
- 300 aa 9x104
- three sequences (length N) N3 comparisons
- 300 aa 2.7x107
- -gt exact multiple alignments are not feasible for
most data sets heuristic methods are required
82Progressive methods for multiple alignment
- PILEUP
- Part of the GCG package
- CLUSTALW
- Available as local programs (Mac, PC, Unix)
- Could be also run on remote computers
83Progressive alignment algorithm
- Produce a global pairwise alignment for all pairs
of sequences - Full dynamic programming
- K-tuple approach, similar to FASTA
- Calculate the pairwise alignment scores
- Built a tree based on the genetic distances
derived from the alignment scores (NJ) - Align the sequences sequentially, guided by the
phylogenetic relationships indicated by the tree
84Progressive alignment weighting
- Problem alike sequences will produce a bias in
the alignment - Solution weighting of sequences based on
alignment scores
0.2
A
0.2 0.3/2 0.35
0.3
0.1
B
0.1 0.3/2 0.25
0.5
C
0.5
85Progressive alignment problems
- Dependence on the initial pairwise alignments
- No problem for closely related sequences
- The more diverged the sequences are, the more
problematic is the alignment - Choice of suitable scoring matrices and gap
penalties that apply to the entire set of
sequences - -gt Bayesian methods such as hidden Markov models
(HMMs) may be preferable for distantly related
sequences
86Single sequence queries
- Rationale a single sequence should be searched
against a database to identify those sequences,
which are most similar - Identification of a related gene in another
organism - Identification of a related gene in the same
organism - Similarity may provide clues about function
87Data banks
- Genomic sequences
- Complete genomes
- cDNA/proteins
- ESTs (expressed sequence tags)
88FASTA BLAST rationale
- Main idea Good alignments are expected to share
several aa. Hence, consecutive shared aa (words,
k-tuples) could serve as an indicator of quality.
- Observation HSPs of interest are usually longer
than a single word, so look for multiple hits on
the same diagonal, separated by a short distance
89FASTA
- FASTA3 is the latest version with increased
ability to detect distantly related sequences - Input
- k size of matching sequence patterns or words,
called k-tuples - Similarity matrix
- Compares query sequence pairwise with each
sequence in the database
90FASTA hashing algorithm
- Search for k consecutive matches
- Use a precompiled table that lists where in the
database each possible word occurs - Generation of the table is in the order L (size
of databank) - Use of the order N (size of query sequence)
91FASTA hashing algorithm
92FASTA algorithm
- Hashing built a library of k consecutive
residues and search the database represented by
such a library - Note not database is searched, but the library
- DNA k4-6 protein k1-2
- Longer words result in a faster, but less
sensitive search - Joining those matches within a certain distance
of each other are joined along with the region
between them into a longer matching region
without gaps.
93FASTA algorithm
- Filtering the 10 best matching regions are
rescored using a scoring matrix (BLOSUM or PAM) - Ends of the regions are trimmed to remove
residues not contributing to the score - The best scoring region INIT1 is reported
- Joining regions that are near enough are joined.
The score of this larger region, including
penalties for gaps needed to join the initial
regions is reported as INITN. - Distance for proteins K132 k216
94FASTA algorithm
- Later versions of FASTA include an optimization
step - When INITN reaches a certain threshold, the score
of the region is recalculated to produce an OPT
score by performing a full local alignment using
dynamic programming. - This procedure increases sensitivity but
decreases selectivity
95Limitations of FASTA
- FASTA can miss significant similarity since
- For proteins, similar sequences do not have to
share identical residues - Asp-Lys-Val is quite similar to Glu-Arg-Ile yet
it is missed even with k-tuple size of 1 since no
amino acid matches - For nucleic acids, due to codon wobble, DNA
sequences may look like XXyXXyXXy where Xs are
conserved and ys are not
96BLAST (1)Basic Local Alignment Search Tool
- Filter low complexity regions are removed
- Divide query sequence into words (sliding by 1
position) - Include imperfection based on a scoring matrix
similar words which produce a score higher than T
are assembled to a list - This step is included to permit not perfect
matches between subject and query sequence - Usually about 50 entries per word (rather than
20x20x208000)
97BLAST (2)Basic Local Alignment Search Tool
- Approach find segment pairs by first finding
word pairs that score above a threshold, i.e.,
find word pairs of fixed length w with a score of
at least T - Key concept Seems similar to FASTA, but we are
searching for words which score above T rather
than that match exactly
98BLAST (3)Basic Local Alignment Search Tool
- Each database entry is scanned for a match to one
of the list entries - Use the short matched regions (x) lying on the
same diagonal and within distance A as starting
points for a longer ungapped alignment between
words
99BLAST (4)Basic Local Alignment Search Tool
- Extension of the alignment from the matching
words in each direction along the sequences.
Extension continues as long as the score
increases.The extension is stopped when the
accumulated score stops increasing and had just
begun to fall a small amount below the best score
found for a shorter extension. - The obtained segment is called high scoring
segment pair (HSP)
100BLAST (5)Basic Local Alignment Search Tool
- Determine whether the HSP has a score larger than
a cutoff score S - S is determined by examining the range of scores
found by comparing random sequences and by
choosing a value that is significantly greater - Determine significance of each HSP score
- P(Sgtx) 1-exp (-Kmne-?x)
- P(Sgtx) 1-exp (-E)
- m, n effective length of query and databank
sequence - E number of expected HSPs with score at least S
101BLAST (6)Basic Local Alignment Search Tool
- Significance
- BLAST provides E-values rather than P-values, as
E is easier to understande.g. E-value of 5 vs.
10 corresponds to P-value 0.993 and 0.99995 - P-value is associated with E-value e.g if one
expects to find 3 HSPs with score gtS, the
probability of finding one is 0.95 - When Elt0.01, P-values and E-values are nearly
identical
102Selecting the BLAST program
103FASTA-BLAST comparison
104Significance of database searches (1)
- All previous theory referred to the comparison of
two sequences- how should one consider the entire
set of sequences? - 1. Significance is independent of the length of a
sequence-gt multiply pairwise significance with
number of sequence entries (FAST A) - 2. Significance depends on length, as long
sequences are composed of multiple distinct
domains-gt treat entire database as a single
sequence for calculation of significance
105Significance of database searches (2)
- Until now, only ungapped sequences were
considered. - Computational experiments and analytical results
suggest that the same theory could be applied to
gapped alignments - For ungapped alignments the statistical
parameters (?,K) can be calculated using analytic
formulas - For gapped alignments these parameters must be
estimated from a large-scale comparison of
random sequences
106Significance of database searches (3)
- gapped alignments
- FASTA local alignment scores are produced for
the comparison of query and every databank
sequence. Most of these scores involve unrelated
sequences, they could therefore be used to
estimate ? and K.Problemscores from pairs of
related sequences should be excluded - BLAST ? and K are estimated for a selected set
of substitution matrices and gap costs.The
estimation could be done with real sequences, but
has instead relied on random sequences
107Hidden Markov Model (HMM)
- HMMs offer a more systematic approach to
estimating model parameters - HMMs could be compared to a kind of dynamic
statistical profile - Like an ordinary profile, it is built by
analyzing the distribution of aa in a training
set of related proteins - The topology of a HMM can be visualized as a
finite state machine
108Hidden Markov Model (HMM)
Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain
transition probability
109Hidden Markov Model (HMM)
- More than one path leads to the same result
Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain
transition probability
110Hidden Markov Model (HMM)
- The probability of a given sequence is obtained
by the sum of loge (transition probabilities) - Hidden Markov model, as the path is hidden
- Transition probabilities are obtained by training
on a set of sequences - Initialization by estimated transition
probabilities - All possible paths generating a given sequence
are visited proportional to the estimated
transition probabilities - Counting the number of times a given transition
was visited during the above step provides
improved transition probabilities - The Viterbi algorithm is used on a trained HMM to
determine the best path - The Viterbi algorithm is similar to dynamic
programming
111Hidden Markov Model (HMM)
- HMM is a general technique that can be applied to
many different questions - Multiple sequence alignment
- Identification of conserved domains
- Gene prediction
- Protein secondary structure prediction
112Single aa sequence query programs
- Sequence similarity with query sequence
- FASTA, BLAST
- Alignment search with profile (scoring matrix
with gap penalties) - PROFILESEARCH
- Search with position specific scoring matrix
(PSSM) representing ungapped sequence alignment
(BLOCK) - MAST
- Iterative alignment search for similar sequences
that starts with query sequence, builds a gapped
multiple alignment, and then uses this to augment
the search - PSI-BLAST
- Search query sequence for patterns representative
of protein families - PROSITE, INTERPRO, PFAM, CDD/IMPALA
113(No Transcript)
114(No Transcript)
115(No Transcript)
116(No Transcript)
117Comparison of EMBL NCBI
118(No Transcript)
119Assignments 8 to 10
- You have isolated a number of proteins by their
interaction with a protein known - to interact with RING finger proteins. By
sequencing the protein you got - from human cell lines msvdmnsqgsdsneedydpnceeeeee
eeddpgdie - from C.elegans mnsddeiymegsasseddmddeclsd and
mddedmsctsgddyagygdedyyneadv - from Drosophila melanogaster mdsdndndfcdnvdsgnvss
gddgdddfg and - mdsdiemdmesdndgeydddydyyntgedcd
- from Saccharomyces cerevisiae mssgtendqfysfdesdss
sielyeshntseftihglv - from Arabidopsis thaliana mdnnsvigsevdaeadesyvna
aledgqtgkks and - mddyfsaeeeacyyssdqdsldgidneeselqpl
- a. Find the complete protein sequences for every
given peptide and align the sequence to find
out about their overall homology. - b. Are there RING finger motifs in your proteins
and if yes how many and where? - c. RING-Finger proteins share a common protein
motif of - C-X2-C-X9-29-C-X1-3-H-X2-3-C/H-X2-C-X4-48-C-
X2-C. - d. Are there other remarkable protein motifs?
- Software Word, BLAST, FastA and ClustalW
120Assignment 9
- You received a manuscript submitted for
publication. The authors claim that they have
discovered a gene involved in abnormal muscle
growth in salmon (hs heavy salmon). You should
decide if the paper should be published. - b. What gene is it? Is it really a novel gene?
- c. Do you support the authors claim that this
is a salmon gene? - d. Could the authors claim be true?
- Software Word, FastA, BLAST, Pubmed
- Download hs_gene.doc
121Assignment 10
- Inspired by the manuscripts you reviewed, you
decide to look for the gene in whales. - a. Make a sequence alignment to design primers
for cross species amplification -
- b. Design primers that have a fair chance to
amplify the gene from whales - c. You know that human contaminations are a
problem in your lab. What would you do to
minimize the risk of a human contamination? - Software Word, BLAST, FastA, ClustalW
122Organismal databases
123Arabidopsis thaliana
124Drosophila BDGP (1)
125Drosophila BDGP (2)
126Drosophila Flybase
127Drosophila NCBI
128Assignments 11 to 13
- In Drosophila microsatellites are very short. Try
to find the longest dinucleotide microsatellite
in D. melanogaster - Software FLYBASE, BDGP, BLAST,
129Assignment 12
- ITS sequences are widely employed to reconstruct
the phylogeny of closely related species. The
major advantage of ITS sequences is that you
could use primers (located in the 18S and 28S
rDNA) which are conserved across many species.
You have used these conserved primers to amplify
the complete ITS region form oaks. The PCR
products were cloned and sequenced. In the folder
oaks you find the results of your experiment. - Figure 1. Organization of the rDNA
- a. Make a contig of your sequences
- b. Define the boundaries of the genes with the
spacers - c. Verify that your sequences originate from
oaks. - Software Word, JaMBW, ClustalW, BLAST,FastA
- Download oak1, oak2, oak3, oak4, oak5
130Assignment 13
- You received one pair of microsatellite primers,
made PCR and found a highly interesting pattern
in one population (no variability). Inspired by
this result, you are interested to know more
about the locus. Unfortunately, you found only
the sequence of one of the primers
(ttttgtcgttttcgttatg) and your friend has gone
for a 6 months holiday. Fortunately, you are
working with one of the best studied organisms
Drosophila melanogaster so you have all
possibilities to investigate! - a. What is the repeat motif of your
microsatellite? - b. Which gene is in close proximity to the
microsatellite? - c. On which chromosome is the gene located?
- d.Determine the number of available transposon
insertions in the gene - e. Where in the gene are the transposons
inserted? - f. What would you do to obtain a flystock
having the gene deleted? - Software FastA, BLAST, FLYBASE, BDGP
131Gene prediction
132Gene prediction
- Goal identify those regions that code for
proteins - Direct approach Look for stretches that can be
interpreted as protein using the genetic code - Statistical approaches Use other knowledge about
likely coding regions
5 UTR
Exons
Introns
3 UTR
133Gene prediction direct approach
- Genetic code
- The universal genetic code is common to all
organisms - Prokaryotes, mitochondria and chloroplasts often
use slightly different genetic codes - More than one tRNA may be present for a given
codon, allowing more than one possible
translation product - Differences in genetic codes occur in start and
stop codons only - Alternate initiation codons codons that encode
amino acids but can also be used to start
translation (GUG, UUG, AUA, UUA, CUG) - Suppressor tRNA codons codons that normally stop
translation but are translated as amino acids
(UAG, UGA, UAA)
134Gene prediction direct approach
- Reading Frames
- Since nucleotide sequences are read three bases
at a time, there are three possible frames in
which a given nucleotide sequence can be read
(in the forward direction) - Taking the complement of the sequence and reading
in the reverse direction gives a total of six
reading frames - Open reading frames are defined by a set of
codons not interrupted by a stop codon - Note not all ORFs are actually used
135Gene prediction direct approach
- Statistical support by Ficketts statistic
codon usage bias - Observation every third base tends to be the
same one much more often than expected by chance.
- The reason for this is codon usage bias
- Different levels of expression of different tRNAs
for a given amino acid lead to pressure on coding
regions to conform to the preferred codon usage - Non-coding regions, on the other hand, feel no
selective pressure and can drift
136Gene prediction direct approach
- Statistical support by Ficketts statistic
codon usage bias - Example Glycine codon frequencies
137Gene prediction direct approach
exon
138Gene prediction direct approach
- Problem the direct approach works well for
Prokaryotes but not for Eukaryotes - Codon usage bias is not constant across genes
- Introns in Eukaryotes
139Gene prediction statistical approach
- To discriminate between different regions of a
gene, typical sequence elements are used as
clues - Content sensor Region of residues with similar
properties (introns, exons) - Signal sensor A specific signal sequence (may be
a consensus)
5 UTR
Exons
Introns
3 UTR
140Pre-mRNA splicing
141Gene Finding Software
- GENSCAN
- HMMGENE
- GENMARK
- GRAIL
HMMs
Neural Network
142Evaluation of gene predictions
- One has to discriminate between
- True positives (TP)
- False positives (FP)
- False negative (FN)
- Sensitivity TP/(TPFN)
- Specificity TP /(TPFP)
- GRAIL was used for different human data sets
- Sensitivity 0.48-0.65 specificity 0.61 - 0.72
143Promoter prediction
- Similar to gene prediction, known regulatory
signals could be used to make predictions - Algorithms
- Neuronal networks
- HMMs
144(No Transcript)
145(No Transcript)
146(No Transcript)
147Analyzing Gene Expression (Microarray) Data
148Assignments 14 and 15
- You have transformed an Arabidopsis thaliana
mutant with a genomic sequence (Annotierungssequen
z.doc) and the presumable gene is sufficient to
restore the function of the mutant gene. - a. Find the coding sequence
b. Find the PolyA signal - c. Where is the TATA box motif located?
- d. Locate the gene on the A. thaliana map
- e. Are cDNA clones available for this gene?
- f. Where is the gene expressed?
g. Predict the protein sequence - h. Does this protein share homologies with other
proteins? - i. Are there any related proteins in other
plants/animals? - j. Do these homologies indicate a possible
function? - k. Does the protein has some interesting domains?
- l. Is there a transmembran domain? m.
Predict the subcellular localization - SoftwareArabidopsis DatenbankTAIR,
GENSCAN,Genfinder, MCB search, ExPasy,PLACE - Download Annotierungssequenz.doc
149Assignment 15
- Based on sequence polymorphism data your friend
concluded that a given sequence has been the
target of selection. He asked you for advice
about the identified sequence. Make the best
possible characterization of the sequence-not
relying on a single source of information only. -
- Download Unknown.doc
150Microarray Data
- A snapshot of the amount of a particular gene
being transcribed in a tissue - Measured for tens of thousands of genes
- Use of multiple tissues on a single array allow
for direct comparisons between tissues
151Objectives of Microarray Studies
- Gene discovery Which genes are affected when
exposed to a treatment? - Hit it with a stick and see what happens
- Disease diagnosis Given a profile of levels of
expression for many genes, can the unknown
treatment be predicted? - Tumor or disease classification
- Time course experiments allow the study of
co-regulation of genes, and for the
reconstruction of regulatory networks - Pharmacogenomics
- The goal of pharmacogenomics is to find
correlations between therapeutic responses to
drugs and the genetic profiles of patients.
152Many computational and statistical problems
- Image analysis (spot identification, background,
etc.) - Data management and pipelining
- Normalization of data
- Clustering co-regulated genes
- Classifying tissue types
- Regulatory network inference
- Promoter identification (when combined with
genomic sequence data)
153Microarray Technology
- Spotted arrays
- Attach entire sequence of genes to the array
- Create cDNA from a tissue (expressed genes)
- Wash the pool of cDNAs over the array
- Complementary sequences bind
- Oligonucleotide arrays (Affymetrix chips)
- Attach short (25bp) oligos instead of entire genes
154GTTCGA.... The gene
CAAGCT.... cDNA
Via reverse transcription
GUUCGA.... mRNA
155Spotted arrays are usually treated with samples
from two different tissues, each labeled with a
different color of dye (Red and Green)
Highly expressed in tissue A
Highly expressed in tissue B
156(No Transcript)
157The Data
158Goal Cluster genes that share a profile
Experiment
159The approach is formally similar to
distance-based phylogenetic inference
- Compute a matrix of pairwise profile similarity
scores between genes - Use these scores in something like UPGMA
- Eisen et al. 1998. Cluster analysis and display
of genome-wide expression patterns. PNAS
9514863-14868
160(No Transcript)
161Clustering Techniques
- Bottom-up techniques
- Each gene starts in its own cluster, and genes
are sequentially clustered in a hierarchical
manner -
- Top-down techniques
- Begin with an initial number of clusters and
initial positions for the cluster centers (e.g.,
averages). Genes are added to the clusters
according to an optimality criterion.
162Clustering Techniques
- Principal component techniques
- Identify groups of genes that are highly
correlated with some underlying factor
(principal component). - Self-organizing maps
- Similar to Top-down clustering, with restrictions
placed on dimensionality of the final result.