Bioinformatics: overview

About This Presentation

Title:

Bioinformatics: overview

Description:

Netscape Navigator. Omniweb. Getting started. Open your web browser. Type in the address: ... and tetranucleotids as (AT)n,(CT)n, (CA)n, (GA)n, (GT)n or (CCT)n, ... – PowerPoint PPT presentation

Number of Views:373

Avg rating:3.0/5.0

Slides: 163

Provided by: christians6

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics: overview

1
Bioinformatics overview

Handling a computer
Opening and saving of files
Starting programs
Navigating the WWW
FTP
Browsing
Sequence data
Primary data
Sequence formats

2
Bioinformatics overview (2)

Databases
Entrez
SRS
Manipulation of DNA sequences
Restriction analysis
in silico cloning
Translation of nt sequence into protein
PCR
Primer design

3
Bioinformatics overview (3)

Comparison of two sequences
Dot matrix
Pairwise alignment
Multiple alignments
Database searches for similar sequences
FASTA
BLAST

4
Bioinformatics overview (4)

Sequence annotation
Intron/Exon prediction
Identification of conserved motifs
Identification of regulatory sequences
Organismal databases
D. melanogaster
A. thaliana
Expression profiling (chips)

5
Copy and Paste
To transfer text/sequence files into programs use
copy/paste

StrgC for copy StrgV for paste

6
Start a program

Double click on the program you wish to open
Microsoft word can be found under
Start -gt Programme-gt Microsoft Word

7
Create a Folder

Start - Programme - Windows Explorer - click
Desktop - click
Datei - Neu - Ordner - click
the new folder will appear on the screen
rename Neuer Ordner to EDV
click once on the icon and a second time on the
text field to activate the editor and write EDV
Save all your future files into this folder

8
Navigating the Internet

File transfer protocol (FTP)
Allows a person or computer to retrieve and send
files from/to another computer. Only copies are
moved the original file remains untouched.
The network terminal protocol (TELNET)
allows a user to log in on any other computer on
the network, turning the local computer in a
terminal.

9
Navigating the Internet

Every computer in the internet has its own unique
IP address
e.g. 193.171.103.86
Because these numbers are not intuitive they are
often converted into a name
i122server.vu-wien.ac.at addresses the same
computer
Subdirectories on this computer can be specified
i122server.vu-wien.ac.at/edv/start.html is a
folder, which has been prepared for this course

10
Navigating the Internet

For internet access you need either a modem or a
direct line
Once you are connected with one server you can
access the full internet
For most purposes an internet browser is
sufficient
Internet Explorer
Netscape Navigator
Omniweb

11
Getting started

Open your web browser
Type in the address http//i122server.vu-wien.ac
.at/edv/start.html

Press return
You could make a bookmark of this page

12
Links

Rather than typing a new address each time, it is
possible to click on specially marked text or
symbols
After a single click you will connect to the
address
To view the address before connecting, simply
move your mouse above the link

13
Assignment 1

Open Netscape Communicator
http//i122server.vu-wien.ac.at/edv/start.html
Create a new folder on the desktop
Open Microsoft Word and type in a random DNA
sequence, save this sequence as text only file
into your folder.
Software Windows-Explorer and Word

14
Sequence data

Automated DNA sequencing heavily relies on the
support of computer algorithms
Data collection

15
Sequence data

The use of 4 different dyes requires intensive
computer calculations to extract sequence
information
Electropherograms

16
Sequence formats

While electorpherograms are useful during the
sequencing project, after the completion
sequences are stored as text.
Plain text contains only the sequence
information
Fasta

17
Other sequence formats

GenBank

18
Manipulation of DNA Sequences I
Restriction endonucleases sticky ends XhoI
(c/tcgag), PstI (ctgca/g), ... blunt
ends SmaI (ccc/ggg), DraI (ttt/aaa),.... Rare
cutters large recognition sites Frequent
cutters small recognition sites Multiple
cloning site
19
Manipulation of DNA Sequences II
20
Assignment 2
1. Open the file pSKII.doc and try to find the
sequences for the sequencing primers M13-forward
(5' gtaaaacgacggccagt 3') and M13-reverse (5'
ggaaacagctatgaccatg 3') as well as the RNA
polymerase promoters T3 (5' aattaaccctcactaaaggg
3') and T7 (5 gtaatacgactcactatagggc 3').
2. Which primers are homologous to the single
stranded SKII sequence and which are
complementary? Software Word, JaMBW (Reverse,
Complement, Inverse) Download pSKII.doc
21
Characteristics of cloning vectors I
multiple cloning site region for universal
(M13 /-) sequencing primers RNA
polymerase (T7, T3, SP6) promoters genes for
selections
22
Watson-Crick DNA strands
The upper strand of the dsDNA is called, W
(Watson) for forward and the lower strand C
(Crick) for reverse.

The C strand is complementary (complement/reverse)
to the W strand
C is in antisense to W

23
DNA and RNA Polymerases
DNA Polymerases need short primers to start DNA
synthesis RNA Polymerases need short
promoters Polymerases synthesize DNA/RNA only in
the 5 - 3 direction If open reading frame (ORF)
is coded by the W strand - the C strand codes
for the antisense gene Also the C strand can code
for ORFs - than the W strand codes the antisense
gene
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Assignment 3
You received a cDNA clone and the sequence of the
insert (prc1edvkurs.doc) from your colleague. He
told you that the startcodon is the "atg" at
position 79. For synthesis of an antisense RNA
used as Northern Blot probe you have to subclone
the insert into another vector. The vector you
have in the lab is Bluescript (pSKII.doc).
Bluescript contains a multiple cloning site
flanked by sequences for the sequencing primers
M13-forward and M13-reverse and the RNA
polymerase promoters T3 and T7. a. Find the
multiple cloning site in the vector b. Find the
best cloning strategy using only one restriction
enzyme. c. Use directed cloning to ensure that
all clones could be used to produce an
antisense probe with the RNA polymerase T3. d.
Define a strategy to modify the clone of 3c for
the use of T7 RNA polymerase. Which enzymes would
you use? Software Word and Webcutter Download
prc1edvkurs.doc
28
Manipulation of DNA Sequences III
Polymerase chain reaction (PCR) http//bibiserv.t
echfak.uni-bielefeld.de/sadr/pcrtutor.html
29
Manipulation of DNA Sequences IV
Primer design size between 19 and 25
bases melting temperature 48 C and 60
C Tmforward Tmreverse Tm 2 (A
T) 4 (G C) minimum of G/Cs 9 - 11 ( 40 -
50) distance between primer pairs 10 bp - 40
kb annealing sites unique - 3 end avoid
mispriming primer-primer interaction hairpin
structures
30
(No Transcript)
31
(No Transcript)
32
Assignment 4

Microsatellites are highly polymorphic markers,
which are extensively used for paternity testing,
genome walking, provenance studies and analysis
of population structures.
They consist of tandemly repeated simple
sequences of di-, tri and tetranucleotids as
(AT)n,(CT)n, (CA)n, (GA)n, (GT)n or (CCT)n,....
Their length variation results from DNA slippage
a mechanism, which increases and decreases their
repeat number. The repeats are flanked by unique
sequences, which allow to design specific primers
for the amplification of the microsatellite.
Please design primer pairs for the amplification
of a microsatellite using the following criteria
product length 100 - 300 bp
annealing temperature higher than 55 C
primer length between 20 - 24 bp
Software Word and Primer3
Download microsatellite.doc

33
The importance of centralized databanks
34
EMBL Databank
35
EMBL SRS
36
Entrez

is a search and retrieval system that integrates
information from databases at NCBI

37
(No Transcript)
38
PopSet-prealigned multiple data sets
39
Taxonomy Browser
40
Online Mendelian Inheritance in Man

This database is a catalog of human genes and
genetic disorders. The database contains textual
information and references. It also contains
copious links to MEDLINE and sequence records in
Entrez

41
Objectives

What is the function of this gene?
Do other genes have this functional motif?
Can I predict the higher order structure of this
protein?
Is this gene a member of a known gene family?
Do other organisms have this gene?

42
General Database Search Issues

Search using amino acid sequence if possible
Why? Protein evolution is slower than DNA
sequence evolution
Ask the program to translate your query sequence
in all 6 possible reading frames.
Statistical theory is based on unrealistic
assumptions consider searches as exploratory
analyses.

43
Similarity Search Jargon
A similarity search of a database is performed by
aligning a query sequence to each sequence in the
database. If good matches are found, the search
returns a list of HSPs High-scoring Segment
Pairs.
44
Alignment Jargon
ancestor
Evolutionarily related sequences differ from one
other because of several processes

Substitutions
Insertions
Deletions

Observed sequences
45
Alignment Jargon
GCG ACG
Substitution
A?G

1 mismatch
2 matches

46
Alignment Jargon
ATCG A-CG
Insertion
?T

0 mismatches
3 matches
1 gap

47
Alignment Jargon
Deletion
ATCG A-CG

0 mismatches
3 matches
1 gap

48
Alignment Jargon
Results of insertion and deletion events can be
indistinguishable. Indel INsertion or DELetion
49
Sequence Alignment

Sequence alignment is simply the optimal
assignment of substitution and indel events to a
pair of sequences.
Global alignment align entire sequences
Local alignment find best matching regions of
sequences

50
Alignment of pairs of sequences

Dot matrix analysis
Dynamic programming
Word (k-tuple) methods

51
Dot matrix

Sequence A is compared against B
Matching bases are marked on a AxB grid

52
Dot matrix

Sequence A is compared against B
Matching bases are marked on a AxB grid

53
Dot matrix

The background could be adjusted by changing the
window size

54
Dot matrix

The background could be adjusted by changing the
window size
(phage lambda and P22 repressor proteins)
1/1 7/11 15/23

55
Dot matrix

Search for conserved regions and domains
Identify repeated nucleic acid and protein
domains
Determine introns and exons
Find inverted repeats and stem-loop structures
regions of low complexity
frameshifts

56
(No Transcript)
57
Assignments 5 and 6

You isolated a cDNA clone (PlecDNA.doc) and you
would like to know how many introns are in the
gene. Fortunately you are working with a fully
sequenced organism thus it is easy to retrieve
the full genomic region (Plegenomic.doc).
a) How many introns does the gene contain?
b) What are the sequences (10 bp) around
introns 2, 3 and the
corresponding exons borders?
Software Word and Dotlet
Download PlecDNA.doc, Plegenomic.doc

58
Assignment 6

The previous analysis showed that with the dot
matrix program some useful interpretation can be
made on DNA sequences. You have recently isolated
a genomic fragment (Test.doc) and encouraged by
the former results to analyze it with the dot
matrix program.
How can you explain the pattern you see in the
dot matrix?
Delete an internal portion of the sequence and
compare the full versus the deleted sequence ?
What is the pattern on the dot matrix?
Software Word and Dotlet
Download Test.doc

59
Dynamic programming

The dynamic programming algorithm provides a
reliable computation method for aligning
sequences
The method has been proven mathematically to
yield the optimal alignment (note there may be
more than a single optimal alignment)
Both local and global alignments can be produced

60
Problem of alignment

Roughly n x m comparisons need to be made for two
sequences of length n and m.
If the alignment is to include gaps of any length
at any position in either sequence, the number of
comparisons that must be made becomes
astronomical
Dynamic programming is a method of sequence
alignment that can take gaps into account but
requires only a moderate number of comparisons

61
The algorithm
V D S C Y V D S L C Y
4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y Y
C C
L -
S S
D D
V V
7
9
-11
4
6
4
16
62
A sub-optimal alignment
V D S C Y V D S L C Y
4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y -
C Y
L C
S S
D D
V V
-11
-2
-2
4
6
4
-1
63
Measuring Alignment Quality(subjective criteria)

Good alignments should have
many exact matches
few mismatches
many of the mismatches should be similar
residues
few gaps

64
Measuring Alignment Quality(objective criteria)

What is the expected number of HSPs with a score
of at least S?
K constant dependent on the frequency of
nucleotide
m, n length of sequences
? loge (1/p), p probability of a match of
identical bases (1/4 for equal base frequencies

65
Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without
detailed knowledge of the scoring system used. By
normalizing a raw score using One attains a
bit score S
66
Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without
detailed knowledge of the scoring system used. By
normalizing a raw score using One attains a
bit score S
67
Measuring Alignment Quality(objective criteria)
Bit scores The E value to a given bit score is
Bit scores subsume the statistical essence
of the scoring system, hence to calculate
significance one needs to know only the size of
the search space
68
Measuring Alignment Quality(objective criteria)

Significance of a HSP score
P(Sgtx) 1-exp (-Kmne-?x)
P(Sgtx) 1-exp (-E)
m, n effective length of query and databank
sequence
E number of expected HSPs with score at least S

69
Measuring Alignment Quality(objective criteria)

Significance
Some programs provide E-values rather than
P-values, as E is easier to understande.g.
E-value of 5 vs. 10 corresponds to P-value 0.993
and 0.99995
P-value is associated with E-value e.g if one
expects to find 3 HSPs with score gtS, the
probability of finding one is 0.95
When Elt0.01, P-values and E-values are nearly
identical

70
Scoring matrices

Rationale
certain aa replacements occur often in a protein.
Because proteins are functioning despite these
changes the substituted aa are compatible with
structure and function. Yet other substitutions
are rare.
A scoring matrix is accounting for these
differences

71
Scoring matrices

Dayhoff, 1978
PAM (point accepted mutation) matrices
Henikoff Henikoff, 1992
BLOSSUM (blocks amino acid substitution
matrices)

72
PAM matrices

This family of matrices lists the likelihood of
change from one aa to another in homologous
proteins during evolution
Each matrix gives the changes expected for a
given period of evolutionary time
Assumption
Each change in the current aa is independent of
previous mutation events at that site.
aa changes observed in short evolutionary times
can be extrapolated to longer periods

73
PAM matrices

aa substitutions that occur in a group of
evolving proteins were estimated.
Because these changes are observed in closely
related proteins, they represent aa substitutions
that do not change the function of the protein -gt
accepted mutations
1572 changes in 71 groups of protein sequences
were observed
The number of changes at each aa was counted on
a phylogenetic tree
And divided by the exposure to mutation (aa
frequency x number of aa in that group) PAM1
Asn, Ser, Asp, Glu (highly mutable) Cys, Trp
(least mutable)

74
PAM matrices

The PAM 1 matrix gives the probability of a
single change
To obtain PAM matrices for N mutations, the PAM1
matrix is multiplied to itself N times
PAM250 represents a level of 250 change
(corresponds to 20 similarity)
Computer simulations have shown that PAM250
provides a better scoring alignment than lower
numbered PAMs for distantly (14-27 similarity)
proteins.

75
PAM log odds score

PAM matrices are usually converted in log odds
matrices
The ratio of the hypothesis that the change
represents an authentic evolutionary variation to
the hypothesis that the change occurred because
of random sequence variation (no biol.
significance)
Phe-gtTry
Phe-Try score in PAM250 0.15
Frequency of Phe in data 0.04
Log odds score 10 x (0.15/0.04) 5.7

76
PAM250
77
BLOSUM matrices

500 families of related proteins
Search for ungapped aa blocks that were present

78
Gap scores

The cost of introducing a gap must be higher than
the cost for extending it
Wx g rx
g gap opening penalty
x length of the gap
r gap extension penalty

79
(No Transcript)
80
Assignment 7

You have obtained a peptid sequence
(ASFPCLNGGTCNDQVNGYVCVCAQDTSVSTCET) and
would like to find its position in the full
length protein.
Software Word and Blast2 Sequences
Download UEGF1.doc

81
Multiple alignments

Problem
Alignment of
two sequences (length N) N2 comparisons
300 aa 9x104
three sequences (length N) N3 comparisons
300 aa 2.7x107
-gt exact multiple alignments are not feasible for
most data sets heuristic methods are required

82
Progressive methods for multiple alignment

PILEUP
Part of the GCG package
CLUSTALW
Available as local programs (Mac, PC, Unix)
Could be also run on remote computers

83
Progressive alignment algorithm

Produce a global pairwise alignment for all pairs
of sequences
Full dynamic programming
K-tuple approach, similar to FASTA
Calculate the pairwise alignment scores
Built a tree based on the genetic distances
derived from the alignment scores (NJ)
Align the sequences sequentially, guided by the
phylogenetic relationships indicated by the tree

84
Progressive alignment weighting

Problem alike sequences will produce a bias in
the alignment
Solution weighting of sequences based on
alignment scores

0.2
A
0.2 0.3/2 0.35
0.3
0.1
B
0.1 0.3/2 0.25
0.5
C
0.5
85
Progressive alignment problems

Dependence on the initial pairwise alignments
No problem for closely related sequences
The more diverged the sequences are, the more
problematic is the alignment
Choice of suitable scoring matrices and gap
penalties that apply to the entire set of
sequences
-gt Bayesian methods such as hidden Markov models
(HMMs) may be preferable for distantly related
sequences

86
Single sequence queries

Rationale a single sequence should be searched
against a database to identify those sequences,
which are most similar
Identification of a related gene in another
organism
Identification of a related gene in the same
organism
Similarity may provide clues about function

87
Data banks

Genomic sequences
Complete genomes
cDNA/proteins
ESTs (expressed sequence tags)

88
FASTA BLAST rationale

Main idea Good alignments are expected to share
several aa. Hence, consecutive shared aa (words,
k-tuples) could serve as an indicator of quality.
Observation HSPs of interest are usually longer
than a single word, so look for multiple hits on
the same diagonal, separated by a short distance

89
FASTA

FASTA3 is the latest version with increased
ability to detect distantly related sequences
Input
k size of matching sequence patterns or words,
called k-tuples
Similarity matrix
Compares query sequence pairwise with each
sequence in the database

90
FASTA hashing algorithm

Search for k consecutive matches
Use a precompiled table that lists where in the
database each possible word occurs
Generation of the table is in the order L (size
of databank)
Use of the order N (size of query sequence)

91
FASTA hashing algorithm

word size 1 aa

92
FASTA algorithm

Hashing built a library of k consecutive
residues and search the database represented by
such a library
Note not database is searched, but the library
DNA k4-6 protein k1-2
Longer words result in a faster, but less
sensitive search
Joining those matches within a certain distance
of each other are joined along with the region
between them into a longer matching region
without gaps.

93
FASTA algorithm

Filtering the 10 best matching regions are
rescored using a scoring matrix (BLOSUM or PAM)
Ends of the regions are trimmed to remove
residues not contributing to the score
The best scoring region INIT1 is reported
Joining regions that are near enough are joined.
The score of this larger region, including
penalties for gaps needed to join the initial
regions is reported as INITN.
Distance for proteins K132 k216

94
FASTA algorithm

Later versions of FASTA include an optimization
step
When INITN reaches a certain threshold, the score
of the region is recalculated to produce an OPT
score by performing a full local alignment using
dynamic programming.
This procedure increases sensitivity but
decreases selectivity

95
Limitations of FASTA

FASTA can miss significant similarity since
For proteins, similar sequences do not have to
share identical residues
Asp-Lys-Val is quite similar to Glu-Arg-Ile yet
it is missed even with k-tuple size of 1 since no
amino acid matches
For nucleic acids, due to codon wobble, DNA
sequences may look like XXyXXyXXy where Xs are
conserved and ys are not

96
BLAST (1)Basic Local Alignment Search Tool

Filter low complexity regions are removed
Divide query sequence into words (sliding by 1
position)
Include imperfection based on a scoring matrix
similar words which produce a score higher than T
are assembled to a list
This step is included to permit not perfect
matches between subject and query sequence
Usually about 50 entries per word (rather than
20x20x208000)

97
BLAST (2)Basic Local Alignment Search Tool

Approach find segment pairs by first finding
word pairs that score above a threshold, i.e.,
find word pairs of fixed length w with a score of
at least T
Key concept Seems similar to FASTA, but we are
searching for words which score above T rather
than that match exactly

98
BLAST (3)Basic Local Alignment Search Tool

Each database entry is scanned for a match to one
of the list entries
Use the short matched regions (x) lying on the
same diagonal and within distance A as starting
points for a longer ungapped alignment between
words

99
BLAST (4)Basic Local Alignment Search Tool

Extension of the alignment from the matching
words in each direction along the sequences.
Extension continues as long as the score
increases.The extension is stopped when the
accumulated score stops increasing and had just
begun to fall a small amount below the best score
found for a shorter extension.
The obtained segment is called high scoring
segment pair (HSP)

100
BLAST (5)Basic Local Alignment Search Tool

Determine whether the HSP has a score larger than
a cutoff score S
S is determined by examining the range of scores
found by comparing random sequences and by
choosing a value that is significantly greater
Determine significance of each HSP score
P(Sgtx) 1-exp (-Kmne-?x)
P(Sgtx) 1-exp (-E)
m, n effective length of query and databank
sequence
E number of expected HSPs with score at least S

101
BLAST (6)Basic Local Alignment Search Tool

Significance
BLAST provides E-values rather than P-values, as
E is easier to understande.g. E-value of 5 vs.
10 corresponds to P-value 0.993 and 0.99995
P-value is associated with E-value e.g if one
expects to find 3 HSPs with score gtS, the
probability of finding one is 0.95
When Elt0.01, P-values and E-values are nearly
identical

102
Selecting the BLAST program
103
FASTA-BLAST comparison
104
Significance of database searches (1)

All previous theory referred to the comparison of
two sequences- how should one consider the entire
set of sequences?
1. Significance is independent of the length of a
sequence-gt multiply pairwise significance with
number of sequence entries (FAST A)
2. Significance depends on length, as long
sequences are composed of multiple distinct
domains-gt treat entire database as a single
sequence for calculation of significance

105
Significance of database searches (2)

Until now, only ungapped sequences were
considered.
Computational experiments and analytical results
suggest that the same theory could be applied to
gapped alignments
For ungapped alignments the statistical
parameters (?,K) can be calculated using analytic
formulas
For gapped alignments these parameters must be
estimated from a large-scale comparison of
random sequences

106
Significance of database searches (3)

gapped alignments
FASTA local alignment scores are produced for
the comparison of query and every databank
sequence. Most of these scores involve unrelated
sequences, they could therefore be used to
estimate ? and K.Problemscores from pairs of
related sequences should be excluded
BLAST ? and K are estimated for a selected set
of substitution matrices and gap costs.The
estimation could be done with real sequences, but
has instead relied on random sequences

107
Hidden Markov Model (HMM)

HMMs offer a more systematic approach to
estimating model parameters
HMMs could be compared to a kind of dynamic
statistical profile
Like an ordinary profile, it is built by
analyzing the distribution of aa in a training
set of related proteins
The topology of a HMM can be visualized as a
finite state machine

108
Hidden Markov Model (HMM)
Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain
transition probability
109
Hidden Markov Model (HMM)

More than one path leads to the same result

Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain
transition probability
110
Hidden Markov Model (HMM)

The probability of a given sequence is obtained
by the sum of loge (transition probabilities)
Hidden Markov model, as the path is hidden
Transition probabilities are obtained by training
on a set of sequences
Initialization by estimated transition
probabilities
All possible paths generating a given sequence
are visited proportional to the estimated
transition probabilities
Counting the number of times a given transition
was visited during the above step provides
improved transition probabilities
The Viterbi algorithm is used on a trained HMM to
determine the best path
The Viterbi algorithm is similar to dynamic
programming

111
Hidden Markov Model (HMM)

HMM is a general technique that can be applied to
many different questions
Multiple sequence alignment
Identification of conserved domains
Gene prediction
Protein secondary structure prediction

112
Single aa sequence query programs

Sequence similarity with query sequence
FASTA, BLAST
Alignment search with profile (scoring matrix
with gap penalties)
PROFILESEARCH
Search with position specific scoring matrix
(PSSM) representing ungapped sequence alignment
(BLOCK)
MAST
Iterative alignment search for similar sequences
that starts with query sequence, builds a gapped
multiple alignment, and then uses this to augment
the search
PSI-BLAST
Search query sequence for patterns representative
of protein families
PROSITE, INTERPRO, PFAM, CDD/IMPALA

113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
(No Transcript)
117
Comparison of EMBL NCBI
118
(No Transcript)
119
Assignments 8 to 10

You have isolated a number of proteins by their
interaction with a protein known
to interact with RING finger proteins. By
sequencing the protein you got
from human cell lines msvdmnsqgsdsneedydpnceeeeee
eeddpgdie
from C.elegans mnsddeiymegsasseddmddeclsd and
mddedmsctsgddyagygdedyyneadv
from Drosophila melanogaster mdsdndndfcdnvdsgnvss
gddgdddfg and
mdsdiemdmesdndgeydddydyyntgedcd
from Saccharomyces cerevisiae mssgtendqfysfdesdss
sielyeshntseftihglv
from Arabidopsis thaliana mdnnsvigsevdaeadesyvna
aledgqtgkks and
mddyfsaeeeacyyssdqdsldgidneeselqpl
a. Find the complete protein sequences for every
given peptide and align the sequence to find
out about their overall homology.
b. Are there RING finger motifs in your proteins
and if yes how many and where?
c. RING-Finger proteins share a common protein
motif of
C-X2-C-X9-29-C-X1-3-H-X2-3-C/H-X2-C-X4-48-C-
X2-C.
d. Are there other remarkable protein motifs?
Software Word, BLAST, FastA and ClustalW

120
Assignment 9

You received a manuscript submitted for
publication. The authors claim that they have
discovered a gene involved in abnormal muscle
growth in salmon (hs heavy salmon). You should
decide if the paper should be published.
b. What gene is it? Is it really a novel gene?
c. Do you support the authors claim that this
is a salmon gene?
d. Could the authors claim be true?
Software Word, FastA, BLAST, Pubmed
Download hs_gene.doc

121
Assignment 10

Inspired by the manuscripts you reviewed, you
decide to look for the gene in whales.
a. Make a sequence alignment to design primers
for cross species amplification
b. Design primers that have a fair chance to
amplify the gene from whales
c. You know that human contaminations are a
problem in your lab. What would you do to
minimize the risk of a human contamination?
Software Word, BLAST, FastA, ClustalW

122
Organismal databases
123
Arabidopsis thaliana
124
Drosophila BDGP (1)
125
Drosophila BDGP (2)
126
Drosophila Flybase
127
Drosophila NCBI
128
Assignments 11 to 13

In Drosophila microsatellites are very short. Try
to find the longest dinucleotide microsatellite
in D. melanogaster
Software FLYBASE, BDGP, BLAST,

129
Assignment 12

ITS sequences are widely employed to reconstruct
the phylogeny of closely related species. The
major advantage of ITS sequences is that you
could use primers (located in the 18S and 28S
rDNA) which are conserved across many species.
You have used these conserved primers to amplify
the complete ITS region form oaks. The PCR
products were cloned and sequenced. In the folder
oaks you find the results of your experiment.
Figure 1. Organization of the rDNA
a. Make a contig of your sequences
b. Define the boundaries of the genes with the
spacers
c. Verify that your sequences originate from
oaks.
Software Word, JaMBW, ClustalW, BLAST,FastA
Download oak1, oak2, oak3, oak4, oak5

130
Assignment 13

You received one pair of microsatellite primers,
made PCR and found a highly interesting pattern
in one population (no variability). Inspired by
this result, you are interested to know more
about the locus. Unfortunately, you found only
the sequence of one of the primers
(ttttgtcgttttcgttatg) and your friend has gone
for a 6 months holiday. Fortunately, you are
working with one of the best studied organisms
Drosophila melanogaster so you have all
possibilities to investigate!
a. What is the repeat motif of your
microsatellite?
b. Which gene is in close proximity to the
microsatellite?
c. On which chromosome is the gene located?
d.Determine the number of available transposon
insertions in the gene
e. Where in the gene are the transposons
inserted?
f. What would you do to obtain a flystock
having the gene deleted?
Software FastA, BLAST, FLYBASE, BDGP

131
Gene prediction
132
Gene prediction

Goal identify those regions that code for
proteins
Direct approach Look for stretches that can be
interpreted as protein using the genetic code
Statistical approaches Use other knowledge about
likely coding regions

5 UTR
Exons
Introns
3 UTR
133
Gene prediction direct approach

Genetic code
The universal genetic code is common to all
organisms
Prokaryotes, mitochondria and chloroplasts often
use slightly different genetic codes
More than one tRNA may be present for a given
codon, allowing more than one possible
translation product
Differences in genetic codes occur in start and
stop codons only
Alternate initiation codons codons that encode
amino acids but can also be used to start
translation (GUG, UUG, AUA, UUA, CUG)
Suppressor tRNA codons codons that normally stop
translation but are translated as amino acids
(UAG, UGA, UAA)

134
Gene prediction direct approach

Reading Frames
Since nucleotide sequences are read three bases
at a time, there are three possible frames in
which a given nucleotide sequence can be read
(in the forward direction)
Taking the complement of the sequence and reading
in the reverse direction gives a total of six
reading frames
Open reading frames are defined by a set of
codons not interrupted by a stop codon
Note not all ORFs are actually used

135
Gene prediction direct approach

Statistical support by Ficketts statistic
codon usage bias
Observation every third base tends to be the
same one much more often than expected by chance.
The reason for this is codon usage bias
Different levels of expression of different tRNAs
for a given amino acid lead to pressure on coding
regions to conform to the preferred codon usage
Non-coding regions, on the other hand, feel no
selective pressure and can drift

136
Gene prediction direct approach

Statistical support by Ficketts statistic
codon usage bias
Example Glycine codon frequencies

137
Gene prediction direct approach
exon
138
Gene prediction direct approach

Problem the direct approach works well for
Prokaryotes but not for Eukaryotes
Codon usage bias is not constant across genes
Introns in Eukaryotes

139
Gene prediction statistical approach

To discriminate between different regions of a
gene, typical sequence elements are used as
clues
Content sensor Region of residues with similar
properties (introns, exons)
Signal sensor A specific signal sequence (may be
a consensus)

5 UTR
Exons
Introns
3 UTR
140
Pre-mRNA splicing
141
Gene Finding Software

GENSCAN
HMMGENE
GENMARK
GRAIL

HMMs
Neural Network
142
Evaluation of gene predictions

One has to discriminate between
True positives (TP)
False positives (FP)
False negative (FN)
Sensitivity TP/(TPFN)
Specificity TP /(TPFP)
GRAIL was used for different human data sets
Sensitivity 0.48-0.65 specificity 0.61 - 0.72

143
Promoter prediction

Similar to gene prediction, known regulatory
signals could be used to make predictions
Algorithms
Neuronal networks
HMMs

144
(No Transcript)
145
(No Transcript)
146
(No Transcript)
147
Analyzing Gene Expression (Microarray) Data
148
Assignments 14 and 15

You have transformed an Arabidopsis thaliana
mutant with a genomic sequence (Annotierungssequen
z.doc) and the presumable gene is sufficient to
restore the function of the mutant gene.
a. Find the coding sequence
b. Find the PolyA signal
c. Where is the TATA box motif located?
d. Locate the gene on the A. thaliana map
e. Are cDNA clones available for this gene?
f. Where is the gene expressed?
g. Predict the protein sequence
h. Does this protein share homologies with other
proteins?
i. Are there any related proteins in other
plants/animals?
j. Do these homologies indicate a possible
function?
k. Does the protein has some interesting domains?
l. Is there a transmembran domain? m.
Predict the subcellular localization
SoftwareArabidopsis DatenbankTAIR,
GENSCAN,Genfinder, MCB search, ExPasy,PLACE
Download Annotierungssequenz.doc

149
Assignment 15

Based on sequence polymorphism data your friend
concluded that a given sequence has been the
target of selection. He asked you for advice
about the identified sequence. Make the best
possible characterization of the sequence-not
relying on a single source of information only.
Download Unknown.doc

150
Microarray Data

A snapshot of the amount of a particular gene
being transcribed in a tissue
Measured for tens of thousands of genes
Use of multiple tissues on a single array allow
for direct comparisons between tissues

151
Objectives of Microarray Studies

Gene discovery Which genes are affected when
exposed to a treatment?
Hit it with a stick and see what happens
Disease diagnosis Given a profile of levels of
expression for many genes, can the unknown
treatment be predicted?
Tumor or disease classification
Time course experiments allow the study of
co-regulation of genes, and for the
reconstruction of regulatory networks
Pharmacogenomics
The goal of pharmacogenomics is to find
correlations between therapeutic responses to
drugs and the genetic profiles of patients.

152
Many computational and statistical problems

Image analysis (spot identification, background,
etc.)
Data management and pipelining
Normalization of data
Clustering co-regulated genes
Classifying tissue types
Regulatory network inference
Promoter identification (when combined with
genomic sequence data)

153
Microarray Technology

Spotted arrays
Attach entire sequence of genes to the array
Create cDNA from a tissue (expressed genes)
Wash the pool of cDNAs over the array
Complementary sequences bind
Oligonucleotide arrays (Affymetrix chips)
Attach short (25bp) oligos instead of entire genes

154
GTTCGA.... The gene
CAAGCT.... cDNA
Via reverse transcription
GUUCGA.... mRNA
155
Spotted arrays are usually treated with samples
from two different tissues, each labeled with a
different color of dye (Red and Green)
Highly expressed in tissue A
Highly expressed in tissue B
156
(No Transcript)
157
The Data
158
Goal Cluster genes that share a profile
Experiment
159
The approach is formally similar to
distance-based phylogenetic inference

Compute a matrix of pairwise profile similarity
scores between genes
Use these scores in something like UPGMA
Eisen et al. 1998. Cluster analysis and display
of genome-wide expression patterns. PNAS
9514863-14868

160
(No Transcript)
161
Clustering Techniques

Bottom-up techniques
Each gene starts in its own cluster, and genes
are sequentially clustered in a hierarchical
manner
Top-down techniques
Begin with an initial number of clusters and
initial positions for the cluster centers (e.g.,
averages). Genes are added to the clusters
according to an optimality criterion.

162
Clustering Techniques

Principal component techniques
Identify groups of genes that are highly
correlated with some underlying factor
(principal component).
Self-organizing maps
Similar to Top-down clustering, with restrictions
placed on dimensionality of the final result.

Write a Comment

User Comments (0)