Computing Patterns in Biology

About This Presentation

Title:

Computing Patterns in Biology

Description:

Hydrophobicity is a property of groups of amino acids - best examined as a graph ... pepinfo: plots protein secondary structure and hydrophobicity in parallel panels ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 57

Provided by: stuart67

Category:

more less

Transcript and Presenter's Notes

Title: Computing Patterns in Biology

1
Computing Patterns in Biology

Stuart M. Brown
New York University School of Medicine

2
Why Compute Biological Patterns?

Because we can
(computer scientists love to find interesting
problems)
patterns are beautiful
Its practical - helps with genecloning
experiments, predict functions of new proteins
Systems biology - figure out circuits of
regulation, predict outcome of changes, design
new biological systems

3
Overview

Restriction sites
Finding genes in DNA sequences
Regulatory sites in DNA
Protein signals (transport and processing)
Protein functional Motifs
Protein families
Protein 3-D structure

4
Restriction Sites

Bacteria make restriction enzymes that cut DNA
at specific sequences (4-8 base patterns)
Very simple to find these patterns - can even use
the Find function of your web browser or word
processor
Open any page of text and look for CAT
you now have a restriction site search program!

5
NEBcutter2

http//tools.neb.com/NEBcutter2/

6
Finding Genes in Genomic DNA

Translate (in all 6 reading frames) and look for
similarity to known protein sequences
Look for long Open Reading Frames (ORFs) between
start and stop codons (startATG, stopTAA,
TAG, TGA)
Look for known gene markers
TAATAA box, intron splice sites, etc.
Statistical methods (codon preference)

7
GCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAA
AAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATA
AAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGG
CAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAA
TAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAA
ATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGC
AAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAAC
TCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGG
AATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAA
AACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAAT
GAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAA
AAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACT
GCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGA
TTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACT
ACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTT
GTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAAC
CAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATAT
ATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAA
CTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAAT
GGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATG
AGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAA
AGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTA
CACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGA
ACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAA
AAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCA
CCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAAT
ACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGG
TAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCA
CCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAA
TGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGG
AAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACC
TACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCAT
TGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAA
TCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCG
TGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGT
CAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATA
TTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTT
TGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGG
CATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCA
GCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTT
GCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGA
GGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCT
GAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAAC
ACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCAT
8
Intron/Exon structure

Gene finding programs work well in bacteria
None of the gene prediction programs do an
adequate job predicting intron/exon boundaries
The only reasonable gene models are based on
alignment of cDNAs to genome sequence
Perhaps 50 of all human genes still do not have
a correct coding sequence defined
(transcription start, intron splice sites)

9
Gene Finding on the Web

GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
http//compbio.ornl.gov/grailexp
ORFfinder NCBI
http//www.ncbi.nlm.nih.gov/gorf/gorf.html
DNA translation Univ. of Minnesota Med. School
http//alces.med.umn.edu/webtrans.html
GenLang
http//cbil.humgen.upenn.edu/sdong/genlang.html
BCM GeneFinder Baylor College of Medicine,
Houston, TX
http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
search.html
http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
tml

10
Genomic Sequence

Once each gene is located on the chromosome, it
becomes possible to get upstream genomic sequence
This is where transcription factor (TF) binding
sites are located
promoters and enhancers
Search for known TF sites, and discover new ones
(among co-regulated genes)

11
Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
12
Many DNA Regulatory Sequences are Known

Databases of promoters, enhancers, etc.
TransFac the Transcription Factor database
4342 entries w/ known protein binding and
transcriptional regulatory functions
Maintained by Gesellschaft for Biotechnologische
Forschung mbH (Braunschweig, Germany)
The Eukaryotic Promoter Database (EPD)
Bucher Trifonov. (1986) NAR 14 10009-26
1314 entries taken directly from scientific
literature
Maintained by ISREC (Lausanne, Switzerland) as a
subset of the EMBL

13
TF Binding sites lack information

Most TF binding sites are determined by just a
few base pairs (typically 6)
Sequence is variable (consensus)
This is not enough information for proteins to
locate unique promoters for each gene
TF's bind cooperatively and combinatorially
the key is in the location in relation to each
other and to the transcription units of genes
Can use information from alignment of related
genes

14
Sequence Logos
15
Tools to find TF sites in DNA

GCG FINDPATTERNS
with database file TFSITES.DAT
Macintosh (Signal Scan), PC/UNIX (Promoter Scan)
Dr. Dan S. Prestridge, Univ. of Minnesota

16
Websites for Promoter finding

Promoter Scan NIH Bioinformatics (BIMAS)
http//bimas.dcrt.nih.gov/molbio/proscan/
Promoter Scan II Univ. of Minnesota Axyx
Pharmaceuticals
http//biosci.cbs.umn.edu/software/proscan/promote
rscan.htm
Signal Scan NIH Bioinformatics (BIMAS)
http//bimas.dcrt.nih.gov80/molbio/signal/index.h
tml
Transcription Element Search (TESS) Center for
Bioinformatics, Univ. of Pennsylvania
http//www.cbil.upenn.edu/tess/
Search TransFac at GBF with MatInspector,
PatSearch, and FunSiteP
http//transfac.gbf-braunschweig.de/TRANSFAC/progr
ams.html
TargetFinder Telethon Inst.of Genetics and
Medicine, Milan, Italy
http//hercules.tigem.it/TargetFinder.html

17
Protein Sequence
18
Protein Sequence Analysis

Molecular properties (pH, mol. wt. isoelectric
point, hydrophobicity)
Motifs (signal peptide, coiled-coil,
trans-membrane, etc.)
Protein Families
Secondary Structure (helix vs. beta-sheet)
3-D prediction, Threading

19
Chemical Properties of Proteins

Proteins are linear polymers of 20 amino acids
Chemical properties of the protein are determined
by its amino acids
Molecular wt., pH, isoelectric point are simple
calculations from amino acid composition
Hydrophobicity is a property of groups of amino
acids - best examined as a graph

20
Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen
p53 Kyte-Doolittle hydrophilicty, window19
21
Web Sites for Simple Protein Analysis

Protein Hydrophobicity Server Bioinformatics
Unit, Weizmann Institute of Science , Israel
http//bioinformatics.weizmann.ac.il/hydroph/
SAPS - statistical analysis of protein sequences
composition, charge, hydrophobic and
transmembrane segments, cysteine spacings,
repeats and periodicity
http//www.isrec.isb-sib.ch/software/SAPS_form.htm
l

22
EMBOSS Protein Analysis Toolkit

plotorf simple open reading frame finder
Garnier predicts 2ndary structure
Charge plot of protein charge
Octanol hydrophobicity plot
Pepwindow hydorpathy plot
pepinfo plots protein secondary structure and
hydrophobicity in parallel panels
tmap predict transmembrane regions
Topo draws a map of transmembrane protein
Pepwheel shows protein sequence as helical wheel
Pepcoil predicts coiled-coil domains
Helixturnhelix predicts helix-turn-helix domains

23
Simple Motifs

Common structural motifs
Membrane spanning
Signal peptide
Coiled coil
Helix-turn-helix

24
Super-secondary Structure

Common structural motifs
Membrane spanning (GCG TransMem)
Signal peptide (GCG SPScan)
Coiled coil (GCG CoilScan)
Helix-turn-helix (GCG HTHScan)

25
Web servers that predict these structures

Predict Protein server EMBL Heidelberg
http//www.embl-heidelberg.de/predictprotein/
SOSUI Tokyo Univ. of Ag. Tech., Japan
http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
l
TMpred (transmembrane prediction) ISREC (Swiss
Institute for Experimental Cancer Research)
http//www.isrec.isb-sib.ch/software/TMPRED_form.h
tml
COILS (coiled coil prediction) ISREC
http//www.isrec.isb-sib.ch/software/COILS_form.ht
ml
SignalP (signal peptides) Tech. Univ. of Denmark
http//www.cbs.dtu.dk/services/SignalP/

26
Protein Domains/Motifs

Proteins are built out of functional units know
as domains (or motifs)
These domains have conserved sequences
Often much more similar than their respective
proteins
Exon splicing theory (W. Gilbert)
Exons correspond to folding domains which in
turn serve as functional units
Unrelated proteins may share a single similar
exon (i.e.. ATPase or DNA binding function)

27
Motifs are built from Multiple Alignmennts
28
Protein Motif Databases

Known protein motifs have been collected in
databases
Best database is PROSITE
The Dictionary of Protein Sites and Patterns
maintained by Amos Bairoch, at the Univ. of
Geneva, Switzerland
contains a comprehensive list of documented
protein domains constructed by expert molecular
biologists
Alignments and patterns built by hand!

29
PROSITE is based on Patterns

Each domain is defined by a simple pattern
Patterns can have alternate amino acids in each
position and defined spaces, but no gaps
Pattern searching is by exact matching, so any
new variant will not be found (can allow
mismatches, but this weakens the algorithm)

30
(No Transcript)
31
Tools for PROSITE searches

Free Mac program MacPattern
ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
x
Free PC program (DOS) PATMAT
ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
s
GCG provides the program MOTIFS
Also in virtually all commercial programs
MacVector, OMIGA, LaserGene, etc.

32
Websites for PROSITE Searches

ScanProsite at ExPASy Univ. of Geneva
http//expasy.hcuge.ch/sprot/scnpsit1.html
Network Protein Sequence Analysis Institut de
Biologie et Chimie des Protéines, Lyon, France
http//pbil.ibcp.fr/NPSA/npsa_prosite.html
PPSRCH EBI, Cambridge, UK
http//www2.ebi.ac.uk/ppsearch/

33
Profiles

Profiles are tables of amino acid frequencies at
each position in a motif
They are built from multiple alignments
PROSITE entries also contain profiles built from
an alignment of proteins that match the pattern
Profile searching is more sensitive than pattern
searching - uses an alignment algorithm, allows
gaps

34
(No Transcript)
35
EMBOSS ProfileSearch

EMBOSS has a set of profile analysis tools.
Start with a multiple alignment
fuzzpro protein pattern search
preg regular expression search of a protein
sequence
prophecy create a profile
profit scans a database with your profile
prophet makes pairwise alignments between a
single sequence and a profile
patmatmotifs scan a query protein with the
PROSITE motif database

36
Websites for Profile searching

PROSITE ProfileScan ExPASy, Geneva
http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
tml
BLOCKS (builds profiles from PROSITE entries and
adds all matching sequences in SwissProt) Fred
Hutchinson Cancer Research Center, Seattle,
Washington, USA
http//www.blocks.fhcrc.org/blocks_search.html
PRINTS (profiles built from automatic alignments
of OWL non-redundant protein databases)
http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
an/fps/PathForm.cgi

37
More Protein Motif Databases

PFAM (1344 protein family HMM profiles built by
hand) Washington Univ., St. Louis
http//pfam.wustl.edu/hmmsearch.shtml
ProDom (profiles built from PSI-BLAST automatic
multiple alignments of the SwissProt database)
INRA, Toulouse, France
http//www.toulouse.inra.fr/prodom/doc/blast_form.
html
This is my favorite protein database - nicely
colored results

38
Sample ProDom Output
39
Hidden Markov Models

Hidden Markov Models (HMMs) are a more
sophisticated form of profile analysis.
Rather than build a table of amino acid
frequencies at each position, they model the
transition from one amino acid to the next.
Pfam is built with HMMs.
EMBOSS HMM tools (HMMER)
ehmmBuild ehmmCalibrate
ehmmSearch ehmmPfam
ehmmAlign ehmmEmit
ehmmFetch ehmmIndex

40
Discovery of new Motifs

All of the tools discussed so far rely on a
database of existing domains/motifs
How to discover new motifs
Start with a set of related proteins
Make a multiple alignment
Build a pattern or profile
You will need access to a fairly powerful UNIX
computer to search databases with custom built
profiles or HMMs.

41
Patterns in Unaligned Sequences

Sometimes sequences may share just a small common
region
transcription factors
MEME San Diego Supercomputing Facility
http//www.sdsc.edu/MEME/meme/website/meme.html
EMBOSS also includes the MEME program

42
Protein 3-D Structure
43
Self-assembly

Proteins self-assemble in solution
All of the information necessary to determine the
complex 3-D structure is in the amino acid
sequences
Structure determines function
- lock key model of enzyme function
Know the sequence, know the function?
Nearly infinite complexity

44
Structure prediction

Protein Structure prediction is the Holy Grail
of bioinformatics
Since structure function, then structure
prediction should allow protein design, design of
inhibitors, etc.
Huge amounts of genome data - what are the
functions of all of these proteins?

45
3-D Structure

Cannot be accurately predicted from sequence
alone (known as ab initio)
Levinthals paradox a 100 aa protein has 3200
possible backbone configurations - many orders of
magnitude beyond the capacity of the fastest
computers
There are perhaps only a few hundred basic
structures, but we dont yet have this vocabulary
or the ability to recognize variants on a theme

46
Secondary Structure

Protein secondary structure takes one of three
forms
Alpha helix
Beta pleated sheet
Turn
2ndary structure is predicted within a small
window
Many different algorithms, not highly accurate
Better predictions from a multiple alignment

47
Structure Prediction on the Web

Secondary Structural Content Prediction (SSCP)
EMBL, Heidelberg
http//www.bork.embl-heidelberg.de/SSCP/sscp_seq.h
tml
BCM Search Launcher Protein Secondary Structure
Prediction Baylor College of Medicine
http//dot.imgen.bcm.tmc.edu9331/seq-search/struc
-predict.html
PREDATOR EMBL, Heidelberg
http//www.embl-heidelberg.de/cgi/predator_serv.pl

48
Sample 2-D Structure Prediction
49
Threading Protein Structures

Best bet is to compare with similar sequences
that have known structures gtgt Threading
Only works for proteins with gt25 sequence
similarity to a protein with known structure
Some websites offer quick approximations
Will improve as more 3-D structures are described

50
Websites for 3-D structure prediction

UCLA-DOE Protein Fold Recognition
http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
tsequence.html
SwissModel ExPASy, Univ. of Geneva
http//www.expasy.ch/swissmod/SWISS-MODEL.html
CPHmodels Technical Univ. of Denmark
http//www.cbs.dtu.dk/services/CPHmodels/

51
View Known Protein Structures

GenBank includes a database of protein 3-D
structures and a free viewer Cn3D
GenBank database is derived from PDB (Protein
Data Base)
primary repository of protein structure data
determined by X-ray crystallography and/or NMR
has its own data format and many free viewers
some are very sophisticated - can calculate
intermolecular distances

52
Cn3D

Cn3D is a helper application that allows you to
view three dimensional structures from NCBI's
Entrez database.
Cn3D runs on Windows, MacOS, and Unix/Linux.
Cn3D simultaneously displaysstructure, sequence,
and alignment, it also allows the user to set
display styles for features of interest.
http//www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.s
html

By being tightly coupled to the genomic and
literature databases, Cn3D is the ideal program
for viewing structures found in the NCBI
databases.
53
(No Transcript)
54
RasMol

RasMol is the simplest PDB viewer.
http//www.umass.edu/microbio/rasmol/
It can work together with a web browser to let
you view the structure of any sequence found with
Entrez that has a known 3-D structure.

55
Swiss PDB Viewer
56
Summary

Restriction sites are trivial to compute, but
very useful
Genomic DNA has genes and other information gt
transcription factors
Proteins have predictable 2ndary structures and
functional domains, but generally cant predict
new 3-D structures
Can visualize and compare known structures

Write a Comment

User Comments (0)