Title: Patterns in Biological Sequences
1Patterns in Biological Sequences
- Stuart M. Brown
- New York University School of Medicine
2Overview
- DNA Structure and Function
- Regulatory Sites in DNA
- Finding Genes in DNA Sequences
- RNA Structure
- Protein Structure and Function
- Protein Motifs
3Sequence Analysis on the Web
- Can analyze sequence using a Unix mainframe, or
with free tools on the Web - Web tools are often best
- Available to everyone
- Constantly upgraded
- not always available and subject to random change
- Local Unix tools are better if you need to do big
jobs (lots of sequences - scripts pipelines)
4DNA Structure
- Primary the sequence itself
- Secondary double helix
- Tertiary supercoiled, bent, etc.
- Quaternary complexes with proteins
- Histones
- RNA Polymerase
- DNA binding proteins (transcription factors)
- Chromosome structure
- centromeres telomeres
5Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
6DNA Information Content
- Just a 4 letter alphabet (GATC)
- Encodes proteins with 3 letter codons
- Punctuation determines transcription starts and
stops - Transcripitonal regulation (promoters, enhancers,
etc.) - Determines its own replication
7Simple DNA Patterns
- Restriction enzyme cut sites
- 4, 6, or 8 bases long, inverted repeat
- Repeats (direct or inverted)
- Promoters (universal site recognized by RNA
polymerase to start transcription) - Transcription Factors (unique to one gene or a
group of co-regulated genes) - often just 8-12 bases long
- generally located upstream from transcribed part
of gene - enhancers can be located anwhere within 10,000
bp of gene
8DNA Regulatory Sequences
- Databases of promoters, enhancers, etc.
- (DNA patterns)
- TransFac the Transcription Factor database
- 4504 entries from 1078 eukaryotic genes
- maintained by GBF (Germany)
- http//transfac.gbf.de/TRANSFAC/
- The Eukaryotic Promoter Database (EPD)
- Bucher Trifonov. (1986) NAR 14 10009-26
- 1314 entries taken directly from scientific
literature - maintained by ISREC (Switzerland)
- http//www.epd.isb-sib.ch/
9Tools to find patterns in DNA
- Signal Scan, Promoter Scan - Mac, Windows, Unix
- (Dr. Dan S. Prestridge, Univ. of Minnesota)
- EMBOSS tools Unix
- tfscan scans DNA sequences for transcription
factors - fuzznuc nucleic acid pattern search
- fuzzpro protein pattern search
- fuzztran translate DNA-gtprotein search for
protein patterns - restrict finds restriction enzyme cleavage sites
- repeats (G. Benson) - tandem repeats
- palindrome - inverted repeats
- REPuter (whole genome repeat search) Unix
10TF Binding sites lack information
DE IFI-6-16 (interferon-induced gene 6-16)
G000176. SQ gGGAAAaTGAAACT SF -127 ST
-89 BF T00428 ISGF-3 Quality 6 Species
human, Homo sapiens.
- Most TF binding sites are determined by just a
few base pairs (typically 8-12) - This is not enough information for proteins to
locate unique promoters for each gene in a 3
billion base genome - TF's bind cooperatively and combinatorially
- The key is in the location in relation to each
other and to the transcription units of genes
11Websites for Promoter finding
- Promoter Scan NIH Bioinformatics (BIMAS)
- http//bimas.dcrt.nih.gov/molbio/proscan/
- Promoter Scan II Univ. of Minnesota Axyx
Pharmaceuticals - http//biosci.cbs.umn.edu/software/proscan/promote
rscan.htm - Signal Scan NIH Bioinformatics (BIMAS)
- http//bimas.dcrt.nih.gov80/molbio/signal/index.h
tml - Transcription Element Search (TESS) Center for
Bioinformatics, Univ. of Pennsylvania - http//www.cbil.upenn.edu/tess/
- Search TransFac at GBF with MatInspector,
PatSearch, and FunSiteP - http//transfac.gbf-braunschweig.de/TRANSFAC/progr
ams.html - TargetFinder Telethon Inst.of Genetics and
Medicine, Milan, Italy - http//hercules.tigem.it/TargetFinder.html
12Finding Genes in Genomic DNA
- Translate (in all 6 reading frames) and look for
similarity to known protein sequences - Translate and look for long Open Reading Frames
(ORFs) between start and stop codons - Look for known gene markers
- TAATAA box, intron splice sites, etc.
- Statistical methods (codon preference)
13Gene Finding on the Web
- ORFfinder NCBI
- http//www.ncbi.nlm.nih.gov/gorf/gorf.html
- GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
- http//compbio.ornl.gov/grailexp
- DNA translation Univ. of Minnesota Med. School
- http//alces.med.umn.edu/webtrans.html
- GenLang
- http//cbil.humgen.upenn.edu/sdong/genlang.html
- BCM GeneFinder Baylor College of Medicine,
Houston, TX - http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
search.html - http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
tml
14Genomic Sequence
- Once each gene is located on the chromosome, it
becomes possible to get upstream genomic sequence - This is where the transcription factor binding
sites are located - Search for known TF sites, and discover new ones
(among co-regulated genes)
15Intron/Exon structure
- Gene finding programs work well in bacteria
- None of these gene prediction programs do an
adequate job predicting intron/exon boundaries - The only reasonable gene models are based on
alignment of cDNAs to genome sequence - Perhaps 50 of all human genes still do not have
a correct coding sequence defined
16RNA Structure
- Similar to DNA - base pairing
- Smaller molecules, free to take on more complex
shapes - tRNA, ribozymes, self-splicing introns
17tRNA Structures
18RNA Information Content
- Primary structure (sequence) contains
- Information for 3-D self-assembly
- Genetic code for amino acids in protein
- Translation start and stop signals
- Intron splicing signals
- Controls for RNA stability and transcription level
19RNA Secondary Structure
- Rules for base pairing and free energy
minimization are known - Characteristic tRNA stem-loop structures
- Michael Zuker created the computer program
FoldRNA - UNIX/Mac/PC freeware, in commercial products, and
on the Web - Can predict many RNA secondary structures, not
necessarily the optimal or true structure
20Protein Sequence Analysis
- Molecular properties (pH, mol. wt. isoelectric
point, hydrophobicity) - Secondary Structure
- Super-secondary (signal peptide, coiled-coil,
trans-membrane, etc.) - 3-D prediction, Threading
- Domains, motifs, etc.
21Self-assembly
- Proteins self-assemble in solution
- All of the information necessary to determine the
complex 3-D structure is in the amino acid
sequences - Structure determines function
- lock key model of enzyme function
- Know the sequence, know the function?
- Nearly infinite complexity
22Structure prediction
- Protein Structure prediction is the Holy Grail
of bioinformatics - Since structure function, then structure
prediction should allow protein design, design of
inhibitors, etc. - Huge amounts of genome data - what are the
functions of all of these proteins?
23Chemical Properties of Proteins
- Proteins are linear polymers of 20 amino acids
- Chemical properties of the protein are determined
by its amino acids - Molecular wt., pH, isoelectric point are simple
calculations from amino acid composition - Hydrophobicity is a property of groups of amino
acids - best examined as a graph
24Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen
p53 Kyte-Doolittle hydrophilicty, window19
25Web Sites for Simple Protein Analysis
- Protein Hydrophobicity Server Bioinformatics
Unit, Weizmann Institute of Science , Israel - http//bioinformatics.weizmann.ac.il/hydroph/
- SAPS - statistical analysis of protein sequences
composition, charge, hydrophobic and
transmembrane segments, cysteine spacings,
repeats and periodicity - http//www.isrec.isb-sib.ch/software/SAPS_form.htm
l
26Secondary Structure
- Protein secondary structure takes one of three
forms - Alpha helix
- Beta pleated sheet
- Turn
- 2ndary structure is predicted within a small
window - Many different algorithms, not highly accurate
- Better predictions from a multiple alignment
27Structure Prediction on the Web
- Secondary Structural Content Prediction (SSCP)
EMBL, Heidelberg - http//www.bork.embl-heidelberg.de/SSCP/sscp_seq.h
tml - BCM Search Launcher Protein Secondary Structure
Prediction Baylor College of Medicine - http//dot.imgen.bcm.tmc.edu9331/seq-search/struc
-predict.html - PREDATOR EMBL, Heidelberg
- http//www.embl-heidelberg.de/cgi/predator_serv.pl
- UCLA-DOE Protein Fold Recognition Server
- http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
tsequence.html
28Sample Structure Prediction
29Super-secondary Structure
- Common structural motifs
- Membrane spanning (GCG TransMem)
- Signal peptide (GCG SPScan)
- Coiled coil (GCG CoilScan)
- Helix-turn-helix (GCG HTHScan)
30Web servers that predict these structures
- Predict Protein server EMBL Heidelberg
- http//www.embl-heidelberg.de/predictprotein/
- SOSUI Tokyo Univ. of Ag. Tech., Japan
- http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
l - TMpred (transmembrane prediction) ISREC (Swiss
Institute for Experimental Cancer Research) - http//www.isrec.isb-sib.ch/software/TMPRED_form.h
tml - COILS (coiled coil prediction) ISREC
- http//www.isrec.isb-sib.ch/software/COILS_form.ht
ml - SignalP (signal peptides) Tech. Univ. of Denmark
- http//www.cbs.dtu.dk/services/SignalP/
313-D Structure
- Cannot be accurately predicted from sequence
alone (known as ab initio) - Levinthals paradox a 100 aa protein has 3200
possible backbone configurations - many orders of
magnitude beyond the capacity of the fastest
computers - There are perhaps only a few hundred basic
structures, but we dont yet have this vocabulary
or the ability to recognize variants on a theme
32Threading Protein Structures
- Best bet is to compare with similar sequences
that have known structures gtgt Threading - Only works for proteins with gt25 sequence
similarity to a protein with known structure - Current state of the art requires many days of
computing on a dedicated workstation - Some websites offer quick approximations
- Will improve as more 3-D structures are described
- Another aspect of the Genome Project
33Predicted Structure
34Protein Data Base
- There is a database of all known protein
structures called the PDB. - These have been determined by X-ray
crystalography and/or NMR. - Anyone download and view these structures with a
PDB viewer program.
35RasMol
- RasMol is the simplest PDB viewer.
- http//www.umass.edu/microbio/rasmol/
- It can work together with a web browser to let
you view the structure of any sequence found with
Entrez that has a known 3-D structure.
36Websites for 3-D structure prediction
- UCLA-DOE Protein Fold Recognition
- http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
tsequence.html - SwissModel ExPASy, Univ. of Geneva
- http//www.expasy.ch/swissmod/SWISS-MODEL.html
- CPHmodels Technical Univ. of Denmark
- http//www.cbs.dtu.dk/services/CPHmodels/
37Searching for Patterns in Proteins
38Protein Domains/Motifs
- Proteins are built out of functional units know
as domains (or motifs) - These domains have conserved sequences
- Often much more similar than their respective
proteins - Exon splicing theory (W. Gilbert)
- Exons correspond to folding domains which in
turn serve as functional units - Unrelated proteins may share a single similar
exon (i.e.. ATPase or DNA binding function)
39Protein Motif Databases
- Known protein motifs have been collected in
databases - Best database is PROSITE
- The Dictionary of Protein Sites and Patterns
- maintained by Amos Bairoch, at the Univ. of
Geneva, Switzerland - contains a comprehensive list of documented
protein domains constructed by expert molecular
biologists.
40PROSITE is based on Patterns
- Each domain is defined by a simple pattern
- Patterns can have alternate amino acids in each
position and defined spaces, but no gaps - Pattern searching is by exact matching, so any
new variant will not be found (can allow
mismatches, but this weakens the algorithm)
41Tools for PROSITE searches
- Free Mac program MacPattern
- ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
x - Free PC program (DOS) PATMAT
- ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
s - GCG provides the program MOTIFS
- Also in virtually all commercial programs
MacVector, OMIGA, LaserGene, etc.
42Websites for PROSITE Searches
- ScanProsite at ExPASy Univ. of Geneva
- http//expasy.hcuge.ch/sprot/scnpsit1.html
- Network Protein Sequence Analysis Institut de
Biologie et Chimie des Protéines, Lyon, France - http//pbil.ibcp.fr/NPSA/npsa_prosite.html
- PPSRCH EBI, Cambridge, UK
- http//www2.ebi.ac.uk/ppsearch/
43Profiles
- Profiles are tables of amino acid frequencies at
each position in a motif - They are built from multiple alignments
- PROSITE entries also contain profiles built from
an alignment of proteins that match the pattern - Profile searching is more sensitive than pattern
searching - uses an alignment algorithm, allows
gaps
44Websites for Profile searching
- PROSITE ProfileScan ExPASy, Geneva
- http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
tml - BLOCKS (builds profiles from PROSITE entries and
adds all matching sequences in SwissProt) Fred
Hutchinson Cancer Research Center, Seattle,
Washington, USA - http//www.blocks.fhcrc.org/blocks_search.html
- PRINTS (profiles built from automatic alignments
of OWL non-redundant protein databases)
http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
an/fps/PathForm.cgi
45More Protein Motif Databases
- PFAM (1344 protein family HMM profiles built by
hand) Washington Univ., St. Louis - http//pfam.wustl.edu/hmmsearch.shtml
- ProDom (profiles built from PSI-BLAST automatic
multiple alignments of the SwissProt database)
INRA, Toulouse, France - http//www.toulouse.inra.fr/prodom/doc/blast_form.
html - This is my favorite protein database - nicely
colored results
46Sample ProDom Output
47Hidden Markov Models
- Hidden Markov Models (HMMs) are a more
sophisticated form of profile analysis. - Rather than build a table of amino acid
frequencies at each position, they model the
transition from one amino acid to the next. - Pfam is built with HMMs
- HMMER software - free for UNIX
48Discovery of new Motifs
- All of the tools discussed so far rely on a
database of existing domains/motifs - How to discover new motifs
- Start with a set of related proteins
- Make a multiple alignment
- Build a pattern or profile
- You will need access to a fairly powerful UNIX
computer to search databases with custom built
profiles or HMMs.
49Patterns in Unaligned Sequences
- Sometimes sequences may share just a small common
region - common signal peptide
- new transcription factors
- MEME San Diego Supercomputing Facility
- http//www.sdsc.edu/MEME/meme/website/meme.html
- - GCG also includes the MEME program
50Summary
- DNA has genes and other information
- Transcription factors
- RNA has predictable structures
- Proteins have predictable 2ndary structures and
functional domains, but generally cant predict
new 3-D structures