Title: SequenceFunction Relationships
1Sequence-Function Relationships
- Stuart M. Brown
- New York University School of Medicine
2Overview
- DNA Structure and Function
- Regulatory Sites in DNA
- Finding Genes in DNA Sequences
- RNA Structure
- Protein Structure and Function
- Protein Motifs
3Sequence Analysis on the Web
- Can analyze sequence using a mainframe (GCG), on
a Mac/PC (MacVector, OMIGA, LaserGene, etc.) or
with free tools on the Web - Web tools are often best
- Available to everyone
- Constantly upgraded
- But not always available and subject to random
change
4DNA Structure
- Primary the sequence itself
- Secondary double helix
- Tertiary supercoiled, bent, etc.
- Quaternary complexes with proteins
- Histones
- RNA Polymerase
- DNA binding proteins (transcription factors)
- Chromosome structure
- centromeres telomeres
5Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
6DNA Information Content
- Just a 4 letter alphabet (GATC)
- Encodes proteins with 3 letter codons
- Punctuation determines transcription starts and
stops - Transcripitonal regulation (promoters, enhancers,
etc.) - Determines its own replication
7Many DNA Regulatory Sequences are Known
- Databases of promoters, enhancers, etc.
- TransFac the Transcription Factor database
- 4342 entries w/ known protein binding and
transcriptional regulatory functions - Maintained by Gesellschaft for Biotechnologische
Forschung mbH (Braunschweig, Germany) - The Eukaryotic Promoter Database(EPD)
- Bucher Trifonov. (1986) NAR 14 10009-26
- 1314 entries taken directly from scientific
literature - Maintained by ISREC (Lausanne, Switzerland) as a
subset of the EMBL
8Tools to find TF sites in DNA
- GCG FINDPATTERNS with TFSITES.DAT
- Macintosh (Signal Scan), PC/UNIX (Promoter Scan)
- Dr. Dan S. Prestridge, Univ. of Minnesota
9TF Binding sites lack information
- Most TF binding sites are determined by just a
few base pairs (typically 6) - This is not enough information for proteins to
locate unique promoters for each gene - TF's bind cooperatively and combinatorially
- The key is in the location in relation to each
other and to the transcription units of genes
10Websites for Promoter finding
- Promoter Scan NIH Bioinformatics (BIMAS)
- http//bimas.dcrt.nih.gov/molbio/proscan/
- Promoter Scan II Univ. of Minnesota Axyx
Pharmaceuticals - http//biosci.cbs.umn.edu/software/proscan/promote
rscan.htm - Signal Scan NIH Bioinformatics (BIMAS)
- http//bimas.dcrt.nih.gov80/molbio/signal/index.h
tml - Transcription Element Search (TESS) Center for
Bioinformatics, Univ. of Pennsylvania - http//www.cbil.upenn.edu/tess/
- Search TransFac at GBF with MatInspector,
PatSearch, and FunSiteP - http//transfac.gbf-braunschweig.de/TRANSFAC/progr
ams.html - TargetFinder Telethon Inst.of Genetics and
Medicine, Milan, Italy - http//hercules.tigem.it/TargetFinder.html
11Finding Genes in Genomic DNA
- Translate (in all 6 reading frames) and look for
similarity to known protein sequences - Translate and look for long Open Reading Frames
(ORFs) between start and stop codons - Look for known gene markers
- TAATAA box, intron splice sites, etc.
- Statistical methods (codon preference)
12Gene Finding on the Web
- GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
- http//compbio.ornl.gov/grailexp
- ORFfinder NCBI
- http//www.ncbi.nlm.nih.gov/gorf/gorf.html
- DNA translation Univ. of Minnesota Med. School
- http//alces.med.umn.edu/webtrans.html
- GenLang
- http//cbil.humgen.upenn.edu/sdong/genlang.html
- BCM GeneFinder Baylor College of Medicine,
Houston, TX - http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
search.html - http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
tml
13Genomic Sequence
- Once each gene is located on the chromosome, it
becomes possible to get upstream genomic sequence - This is where the transcription factor binding
sites are located - Search for known TF sites, and discover new ones
(among co-regulated genes)
14Intron/Exon structure
- Gene finding programs work well in bacteria
- None of these gene prediction programs do an
adequate job predicting intron/exon boundaries - The only reasonable gene models are based on
alignment of cDNAs to genome sequence - Perhaps 50 of all human genes still do not have
a correct coding sequence defined
15RNA Structure
- Similar to DNA - base pairing
- Smaller molecules, free to take on more complex
shapes - tRNA, ribozymes, self-splicing introns
16tRNA Structures
17RNA Information Content
- Primary structure (sequence) contains
- Information for 3-D self-assembly
- Genetic code for amino acids in protein
- Translation start and stop signals
- Intron splicing signals
- Controls for RNA stability and transcription level
18RNA Secondary Structure
- Rules for base pairing and free energy
minimization are known - Characteristic tRNA stem-loop structures
- Michael Zuker created the computer program
FoldRNA - GCG, UNIX/Mac/PC freeware, in commercial
products, and on the Web - Can predict many RNA secondary structures, not
necessarily the optimal or true structure
19Protein Sequence Analysis
- Molecular properties (pH, mol. wt. isoelectric
point, hydrophobicity) - Secondary Structure
- Super-secondary (signal peptide, coiled-coil,
trans-membrane, etc.) - 3-D prediction, Threading
- Domains, motifs, etc.
20Self-assembly
- Proteins self-assemble in solution
- All of the information necessary to determine the
complex 3-D structure is in the amino acid
sequences - Structure determines function
- lock key model of enzyme function
- Know the sequence, know the function?
- Nearly infinite complexity
21Structure prediction
- Protein Structure prediction is the Holy Grail
of bioinformatics - Since structure function, then structure
prediction should allow protein design, design of
inhibitors, etc. - Huge amounts of genome data - what are the
functions of all of these proteins?
22Chemical Properties of Proteins
- Proteins are linear polymers of 20 amino acids
- Chemical properties of the protein are determined
by its amino acids - Molecular wt., pH, isoelectric point are simple
calculations from amino acid composition - Hydrophobicity is a property of groups of amino
acids - best examined as a graph
23Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen
p53 Kyte-Doolittle hydrophilicty, window19
24Web Sites for Simple Protein Analysis
- Protein Hydrophobicity Server Bioinformatics
Unit, Weizmann Institute of Science , Israel - http//bioinformatics.weizmann.ac.il/hydroph/
- SAPS - statistical analysis of protein sequences
composition, charge, hydrophobic and
transmembrane segments, cysteine spacings,
repeats and periodicity - http//www.isrec.isb-sib.ch/software/SAPS_form.htm
l
25Secondary Structure
- Protein secondary structure takes one of three
forms - Alpha helix
- Beta pleated sheet
- Turn
- 2ndary structure is predicted within a small
window - Many different algorithms, not highly accurate
- Better predictions from a multiple alignment
26GCG Protein Analysis Toolkit
- Isoelectric plots aa charge as a function of pH
- PeptideStructure secondary structure predictions
- PlotStructure plots protein secondary structure
- PepPlot plots protein secondary structure and
hydrophobicity in parallel panels - Moment makes a contour plot of the helical
hydrophobic moment - HelicalWheel plots a peptide sequence as a
helical wheel to help you recognize
alpha-helical regions.
27Structure Prediction on the Web
- Secondary Structural Content Prediction (SSCP)
EMBL, Heidelberg - http//www.bork.embl-heidelberg.de/SSCP/sscp_seq.h
tml - BCM Search Launcher Protein Secondary Structure
Prediction Baylor College of Medicine - http//dot.imgen.bcm.tmc.edu9331/seq-search/struc
-predict.html - PREDATOR EMBL, Heidelberg
- http//www.embl-heidelberg.de/cgi/predator_serv.pl
- UCLA-DOE Protein Fold Recognition Server
- http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
tsequence.html
28Sample Structure Prediction
29Super-secondary Structure
- Common structural motifs
- Membrane spanning (GCG TransMem)
- Signal peptide (GCG SPScan)
- Coiled coil (GCG CoilScan)
- Helix-turn-helix (GCG HTHScan)
30Web servers that predict these structures
- Predict Protein server EMBL Heidelberg
- http//www.embl-heidelberg.de/predictprotein/
- SOSUI Tokyo Univ. of Ag. Tech., Japan
- http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
l - TMpred (transmembrane prediction) ISREC (Swiss
Institute for Experimental Cancer Research) - http//www.isrec.isb-sib.ch/software/TMPRED_form.h
tml - COILS (coiled coil prediction) ISREC
- http//www.isrec.isb-sib.ch/software/COILS_form.ht
ml - SignalP (signal peptides) Tech. Univ. of Denmark
- http//www.cbs.dtu.dk/services/SignalP/
313-D Structure
- Cannot be accurately predicted from sequence
alone (known as ab initio) - Levinthals paradox a 100 aa protein has 3200
possible backbone configurations - many orders of
magnitude beyond the capacity of the fastest
computers - There are perhaps only a few hundred basic
structures, but we dont yet have this vocabulary
or the ability to recognize variants on a theme
32Threading Protein Structures
- Best bet is to compare with similar sequences
that have known structures gtgt Threading - Only works for proteins with gt25 sequence
similarity to a protein with known structure - Current state of the art requires many days of
computing on a dedicated workstation - Some websites offer quick approximations
- Will improve as more 3-D structures are described
- Another aspect of the Genome Project
33Predicted Structure
34Protein Data Base
- There is a database of all known protein
structures called the PDB. - These have been determined by X-ray
crystalography and/or NMR. - Anyone download and view these structures with a
PDB viewer program.
35RasMol
- RasMol is the simplest PDB viewer.
- http//www.umass.edu/microbio/rasmol/
- It can work together with a web browser to let
you view the structure of any sequence found with
Entrez that has a known 3-D structure.
36Websites for 3-D structure prediction
- UCLA-DOE Protein Fold Recognition
- http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
tsequence.html - SwissModel ExPASy, Univ. of Geneva
- http//www.expasy.ch/swissmod/SWISS-MODEL.html
- CPHmodels Technical Univ. of Denmark
- http//www.cbs.dtu.dk/services/CPHmodels/
37Searching for Patterns in Proteins
38Protein Domains/Motifs
- Proteins are built out of functional units know
as domains (or motifs) - These domains have conserved sequences
- Often much more similar than their respective
proteins - Exon splicing theory (W. Gilbert)
- Exons correspond to folding domains which in
turn serve as functional units - Unrelated proteins may share a single similar
exon (i.e.. ATPase or DNA binding function)
39Protein Motif Databases
- Known protein motifs have been collected in
databases - Best database is PROSITE
- The Dictionary of Protein Sites and Patterns
- maintained by Amos Bairoch, at the Univ. of
Geneva, Switzerland - contains a comprehensive list of documented
protein domains constructed by expert molecular
biologists.
40PROSITE is based on Patterns
- Each domain is defined by a simple pattern
- Patterns can have alternate amino acids in each
position and defined spaces, but no gaps - Pattern searching is by exact matching, so any
new variant will not be found (can allow
mismatches, but this weakens the algorithm)
41Tools for PROSITE searches
- Free Mac program MacPattern
- ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
x - Free PC program (DOS) PATMAT
- ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
s - GCG provides the program MOTIFS
- Also in virtually all commercial programs
MacVector, OMIGA, LaserGene, etc.
42Websites for PROSITE Searches
- ScanProsite at ExPASy Univ. of Geneva
- http//expasy.hcuge.ch/sprot/scnpsit1.html
- Network Protein Sequence Analysis Institut de
Biologie et Chimie des Protéines, Lyon, France - http//pbil.ibcp.fr/NPSA/npsa_prosite.html
- PPSRCH EBI, Cambridge, UK
- http//www2.ebi.ac.uk/ppsearch/
43Profiles
- Profiles are tables of amino acid frequencies at
each position in a motif - They are built from multiple alignments
- PROSITE entries also contain profiles built from
an alignment of proteins that match the pattern - Profile searching is more sensitive than pattern
searching - uses an alignment algorithm, allows
gaps
44GCG ProfileSearch
- GCG has a set of profile analysis tools.
- Start with a multiple alignment
- Create a profile with ProfileMake
- ProfileSearch scans a database with your
profile - ProfileSegments displays alignments between a
profile and matching database sequences - ProfileGap makes pairwise alignments between a
single sequence and a profile
45Websites for Profile searching
- PROSITE ProfileScan ExPASy, Geneva
- http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
tml - BLOCKS (builds profiles from PROSITE entries and
adds all matching sequences in SwissProt) Fred
Hutchinson Cancer Research Center, Seattle,
Washington, USA - http//www.blocks.fhcrc.org/blocks_search.html
- PRINTS (profiles built from automatic alignments
of OWL non-redundant protein databases)
http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
an/fps/PathForm.cgi
46More Protein Motif Databases
- PFAM (1344 protein family HMM profiles built by
hand) Washington Univ., St. Louis - http//pfam.wustl.edu/hmmsearch.shtml
- ProDom (profiles built from PSI-BLAST automatic
multiple alignments of the SwissProt database)
INRA, Toulouse, France - http//www.toulouse.inra.fr/prodom/doc/blast_form.
html - This is my favorite protein database - nicely
colored results
47Hidden Markov Models
- Hidden Markov Models (HMMs) are a more
sophisticated form of profile analysis. - Rather than build a table of amino acid
frequencies at each position, they model the
transition from one amino acid to the next. - Pfam is built with HMMs.
- GCG version 10.2 (released March 2001) has added
a bunch of HMM tools (and Pfam).
48Sample ProDom Output
49Discovery of new Motifs
- All of the tools discussed so far rely on a
database of existing domains/motifs - How to discover new motifs
- Start with a set of related proteins
- Make a multiple alignment
- Build a pattern or profile
- You will need access to a fairly powerful UNIX
computer to search databases with custom built
profiles or HMMs.
50Patterns in Unaligned Sequences
- Sometimes sequences may share just a small common
region - common signal peptide
- new transcription factors
- MEME San Diego Supercomputing Facility
- http//www.sdsc.edu/MEME/meme/website/meme.html
- - GCG also includes the MEME program
51Summary
- DNA has genes and other information
- Transcription factors
- RNA has predictable structures
- Proteins have predictable 2ndary structures and
functional domains, but generally cant predict
new 3-D structures