Sequence Analysis (II) - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Sequence Analysis (II)

Description:

Sequence Analysis (II) Yuh-Shan Jou ( ) jou_at_ibms.sinica.edu.tw Institute of Biomedical Sciences, Academia Sinica – PowerPoint PPT presentation

Number of Views:361
Avg rating:3.0/5.0
Slides: 97
Provided by: Micha935
Category:

less

Transcript and Presenter's Notes

Title: Sequence Analysis (II)


1
Sequence Analysis (II)
  • Yuh-Shan Jou (???)
  • jou_at_ibms.sinica.edu.tw
  • Institute of Biomedical Sciences, Academia Sinica

2
Other Areas to Cover
  • Genomic Data
  • Annotation
  • Common Domains prediction WWW
  • Other Useful Genome Browsers

3
(No Transcript)
4
But first, some vocabulary...
YACs Yeast Artificial Chromosomes Yeast
linear vector to propagate large DNA inserts (100
kb to Mb) Uses yeast centromere and
telomeres to propagate insert as a
chromosome BACs Bacterial Artificial
Chromosomes E. coli circular plasmid
designed to carry large inserts (100-300 kb),
single copy (reduces
occurrence of chimeric clones) Cosmid
E. coli circular plasmid holds 5 to 50 kb
inserts, multicopy Plasmids E. coli circular
vectors designed to propagate DNA inserts (1 to
10,000 bp) Usually have
origin of replication and antibiotic resistance
marker (pUC8) M13 E. coli phage adapted for
DNA sequencing. Can clone small DNA inserts in
double stranded plasmid version, and
convert to single strand version for
sequencing
5
cDNA complementary DNA DNA synthesized
from RNA using reverse transcriptase (an RNA-
dependent DNA
polymerase) EST Expressed Sequence Tag
Single pass DNA sequence run of a cDNA insert
in a plasmid from one end ORF Open Reading
Frame A region of at least 100 codons
that is uninterrupted by stop
codons and thus potentially encodes a
protein SNP Single Nucleotide Polymorphism
A single base that
differs among members of a population. Can be
detected by genotypingby
PCR. Responsible for much trait diversity in
populations (physical appearance,
diseases, drug response). Satellite Marker
Short tandem repeat
(CACACACACACACACACAC, eg.) with length
polymorphisms in a population (10 CAs vs
25, eg.). Can be detected by
genotyping. Often used for screening
affected populations for disease
genes(LOD scores).STS Sequenced
Tag Site Short (500 bp) segment of DNA
of known sequence mapped to location
6
EnsemblDatabase and Web Browser
Erin Pleasance Canadas Michael Smith Genome
Sciences Centre, Vancouver
7
www.ensembl.org
8
What is Ensembl?
  • Joint project of EBI and Sanger
  • Automated annotation of eukaryotic genomes
  • Open source software
  • Relational database system
  • Web interface

The main aim of this campaign is to encourage
scientists across the world - in academia,
pharmaceutical companies, and the biotechnology
and computer industries - to use this free
information.
- Dr. Mike Dexter, Director of the Wellcome Trust
9
TPMD http//tpmd.nhri.org.tw
Nucleic Acids Research, 2005, Vol. 33, Database
issue D174-D177
10
(No Transcript)
11
Ensembl components
Search tools
Data
Chromosomes (ChromoView, KaryoView, CytoView,
MapView)
SNPs and Haplotypes (SNPView, GeneSNPView, HaploVi
ew, LDView)
Sequence Similarity (BLAST, SSAHA)
Diseases (DiseaseView)
Genome Sequence (ContigView)
Genes (GeneView, TransView, ExonView, ProtView)
Markers (MarkerView)
Functions (GOView)
Text (TextView)
Other Annotations
Families (DomainView, FamilyView
Anything (EnsMart)
Comparative Genomics (ContigView,
MultiContigView, SyntenyView, GeneView)
12
Example 1 Exploring Caspase-3
  • Aim to demonstrate basic browsing and views
  • Caspase-3 is a gene involved in apoptosis (cell
    suicide)
  • We will look at
  • Gene annotation
  • SNPs
  • Orthologs and genome alignments
  • Alternative transcripts and EST genes

13
Example 1 Exploring Caspase-3
http//www.ensembl.org
14
Species-specific homepage
Site map
Statistics of current release
15
Finding the tool/view Site Map
16
Text Search
Click Back to
Species-specific homepage
Gene
caspase-3
17
GeneView
ContigView
ExportView
SNPView
ProteinView
ExonView
TransView of transcript
18
GeneView
Orthologs predicted by sequence similarity and
synteny
GeneDAS Get data from external sources
19
GeneView
On the same page, information provided for each
transcript individually
Links to external databases
20
GeneView
21
GeneSNPView
22
Other SNP/Haplotype tools
  • SNPView
  • ProteinView (protein sequence with SNP markup)
  • LDView View linkage disequilibrium (only limited
    regions)
  • HaploView View haplotypes (only limited regions)

23
GeneView
Click Back to
24
ContigView
Chromosome and bands
Sequence contigs
25
ContigView Detailed View
Genscan predictions
Targetted gene predictions (2 alternative
transcripts)
Gene annotations
EST genes
Other tracks Aligned sequences etc.
26
ContigView
27
MultiContigView
DNA sequence homology
Rat ortholog
28
Other Comparative Genomics Tools
  • Saw gene orthology, DNA homology
  • Other view is SyntenyView
  • Also access comparative genomics through EnsMart

29
Data Mining with EnsMart
  • Allows very fast, cross-data source querying
  • Search for genes (features, sequences, etc.) or
    SNPs based on
  • Position function domains similarity
    expression etc.
  • Accessible from Ensembl website (MartView) as
    well as stand-alone
  • Extremely powerful for data mining

30
Example 2 EnsMart
  • A new disease locus has been mapped between
    markers D21S1991 and D21S171. It may be that the
    gene involved has already been identified as
    having a role in another disease. What candidates
    are in this region?

31
Example 2 EnsMart
  • EnsMart is based on BioMart
  • http//www.ensembl.org/Multi/martview
  • OR
  • http//www.ebi.ac.uk/BioMart/martview

32
EnsMart Choosing your dataset
33
EnsMart Filtering
21
D21S1991
D21S171
34
EnsMart Output
Note you can output different types of
information eg. sequences
35
EnsMart Output
36
Sequence Similarity Searching
  • Use SSAHA for exact matches (fast)
  • Use BLAST for more distant similarity (slow)

37
EnsEMBL BLAST
38
The ideal annotation of Gene
All clones
All SNPs
Promoter(s)
Ideal Gene
All mRNAs
All proteins
  • All protein modifications
  • Ontologies
  • Interactions (complexes, pathways, networks)
  • Expression (where and when, and how much)
  • Evolutionary relationships

All structures
Lecture 5.1
38
39
gene number in the human genome
  • Consortium 30.000 40.000 2001
  • Celera 27.000 38.000 2001
  • ConsortiumCelera 50.000 Hogenesch et al. 2001
  • DBsearches 65.000 75.000 Wrigth et al., 2001
  • HumanGenomeSciences 90.000 120.000 Haseltine,
    2001
  • Consortium Build 34 35,000 40,000 April,
    2003
  • Consortium Build 35 20,000 25,000 Nature
    431931, 2004

40
Human Genome Project -- Why sequence junk?!
  • 90 of human genome (3.3x109) in finished status,
    ie 99 of euchromatin.
  • 45 of the genome are repeat sequences.
  • 5 of the genome encodes genes (1.5 is coding).
  • 35,000 40,000 genes with multiple splicing
    products per gene (build 34).
  • Finish at April, 2003 single chromosome papers
    published one by one.
  • The entire human genome was finished again Oct.
    2004.
  • Build 35 assembly with 2.85 billion nucleotides
    interrupted by only 341 gaps. It covers 99 of
    the euchromatic genome with an error rate of 1 /
    100,000 bases. The human genome seems to encode
    only 20,00025,000 protein-coding genes. (Nature
    431931-945, 2004).
  • Cost of Genome sequencing average US 1 per
    base.
  • 3.3 billion US dollars to sequence the human
    genome.

41
Ab initio gene identification
  • Goals
  • Identify coding exons
  • Seek gene structure information
  • Get a protein sequence for further analysis
  • Relevance
  • Characterization of anonymous DNA genomic
    sequences
  • Works on all DNA sequences

42
Gene Finding on the Web
  • GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
  • http//compbio.ornl.gov/grailexp
  • ORFfinder NCBI
  • http//www.ncbi.nlm.nih.gov/gorf/gorf.html
  • DNA translation Univ. of Minnesota Med. School
  • http//alces.med.umn.edu/webtrans.html
  • GenLang
  • http//cbil.humgen.upenn.edu/sdong/genlang.html
  • BCM GeneFinder Baylor College of Medicine,
    Houston, TX
  • http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
    search.html
  • http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
    tml

43
Exon 1 Intron 1 Exon 2
Intron 2 Exon 3 Intron 3 Exon 4
DNA
Transcription
Primary transcript
GU AG
GU AG
GU AG
Splicing
polyA
cap
Mature mRNA
cap
polyA
Translation
44
Relative Performance
  • Claverie 1997 Rogic 2000
  • Sn () Sp () Overall Overall
  • Individual Exons
  • MZEF 78 86 0.79
  • HEXON 71 65 0.64
  • SorFind 42 47 0.62
  • GRAIL II 51 57 0.47
  • Gene Structure
  • GENSCAN 78 81 0.86 0.91
  • FGENES 73 78 0.74 0.83
  • GRAIL II/Gap 51 52 0.66
  • GeneParser 35 40 0.54
  • HMMgene 0.91

45
What works best when?
  • Genome survey (draft) dataexpect only a single
    exon in any given stretch of contiguous sequence
  • BLASTN vs. dbEST (3 UTR)
  • BLASTX vs. nr (protein CDS)
  • Finished data large contigs are available,
    providing context
  • GENSCAN
  • HMMgene

46
Things we are looking to annotate?
  • CDS
  • mRNA
  • Alternative RNA
  • Promoter and Poly-A Signal
  • Pseudogenes
  • ncRNA

Lecture 5.1
46
47
Pseudogenes
  • Could be as high as 20-30 of all genomic
    sequence predictions could be pseudogene
  • Non-functional copy of a gene
  • Processed pseudogene
  • Retro-transposon derived
  • No 5 promoters
  • No introns
  • Often includes polyA tail
  • Non-processed pseudogene
  • Gene duplication derived
  • Both include events that make the gene
    non-funtional
  • Frameshift
  • Stop codons
  • We assume pseudogenes have no function, but we
    really dont know!

48
Noncoding RNA (ncRNA)
  • ncRNA represent 98 of all transcripts in a
    mammalian cell
  • ncRNA have not been taken into account in gene
    counts
  • cDNA
  • ORF computational prediction
  • Comparative genomics looking at ORF
  • ncRNA can be
  • Structural
  • Catalytic
  • Regulatory

49
09_04.jpg
Non-encoding genes
Noncoding RNA database http//biobases.ibch.pozna
n.pl/ncRNA
The total number of ncRNAs are still unknown due
to difficulty of predicting ncRNA from genome
sequences.
50
Noncoding RNA (ncRNA)
  • tRNA transfer RNA involved in translation
  • rRNA ribosomal RNA structural component of
    ribosome, where translation takes place
  • snoRNA small nucleolar RNA functional/catalytic
    in RNA maturation
  • Antisense RNA gene silencing

51
http//rfam.wustl.edu
52
http//rna.tbi.univie.ac.at/
53
Input (sequence only)
RNA or DNA parameters
Fold Algorithm
Target temperature
Advanced fold options
Output formats
Email (necessary for large sequences)
Link to your previous run
54
Output in bracket notation
Output - PostScript
55
(No Transcript)
56
Free energy (?G)
Enthalpy (?S)
Melting (de-hybridization) temperature
57
RNAalifold Predicts consensus secondary
structures for sets of aligned RNA (ClustalW
files). Information from the alignment
  • Conserved nucleotide pairs are shown normally.
  • Pairs with consistent mutations, which support
    the structure, are marked by circles.
  • Pairs with inconsistent mutations are shown in
    two shades of gray.

58
Bracket notation
(((..((((...)))).)))

. - unpaired base ( ) - base i pairs base j
- a weaker version of the above - a base
that is mostly paired but has pairing partners
both upstream and downstream
59
Graphical Representation Sequence Logo
  • Horizontal axis position of the base in the
    sequence.
  • Vertical axis amount of information.
  • Letter stack order indicates importance.
  • Letter height indicates frequency.
  • Consensus can be read across the top of the
    letter columns.

http//www-lmmb.ncifcrf.gov/toms/sequencelogo.htm
l
60
(No Transcript)
61
Tools on the Web for motifs
  • MEME Multiple EM for Motif Elicitation.
    http//meme.sdsc.edu/meme/website/
  • metaMEME- Uses HMM method
  • http//meme.sdsc.edu/meme
  • MAST-Motif Alignment and Search Tool
  • http//meme.sdsc.edu/meme
  • TRANSFAC - database of eukaryotic cis-acting
    regulatory DNA elements and trans-acting factors.
    http//transfac.gbf.de/TRANSFAC/
  • eMotif - allows to scan, make and search for
    motifs in the protein level.
  • http//motif.stanford.edu/emotif/

62
Websites for Promoter finding
  • Promoter Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov/molbio/proscan/
  • Promoter Scan II Univ. of Minnesota Axyx
    Pharmaceuticals
  • http//biosci.cbs.umn.edu/software/proscan/promote
    rscan.htm
  • Signal Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov80/molbio/signal/index.h
    tml
  • Transcription Element Search (TESS) Center for
    Bioinformatics, Univ. of Pennsylvania
  • http//www.cbil.upenn.edu/tess/
  • Search TransFac at GBF with MatInspector,
    PatSearch, and FunSiteP
  • http//transfac.gbf-braunschweig.de/TRANSFAC/progr
    ams.html
  • TargetFinder Telethon Inst.of Genetics and
    Medicine, Milan, Italy
  • http//hercules.tigem.it/TargetFinder.html

63
Transcriptional regulatory region
  • TFs play a significant role in differentiation in
    a number of cell types
  • The fact that 5 of the genes are predicted to
    encode transcription factors underscores the
    importance of transcriptional regulation in gene
    expression (Tupler et al. 2001 Nature.
    409832-833)
  • The combinatorial nature of transcriptional
    regulation and practically unlimited number of
    cellular conditions significantly complicate the
    experimental identification of TF binding sites
    on a genome scale
  • Understanding the transcriptional regulation is a
    major challenge
  • Computational approaches to identify potential
    regulatory elements and modules, and derive new,
    biologically relevant and testable hypothesis

64
Transcriptional regulatory module
  • cis-regulatory elements are sequence-specific
    regions transcription factors bind

CGGTTAAG
GCTAACGC
AGGCTA
CGGTTAAG
  • TFs combinatorially associate with each other to
    form modules and regulate their target genes

GCTAAGCG
AGGCTA
65
Mammalian Promoter Dadabase (MPromDb)
(http//bioinformatics.med.ohio-state.edu)
66
MPromDb 1.0 (Mammalian Promoter Database)
(http//bioinformatics.med.ohio-state.edu)
Human, mouse rat
Search by gene symbol Genbank Acc.Num
Unigene/LocusLink ID TF binding site name
Click here to search the database
67
MPromDb 1.0 (Mammalian Promoter Database)
(http//bioinformatics.med.ohio-state.edu)
68
MPromDb 1.0 (Mammalian Promoter Database)
(http//bioinformatics.med.ohio-state.edu)
BAX gene promoter with TF binding site
annotations, with supporting evidence from 3
PubMed records
69
MPromDb (Mammalian Promoter Database)
  • Promoter sequences with annotations of
    experimentally supported TF binding sites
  • Promoter sequences with annotations of
    computationally predicted TF binding sites
  • A platform for statistical analysis pattern
    recognition, to predict TF binding sites in
    uncharacterized promoters, and model
    combinatorial association of TF binding sites
  • A platform for comparative genomics, to reveal
    conserved regions across genomes of different
    species
  • Identification of core-promoters
  • Identification of all the human-mouse-rat
    homologues pairs
  • Modeling and identification of TF binding sites
    modules

70
(No Transcript)
71
(No Transcript)
72
  • Specific databases of protein sequences
  • and structures
  • Swissprot
  • PIR
  • TREMBL (translated from DNA)
  • PDB (Three Dimensional Structures)

73
Protein Structure
Primary
Tertiary
Quaternary
Secondary
Packing of secondary elements.
Packing of several polypeptide chains
Amino acid sequence
Alpha helices Beta sheets, loops.
74
Structure Prediction Motivation
  • Hundreds of thousands of gene sequences
    translated to proteins (genbanbk, SW, PIR)
  • Only about 28000 solved structures (PDB)
  • Goal Predict protein structure based
  • on sequence information

75
Structure Prediction Motivation
  • Understand protein function
  • Locate binding sites
  • Broaden homology
  • Detect similar function where sequence differs
  • Explain disease
  • See effect of amino acid changes
  • Design suitable compensatory drugs

76
Prediction Approaches
  • Primary (sequence) to secondary structure
  • Sequence characteristics
  • Secondary to tertiary structure
  • Fold recognition
  • Threading against known structures
  • Primary to tertiary structure
  • Ab initio modelling

77
Secondary structure prediction
AGADIR - An algorithm to predict the helical content of peptides APSSP - Advanced Protein Secondary Structure Prediction Server GOR - Garnier et al, 1996 HNN - Hierarchical Neural Network method (Guermeur, 1997) Jpred - A consensus method for protein secondary structure prediction at University of Dundee JUFO - Protein secondary structure prediction from sequence (neural network) nnPredict - University of California at San Francisco (UCSF) PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction PSA - BioMolecular Engineering Research Center (BMERC) / Boston PSIpred - Various protein structure prediction methods at Brunel University SOPMA - Geourjon and Del?age, 1995 SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California DLP - Domain linker prediction at RIKEN
78
http//searchlauncher.bcm.tmc.edu/
79
(No Transcript)
80
Multiple Sequence Alignment
81
ClustalW Algorithm
Progressive Sequences Alignment (Higgins and
Sharp 1988)
  • Compute pairwise alignment for all the pairs of
    sequences.
  • Use the alignment scores to build a phylogenetic
    tree such that
  • similar sequences are neighbors in the tree
  • distant sequences are distant from each other in
    the tree.
  • The sequences are progressively aligned
    according to the branching order in the guide
    tree.
  • http//www.ebi.ac.uk/clustalw/

82
ClustalW Input
Alignment format
Fast alignment?
Fast alignment options
Scoring matrix
Gap scoring
Phylogenetic trees
Input sequences
83
ClustalW Output (1)
Input sequences
Pairwise alignment scores
Building alignment
Final score
84
ClustalW Output (2)
Sequence names
Sequence positions
Match strength in decreasing order .
85
Phylogenetic Trees
  • Represent closeness between many entities
  • In our case, genomic or protein sequences

Unobserved commonality
Observed entity
86
Progressive Sequence Alignment (Protein
sequences example)
87
MSA Approaches
  • Progressive approach CLUSTALW (CLUSTALX)
  • PILEUP
  • T-COFFEE
  • Iterative approach Repeatedly realign subsets
    of sequences. MultAlin, DiAlign.
  • Statistical Methods
  • Hidden Markov Models
  • SAM2K
  • Genetic algorithm
  • SAGA

88
Multiple Alignment tools on the Web (Some URLs)
  • EMBL-EBI
  • http//www.ebi.ac.uk/clustalw/
  • BCM Search Launcher Multiple Alignment
  • http//dot.imgen.bcm.tmc.edu9331/multi-align/mult
    i-align.html
  • Multiple Sequence Alignment for Proteins (Wash.
    U. St. Louis)
  • http//www.ibc.wustl.edu/service/msa/

89
Editing Multiple Alignments
  • There are a variety of tools that can be used to
    modify a multiple alignment.
  • These programs can be very useful in formatting
    and annotating an alignment for publication.
  • An editor can also be used to make modifications
    by hand to improve biologically significant
    regions in a multiple alignment created by one of
    the automated alignment programs.

90
(No Transcript)
91
GCG alignment editors
  • Alignments produced with PILEUP (or CLUSTAL) can
    be adjusted with LINEUP.
  • Nicely shaded printouts can be produced with
    PRETTYBOX
  • GCG's SeqLab X-Windows interface has a superb
    multiple sequence editor - the best editor of any
    kind.

92
SeqVu
93
Editors on the Web
  • Check out CINEMA (Colour INteractive Editor for
    Multiple Alignments)
  • It is an editor created completely in JAVA (old
    browsers beware)
  • It includes a fully functional version of
    CLUSTAL, BLAST, and a DotPlot module

http//www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
Questions
  • Download protein seq and predict domain,
    secondary structure and post-translational
    modifications.
  • Download all SARS virus genome and perform MSA.
Write a Comment
User Comments (0)
About PowerShow.com