COMPARATIVE GENOMICS: GENOMEWIDE ANALYSIS IN METAZOAN EUKARYOTES - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

COMPARATIVE GENOMICS: GENOMEWIDE ANALYSIS IN METAZOAN EUKARYOTES

Description:

Completely sequenced genome could be used for large-scale comparative analysis ... Intra-mammal comparison show a large amount of non-functional conservation, ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 48
Provided by: bioinforB
Category:

less

Transcript and Presenter's Notes

Title: COMPARATIVE GENOMICS: GENOMEWIDE ANALYSIS IN METAZOAN EUKARYOTES


1
COMPARATIVE GENOMICS GENOME-WIDE ANALYSIS IN
METAZOAN EUKARYOTES
  • Mao-Feng Ger
  • 02/08/2006

2
  • Completely sequenced genome could be used for
    large-scale comparative analysis
  • Effective methods for enormous data are
    objectives
  • Main areas in comparative genomics
  • Whole-genome alignment
  • Gene prediction
  • Regulatory-region prediction

3
  • Introduction
  • Whole-genome alignments
  • Gene prediction
  • Finding regulatory regions
  • conclusion

4
(No Transcript)
5
Sequenced genomes
  • In NCBI Genomic Project Database, up to 02/05,
    the of genome project
  • Archaea 51 (complete , draft assembly, in
    progress)
  • Bactria 821 (complete , draft assembly, in
    progress)
  • Eukaryota 391 (complete, draft assembly, in
    progress, organelles)
  • In eukaryota, 174 are metazoans.

6
Comparative genomics
  • Presumption two genomes are from a common
    ancestor, so every bp is the combination of the
    original genome and the action of evolution
  • Evolution mutation selection
  • Can be represented by a rate matrix
  • Selection
  • Negative selection
  • Neutral selection
  • Positive selection

7
BLOSUM and PAM rate matrices
  • PAM (Percent Accepted Mutation)
  • From a set of proteins which are at least 85
    identical
  • Numeric suffix means the number of self
    multiplication
  • BLOSUM (BLOcks SUbstitution Matrix)
  • More empirical and from a large dataset
  • Contructed by extracting ungapped segments
    (blocks) from a set of aligned protein families
  • Numeric suffix means at least x identity to the
    blocks

8
Difficulties in aligning genomes
  • Knowing so little about evolution processes that
    wed better focus on functional sequences
  • Because genome size differences and genome
    readiness, doing whole-genome alignments is
    pretty difficult
  • Recently, there are more and more programs
    dealing with large-scale comparisons. Biologist
    need to know these approaches.

9
  • Introduction
  • Whole-genome alignments
  • Gene prediction
  • Finding regulatory regions
  • conclusion

10
Precomputed alignments
  • Several groups have made large cross-species
    comparisons
  • UC Santa Cruz/PennState (translated BLAT or
    BLASTZ)
  • Berkeley Genome Pipeline (BLAT/AVID)
  • Ensembl (Phusion/Blastn)

11
Whole-genome alignment
12
Which genome to align
  • Sufficient similarity between genomes enable the
    easy identification of homologous regions
  • Example DNA alignment between human and mouse
    resulted in finding new genes and gene regulatory
    regions
  • Alignment between human and puffer fish, though
    less easy, is still feasible

13
Comparing genomes at protein level
  • Not closely related genomes might have problems
    to align genomes at nucleotide level
  • At protein level, might lost info which can help
    finding new genes and regulatory sequences
  • It is better to start from closely related genomes

14
Alignment strategy
  • Dynamic programming makes alignment tractable as
    long as you follow a few rules
  • Needleman/Wunch align sequences globally

15
  • Smith/waterman align sequences locally
  • No negative score, at least 0
  • Tracing to 0
  • However, limitations
  • Cannot handle rearrangement such as inversion,
    duplication, translocation
  • For long sequence (gt10,000 bp), very expensive in
    time and memory usage

16
Seeding strategy
  • Because correct alignment comes from stretches of
    ungapped matches
  • So, first finding a set of ungapped matches
    (seeds)
  • Then, extending gapped alignment from where seeds
    happens.
  • Loss in sensitivity but reward in time and memory
    usage
  • Consecutive model and Two weighted-spaced model

17
  • Simply put,
  • Seeding
  • Seeds used as nucleation point for extension
  • Dynamic programming to produce gapped alignments
  • In this review, we focus on 4 whole-genome
    alignment methods
  • BLASTZ
  • BLAT/AVID
  • BLAT/LAGAN
  • WABA

18
BLASTZ
  • Local aligner, like BLASTZ and BLAT, are highly
    sensitive but less specific
  • BLASTZ applies several methods to increase
    sensitivity and specificity
  • Seeding instead of 11 consecutive model, new
    BLASTZ used two weighted-spaced model(12 of 19
    and tolerate a transition among 12)
  • Extend the seeds without gaps
  • Extend gapped alignment down-weight
    low-complexity matches first

19
(No Transcript)
20
  • In mouse-human alignment case, using a specific
    scoring matrix from known mouse-human homology
    region
  • A post-processing step is needed to sort out the
    most significant orthologues in multiple matches
  • Overall, BLASTZ covered 98 coding region in
    mouse and human genome, indicating it is highly
    sensitive for identifying well-conserved regions

21
BLAT
  • A local aligner
  • Untranslated designed to align cDNA to genomic
    sequences and less effective at lt 90 identity
  • Translated mode more effective in genome
    comparison. With mask for repeats and
    low-complexity, the output is faster and cleaner
  • Produce a set of ungapped alignments, good in
    speed at the expense of overall sensitivity
  • Used in human-pufferfish genome comparison

22
Global alignment
  • 3 steps
  • 1 finding the maximal repeated region
  • 2 clean matches first, then repeat matched
  • Recursively step 1 and 2
  • 3 lt4kb, use NW algorithm gt 4kb, no significant
    alignment

23
(No Transcript)
24
AVID
  • Assumption strictly homologous and no gene
    duplication, inversion, translocation
  • When apply to a whole-genome, it needs a
    pre-processing step to identify syntenic regions

25
LAGAN
  • The advantage of LAGAN over AVID is that it can
    align larger sequences
  • Lower memory requirement
  • Different matching algorithm in step 1 (not
    necessary to find exact matches)
  • In conjunction with BLAT, it has been applied to
    rat-human and rat-mouse comparisons

26
MLAGAN
  • An extension of LAGAN
  • Can do multiple alignment
  • Align closely related genomes first, then
    incorporate others in order of phylogenetic
    distance

27
WABA
  • Take genetic code degeneracy into consideration
  • Seeding step based on nucleotides and use two
    weighted-spaced rule 6of8, which allow the third
    position to mismatch
  • No extension step, but group proper seeds to
    define homologous regions

28
Biological correctness
  • There is no best way to do alignments
  • Know evolution inadequately to indicate which one
    is superior
  • Different algorithms are tuned to different
    genome comparisons (ex. BLASTZ in human-mouse
    case and WABA in C. eleganC. briggsae case)
  • Purposes are different
  • Align as much as possible, regardless of
    selection (ex. AVID, LAGAN, BLASTZ)
  • Identify conserved regions which are under
    selection (ex. BLAT, WABA)

29
  • Most programs concern maximizing the homologous
    bps, while biologist are interested in conserved
    regions for a function.
  • For example, in the mouse-human alignment, 40
    are alignable, but only 6 are under selection
  • To make things worse, substitution rate varies
    across genome.

30
  • Introduction
  • Whole-genome alignments
  • Gene prediction
  • Finding regulatory regions
  • conclusion

31
Defining gene structure
  • Still a challenge because of poor signal-to-noise
    level
  • Comparison between closely related genomes could
    provide additional info (dual genome gene
    predictors)
  • Different programs are with different
    presumptions, so users need to know the strengths
    and limitations

32
Dual genome gene predictor
  • To model a specific type of negative selection
  • Assumption in alignments, most differences are
    neutral and regions without many mutations are
    conserved.
  • Combine other info, such splicing, wobble effect
    to get a better model

33
  • Can be subdivided into 3 classes
  • Pair-HMM take math approach to determine joint
    gene structure and alignment
  • Informant approaches fix on alignment to provide
    a better gene prediction
  • Exon-finding approaches try to demark the exons
    without splicing them together

34
Pair-HMM
  • HMM can be used to predict gene structure in a
    single genome
  • Paire-HMM can find the most likely path to have
    generate these sequences and provide the
    alignment as well as gene prediction
  • Contain two set of orthologous genes

35
  • Two pair HMM approaches
  • SLAM
  • DoubleScan
  • Both need to optimize parameters for a specific
    species and better efficiency
  • SLAM uses AVID method to do rough alignment,
    while DoubleScan uses BLAST

36
Informant appraoches
  • Use only one sequence to predict gene structure,
    and other one sequence is just for additional
    info by its alighment
  • Can predict not only genome sequence but also
    different inputs, like unassembled reads
  • 3 methods are available TwinScan, SGP-2,
    GenomeScan
  • Need precise parameters, so have their own
    alignment methods (often BLAST)

37
Exon prediction
  • A carefully parameterized TBLASTX method designed
    to provide specific exon prediction from
    Tetraodon
  • Sacrifice a certain amount sensitivity for high
    specificity

38
Which method to use
  • A particularly successful way to do this work
    used informant methods combined with some simple
    criteria
  • Produce a strong prediction in mammalian genomes

39
  • Introduction
  • Whole-genome alignments
  • Gene prediction
  • Finding regulatory regions
  • conclusion

40
Finding regulatory regions
  • Called phylogentic footprinting (analogous with
    DNAase footprinting)
  • Functionally important regions are mutated less
  • These cis-regulatory motifs can be dertermined
    by
  • Finding common motifs in orthologous sequences
  • Aligingn orthologous sequences first, then
    indentifying common regions
  • Previously known motifs might help

41
Which region to use
  • 5 and 3 flanking regions as well as intronic
    sequences
  • Difficulties in finding regulatory regions
  • 5 end is often the least well-defined, so we
    need experimental evidence of promoters
  • Enhancers could be several kilobases away
  • In addition to experimental evidence, guessing
    and systematic comparison is needed to potential
    cis-regulatory regions

42
Evolutionary issues
  • Two orthologous genes might have very different
    regulatory cis-elements, such as paralogous genes
  • How evolution affects cis-regulatory motifs is
    still poorly understood
  • Intra-mammal comparison show a large amount of
    non-functional conservation, while in
    intravertebrate, it is hard to detect

43
  • Neutral drift effect could destroy or create
    cis-regulatory sites at a certain rate
  • However, expression pattern could remain
    little/no changed
  • Raise the possibility of compensate mutation
  • Recently, some researchers try to distinguish
    regulatory regions from neutrally evolving DNA by
    genome sequence alignments

44
Motif overrepresentation
  • Motif finding programs do not consider the
    phylogenetic relationship between homologous
    sequences
  • Can be overcome by identifying DNA motifs
    evolving at a slower rate than the surrounding
    sequences
  • All motif-finding techniques work better with
    increasing amounts of sequences

45
Alignment for finding regulatory region
  • Aligning regions of homology in the non-coding
    regions near the orthologous genes
  • More and more researches show that cis-regulatory
    elements are in non-conserved regions

46
  • Introduction
  • Whole-genome alignments
  • Gene prediction
  • Finding regulatory regions
  • conclusion

47
conclusion
  • With more genomes to be sequenced, we can
    investigate the evolution effects on specific
    regions over the entire genomes
  • With precomputed data, users can focus at the
    biological level
  • 3 advances needed to be made
  • Need more genomes to improve the power
  • Power can be improved by knowing how negative
    selection works for different functional
    contraints
  • Knowing more about positive selection
Write a Comment
User Comments (0)
About PowerShow.com