An Introduction to Bioinformatics - PowerPoint PPT Presentation

1 / 124
About This Presentation
Title:

An Introduction to Bioinformatics

Description:

1866 - Gregor Mendel's discovery of the laws of heredity ... Xenopus tropicalis. Gallus gallus. Mus musculus. Pan troglodytes. Homo sapiens ... – PowerPoint PPT presentation

Number of Views:1029
Avg rating:3.0/5.0
Slides: 125
Provided by: Kui1
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Bioinformatics


1
An Introduction to Bioinformatics
  • Laboratory of Computational Molecular Biology
  • College of Life Sciences
  • Beijing Normal University

2
Image from http//microscopy.fsu.edu
3
The Timeline of Landmark Accomplishments in
Genetics and Genomics
  • 1866 - Gregor Mendel's discovery of the laws of
    heredity
  • 1944 - Recognition of DNA as the hereditary
    material
  • 1953 - Determination of DNA structure
  • 1963 - Elucidation of the genetic code
  • 1972/3 - Development of recombinant DNA
    technologies
  • Establishment of increasingly automatable methods
    for DNA sequencing (1975,1977,1986, 2001)
  • 1990 - The stage for the Human Genome Project
    (HGP)
  • 2001 - The IHGSC and Celera Genomics each
    reported draft sequences providing a first
    overall view of the human genome
  • 2005 - Human haplotype map (HapMap)

4
Genome Sequencing Technology
  • Traditional clone-by-clone approach
  • High-throughput whole genome shotgun (WGS)
    sequencing assembly
  • Advance in assembly algorithms
  • Challenges repetitive sequences, diploid genome,
    etc. are difficult for correct assembly
    therefore more robust automated methods are
    needed.

Next-generation sequencing outpaces expectations
Nature Biotechnology
2007. Vol 25149. Several orders of magnitude
more efficient than Sanger capillary-array
electrophoresis (CAE) machines. 1,000 genome
program 600 megabases per day -gt 100
billion bases per day Targeting resequencing
efforts aimed at finding genetic variations
and rare mutations that contribute to complex
diseases.
5
Why is There Bioinformatics?
  • High-throughput molecular biological technologies
  • Automated sequencers
  • Genomes, ESTs, SNPs, etc.
  • DNA array for large-scale gene expression
  • Proteomics platform
  • MS/MS LC/MS/MS (liquid chromatography),
    Protein-protein interaction experiments,
    genome-wide localization technique
  • Massive datasets produced
  • GenBank (As of August 2006)
  • 65,369,091,950 bases in 61,132,599 sequence
    records in the traditional GenBank divisions
  • 80,369,977,826 bases in 17,960,667 sequence
    records in the WGS division.

6
Is Biology an Informational Science?
  • The HGP changed how we view practice biology.
  • Biology is an informational science.
  • Digital genome
  • Environmental signals
  • Biology has become a cross-disciplinary science.

From U.S. Department of Energy Human Genome
Program. http//www.ornl.gov/hgmis
7
Bioinformatics as an intersecting discipline
Developing the high throughput technologies and
computational/mathematical tools required for
this new biology.
8
Why? Where? What? How?
  • Why Ideas for what to produce these huge
    datasets? Biological background needed.
  • Where Raw data need to store, IT platforms
    required.
  • What Patterns in datasets that can be analyzed
    using computers. Various data models and their
    respective algorithms are needed.
  • How Different resources need to be integrated.

9
What is Bioinformatics?
  • The field of biology specializing in developing
    hardware and software to store and analyze the
    huge amounts of data being generated by life
    scientists. (NIH)
  • More than 20 different definitions can be found
    from Google!

Bioinformatics applications Data analysis
Data integration Various molecular biology
databases
10
Key Challenge of Bioinformatics
  • The world of biology is very different from what
    it was even ten years ago.
  • To bridge the considerable gap between technical
    data production and its use by scientists for
    biological discovery.

11
Main resources for Bioinformatics
  • Databases
  • GenBank, EMBL, DDBJ
  • UniProt/SwissProt, InterPro, Pfam, SCOP, PDB
  • Gene Ontology, KEGG
  • Algorithms Applications
  • BLAST, FASTA, BLAT
  • ClustalW, HMMer
  • Phylip

12
(No Transcript)
13
1138 Prokaryotic Genome Sequencing Projects
Selected Complete - 461, Draft assembly - 307,
In Progress 370 257 Eukaryotic Genome
Sequencing Projects Selected Complete - 24,
Draft assembly - 95, In Progress 138 (NCBI
2007-2-27, http//www.ncbi.nlm.nih.gov/genomes/sta
tic/gpstat.html)
14
NCBIs Tools for Data Mining
  • Nucleotide Sequence Analysis
  • Model Maker - allows you to view the evidence
    (mRNAs, ESTs, and gene predictions) that was
    aligned to assembled genomic sequence to build a
    gene model and to edit the model by selecting or
    removing putative exons.
  • Protein Sequence Analysis and Proteomics
  • CD Search - search the Conserved Domain Database
    with Reverse Position Specific BLAST.
  • Structures
  • Genome Analysis
  • Entrez Genomes - whole genomes of over 1000
    organisms. It provides graphical overviews of
    complete genomes/chromosomes and the ability to
    explore regions of interest in progressively
    greater detail.
  • Gene Expression
  • The Cancer Genome Anatomy Project - aims to
    decipher the molecular anatomy of cancer cells.
    CGAP develops profiles of cancer cells by
    comparing gene expression in normal,
    precancerous, and malignant cells from a wide
    variety of tissues.

15
What are the Hot Topics in Bioinformatics?
  • Comparative genomics
  • Genetic variation analysis
  • Microarray-based gene
  • expression analysis
  • Systems biology

The idea that we can study the interactions of
all elements in a biological system and from
these come to understand its emergent properties.
Gene ? Genome all of the genes in an
organism Transcipt ? Transcriptome expressed
in a particular biological process Protein ?
Proteome the world of expressed
proteins Interaction ? Interactome the
interactions between molecules
Genomics, Transcriptomics, Proteomics,
Metabolomics/Interactomics
16
Browsers Light applications
Clients
Intranet and/or Internet
HTML PERL/C/C/Java MySQL Bioinformatics
applications HPC with MPI
WWW servers Database servers Intensive computing
servers
Servers
17
The future of genomics research Nature 422
835-847.2003
.
18
  • Nothing in biology makes sense except in the
    light of evolution.
  • ------ Theodosius Dobzhansky

The evolutionary process finds a way to create
exceptions to every model we propose. --- Austin
L. Hughes
19
Biological Data Objects
Genomes Protein sequences
  • Laboratory of Computational Molecular Biology
  • College of Life Sciences
  • Beijing Normal University

20
Sequence assembly by the shotgun approach
Building up the master sequence directly from the
short sequences obtained from individual
sequencing experiments, simply by examining the
sequences for overlaps. This is called the
shotgun approach. It does not require any prior
knowledge of the genome and so can be carried out
in the absence of a genetic or physical map.

21
So what are you going to do with this?
How do you find where the genes are?
CGGTTGAAAGCGGTAGCGTCCATGCGTATTACTCTTGAGCGGTCGAACCT
TCTGAAATCGCTGAACCACGTCCACCGGGT CGTCGAGCGTCGCAACACG
ATCCCGATCCTGTCCAACGTTCTGCTGCGCGCCTCCGGCGCCAATCTGGA
CATGAAGGCGA CCGACCTCGATCTGGAAATCACCGAAGCGACCCCGGCC
ATGGTGGAGCAGGCTGGCGCCACCACCGTACCGGCACACCTG CTTTACG
AAATCGTGCGCAAGCTGCCGGATGGTTCCGAAGTGCTTCTGGCGACCAAC
CCGGACGGCTCCTCCATGACCGT TGCGTCCGGCCGCTCGAAATTCTCGC
TGCAATGCCTGCCGGAAGCGGATTTCCCTGACCTCACCGCCGGCACCTTC
AGCC ACACCTTCAAACTGAAGGCGGCCGATCTGAAGATGCTGATCGACC
GGACGCAGTTTGCGATTTCGACCGAAGAGACGCGT TATTACCTGAACGG
CATTTTCTTCCACACCATCGAAAGCAATGGCGAGCTGAAACTGCGCGCCG
TCGCCACCGACGGTCA CCGCCTTGCGCGTGCTGACGTCGATGCGCCCTC
CGGCTCCGAAGGCATGCCGGGCATCATCATTCCGCGCAAGACCGTCG GT
GAACTGCAGAAGCTGATGGACAATCCGGAACTGGAAGTCACAGTCGAAGT
CTCGGATGCGAAGATCCGCCTGGCCATC GGTTCCGTCGTTCTGACCTCG
AAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGG
CAACGACAA GGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTG
GACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCG TGAAGCTGG
CGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGA
AGTGCTACCGAAGAAGTGGCC GTTGGCTACGACAATGATTCGATGGAAA
TCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGC
GA AGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTC
GCGATACCGCCGGCGACGACGCACTCTATGTTC TGATGCCGATGCGCGT
TTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCT
TCATCGGCGTTTTT TGATTCGGCGAACAGGTGGCTCTACCCGTAACTGA
ATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGG ATCA
ACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCA
TGGCACTCAACCTGAAGCAACGGCTT GAACAAAAATTTGAGGAAGAAAT
CCGCTTTTTCAAAGGTATGGTCAGCCAGCCGAAAAAAGTCGGCGCCATTG
TCCCGAC GGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTT
CCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAA GGAAATGCGCG
TCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCT
TCCGAGCGCGGCCGCGCCG TGAAGCTGGCGCTAACTGACGGCCAGTTGA
CGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCC
GTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATC
TCCTCGACATCACGTCGCAGCTCTCCGGCGA AGATGCGATTTTTCTGCT
GGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACG
CACTCTATGTTC TGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTC
AATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTT TGATTC
GGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTT
TGCCTTGTTTTTGCGCCAAATGGG ATCAACAGTACGTAACAATTTTTTG
ACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAAC
GGCTT
22
Three Layers of Genome Annotation. From Stein,
L. 2001. Nature Reviews genetics 2493-503
23
  • Human Genome Project
  • 1996 first eukaryotic genome (Saccharomyces
    cerevisiae) sequenced and stored into NCBIs
    Genomes division. (Chromosome size 270KB
    1500KB)
  • 1997 Entrez Genomes was able to provide the
    first graphical views of genomic sequence data
    (genetic, physical and cytogenetic maps).
  • 2001.2 the working draft of the human genome
    (Lander et al., 2001). Chromosome size 46MB
    246MB.
  • NCBI created the first human Map Viewer. UCSCs
    human Genome Browser EBIs Ensembl a system to
    annotate automatically the human genome sequence
    as well as to store and visualize the data.
  • Today, each site provides free access not only to
    human genome sequence data, but also other
    assembled genomic sequences.

24
  • A simple description of each browser
  • The backbone is an assembled genomic sequence.
  • UCSC (May 2000) and NCBI assembled BAC sequences
    into longer contigs.
  • NCBI assembly now is displayed by all three
    browsers (UCSC stopped producing assemblies in
    Dec 2001).
  • The sequence is known by its build number at
    NCBI, and by a date at UCSC. (March 2006, NCBI
    Build 36.1, hg18)
  • Each of the three genome browsers provides their
    own annotation of the common assembled sequence.
  • Location of genes, both known and predicted.
  • Different sources of mRNAs, different alignment
    prediction method.
  • ESTs and SNPs, the location of STS makers.
  • Homologous sequences from other organisms.

25
Depending on the state of sequencing project,
genomic coordinates along the chromosome may
change dramatically from assembly to assembly.
26
DNA Sequence Databases
Data flow for new submissions updates between
the three databases
27
The RefSeq project (NCBI)
  • Providing a reference sequence for each molecule
    in the central dogma (DNA, mRNA, and protein).
  • Distinct accession number series (26 format)
  • NT_123456 Genomic contig (DNA)
  • NM_123456 mRNA
  • NP_123456 Protein
  • XM_123456 Model mRNA
  • XP_123456 Model protein

28
Protein Sequence Database (UniProt)
  • Swiss-Prot protein database (early 1980s, Amos
    Bairoch) TrEMBL
  • Combining Swiss-Prot, TrEMBL, and PIR-PSD
    together.
  • Three major components
  • The UniProt Archive (UniParc), into which new
    updated sequences are loaded
  • The UniProt Knowledgebase, with the goal of
    providing an expertly curated database.
  • The UniProt nonredundant reference database
    (UniRef)

29
The flow of data from primary sources into the
component of databases of the Universal Protein
Resource
Complete coverage of all predicted coding
sequences from human, mouse, and rat.
30
(No Transcript)
31
MeSH Terms NLM's Medical Subject Headings
controlled vocabulary of biomedical terms that is
used to describe the subject of each journal
article in MEDLINE. MeSH contains more than
23,000 terms and is updated annually to reflect
changes in medicine and medical terminology. MeSH
terms are arranged hierarchically by subject
categories with more specific terms arranged
beneath broader terms. PubMed allows you to view
this hierarchy and select terms for searching in
the MeSH Database.
32
Lincoln D. Stein
33
Sequence Databases Beyond NCBI
  • NCBI is the center of the sequence universe
  • There are many specialized sequence databases
    throughout the world that server specific groups
    in the scientific community.

34
(No Transcript)
35
Biological Data Objects
SNPs Human variants
  • Laboratory of Computational Molecular Biology
  • College of Life Sciences
  • Beijing Normal University

36
Why study genetic variation?
Saccharomyces cerevisiae
Caenorhabditis elegans
Drosophila melanogaster
Takifugu rubripes
Xenopus tropicalis
Gallus gallus
Mus musculus
Pan troglodytes
Homo sapiens
(different population variations)
Population A
Population B
Population C
37
Feuk L. et al., 2006. Nature Reviews Genetics.
785-97
Disease susceptibility ??????
38
Feuk L. et al., 2006. Nature Reviews Genetics.
785-97
??
pericentric-???paracentric-????
??
????(uniparental disomy)??????????,???????????????
???
39
Science 315, 848853 (2007)
40
  • With the advent of molecular biology, and DNA
    sequencing in particular, smaller and more
    abundant alterations were observed. Such
    differences include
  • Single Nucleotide Polymorphisms (SNPs-??????)
  • various repetitive elements(????) that involve
    relatively short DNA sequences, e.g., micro- and
    minisatellites(?/??? )
  • small (usually lt1 kb) insertions, deletions,
    inversions and duplications(??).
  • It was presumed that these small-scale variants
    constitute most genetic variation for example,
    estimates predict that there are at least 10
    million SNPs within the human population,
    averaging 1 every 300 nucleotides among the 3
    billion nucleotide base pairs that constitute the
    genome of an individual.

41
Genetic(genomic) variation
  • Mutations fundamentally are produced by errors in
    DNA replication are the ultimate source of
    genetic variation in a population.
  • The frequency of alleles (????)depends on the
    evolutionary forces that have been applied to the
    population (e.g., genetic drift(????), selection,
    migration) and on the age of the mutation itself.
  • When the mutant allele exceeds 1 representation
    in a population, we refer to the mutation as a
    polymorphism, with no implied association with
    phenotype.

42
  • Single nucleotide polymorphisms (SNPs) a
    polymorphism in which alleles are defined by
    single or few base changes sometimes it also
    contains deletions and insertions of one or few
    bases.
  • Short tandem repeats (STRs??????? ) a number of
    repeats of short, 2- to 5-nt subsequences.
    Typically, they are scattered throughout the
    genome and are flanked by unique sequences that
    can be used to target a PCR amplification of the
    locus. Most STRs are variable in copy number,
    making them valuable genetic markers.
  • Seq1 GTCACTGACACACACAGTACGT
  • Seq2 GAC CTGACACACACACACACACACAGTACG

43
  • SNP Discovery methods
  • Pairwise sequence comparison aligning (??,??)
    two sequences from different individuals DNA and
    to look for high-quality sequence difference.
  • Deep resequencing using PCR resequencing of DNA
    samples to find SNPs in a well-defined region of
    the genome.
  • Separation method Slab gel, capillary
    electrophoresis (CE?????), MS, DHPLC (Denaturing
    high performance liquid chromatography
    ????????????), etc
  • Homogeneous assay Taqman, Invader, etc
  • Solid-phase assay beads and DNA arrays chips
    etc.

44
(No Transcript)
45
  • NCBI dbSNP content
  • Original submitter-supplied data (ss)
  • Integrated postsubmission content
  • refSNP clusters (rs)
  • Mapping results
  • Functional analysis
  • Average levels of diversity

46
Alleles define the class of the variation
  • SNP
  • in del DIPs, deletion represented by -
  • het variation has unknown sequence composition,
    but is observed to be heterozygous.
  • microsatllite microsatellite / simple sequence
    repeat / STR
  • mixed MNP multiple nucleotide polymorphism (all
    alleles same length, where length gt 1).
  • named allele sequences defined by name tag,
    e.g., (Alu)/-
  • no variation invariant region in surveyed
    sequence.
  • The dbSNP database is released to the public in
    periodic builds that are synchronized with
    current genome assemblies.

47
Resource Integration
  • The clustered data from dbSNP provides a
    non-redundant set of variation from each organism
    in the database.
  • The non-redundant set of variations (refSNP
    cluster set) is annotated on reference genome
    sequence contigs, chromosome, mRNAs, and proteins
    as part of the NCBI RefSeq project.
  • Summary properties are recomputed for each refSNP
    cluster and are used to build fresh indices for
    NCBIs Entrez query system and to update
    variation maps or tracks in genome browsers
    (e.g., Ensembl, UCSC Genome Browser).

48
Variation Function Class
  • Defined by the position of the variation relative
    to the structure of the aligned transcript (may
    having several different functional relationships
    to a gene may having multiple, potentially
    different relationships to its local gene
    neighbors).
  • locus region near a gene
  • mRNA UTR (Untranslated Region-???????????)
  • intron
  • splice site
  • coding region
  • synonymous (????)
  • nonsynonymous(?????)

49
Genotyping (????)
  • The process of determining the allele states of
    selected polymorphisms in selected groups of
    individuals.
  • How to genotype a known STR?
  • PCR primers are designed on either side of the
    STR.
  • The amplified product from a DNA sample is
    sized-fractionated (??) by electrophoresis to
    determine the number of copies present.
  • An individual who is homozygous for one of the
    allele states will show a single predominant
    band those who are heterozygous will show two
    distinct bands.
  • For STR polymorphisms, there can be many
    different copy numbers present in the population,
    so the copy number(s) is important to record for
    each sample.

50
(No Transcript)
51
  • Uneven distribution
  • The distribution of SNPs departs significantly
    from what would be predicted under the standard
    population genetics models gt SNP hot spots
    cold spots.
  • What combination of historical, structural or
    selective pressures are responsible for this
    phenomenon?
  • How is it related to other factors, such as the
    distribution of genes repetitive elements?

52
Haplotypes(???????) A haplotype is defined as a
specific set of alleles (SNPs) observed on a
region (or the whole) of a given chromosome. When
comparing haplotypes from many individuals,
shared patterns are seen that occur with much
greater likelihood than would be estimated by
assuming each allele state was independent of the
others. This nonrandom association is called
linkage disequilibrium (LD-?????).
53
  • How SNPs shape haplotype-structure?
  • SNPs in close physical proximity are often
    strongly correlated.
  • The length of a shared haplotype segment is
    called a haplotype block.
  • There is substantial variation in block sizes
    across the genome.

Haplotype blocks sizable regions over which
there is little evidence for historical
recombination and within which only a few common
haplotypes are observed.
54
  • These haplotype blocks are likely to result
    from the fact that recombination, that is, the
    re-shuffling of chromosome segments that occurs
    during formation of sex cells (meiosis-????),
    tends to occur in certain areas of the
    chromosomes more often than in others.
  • Thus, any single human chromosome is a mosaic of
    different haplotype blocks, where each block has
    its own pattern of variation. (5kb200kb and 45
    common haplotypes)

Haplotype blocks tend to be shorter in Africa
than elsewhere.
The chromosomes of two hypothetical individuals
are shown. Each individual carries two copies of
each block (as humans carry two sets of
chromosomes).
55
  • Usually, Block regions have only a few common
    haplotypes (each
  • with a frequency of at least 5), which
    account for most of the
  • variation from person to person in population.
  • Those common haplotypes could be distinguished
    by a few tag
  • SNPs, which represent the most of the
    information on the pattern
  • of genetic variation in block region.

Three haplotypes are shown. The two SNPs in color
are sufficient to identify (tag) each of the
three haplotypes.
Theoretically, researchers could look for these
regions by genotyping 10 million SNPs. However,
the methods to do this are currently too
expensive. The HapMap(????) will identify which
200,000 to 1 million tag SNPs provide almost as
much mapping information as the 10 million SNPs.
56
About the International HapMap Project
????????????? Five countries spent 138 million
in total
  • Launched in October 2002.
  • The goal of the International HapMap Project is
    to develop a public haplotype map of the human
    genome, the HapMap, which will describe the
    common patterns of human DNA sequence variation.
  • The HapMap is expected to be a key resource for
    researchers to use to find genes affecting
    health, disease, and responses to drugs and
    environmental factors.

57
???????????? (??YRI) ???????? (??JPT)
???????? (??CHB) CEPH??(???????????????)
(??CEU)
HapMap data release 21, July 2006, on NCBI B35
assembly, dbSNP b125
58
How did HapMap project build ?
  • Populations and samples
  • The DNA samples for the HapMap will come from a
    total of 270 people
  • the Yoruba people in Ibadan, Nigeria (30
    both-parent-and-adult-child trios),
  • Japanese in Tokyo (45 unrelated individuals),
  • Han Chinese in Beijing (45 unrelated individuals)
  • the CEPH (30 trios).
  • These numbers of samples will allow the Project
    to find almost all
  • haplotypes with frequencies of 5 or higher.
  • All of the new samples collected for the Project
    are being obtained with
  • protocols approved by the appropriate
    ethics (????) committees, after culturally
    appropriate processes of community engagement or
    public consultation and individual informed
    consent.

59
  • Genotype data
  • The project produce the genotypes of the 270
    individual samples and
  • the frequencies of SNP alleles.

Affymetrix
Illumina
60
  • Haplotype phasing
  • Inferring the haplotype data from
    genotypes data .

Statistics model
Genotype data
Haplotype data
  • Haplotype-block partition
  • The Project will use standard measures of
    SNP association, such as D'
  • and r2, partitioning the ENCODE region
    into block structure.

61
HapMap genotype data dump
62
HapMap file format (genotype)
63
HapMap genotype frequency data dump
  • refhom-gt refhom-freq refhom-count
  • het-gt het-freq het-count
  • otherhom-gt otherhom-freq otherhom-count
  • totalcount

64
HapMap allele frequency data dump
  • ref_allele ref-allele_freq ref-allele_count
  • other_allele other-freq other-allele_count
  • totalcount

65
Phased haplotypes
66
(No Transcript)
67
Haplotype-block partition The Project will use
standard measures of SNP association, such as D'
and r2, partitioning a region into block
structure.
As soon as these blocks are identified, only 34
SNPs within a block will be necessary to
characterize the haplotype of that entire segment
of the genome. These are called haplotype tag
SNPs. Across the entire genome, which may contain
up to 10 million common SNPs, it is anticipated
that only 500,000 tag SNPs will be necessary to
identify more than 90 of the haplotype blocks
(1/20), greatly reducing the costs of these
studies.
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
36 adjacent SNPs in an ENCODE region (ENr131.2q37)
D 1 for all markers. In contrast, r 2 values
display a complex pattern, varying from 0.0003 to
1.0., with no relationship to physical
distance. Why?
Only seven haplotypes are observed (five seen
more than once) among the 120 parental CEU
chromosomes studied, reflecting shared ancestry
since their most recent common ancestor among
apparently unrelated individuals.
Simple LD pattern for the region absence of
recombination
72
Selection of tag SNPs for association studies
  • We refer to the set of SNPs genotyped in a
    disease study as tags. A given set of tags can be
    analysed for association with a phenotype using a
    variety of statistical methods which we term
    tests, based either on the genotypes of single
    SNPs or combinations of multiple SNPs.
  • The shared goal of all tag selection methods is
    to exploit redundancy among SNPs, maximizing
    efficiency in the laboratory while minimizing
    loss of information (this literature is extensive
    and varied, despite its youth).
  • Pairwise algorithm SNPs are selected for
    genotying until all common SNPs are highly
    correlated (r2 gt 0.8)

73
(2). Software for Haplotype block partition and
Tag SNPs analysis
Resources for haplotype block analysis
(1). Software for Haplotype Inference
  • PHASE (http//www.stat.washington.edu/stephens/sof
    tware.html)
  • Haplotyper (http//www.people.fas.harvard.edu/jun
    liu/Haplo/docMain.htm)
  • Haplore (http//bioinformatics.med.yale.edu/softwa
    relist.html)
  • HapBlock (http//hto-b.usc.edu/msms/HapBlock/)
  • HaploBlockFinder (http//cgi.uc.edu/cgi-bin/kzhang
    /haploBlockFinder.cgi)
  • TagSNPs (http//www-rcf.usc.edu/
    stram/tagSNPs.html)
  • Tagger (http//www.broad.mit.edu/mpg/tagger/ )

(3). Haplotype Visualization Software
  • HaploView (http//www.broad.mit.edu/personal/jcbar
    ret/haploview/)
  • Haplot (http//info.med.yale.edu/genetics/kkidd/pr
    ograms.html)

74
Association Studies
  • A group of individuals are selected as cases
    and another as controls.
  • The cases group are individuals that are
    diagnosed with some disease, react to some type
    of medicine, or are even especially healthy.
  • The controls group are individuals that do not
    exhibit the feature selected for the cases group.
  • For case-control studies, a selection of SNPs is
    genotyped in both the case and control groups,
    and those alleles that exhibit a higher incidence
    (???,???) in the case group as opposed to the
    control group are potential makers for the
    observed phenotype.

75
(No Transcript)
76
NATURE. Vol 445828. 2007
It is the largest GWA study so far, and tackles a
very common disease that is rising in prevalence
throughout the world.
The efforts to understand the interplay between
genetic and environmental risk factors in
generating the high frequency of the disease.
77
(No Transcript)
78
Other Applications
  • Evolution studies the map may be biased because
    it refers common SNPs to rare ones.
  • Examining genomes architecture
  • Nature selection studies

79
Debate
  • Common mutations are behind most common diseases?
    (common diseases common mutations)
  • OR
  • Common diseases arise from combinations of rare
    mutations?

80
Major Analysis Tools
  • Laboratory of Computational Molecular Biology
  • College of Life Sciences
  • Beijing Normal University

81
Outline
  • BLAST Sequence Similarity Comparison
  • PSI-BLAST Distant Similarity Comparison
  • ClustalW Multiple Sequence Alignment
  • HMMER More Sensitive Multiple Sequence
    Alignment
  • PhylipP Inferring Phylogenies

Comparison of protein and DNA sequence is one of
the foundations of bioinformatics.
82
  • Evolution
  • The theory of evolution is the foundation upon
    which all of modern biology is built.
  • From anatomy to behavior to genomics, the
    scientific method requires an appreciation of
    changes in organisms over time.
  • It is impossible to evaluate relationships among
    gene sequences without taking into consideration
    the way these sequences have been modified over
    time.
  • Relationships
  • Similarity searches and multiple alignments of
    sequences naturally lead to the question
  • How are these sequences related?
  • And more generally
  • How are the organisms from which these sequences
    come related?

83
BLAST
Basic Local Alignment Search Tool (BLAST) is the
tool most frequently used for calculating
sequence similarity.
84
Why BLAST?
  • Identify unknown sequences
  • Help gene/protein function and structure
    prediction genes with similar sequences tend to
    share similar functions or structure.
  • Identify protein family group related (paralog
    or ortholog) genes and their proteins into a
    family.
  • Prepare sequences for multiple alignments
  • And more

85
What is BLAST?
  • An Example of Sequence Comparison

86
Components of Sequence Alignment
  • Scoring function a measure of similarity between
    elements (nucleotides, amino acids, gaps)
  • An algorithm for alignment
  • Confidence assessment of alignment result.

87
Global vs. Local Alignment
  • Global Alignment the alignment of complete
    sequences
  • Good for comparing members of same protein family
  • Needleman Wunsch 1970. J Mol Biol 48443
  • Local alignment the alignment of segments of
    sequences
  • ignore areas that show little similarity
  • Smith Waterman 1981,.J Mol Biol, 147195
  • modified from Needelman-Wunsh algorithm
  • can be done with heuristics (FASTA and BLAST)

88
BLAST Stages
  • Stage I
  • Find matching word pairs
  • Extend word pairs as much as possible,i.e., as
    long as the total weight increases
  • Result High-scoring Segment Pairs (HSPs)
  • THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
  • INVIEIAMDEADMEATTNAMHEWASNINETEEN
  • Stage II
  • Try to connect HSPs by aligning the sequences in
    between them
  • THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
  • INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

89
BLAST Options
  • Composition-based statistics (Yes)
  • Sequence Complexity Filter (Yes)
  • Expect (E) value (10)
  • Word Size (3 or 11)
  • Substitution or Scoring Matrix (Blosum62)
  • Gap Insertion Penalty (11)
  • Gap Extension Penalty (1)

90
  • BLAST is a collection of programs with versions
    for query-to-database pairs such as
  • Query nucleotide ? nucleotide DB blastn
  • Query protein ? protein DB lt- nucleotide DB
    tblastn
  • Query protein ? protein DB blastp
  • Query nucleotide -gt query proteins ? protein DB
    blastx
  • Query nucleotide -gt query protein ? protein DB lt-
    nucleotide DB tblastx

91
(No Transcript)
92
BLAST????????
????
???????
??????
?
?
?
?
???????
?
?
??????
?
?
blastp
tblastn
blastn
tblastx
blastx
93
BLAST Parameters
  • Identities - No. exact residue matches
  • Positives - No. and similar ID matches
  • Gaps - No. gaps introduced
  • Score - Summed HSP score (S)
  • Bit Score - a normalized score (S)
  • Expect (E) - Expected of chance HSP aligns
  • P - Probability of getting a score gt X
  • T - Minimum word or k-tuple score (Threshold)

94
E-value
  • The probability that a variate would assume a
    value greater than or equal to the observed value
    strictly by chance P(zgtzo)
  • If the E-value found for an alignment is low
    (lt0.001) then alignment is probably biologically
    meaningful.
  • Pre-compute the parameters based on a
    statistical model

95
Low complexity issue
  • Watch out for
  • transmembrane or signal peptide regions
  • coil-coil regions
  • short amino acid repeats (collagen, elastin)
  • homopolymeric repeats
  • BLAST uses SEG to mask amino acids
  • BLAST uses DUST to mask bases

96
BLAST-related tools
  • BLAST2Sequences to find local alignments between
    any two protein or nucleotide sequences.
  • MegaBLAST a variation of BLASTN that has been
    optimized specifically for use in aligning either
    long or high similar (gt95) sequences and is the
    method of choice when looking for exact matches
    in nucleotide databases.
  • PSI-BLAST (position-specific-iterated BLAST)
    particularly well suited for identifying
    distantly related proteins that may not have been
    found using the traditional BLASTP method.
  • BLAT (BLAST-Like Alignment Tool) similar to
    MegaBLAST in that it is designed rapidly to align
    longer nucleotide sequences having more than 95
    similarity but using a slightly different
    strategy than does BLAST to achieve faster speed.

97
PSI-BLAST
Position-Specific Iterated (PSI)-BLAST is the
most sensitive BLAST program, making it useful
for finding very distantly related proteins
98
Distant similarity detection
  • Many functionally and evolutionarily important
    protein similarities are recognizable only
    through 3D structures comparison.
  • When such structures are not available, patterns
    of conservation identified from the alignment of
    related sequences can aid the recognition of
    distant similarities.
  • These conserved patterns variously called
    motifs, profiles, position-specific score
    matrices, and Hidden Markov Models.
  • In essence, for each position in the derived
    pattern, every amino acid is assigned a score.
  • Highly conserved residue at a particular position
    is assigned a high positive score, and others are
    assigned high negative scores.
  • At weakly conserved positions, all residues
    receive scores near zero.
  • Position-specific scores can also be assigned to
    potential insertions and deletions.

99
  • The power of profile methods can be further
    enhanced through iteration of the search
    procedure .
  • Position-specific scores improve the ability of
    successive BLAST iterations for detecting remote
    homologs
  • Use PSI-BLAST when your standard protein-protein
    BLAST search either failed to find significant
    hits, or returned hits with descriptions such as
    "hypothetical protein" or "similar to..."

100
PSI-BLAST flow chart
Take a sequence
Search for similar sequences in a full sequence
database
FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ
Sequences are multiply aligned
  • After several iterations of this procedure we
    have
  • Sequence information, inc. links to annotation
  • Several sets of multiple alignments.
  • Profiles, derived by us or by PSI-BLAST
  • Thresholding information (alignment statistics)

Construct a profile, and represent conservation
in each position numerically
Profile holds more information than a single
sequence use the profile to retrieve additional
sequences
101
(No Transcript)
102
Multiple Alignment Methods
  • The most practical and widely used method in
    multiple sequence alignment is the hierarchical
    extensions of pairwise alignment methods.
  • The principal is that multiple alignments is
    achieved by successive application of pairwise
    methods.
  • Different algorithms for multiple alignment
  • CarilloLipman (MSA, DCA)
  • Segment based (Dialign)
  • Iterative (Profiles, HMMs)
  • Progressive (ClustalW, T-Coffee)
  • PSI-BLAST and iSCANPS

103
What is a multiple alignment?
  • An alignment that contains more than two
    sequences

VTISCTGSESNIGAG-NHVKWYQQLPG VTISCTGTESNIGS--ITVNWY
QQLPG LRLSCSSSDFIFSS--YAMYWVRQAPG LSLTCTVSETSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKEFYPSD--IAVEWWSNG--
104
Why it is important to accurately assess multiple
alignments
  • Natural extension of Pairwise Sequence Alignment
  • Pairwise alignment whispers multiple
    alignment shouts out loud Hubbard et al 1996
  • Much more sensitive in detecting sequence
    relationship and patterns
  • Help prediction of the secondary and tertiary
    structures of new sequences
  • Preliminary step in molecular evolution analysis
    using Phylogenetic methods for constructing
    phylogenetic trees.
  • In order to characterize protein families,
    identify shared regions of homology in a multiple
    sequence alignment
  • This happens generally when a sequence search
    revealed homologies to several sequences.
  • Identify primers and probes to search for
    homologous sequences in other organisms

105
ClustalWfor multiple alignment
  • ClustaW is a general purpose multiple alignment
    program for DNA or proteins.
  • ClustalW is produced by Julie D. Thompson, Toby
    Gibson of European Molecular Biology Laboratory,
    Germany and Desmond Higgins of European
    Bioinformatics Institute, Cambridge, UK.
    Algorithmic
  • Thompson, J. D., D. G. Higgins, et al. (1994).
    "Clustal-W - Improving the Sensitivity of
    Progressive Multiple Sequence Alignment through
    Sequence Weighting, Position-Specific Gap
    Penalties and Weight Matrix Choice." Nucleic
    Acids Research 22(22) 4673-4680.
  • ClustalW can create multiple alignments,
    manipulate existing alignments, do profile
    analysis and create phylogentic trees.
  • Fast and reliable are the progressive algorithms
    found in ClustalW.

106
Tips for ClustalW
  • Provides global multiple sequence alignment.
  • Not constructed to perform local alignments.
  • No guarantee to find best alignment.
  • No scoring system.
  • The very first sequences to be aligned are the
    most closely related in the tree
  • if they align well, there will be few errors,
  • the more distantly related the more errors.
  • Choice of suitable scoring matrices and gap
    penalties

107
When to use?
  • for more closely related sequences and large
    number of sequences.
  • repeatedly realigns subgroups of sequences then
    aligning these subgroups into global alignment of
    all the sequences, aim is to improve the overall
    alignment score.
  • selection of groups is based on the phylogenetic
    tree (separation of one or two sequences from the
    rest) similar to that of progressive alignment.

108
(No Transcript)
109
HMMER
HMMER is a freely distributable implementation of
profile Hidden Markov Models (HMMs) software for
protein sequence analysis.
110
  • Hidden Markov Models (HMMs)
  • HMMs are statistical models of the primary
    structure consensus of a sequence family.
  • HMM- or profile-based methods typically
    outperform pairwise methods in both alignment
    accuracy and database search sensitivity and
    specificity.
  • The advantage of HMMs is that HMMs have a formal
    probabilistic basis.

111
Why ?
  • Weight matrices do not deal with insertions and
    deletions.
  • In alignments, this is done in an ad-hoc manner
    by optimization of the two gap penalties for
    first gap and gap extension.
  • HMM is a natural frame work where
    insertions/deletions are dealt with explicitly.
  • Alignments can be used to construct hidden Markov
    models
  • These are basically statistical models of residue
    preferences gap insertion deletion penalties
    for a specific protein domain
  • The theory behind profile HMMs R. Durbin, S.
    Eddy, A. Krogh, and G. Mitchison, Biological
    sequence analysis probabilistic models of
    proteins and nucleic acids, Cambridge University
    Press, 1998.
  • HMMER makes this easy.

112
HMMs are trained from a multiple alignment
113
(No Transcript)
114
PHYLIP
A Package of programs for inferring phylogenies
(evolutionary tree)
115
Phylogenetics
  • Evolutionary theory states that groups of similar
    organisms are descended from a common ancestor.
  • Phylogenetic systematics (cladistics) is a method
    of taxonomic classification based on their
    evolutionary history.
  • It was developed by Willi Hennig, a German
    entomologist, in 1950.
  • Three major reason to use phylogenetics
  • Understand the lineage of different species
  • Organizing principle to sort species into a
    taxonomy
  • Understand how various functions evolved
  • Understand forces and constraints on evolution
  • Perform multiple sequence alignment
  • Predict gene function (phylogenetic footprint)

116
Species/Gene Trees
  • Species tree (how are my species related?)
  • contains only one representative from each
    species
  • when did speciation take place?
  • all nodes indicate speciation events
  • Gene tree (how are my genes related?)
  • often contains a number of genes from a single
    species
  • nodes relate either to speciation or gene
    duplication events
  • Your sequence data may not have the same
    phylogenetic history as the species from which
    they were isolated
  • Different genes evolve at different speeds, and
    there is always the possibility of horizontal
    gene transfer (HGT).

117
Species tree
118
Gene tree
119
Using DNA or protein sequences
  • gt70 identity, use DNA (take more times).
  • lt70 identity, use coding from protein (if
    possible) and tread the DNA sequences back onto
    the protein alignment.
  • DNA is easier and more accurate, however protein
    is reasonably good.

120
What is Phylip?
  • A Package of programs for inferring phylogenies
    (evolutionary tree)
  • PHYLIP is the most widely-distributed phylogeny
    package, and competes with PAUP to be the one
    responsible for the largest number of published
    trees.
  • It is written and distributed by Joe Felsenstein
    and collaborators (some of the following is
    copied from the PHYLIP homepage).
  • Available free over the internet for
    Windows95/98/3.x/NT, DOS, PowerMac and 68k
    Macintosh systems.
  • PHYLIP has been in distribution since 1980, and
    has over 15,000 registered users.
  • Methods are available in the package include
    parsimony, distance matrix, and likelihood
    methods.
  • Also include bootstrapping and consensus tree

121
Phylip input
  • Multiple alignment in Phylip format
  • Carry out multiple alignment using clustalw
  • Save the output in phylip format (.phy)
  • Use .phy as an input to phylip

122
Programs
  • Seqboot
  • Generate a bootstrap set of resample alignments
    for confidence determination.
  • PROTPARS
  • Estimates phylogenies from protein sequences
    (input using the standard one-letter code for
    amino acids) using the parsimony method, in a
    variant which counts only those nucleotide
    changes that change the amino acid, on the
    assumption that silent changes are more easily
    accomplished.
  • DNAPARS
  • Estimates phylogenies by the parsimony method
    using nucleic acid sequences. Allows use the full
    IUB ambiguity codes, and estimates ancestral
    nucleotide states. Gaps treated as a fifth
    nucleotide state. It can also fo transversion
    parsimony. Can cope with multifurcations,
    reconstruct ancestral states, use 0/1 character
    weights, and infer branch lengths.
  • NEIGHBOR
  • Estimate phylogenies from a distance matrix by
    the Neighbor-joining method or UPGMA (Average
    Linkage Clustering) method.
  • CONSENSE
  • Compute consensus tree from bootstrapped tree
    data.
  • See details and other programs
  • http//evolution.genetics.washington.edu/phylip/do
    c/main.htmlprograms

123
Steps
  • Bootstrap the sequence data
  • Generate Phylogenetic trees using Parsimony/ML/NJ
  • Consensus tree generation
  • Plot the tree

124
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com