Title: An Introduction to Bioinformatics
1An Introduction to Bioinformatics
- Laboratory of Computational Molecular Biology
- College of Life Sciences
- Beijing Normal University
2Image from http//microscopy.fsu.edu
3The Timeline of Landmark Accomplishments in
Genetics and Genomics
- 1866 - Gregor Mendel's discovery of the laws of
heredity - 1944 - Recognition of DNA as the hereditary
material - 1953 - Determination of DNA structure
- 1963 - Elucidation of the genetic code
- 1972/3 - Development of recombinant DNA
technologies - Establishment of increasingly automatable methods
for DNA sequencing (1975,1977,1986, 2001) - 1990 - The stage for the Human Genome Project
(HGP) - 2001 - The IHGSC and Celera Genomics each
reported draft sequences providing a first
overall view of the human genome - 2005 - Human haplotype map (HapMap)
4Genome Sequencing Technology
- Traditional clone-by-clone approach
- High-throughput whole genome shotgun (WGS)
sequencing assembly - Advance in assembly algorithms
- Challenges repetitive sequences, diploid genome,
etc. are difficult for correct assembly
therefore more robust automated methods are
needed.
Next-generation sequencing outpaces expectations
Nature Biotechnology
2007. Vol 25149. Several orders of magnitude
more efficient than Sanger capillary-array
electrophoresis (CAE) machines. 1,000 genome
program 600 megabases per day -gt 100
billion bases per day Targeting resequencing
efforts aimed at finding genetic variations
and rare mutations that contribute to complex
diseases.
5Why is There Bioinformatics?
- High-throughput molecular biological technologies
- Automated sequencers
- Genomes, ESTs, SNPs, etc.
- DNA array for large-scale gene expression
- Proteomics platform
- MS/MS LC/MS/MS (liquid chromatography),
Protein-protein interaction experiments,
genome-wide localization technique - Massive datasets produced
- GenBank (As of August 2006)
- 65,369,091,950 bases in 61,132,599 sequence
records in the traditional GenBank divisions - 80,369,977,826 bases in 17,960,667 sequence
records in the WGS division.
6Is Biology an Informational Science?
- The HGP changed how we view practice biology.
- Biology is an informational science.
- Digital genome
- Environmental signals
- Biology has become a cross-disciplinary science.
From U.S. Department of Energy Human Genome
Program. http//www.ornl.gov/hgmis
7Bioinformatics as an intersecting discipline
Developing the high throughput technologies and
computational/mathematical tools required for
this new biology.
8Why? Where? What? How?
- Why Ideas for what to produce these huge
datasets? Biological background needed. - Where Raw data need to store, IT platforms
required. - What Patterns in datasets that can be analyzed
using computers. Various data models and their
respective algorithms are needed. - How Different resources need to be integrated.
9What is Bioinformatics?
- The field of biology specializing in developing
hardware and software to store and analyze the
huge amounts of data being generated by life
scientists. (NIH) - More than 20 different definitions can be found
from Google!
Bioinformatics applications Data analysis
Data integration Various molecular biology
databases
10Key Challenge of Bioinformatics
- The world of biology is very different from what
it was even ten years ago. - To bridge the considerable gap between technical
data production and its use by scientists for
biological discovery.
11Main resources for Bioinformatics
- Databases
- GenBank, EMBL, DDBJ
- UniProt/SwissProt, InterPro, Pfam, SCOP, PDB
- Gene Ontology, KEGG
- Algorithms Applications
- BLAST, FASTA, BLAT
- ClustalW, HMMer
- Phylip
12(No Transcript)
131138Â Prokaryotic Genome Sequencing Projects
Selected Complete - 461, Draft assembly - 307,
In Progress 370 257Â Eukaryotic Genome
Sequencing Projects Selected Complete - 24,
Draft assembly - 95, In Progress 138 (NCBI
2007-2-27, http//www.ncbi.nlm.nih.gov/genomes/sta
tic/gpstat.html)
14NCBIs Tools for Data Mining
- Nucleotide Sequence Analysis
- Model Maker - allows you to view the evidence
(mRNAs, ESTs, and gene predictions) that was
aligned to assembled genomic sequence to build a
gene model and to edit the model by selecting or
removing putative exons. - Protein Sequence Analysis and Proteomics
- CD Search - search the Conserved Domain Database
with Reverse Position Specific BLAST. - Structures
- Genome Analysis
- Entrez Genomes - whole genomes of over 1000
organisms. It provides graphical overviews of
complete genomes/chromosomes and the ability to
explore regions of interest in progressively
greater detail. - Gene Expression
- The Cancer Genome Anatomy Project - aims to
decipher the molecular anatomy of cancer cells.
CGAP develops profiles of cancer cells by
comparing gene expression in normal,
precancerous, and malignant cells from a wide
variety of tissues.
15What are the Hot Topics in Bioinformatics?
- Comparative genomics
- Genetic variation analysis
- Microarray-based gene
- expression analysis
- Systems biology
The idea that we can study the interactions of
all elements in a biological system and from
these come to understand its emergent properties.
Gene ? Genome all of the genes in an
organism Transcipt ? Transcriptome expressed
in a particular biological process Protein ?
Proteome the world of expressed
proteins Interaction ? Interactome the
interactions between molecules
Genomics, Transcriptomics, Proteomics,
Metabolomics/Interactomics
16Browsers Light applications
Clients
Intranet and/or Internet
HTML PERL/C/C/Java MySQL Bioinformatics
applications HPC with MPI
WWW servers Database servers Intensive computing
servers
Servers
17The future of genomics research Nature 422
835-847.2003
.
18- Nothing in biology makes sense except in the
light of evolution. - ------ Theodosius Dobzhansky
The evolutionary process finds a way to create
exceptions to every model we propose. --- Austin
L. Hughes
19Biological Data Objects
Genomes Protein sequences
- Laboratory of Computational Molecular Biology
- College of Life Sciences
- Beijing Normal University
20Sequence assembly by the shotgun approach
Building up the master sequence directly from the
short sequences obtained from individual
sequencing experiments, simply by examining the
sequences for overlaps. This is called the
shotgun approach. It does not require any prior
knowledge of the genome and so can be carried out
in the absence of a genetic or physical map.
21So what are you going to do with this?
How do you find where the genes are?
CGGTTGAAAGCGGTAGCGTCCATGCGTATTACTCTTGAGCGGTCGAACCT
TCTGAAATCGCTGAACCACGTCCACCGGGT CGTCGAGCGTCGCAACACG
ATCCCGATCCTGTCCAACGTTCTGCTGCGCGCCTCCGGCGCCAATCTGGA
CATGAAGGCGA CCGACCTCGATCTGGAAATCACCGAAGCGACCCCGGCC
ATGGTGGAGCAGGCTGGCGCCACCACCGTACCGGCACACCTG CTTTACG
AAATCGTGCGCAAGCTGCCGGATGGTTCCGAAGTGCTTCTGGCGACCAAC
CCGGACGGCTCCTCCATGACCGT TGCGTCCGGCCGCTCGAAATTCTCGC
TGCAATGCCTGCCGGAAGCGGATTTCCCTGACCTCACCGCCGGCACCTTC
AGCC ACACCTTCAAACTGAAGGCGGCCGATCTGAAGATGCTGATCGACC
GGACGCAGTTTGCGATTTCGACCGAAGAGACGCGT TATTACCTGAACGG
CATTTTCTTCCACACCATCGAAAGCAATGGCGAGCTGAAACTGCGCGCCG
TCGCCACCGACGGTCA CCGCCTTGCGCGTGCTGACGTCGATGCGCCCTC
CGGCTCCGAAGGCATGCCGGGCATCATCATTCCGCGCAAGACCGTCG GT
GAACTGCAGAAGCTGATGGACAATCCGGAACTGGAAGTCACAGTCGAAGT
CTCGGATGCGAAGATCCGCCTGGCCATC GGTTCCGTCGTTCTGACCTCG
AAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGG
CAACGACAA GGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTG
GACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCG TGAAGCTGG
CGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGA
AGTGCTACCGAAGAAGTGGCC GTTGGCTACGACAATGATTCGATGGAAA
TCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGC
GA AGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTC
GCGATACCGCCGGCGACGACGCACTCTATGTTC TGATGCCGATGCGCGT
TTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCT
TCATCGGCGTTTTT TGATTCGGCGAACAGGTGGCTCTACCCGTAACTGA
ATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGG ATCA
ACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCA
TGGCACTCAACCTGAAGCAACGGCTT GAACAAAAATTTGAGGAAGAAAT
CCGCTTTTTCAAAGGTATGGTCAGCCAGCCGAAAAAAGTCGGCGCCATTG
TCCCGAC GGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTT
CCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAA GGAAATGCGCG
TCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCT
TCCGAGCGCGGCCGCGCCG TGAAGCTGGCGCTAACTGACGGCCAGTTGA
CGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCC
GTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATC
TCCTCGACATCACGTCGCAGCTCTCCGGCGA AGATGCGATTTTTCTGCT
GGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACG
CACTCTATGTTC TGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTC
AATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTT TGATTC
GGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTT
TGCCTTGTTTTTGCGCCAAATGGG ATCAACAGTACGTAACAATTTTTTG
ACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAAC
GGCTT
22Three Layers of Genome Annotation. From Stein,
L. 2001. Nature Reviews genetics 2493-503
23- Human Genome Project
- 1996 first eukaryotic genome (Saccharomyces
cerevisiae) sequenced and stored into NCBIs
Genomes division. (Chromosome size 270KB
1500KB) - 1997 Entrez Genomes was able to provide the
first graphical views of genomic sequence data
(genetic, physical and cytogenetic maps). - 2001.2 the working draft of the human genome
(Lander et al., 2001). Chromosome size 46MB
246MB. - NCBI created the first human Map Viewer. UCSCs
human Genome Browser EBIs Ensembl a system to
annotate automatically the human genome sequence
as well as to store and visualize the data. - Today, each site provides free access not only to
human genome sequence data, but also other
assembled genomic sequences.
24- A simple description of each browser
- The backbone is an assembled genomic sequence.
- UCSC (May 2000) and NCBI assembled BAC sequences
into longer contigs. - NCBI assembly now is displayed by all three
browsers (UCSC stopped producing assemblies in
Dec 2001). - The sequence is known by its build number at
NCBI, and by a date at UCSC. (March 2006, NCBI
Build 36.1, hg18) - Each of the three genome browsers provides their
own annotation of the common assembled sequence. - Location of genes, both known and predicted.
- Different sources of mRNAs, different alignment
prediction method. - ESTs and SNPs, the location of STS makers.
- Homologous sequences from other organisms.
25Depending on the state of sequencing project,
genomic coordinates along the chromosome may
change dramatically from assembly to assembly.
26DNA Sequence Databases
Data flow for new submissions updates between
the three databases
27The RefSeq project (NCBI)
- Providing a reference sequence for each molecule
in the central dogma (DNA, mRNA, and protein). - Distinct accession number series (26 format)
- NT_123456 Genomic contig (DNA)
- NM_123456 mRNA
- NP_123456 Protein
- XM_123456 Model mRNA
- XP_123456 Model protein
28Protein Sequence Database (UniProt)
- Swiss-Prot protein database (early 1980s, Amos
Bairoch) TrEMBL - Combining Swiss-Prot, TrEMBL, and PIR-PSD
together. - Three major components
- The UniProt Archive (UniParc), into which new
updated sequences are loaded - The UniProt Knowledgebase, with the goal of
providing an expertly curated database. - The UniProt nonredundant reference database
(UniRef)
29The flow of data from primary sources into the
component of databases of the Universal Protein
Resource
Complete coverage of all predicted coding
sequences from human, mouse, and rat.
30(No Transcript)
31MeSH Terms NLM's Medical Subject Headings
controlled vocabulary of biomedical terms that is
used to describe the subject of each journal
article in MEDLINE. MeSH contains more than
23,000 terms and is updated annually to reflect
changes in medicine and medical terminology. MeSH
terms are arranged hierarchically by subject
categories with more specific terms arranged
beneath broader terms. PubMed allows you to view
this hierarchy and select terms for searching in
the MeSH Database.
32Lincoln D. Stein
33Sequence Databases Beyond NCBI
- NCBI is the center of the sequence universe
- There are many specialized sequence databases
throughout the world that server specific groups
in the scientific community.
34(No Transcript)
35Biological Data Objects
SNPs Human variants
- Laboratory of Computational Molecular Biology
- College of Life Sciences
- Beijing Normal University
36Why study genetic variation?
Saccharomyces cerevisiae
Caenorhabditis elegans
Drosophila melanogaster
Takifugu rubripes
Xenopus tropicalis
Gallus gallus
Mus musculus
Pan troglodytes
Homo sapiens
(different population variations)
Population A
Population B
Population C
37Feuk L. et al., 2006. Nature Reviews Genetics.
785-97
Disease susceptibility ??????
38Feuk L. et al., 2006. Nature Reviews Genetics.
785-97
??
pericentric-???paracentric-????
??
????(uniparental disomy)??????????,???????????????
???
39Science 315, 848853 (2007)
40- With the advent of molecular biology, and DNA
sequencing in particular, smaller and more
abundant alterations were observed. Such
differences include - Single Nucleotide Polymorphisms (SNPs-??????)
- various repetitive elements(????) that involve
relatively short DNA sequences, e.g., micro- and
minisatellites(?/??? ) - small (usually lt1 kb) insertions, deletions,
inversions and duplications(??). - It was presumed that these small-scale variants
constitute most genetic variation for example,
estimates predict that there are at least 10
million SNPs within the human population,
averaging 1 every 300 nucleotides among the 3
billion nucleotide base pairs that constitute the
genome of an individual.
41Genetic(genomic) variation
- Mutations fundamentally are produced by errors in
DNA replication are the ultimate source of
genetic variation in a population. - The frequency of alleles (????)depends on the
evolutionary forces that have been applied to the
population (e.g., genetic drift(????), selection,
migration) and on the age of the mutation itself. - When the mutant allele exceeds 1 representation
in a population, we refer to the mutation as a
polymorphism, with no implied association with
phenotype.
42- Single nucleotide polymorphisms (SNPs) a
polymorphism in which alleles are defined by
single or few base changes sometimes it also
contains deletions and insertions of one or few
bases. - Short tandem repeats (STRs??????? ) a number of
repeats of short, 2- to 5-nt subsequences.
Typically, they are scattered throughout the
genome and are flanked by unique sequences that
can be used to target a PCR amplification of the
locus. Most STRs are variable in copy number,
making them valuable genetic markers. - Seq1 GTCACTGACACACACAGTACGT
- Seq2 GAC CTGACACACACACACACACACAGTACG
43- SNP Discovery methods
- Pairwise sequence comparison aligning (??,??)
two sequences from different individuals DNA and
to look for high-quality sequence difference. - Deep resequencing using PCR resequencing of DNA
samples to find SNPs in a well-defined region of
the genome. - Separation method Slab gel, capillary
electrophoresis (CE?????), MS, DHPLC (Denaturing
high performance liquid chromatography
????????????), etc - Homogeneous assay Taqman, Invader, etc
- Solid-phase assay beads and DNA arrays chips
etc.
44(No Transcript)
45- NCBI dbSNP content
- Original submitter-supplied data (ss)
- Integrated postsubmission content
- refSNP clusters (rs)
- Mapping results
- Functional analysis
- Average levels of diversity
46Alleles define the class of the variation
- SNP
- in del DIPs, deletion represented by -
- het variation has unknown sequence composition,
but is observed to be heterozygous. - microsatllite microsatellite / simple sequence
repeat / STR - mixed MNP multiple nucleotide polymorphism (all
alleles same length, where length gt 1). - named allele sequences defined by name tag,
e.g., (Alu)/- - no variation invariant region in surveyed
sequence. - The dbSNP database is released to the public in
periodic builds that are synchronized with
current genome assemblies.
47Resource Integration
- The clustered data from dbSNP provides a
non-redundant set of variation from each organism
in the database. - The non-redundant set of variations (refSNP
cluster set) is annotated on reference genome
sequence contigs, chromosome, mRNAs, and proteins
as part of the NCBI RefSeq project. - Summary properties are recomputed for each refSNP
cluster and are used to build fresh indices for
NCBIs Entrez query system and to update
variation maps or tracks in genome browsers
(e.g., Ensembl, UCSC Genome Browser).
48Variation Function Class
- Defined by the position of the variation relative
to the structure of the aligned transcript (may
having several different functional relationships
to a gene may having multiple, potentially
different relationships to its local gene
neighbors). - locus region near a gene
- mRNA UTR (Untranslated Region-???????????)
- intron
- splice site
- coding region
- synonymous (????)
- nonsynonymous(?????)
49Genotyping (????)
- The process of determining the allele states of
selected polymorphisms in selected groups of
individuals. - How to genotype a known STR?
- PCR primers are designed on either side of the
STR. - The amplified product from a DNA sample is
sized-fractionated (??) by electrophoresis to
determine the number of copies present. - An individual who is homozygous for one of the
allele states will show a single predominant
band those who are heterozygous will show two
distinct bands. - For STR polymorphisms, there can be many
different copy numbers present in the population,
so the copy number(s) is important to record for
each sample.
50(No Transcript)
51- Uneven distribution
- The distribution of SNPs departs significantly
from what would be predicted under the standard
population genetics models gt SNP hot spots
cold spots. - What combination of historical, structural or
selective pressures are responsible for this
phenomenon? - How is it related to other factors, such as the
distribution of genes repetitive elements?
52Haplotypes(???????) A haplotype is defined as a
specific set of alleles (SNPs) observed on a
region (or the whole) of a given chromosome. When
comparing haplotypes from many individuals,
shared patterns are seen that occur with much
greater likelihood than would be estimated by
assuming each allele state was independent of the
others. This nonrandom association is called
linkage disequilibrium (LD-?????).
53- How SNPs shape haplotype-structure?
- SNPs in close physical proximity are often
strongly correlated. - The length of a shared haplotype segment is
called a haplotype block. - There is substantial variation in block sizes
across the genome.
Haplotype blocks sizable regions over which
there is little evidence for historical
recombination and within which only a few common
haplotypes are observed.
54- These haplotype blocks are likely to result
from the fact that recombination, that is, the
re-shuffling of chromosome segments that occurs
during formation of sex cells (meiosis-????),
tends to occur in certain areas of the
chromosomes more often than in others. - Thus, any single human chromosome is a mosaic of
different haplotype blocks, where each block has
its own pattern of variation. (5kb200kb and 45
common haplotypes)
Haplotype blocks tend to be shorter in Africa
than elsewhere.
The chromosomes of two hypothetical individuals
are shown. Each individual carries two copies of
each block (as humans carry two sets of
chromosomes).
55- Usually, Block regions have only a few common
haplotypes (each - with a frequency of at least 5), which
account for most of the - variation from person to person in population.
- Those common haplotypes could be distinguished
by a few tag - SNPs, which represent the most of the
information on the pattern - of genetic variation in block region.
Three haplotypes are shown. The two SNPs in color
are sufficient to identify (tag) each of the
three haplotypes.
Theoretically, researchers could look for these
regions by genotyping 10 million SNPs. However,
the methods to do this are currently too
expensive. The HapMap(????) will identify which
200,000 to 1 million tag SNPs provide almost as
much mapping information as the 10 million SNPs.
56About the International HapMap Project
????????????? Five countries spent 138 million
in total
- Launched in October 2002.
- The goal of the International HapMap Project is
to develop a public haplotype map of the human
genome, the HapMap, which will describe the
common patterns of human DNA sequence variation. - The HapMap is expected to be a key resource for
researchers to use to find genes affecting
health, disease, and responses to drugs and
environmental factors.
57???????????? (??YRI) ???????? (??JPT)
???????? (??CHB) CEPH??(???????????????)
(??CEU)
HapMap data release 21, July 2006, on NCBI B35
assembly, dbSNP b125
58How did HapMap project build ?
- Populations and samples
- The DNA samples for the HapMap will come from a
total of 270 people - the Yoruba people in Ibadan, Nigeria (30
both-parent-and-adult-child trios), - Japanese in Tokyo (45 unrelated individuals),
- Han Chinese in Beijing (45 unrelated individuals)
- the CEPH (30 trios).
- These numbers of samples will allow the Project
to find almost all - haplotypes with frequencies of 5 or higher.
- All of the new samples collected for the Project
are being obtained with - protocols approved by the appropriate
ethics (????) committees, after culturally
appropriate processes of community engagement or
public consultation and individual informed
consent.
59- Genotype data
- The project produce the genotypes of the 270
individual samples and - the frequencies of SNP alleles.
Affymetrix
Illumina
60- Haplotype phasing
- Inferring the haplotype data from
genotypes data .
Statistics model
Genotype data
Haplotype data
- Haplotype-block partition
- The Project will use standard measures of
SNP association, such as D' - and r2, partitioning the ENCODE region
into block structure.
61HapMap genotype data dump
62HapMap file format (genotype)
63HapMap genotype frequency data dump
- refhom-gt refhom-freq refhom-count
- het-gt het-freq het-count
- otherhom-gt otherhom-freq otherhom-count
- totalcount
64HapMap allele frequency data dump
- ref_allele ref-allele_freq ref-allele_count
- other_allele other-freq other-allele_count
- totalcount
65Phased haplotypes
66(No Transcript)
67Haplotype-block partition The Project will use
standard measures of SNP association, such as D'
and r2, partitioning a region into block
structure.
As soon as these blocks are identified, only 34
SNPs within a block will be necessary to
characterize the haplotype of that entire segment
of the genome. These are called haplotype tag
SNPs. Across the entire genome, which may contain
up to 10 million common SNPs, it is anticipated
that only 500,000 tag SNPs will be necessary to
identify more than 90 of the haplotype blocks
(1/20), greatly reducing the costs of these
studies.
68(No Transcript)
69(No Transcript)
70(No Transcript)
7136 adjacent SNPs in an ENCODE region (ENr131.2q37)
D 1 for all markers. In contrast, r 2 values
display a complex pattern, varying from 0.0003 to
1.0., with no relationship to physical
distance. Why?
Only seven haplotypes are observed (five seen
more than once) among the 120 parental CEU
chromosomes studied, reflecting shared ancestry
since their most recent common ancestor among
apparently unrelated individuals.
Simple LD pattern for the region absence of
recombination
72Selection of tag SNPs for association studies
- We refer to the set of SNPs genotyped in a
disease study as tags. A given set of tags can be
analysed for association with a phenotype using a
variety of statistical methods which we term
tests, based either on the genotypes of single
SNPs or combinations of multiple SNPs. - The shared goal of all tag selection methods is
to exploit redundancy among SNPs, maximizing
efficiency in the laboratory while minimizing
loss of information (this literature is extensive
and varied, despite its youth). - Pairwise algorithm SNPs are selected for
genotying until all common SNPs are highly
correlated (r2 gt 0.8)
73(2). Software for Haplotype block partition and
Tag SNPs analysis
Resources for haplotype block analysis
(1). Software for Haplotype Inference
- PHASE (http//www.stat.washington.edu/stephens/sof
tware.html) - Haplotyper (http//www.people.fas.harvard.edu/jun
liu/Haplo/docMain.htm) - Haplore (http//bioinformatics.med.yale.edu/softwa
relist.html)
- HapBlock (http//hto-b.usc.edu/msms/HapBlock/)
- HaploBlockFinder (http//cgi.uc.edu/cgi-bin/kzhang
/haploBlockFinder.cgi) - TagSNPs (http//www-rcf.usc.edu/
stram/tagSNPs.html) - Tagger (http//www.broad.mit.edu/mpg/tagger/ )
(3). Haplotype Visualization Software
- HaploView (http//www.broad.mit.edu/personal/jcbar
ret/haploview/) - Haplot (http//info.med.yale.edu/genetics/kkidd/pr
ograms.html)
74Association Studies
- A group of individuals are selected as cases
and another as controls. - The cases group are individuals that are
diagnosed with some disease, react to some type
of medicine, or are even especially healthy. - The controls group are individuals that do not
exhibit the feature selected for the cases group. - For case-control studies, a selection of SNPs is
genotyped in both the case and control groups,
and those alleles that exhibit a higher incidence
(???,???) in the case group as opposed to the
control group are potential makers for the
observed phenotype.
75(No Transcript)
76NATURE. Vol 445828. 2007
It is the largest GWA study so far, and tackles a
very common disease that is rising in prevalence
throughout the world.
The efforts to understand the interplay between
genetic and environmental risk factors in
generating the high frequency of the disease.
77(No Transcript)
78Other Applications
- Evolution studies the map may be biased because
it refers common SNPs to rare ones. - Examining genomes architecture
- Nature selection studies
79Debate
- Common mutations are behind most common diseases?
(common diseases common mutations) - OR
- Common diseases arise from combinations of rare
mutations?
80Major Analysis Tools
- Laboratory of Computational Molecular Biology
- College of Life Sciences
- Beijing Normal University
81Outline
- BLAST Sequence Similarity Comparison
- PSI-BLAST Distant Similarity Comparison
- ClustalW Multiple Sequence Alignment
- HMMER More Sensitive Multiple Sequence
Alignment - PhylipP Inferring Phylogenies
Comparison of protein and DNA sequence is one of
the foundations of bioinformatics.
82- Evolution
- The theory of evolution is the foundation upon
which all of modern biology is built. - From anatomy to behavior to genomics, the
scientific method requires an appreciation of
changes in organisms over time. - It is impossible to evaluate relationships among
gene sequences without taking into consideration
the way these sequences have been modified over
time. - Relationships
- Similarity searches and multiple alignments of
sequences naturally lead to the question - How are these sequences related?
- And more generally
- How are the organisms from which these sequences
come related?
83BLAST
Basic Local Alignment Search Tool (BLAST) is the
tool most frequently used for calculating
sequence similarity.
84Why BLAST?
- Identify unknown sequences
- Help gene/protein function and structure
prediction genes with similar sequences tend to
share similar functions or structure. - Identify protein family group related (paralog
or ortholog) genes and their proteins into a
family. - Prepare sequences for multiple alignments
- And more
85What is BLAST?
- An Example of Sequence Comparison
86Components of Sequence Alignment
- Scoring function a measure of similarity between
elements (nucleotides, amino acids, gaps) - An algorithm for alignment
- Confidence assessment of alignment result.
87Global vs. Local Alignment
- Global Alignment the alignment of complete
sequences - Good for comparing members of same protein family
- Needleman Wunsch 1970. J Mol Biol 48443
- Local alignment the alignment of segments of
sequences - ignore areas that show little similarity
- Smith Waterman 1981,.J Mol Biol, 147195
- modified from Needelman-Wunsh algorithm
- can be done with heuristics (FASTA and BLAST)
88BLAST Stages
- Stage I
- Find matching word pairs
- Extend word pairs as much as possible,i.e., as
long as the total weight increases - Result High-scoring Segment Pairs (HSPs)
- THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
- INVIEIAMDEADMEATTNAMHEWASNINETEEN
- Stage II
- Try to connect HSPs by aligning the sequences in
between them - THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
- INVIEIAMDEADMEATTNAMHEW___ASNINETEEN
89BLAST Options
- Composition-based statistics (Yes)
- Sequence Complexity Filter (Yes)
- Expect (E) value (10)
- Word Size (3 or 11)
- Substitution or Scoring Matrix (Blosum62)
- Gap Insertion Penalty (11)
- Gap Extension Penalty (1)
90- BLAST is a collection of programs with versions
for query-to-database pairs such as - Query nucleotide ? nucleotide DB blastn
- Query protein ? protein DB lt- nucleotide DB
tblastn - Query protein ? protein DB blastp
- Query nucleotide -gt query proteins ? protein DB
blastx - Query nucleotide -gt query protein ? protein DB lt-
nucleotide DB tblastx
91(No Transcript)
92BLAST????????
????
???????
??????
?
?
?
?
???????
?
?
??????
?
?
blastp
tblastn
blastn
tblastx
blastx
93BLAST Parameters
- Identities - No. exact residue matches
- Positives - No. and similar ID matches
- Gaps - No. gaps introduced
- Score - Summed HSP score (S)
- Bit Score - a normalized score (S)
- Expect (E) - Expected of chance HSP aligns
- P - Probability of getting a score gt X
- T - Minimum word or k-tuple score (Threshold)
94E-value
- The probability that a variate would assume a
value greater than or equal to the observed value
strictly by chance P(zgtzo) - If the E-value found for an alignment is low
(lt0.001) then alignment is probably biologically
meaningful. - Pre-compute the parameters based on a
statistical model
95Low complexity issue
- Watch out for
- transmembrane or signal peptide regions
- coil-coil regions
- short amino acid repeats (collagen, elastin)
- homopolymeric repeats
- BLAST uses SEG to mask amino acids
- BLAST uses DUST to mask bases
96BLAST-related tools
- BLAST2Sequences to find local alignments between
any two protein or nucleotide sequences. - MegaBLAST a variation of BLASTN that has been
optimized specifically for use in aligning either
long or high similar (gt95) sequences and is the
method of choice when looking for exact matches
in nucleotide databases. - PSI-BLAST (position-specific-iterated BLAST)
particularly well suited for identifying
distantly related proteins that may not have been
found using the traditional BLASTP method. - BLAT (BLAST-Like Alignment Tool) similar to
MegaBLAST in that it is designed rapidly to align
longer nucleotide sequences having more than 95
similarity but using a slightly different
strategy than does BLAST to achieve faster speed.
97PSI-BLAST
Position-Specific Iterated (PSI)-BLAST is the
most sensitive BLAST program, making it useful
for finding very distantly related proteins
98Distant similarity detection
- Many functionally and evolutionarily important
protein similarities are recognizable only
through 3D structures comparison. - When such structures are not available, patterns
of conservation identified from the alignment of
related sequences can aid the recognition of
distant similarities. - These conserved patterns variously called
motifs, profiles, position-specific score
matrices, and Hidden Markov Models. - In essence, for each position in the derived
pattern, every amino acid is assigned a score. - Highly conserved residue at a particular position
is assigned a high positive score, and others are
assigned high negative scores. - At weakly conserved positions, all residues
receive scores near zero. - Position-specific scores can also be assigned to
potential insertions and deletions.
99- The power of profile methods can be further
enhanced through iteration of the search
procedure . - Position-specific scores improve the ability of
successive BLAST iterations for detecting remote
homologs - Use PSI-BLAST when your standard protein-protein
BLAST search either failed to find significant
hits, or returned hits with descriptions such as
"hypothetical protein" or "similar to..."
100PSI-BLAST flow chart
Take a sequence
Search for similar sequences in a full sequence
database
FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ
Sequences are multiply aligned
- After several iterations of this procedure we
have - Sequence information, inc. links to annotation
- Several sets of multiple alignments.
- Profiles, derived by us or by PSI-BLAST
- Thresholding information (alignment statistics)
Construct a profile, and represent conservation
in each position numerically
Profile holds more information than a single
sequence use the profile to retrieve additional
sequences
101(No Transcript)
102Multiple Alignment Methods
- The most practical and widely used method in
multiple sequence alignment is the hierarchical
extensions of pairwise alignment methods. - The principal is that multiple alignments is
achieved by successive application of pairwise
methods. - Different algorithms for multiple alignment
- CarilloLipman (MSA, DCA)
- Segment based (Dialign)
- Iterative (Profiles, HMMs)
- Progressive (ClustalW, T-Coffee)
- PSI-BLAST and iSCANPS
103What is a multiple alignment?
- An alignment that contains more than two
sequences
VTISCTGSESNIGAG-NHVKWYQQLPG VTISCTGTESNIGS--ITVNWY
QQLPG LRLSCSSSDFIFSS--YAMYWVRQAPG LSLTCTVSETSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKEFYPSD--IAVEWWSNG--
104Why it is important to accurately assess multiple
alignments
- Natural extension of Pairwise Sequence Alignment
- Pairwise alignment whispers multiple
alignment shouts out loud Hubbard et al 1996 - Much more sensitive in detecting sequence
relationship and patterns - Help prediction of the secondary and tertiary
structures of new sequences - Preliminary step in molecular evolution analysis
using Phylogenetic methods for constructing
phylogenetic trees. - In order to characterize protein families,
identify shared regions of homology in a multiple
sequence alignment - This happens generally when a sequence search
revealed homologies to several sequences. - Identify primers and probes to search for
homologous sequences in other organisms
105ClustalWfor multiple alignment
- ClustaW is a general purpose multiple alignment
program for DNA or proteins. - ClustalW is produced by Julie D. Thompson, Toby
Gibson of European Molecular Biology Laboratory,
Germany and Desmond Higgins of European
Bioinformatics Institute, Cambridge, UK.
Algorithmic - Thompson, J. D., D. G. Higgins, et al. (1994).
"Clustal-W - Improving the Sensitivity of
Progressive Multiple Sequence Alignment through
Sequence Weighting, Position-Specific Gap
Penalties and Weight Matrix Choice." Nucleic
Acids Research 22(22) 4673-4680. - ClustalW can create multiple alignments,
manipulate existing alignments, do profile
analysis and create phylogentic trees. - Fast and reliable are the progressive algorithms
found in ClustalW.
106Tips for ClustalW
- Provides global multiple sequence alignment.
- Not constructed to perform local alignments.
- No guarantee to find best alignment.
- No scoring system.
- The very first sequences to be aligned are the
most closely related in the tree - if they align well, there will be few errors,
- the more distantly related the more errors.
- Choice of suitable scoring matrices and gap
penalties
107When to use?
- for more closely related sequences and large
number of sequences. - repeatedly realigns subgroups of sequences then
aligning these subgroups into global alignment of
all the sequences, aim is to improve the overall
alignment score. - selection of groups is based on the phylogenetic
tree (separation of one or two sequences from the
rest) similar to that of progressive alignment.
108(No Transcript)
109HMMER
HMMER is a freely distributable implementation of
profile Hidden Markov Models (HMMs) software for
protein sequence analysis.
110- Hidden Markov Models (HMMs)
- HMMs are statistical models of the primary
structure consensus of a sequence family. - HMM- or profile-based methods typically
outperform pairwise methods in both alignment
accuracy and database search sensitivity and
specificity. - The advantage of HMMs is that HMMs have a formal
probabilistic basis.
111Why ?
- Weight matrices do not deal with insertions and
deletions. - In alignments, this is done in an ad-hoc manner
by optimization of the two gap penalties for
first gap and gap extension. - HMM is a natural frame work where
insertions/deletions are dealt with explicitly. - Alignments can be used to construct hidden Markov
models - These are basically statistical models of residue
preferences gap insertion deletion penalties
for a specific protein domain - The theory behind profile HMMs R. Durbin, S.
Eddy, A. Krogh, and G. Mitchison, Biological
sequence analysis probabilistic models of
proteins and nucleic acids, Cambridge University
Press, 1998. - HMMER makes this easy.
112HMMs are trained from a multiple alignment
113(No Transcript)
114PHYLIP
A Package of programs for inferring phylogenies
(evolutionary tree)
115Phylogenetics
- Evolutionary theory states that groups of similar
organisms are descended from a common ancestor. - Phylogenetic systematics (cladistics) is a method
of taxonomic classification based on their
evolutionary history. - It was developed by Willi Hennig, a German
entomologist, in 1950. - Three major reason to use phylogenetics
- Understand the lineage of different species
- Organizing principle to sort species into a
taxonomy - Understand how various functions evolved
- Understand forces and constraints on evolution
- Perform multiple sequence alignment
- Predict gene function (phylogenetic footprint)
116Species/Gene Trees
- Species tree (how are my species related?)
- contains only one representative from each
species - when did speciation take place?
- all nodes indicate speciation events
- Gene tree (how are my genes related?)
- often contains a number of genes from a single
species - nodes relate either to speciation or gene
duplication events - Your sequence data may not have the same
phylogenetic history as the species from which
they were isolated - Different genes evolve at different speeds, and
there is always the possibility of horizontal
gene transfer (HGT).
117Species tree
118Gene tree
119Using DNA or protein sequences
- gt70 identity, use DNA (take more times).
- lt70 identity, use coding from protein (if
possible) and tread the DNA sequences back onto
the protein alignment. - DNA is easier and more accurate, however protein
is reasonably good.
120What is Phylip?
- A Package of programs for inferring phylogenies
(evolutionary tree) - PHYLIP is the most widely-distributed phylogeny
package, and competes with PAUP to be the one
responsible for the largest number of published
trees. - It is written and distributed by Joe Felsenstein
and collaborators (some of the following is
copied from the PHYLIP homepage). - Available free over the internet for
Windows95/98/3.x/NT, DOS, PowerMac and 68k
Macintosh systems. - PHYLIP has been in distribution since 1980, and
has over 15,000 registered users. - Methods are available in the package include
parsimony, distance matrix, and likelihood
methods. - Also include bootstrapping and consensus tree
121Phylip input
- Multiple alignment in Phylip format
- Carry out multiple alignment using clustalw
- Save the output in phylip format (.phy)
- Use .phy as an input to phylip
122Programs
- Seqboot
- Generate a bootstrap set of resample alignments
for confidence determination. - PROTPARS
- Estimates phylogenies from protein sequences
(input using the standard one-letter code for
amino acids) using the parsimony method, in a
variant which counts only those nucleotide
changes that change the amino acid, on the
assumption that silent changes are more easily
accomplished. - DNAPARS
- Estimates phylogenies by the parsimony method
using nucleic acid sequences. Allows use the full
IUB ambiguity codes, and estimates ancestral
nucleotide states. Gaps treated as a fifth
nucleotide state. It can also fo transversion
parsimony. Can cope with multifurcations,
reconstruct ancestral states, use 0/1 character
weights, and infer branch lengths. - NEIGHBOR
- Estimate phylogenies from a distance matrix by
the Neighbor-joining method or UPGMA (Average
Linkage Clustering) method. - CONSENSE
- Compute consensus tree from bootstrapped tree
data. - See details and other programs
- http//evolution.genetics.washington.edu/phylip/do
c/main.htmlprograms
123Steps
- Bootstrap the sequence data
- Generate Phylogenetic trees using Parsimony/ML/NJ
- Consensus tree generation
- Plot the tree
124(No Transcript)