Variation - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Variation

Description:

Polymorphism: a DNA variation in which each possible sequence is present in at ... Pilot project showed: Use chromatin accessibility and histone modification ... – PowerPoint PPT presentation

Number of Views:570
Avg rating:5.0/5.0
Slides: 48
Provided by: BertOv6
Category:

less

Transcript and Presenter's Notes

Title: Variation


1
Variation
2
Overview
  • Genomic Diversity (SNPs)
  • Variations in the Ensembl Browser
  • Human genome
  • HapMap
  • Gen2Phen and EGA
  • A bit about Functional Genomics

3
Genomic Diversity
  • SNPs (Single Nucleotide Polymorphisms)
  • base pair substitutions
  • InDels
  • insertion/deletion (frameshifts)
  • occur in
  • 1 in every 300 bp (human)
  • 3 billion base pairs in mammalian genomes!

4
Single nucleotide polymorphisms (SNPs)
  • Polymorphism a DNA variation in which each
    possible sequence is present in at least 1 of
    the population
  • Most polymorphisms (90) take the form of SNPs
    variations that involve just one nucleotide

5
Origin of SNPs
Mutation in individuals
Selection of alleles
Increase of the allele to a substantial
population frequency
w
Fixation of the allele in a populations
SNP
Adapted from Bioinformatics for Geneticists, Eds
Barnes and Gray
6
Functional Consequences
7
Studying variation why?
  • SNPs can cause disease
  • (SNP in clotting factor IX codes for a stop
    codon haemophilia)
  • SNPs can increase disease risk
  • (SNP in LDL receptor reduces efficiancy high
    cholesterol)
  • SNPs can affect drug response
  • (SNP in CYP2D8, a gene in the drug breakdown
    pathway in the liver, disrputs breakdown of
    debrisoquine, a treatment for high blood
    pressure.)

8
Studying variation why?
  • Determine disease risk
  • Individualised medicine (pharmacogenomics)
  • Forensic studies
  • Biological markers
  • Hybridisation studies, marker-assisted breeding
  • Understanding Evolution

9
Practical Applications
9 of 25
10
SNPs in Ensembl
  • Most SNPs imported from dbSNP (rs)
  • Imported data alleles, flanking sequences,
    frequencies, .
  • Calculated data position, synonymous status,
    peptide shift, .
  • For human also
  • HGVbase
  • Affy GeneChip 100K and 500K Mapping Array
  • Affy Genome-Wide SNP array 6.0
  • Ensembl-called SNPs (from Celera reads and Jim
    Watsons and Craig Venters genomes)
  • For mouse, rat, dog and chicken also
  • Sanger- and Ensembl-called SNPs (other strains /
    breeds)

10 of 25
11
dbSNP
  • Central repository for simple genetic
    polymorphisms
  • single-base nucleotide substitutions
  • small-scale multi-base deletions or insertions
  • retroposable element insertions and
    microsatellite repeat variations
  • http//www.ncbi.nlm.nih.gov/SNP/
  • For human (dbSNP build 129)
  • 19,125,432 submissions (sss)
  • 2,920,818 new RefSNPs (rss)

11 of 25
12
SNPs in Ensembl - Types
  • Non-synonymous In coding sequence, resulting in
    an aa change
  • Synonymous In coding sequence, not resulting in
    an aa change
  • Frameshift In coding sequence, resulting in a
    frameshift
  • Stop lost In coding sequence, resulting in the
    loss of a stop codon
  • Stop gained In coding sequence, resulting in the
    gain of a stop codon
  • Essential splice site In the first 2 or the last
    2 basepairs of an intron
  • Splice site 1-3 bps into an exon or 3-8 bps into
    an intron
  • Upstream Within 5 kb upstream of the 5'-end of a
    transcript
  • Regulatory region In regulatory region annotated
    by Ensembl
  • 5' UTR In 5' UTR
  • Intronic In intron
  • 3' UTR In 3' UTR
  • Downstream Within 5 kb downstream of the 3'-end
    of a transcript
  • Intergenic More than 5 kb away from a transcript

12 of 25
13
SNPs in Ensembl - Species
14
Overview
  • Genomic Diversity (SNPs)
  • Variations in the Ensembl Browser
  • Human genome
  • HapMap
  • Gen2Phen and EGA
  • A bit about Functional Genomics

15
Focus on Human
  • Venter and Watson genomes
  • 1000 genomes project
  • HapMap

16
First diploid genomes for human
  • Craig Venter
  • Sequence analysis ongoing since 2003
  • Jim Watson
  • 454 technology (7.4x)
  • 100 mill unpaired reads (25 billion bps)
  • 1,000,000

The Diploid Genome Sequence of an Individual
Human PLoS Biology 5 10 2113-2144 (2007) The
Complete Genome of an Individual by Massively
Parallel DNA Sequencing Nature 452872-876
(2008) Accurate Whole Human Genome Sequencing
Using Reversible Terminator Chemistry Nature
45653-59 (2008) The Diploid Genome Sequence of
an Asian Individual Nature 45660-65 (2008)
17
www.1000genomes.org
18
1000 Genomes
  • Delivering 20TB of sequence data
  • First Pilot. 60 HapMap samples sequenced (low
    coverage)
  • Second Pilot. Two trios of European and African
    descent (high coverage)
  • Third Pilot. Sequence 1,000 genes in 1,000
    individuals (high coverage)

19
1000 Genomes Browser
  • Main page
  • Built on Ensembl
  • Navigation on the left hand side
  • Options as drop down menus
  • Currently only includes human data
  • In the future comparative genomics information
    will be available
  • All pages link to Ensembl and UCSC

20
Spot the difference!
21
Reference Sequence
  • The Human Genome Project gave the average DNA
    sequence of a small number of people.
  • This helps us find out how a human develops and
    works
  • Does not show us the DNA differences between
    different humans
  • Does not reflect the major alleles

22
HapMap www.hapmap.org
  • A multi-country effort to identify and catalogue
    genetic similarities and differences in people.
  • Collaboration among scientists and funding
    agencies from Japan, the United Kingdom, Canada,
    China, Nigeria, and the United States.
  • All of the information generated by the project
    released into the public domain.

23
HapMap (phase I II)
  • Samples from populations with African, Asian and
    European ancestry.
  • 270 DNA samples from 4 populations
  • 30 trios (two parents and an adult child) from
    the Yoruba people of Ibadan, Nigeria
  • 45 unrelated Japanese from the Tokyo area
  • 45 unrelated Han Chinese from Beijing
  • 30 trios from Utah with Northern and Western
    European ancestry (CEPH)

24
HapMap (phase III)
  • Genotypes from 1115 individual from 11
    populations
  • ASW African ancestry in Southwest USA (71)
  • CEU Utah residents with Northern and Western
    European ancestry from the CEPH collection (162)
  • CHB Han Chinese in Beijing, China (70)
  • CHD Chinese in Metropolitan Denver, Colorado (70)
  • GIH Gujarati Indians in Houston, Texas (83)
  • JPT Japanese in Tokyo, Japan (82)
  • LWK Luhya in Webuye, Kenya (83)
  • MEX Mexican ancestry in Los Angeles, California
    (71)
  • MKK Maasai in Kinyawa, Kenya (171)
  • TSI Toscani in Italia (77)
  • YRI Yoruba in Ibadan, Nigeria (163)

25
Haplotyping
  • A haplotype is a set of SNPs (on average 25 kb)
    found to be statistically associated on a single
    chromatid and which therefore tend to be
    inherited together over time.
  • Haplotyping involves grouping subjects by
    haplotypes.

26
Linkage Disequilibrium
  • LD is the deviation from equilibrium, or random
    association.
  • (i.e. in a population, two alleles are always
    inherited together, though they should undergo
    recombination some of the time.)

27
Measures of LD
  • D P(AB) P(A)P(B)
  • D ranges from 0.25 to 0.25
  • D 0 indicates linkage equilibrium
  • dependent on allele frequencies, therefore of
    little use
  • D D / maximum possible value
  • D 1 indicates perfect LD
  • estimates of D strongly inflated in small
    samples
  • r2 D2 / P(A)P(B)P(a)P(b)
  • r2 1 indicates perfect LD
  • measure of choice

High LD, or perfect LD, shows high association of
SNPs.
28
Linkage Disequilibrium
LD values between two variants are displayed by
means of inverted coloured triangles going from
white (low LD) to red (high LD).
29
Tag SNPs define a haplotype
Adapted from Nature 426, 6968 789-796 (2003)
30
Tag SNPs
  • Tag SNPs define the minimum SNP set to identify
    a haplotype.
  • r2 1 between 2 SNPs means 1 would be
    redundant in the haplotype.

31
Locus specific databases (LSDB)
  • Databases that focus on one gene or one disease
  • e.g. p53, ABO, collagen
  • e.g. Albinism, cystic fibrosis, Alzheimers
    disease
  • User communities
  • Research groups-disease and function driven
  • Clinicians driven by genetic testing of
    patients

32
LSDBs
  • gt700 on the Human Genome Variation Society website

33
LSDB examples
34
Why is it difficult to merge these data?
  • Historical reasons. LSDBs sometimes
  • Use sequences which do not start at Methionine
  • Use transcript coordinates not genomic
  • Use a different transcript for reporting
    mutations
  • Regularly changes with new assemblies/gene builds
  • It may contain minor alleles or rare alleles
  • It may be inaccurate
  • Missing genes (e.g. no a-haemoglobin -
    Thalasemia)
  • Mixture of sequences from different individuals

35
Ensembl and LRGs
  • Define an exchange format for LRGs with the NCBI
  • Create an LRG website
  • Create a pipeline for receiving the data and
    creating an LRG
  • Extend e! databases to store LRGs
  • Develop an API to query LRGs and associated
    annotation
  • Consult with the LSDBs to develop useful
    visualisation tools
  • Build displays for LRG data and annotation

36
Why is this important for Ensembl
  • Ensembl has traditionally focused on an
    infrastructure for molecular biologists
  • Needs to expand to provide support for more
    stable transcript sequences used for reporting
    mutations
  • It will give central databases access to patient
    variation, genotype, phenotype and disease data
  • This will improve our data resources

37
Advantages to LSDBs
  • LRGs in Ensembl gives LSDBs access to
  • Genome annotation (including comparative,
    functional genomics and variation data)
  • Data integration with other variation resources
    (dbSNP, EGA, 1000 Genomes, NHGRI GWA catalogue)
  • Sequence search and data mining tools
  • A Perl API to query the data
  • A genome browser website for visualisation in
    genomic context and local context
  • Promotes discoverability of LSDBs
  • Data is mapped from one assembly to the next

38
EGA- Repository for genotype data
  • www.ebi.ac.uk/ega/

39
Variations Team
Fiona Cunningham Yuan Chen Will McLaren
40
Functional Genomics
(Wikipedia) Functional genomics is a field of
molecular biology that attempts to make use of
the vast wealth of data produced by genomic
projects (such as genome sequencing projects)
to describe gene (and protein) functions and
interactions.
In Ensembl Regulatory build using ENCODE
project information Promoters and Enhancers from
CisRED and VISTA FlyReg features (for Drosophila)
41
ENCODE
  • Encylopedia Of DNA Elements
  • Where are the promoter, enhancer, and other
    regulatory regions of the human genome?
  • Pilot project showed Use chromatin accessibility
    and histone modification analysis to predict TSS

14 June 2007, Nature
42
cisRED
Sequence motifs determined by experimental and
prediction tools.
http//www.cisred.org/
VISTA Enhancer Set
Tissue-specific enhancers. Tested
experimentally.
Nucleic Acids Res. 2007 January 35(Database
issue) D88D92.
43
How to get there?
44
Click on a Regulatory Feature
45
Region in Detail
46
BioMart
47
Functional Genomics Team
Ian Dunham Nathan Johnson Damian Keefe
Write a Comment
User Comments (0)
About PowerShow.com