From sequence data to genomic prediction - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

From sequence data to genomic prediction

Description:

Title: Why detect QTL? Author: bh18 Last modified by: Ben Hayes Created Date: 2/14/2003 4:27:29 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 57
Provided by: bh14
Category:

less

Transcript and Presenter's Notes

Title: From sequence data to genomic prediction


1
From sequence data to genomic prediction
2
Course overview
  • Day 1
  • Introduction
  • Generation, quality control, alignment of
    sequence data
  • Detection of variants, quality control and
    filtering
  • Day 2
  • Imputation from SNP array genotypes to sequence
    data
  • Day 3
  • Genome wide association studies with SNP array
    and sequence variant genotypes
  • Day 4 5
  • Genomic prediction with SNP array and sequence
    variant genotypes (BLUP and Bayesian methods)
  • Use of genomic selection in breeding programs

3
Imputation
  • Why impute?
  • Approaches for imputation
  • Factors affecting accuracy of imputation
  • Does imputation give you more power?
  • Imputation to whole genome sequence variant
    genotypes

4
Why impute?
  • Fill in missing genotypes from the lab
  • Merge data sets with genotypes on different
    arrays
  • Eg. Affy and Illumina data
  • Impute from low density to high density
  • 7K-gt 50K (save )
  • 50K-gt800K
  • capture power of higher density?
  • Better persistence of accuracy
  • Sequence expensive, can we impute to full
    sequence data?

5
Core concept
  • Identity by state (IBS)
  • A pair of individuals have the same allele at a
    locus
  • Identity by descent (IBD)
  • A pair of individuals have the same alleles at a
    locus and it traces to a common ancestor
  • Imputation methods determine whether a chromosome
    segment is IBD

6
Causes of LD
  • A chunk of ancestral chromosome is conserved in
    the current population

Marker Haplotype
1
1
1
2
7
Core concept 2
  • Any individuals in a population may share a
    proportion of their genome identical by descent
    (IBD)
  • IBD segments are the same and have originated in
    a common ancestor
  • The closer the relationship the longer the IBD
    segments
  • Pedigree relationships

8
Several methods for imputation
  • Two main categories
  • Family based
  • Population based
  • Or combination of the two
  • Some of the most effective are Beagle (Browning
    and Browning, 2009), MACH (Li et al., 2010),
    Impute2 (Howie et al., 2009), AlphaPhase (Hickey
    et al 2011)

9
Several methods for imputation
  • Two main categories
  • Family based
  • Population based
  • Or combination of the two
  • Some of the most effective are Beagle (Browning
    and Browning, 2009), MACH (Li et al., 2010),
    Impute2 (Howie et al., 2009), AlphaPhase (Hickey
    et al 2011)

10
Finding an IBD segment
Sire
Progeny
11
Sire
IBD segment
Progeny
12
Sire
Progeny
13
Several methods for imputation
  • Two main categories
  • Family based
  • Population based (exploits LD)
  • Or combination of the two
  • Some of the most effective are Beagle (Browning
    and Browning, 2009), MACH (Li et al., 2010),
    Impute2 (Howie et al., 2009), AlphaPhase (Hickey
    et al 2011)

14
Population based imputation
  • Hidden Markov Models
  • Has hidden states
  • For target individuals these are map of
    reference haplotypes that have been inherited
  • Imputation problem is to derive genotype
    probabilities given hidden states, sparse
    genotypes, recombination rates, other population
    parameters

15
Population based imputation
Reference population
Target population
Marchini J, Howie B. Genotype imputation for
genome-wide association studies. Nat Rev Genet.
2010 11499-511.
16
Population based imputation
  • Consider three markers, 4 reference haplotypes
  • 0 1 1
  • 0 1 0
  • 1 0 1
  • 0 0 1
  • Imputation?

17
Li and Stephens
18
Beagle
19
Imputation accuracy
  • Accuracy correlation of real and imputed
    genotypes
  • Concordance percentage () of genotypes called
    correctly

20
Imputation accuracy
  • Depends on
  • Size of reference set
  • bigger the better!
  • Density of markers
  • extent of LD, effective population size
  • Frequency of SNP alleles
  • Genetic relationship to reference

21
  • Table 6. Accuracy of imputation from BovineLD
    genotypes to BovineSNP50 genotypes for
    Australian, French, and North American breeds.

Boichard D, Chung H, Dassonneville R, David X, et
al. (2012) Design of a Bovine Low-Density SNP
Array Optimized for Imputation. PLoS ONE 7(3)
e34130. doi10.1371/journal.pone.0034130 http//ww
w.plosone.org/article/infodoi/10.1371/journal.pon
e.0034130
22
Imputation accuracy
  • Density of markers (extent of LD)
  • In Holstein Dairy cattle
  • 3K -gt 50K accuracy 0.93
  • 7K -gt 50K accuracy 0.98

23
Illumina Bovine HD array
  • We genotyped
  • 898 Holstein heifers
  • 47 Holstein Key ancestor bulls
  • After (stringent) QC 634,307 SNPs

24
Imputation 50K -gt 800K
  • Holsteins

25
Imputation accuracy
  • Rare alleles?

26
Imputation accuracy
  • Relationship to reference?

27
Imputation accuracy
  • Effect of map errors?

28
Why more power with imputation
  • High accuracies of imputation demonstrate that we
    can infer haplotypes of animal genotyped with
    e.g. 3K accurately
  • But potentially large number of haplotypes
  • With imputed data can test single snp, only use 1
    degree of freedom, rather than number of
    haplotypes

29
Why more power with imputation
  • Weigel et al. (2010)

30
Imputation
  • Why impute?
  • Approaches for imputation
  • Factors affecting accuracy of imputation
  • Does imputation give you more power?
  • Imputation to whole genome sequence variant
    genotypes

31
Which individuals to sequence?
  • Those which capture greatest genetic diversity?
  • Select set of individuals which are likely to
    capture highest proportion of unique chromosome
    segments

32
Which individuals to sequence?
  • Let total number of individuals in population be
    n, number of individuals that can be sequenced be
    m.
  • A average relationship matrix among n
    individuals, from pedigree

33
  • An example A matrix..

Pedigree
Animals 6 is a half sib of 4 and 5
34
Which individuals to sequence?
  • Let total number of individuals in population be
    n, number of individuals that can be sequenced be
    m.
  • A average relationship matrix among n
    individuals, from pedigree
  • c is a vector of size n, which for each animal
    has the average relationship to the population
    (eg. Sum up the elements of A down the column for
    individual i, take mean)

35
Which individuals to sequence?
  • If we choose a group of m animals for sequencing,
    how much of the diversity do they capture
  • pm Am-1cm
  • Where Am is the sub matrix of A for the m
    individuals, and cm is the elements of the c
    vector for the m individuals
  • Proportion of diversity pm1n

36
Which individuals to sequence?
  • Example

37
Which individuals to sequence?
  • Then choose set of individuals to sequence (m)
    which maximise pm1n
  • Step wise regression
  • Find single individual with largest pi, set ci to
    zero, next largest pi, set ci to zero..
  • Genetic algorithm

38
Which individuals to sequence?
  • Then choose set of individuals to sequence (m)
    which maximise pm1n
  • Step wise regression
  • Find single individual with largest pi, set ci to
    zero, next largest pi, set ci to zero..
  • Genetic algorithm
  • No A? Use G

39
Which individuals to sequence?
  • Poll Dorset sheep

40
Imputation of full sequence data
  • Two groups of individuals
  • Sequenced individuals reference population
  • Individuals genotyped on SNP array target
    individuals

41
Imputation of full sequence data
  • Steps
  • Step 1. Find polymorphisms in sequence data
  • Step 2. Genotype all sequenced animals for
    polymorphisms (SNP, Indels)
  • Step 3. Phase genotypes (eg Beagle) in sequenced
    individuals, create reference file
  • Step 4. Impute all polymorphisms into
    individuals genotyped with SNP array

42
Imputation of full sequence data
Variant calling SamTools mPileup Vcf file -gt
filter (number forward /reverse reads of each
allele, read depth, quality, filter number of
variants in 5bp window)
Create BAM files 1. Filter reads on quality
score, trim ends 2. Remove PCR duplicates 3.
Align with BWA
Beagle Phasing in Reference Input genotype probs
from Phred scores QC with 800K
BAM
Reference file for imputation
Analysis Genome wide association Genomic
selection
Beagle Imputation in Target SNP array data in
target population
Genotype probabilities
43
Imputation of full sequence data
  • How accurate?

44
Run4.0 1000 bull genomes Run 4.0
Breed/Cross Number
Holstein (Black and White) 288
Simmental (Dual and Beef) 216
Angus (Black and Red) 138
Jersey 61
Brown Swiss 59
Gelbvieh 34
Charolais 33
Hereford 31
Limousin 31
Guelph Composite 30
Beef Booster 29
Alberta Composite 28
Montbeliarde 28
AyrshireFinnish 25
Normande 24
Holstein (Red and White) 23
Swedish Red 16
Danish Red 15
Other Crosses 11
Belgian Blue 10
Piedmontese 5
Eringer 2
Galloway 2
Unknown 2
Scottish Highland 2
Pezzata Rossa Italiana 1
Romagnola 1
Salers 1
Tyrolean Grey 1
Total 1147
  • 1147 animals sequenced
  • 27 breeds
  • 20 Partners
  • Average 11X

CRV
45
1000 bull genomes Run 4.0
  • 36.9 million filtered variants
  • 35.2 million SNP
  • 1.7 million INDEL

X
46
Imputation of full sequence data
  • Accuracy?
  • Chromosome 14
  • Remove 50 Holsteins, 20 Jerseys from data set
  • Reduce genotypes to 800K for these animals
  • Impute full sequence using rest of animals as
    reference

47
Imputation of full sequence data
48
Imputation of full sequence data
49
Imputation of full sequence data
50
Imputation of full sequence data
  • Why so difficult to impute rare mutations?
  • Examples Complex Veterbral Malformation (CVM) and
    Bovine Leukocyte Deficiency (BLAD)
  • All cases of CVM trace back to Ivanhoe Bell
  • BLAD traces to Osbornedale Ivanhoe

51
Imputation of full sequence data
  • Why so difficult to impute rare mutations?

BLAD CVM
Location Chr1145114963 Chr343412427
Frequency 0.0014 0.0103
Bulls genotyped 5987 5987
Imputed correctly 5970 5836
Accuracy 0.9972 0.9748
Carriers 17 123
Carriers correctly imputed 13 5
Prop. Carriers correctly imputed 0.765 0.041
52
Imputation of full sequence data
  • Why so difficult to impute rare mutations?
  • The BLAD mutation is in a unique 250kb haplotype,
    which does not occur in any non-carriers
  • The CVM mutation is in a 250kb haplotype which
    occurs in many non carriers, and also occurs in
    breeds without mutation
  • Hypothesis BLAD mutation occurred on rare
    haplotype, while CVM a recent mutation that
    occurred on a common haplotype background

53
Imputation of full sequence data
  • Computationally efficient strategies
  • Beagle run imputation in chromosome segments,
    say 5MB with 0.5MB overlap (to avoid edge
    effects)
  • Fimpute much faster than Beagle, used to impute
    32,500 animals from 800K to 16 million SNP!
  • Does not give probabilties
  • Beagle phasing Minimac

54
(No Transcript)
55
(No Transcript)
56
Conclusion
  • Impute
  • to fill in missing genotypes
  • low density to high density to save
  • Accuracy depends on size of reference, effective
    population size, relationship to reference,
    marker density
  • Imputation to sequence possible, relatively low
    accuracies for rare alleles
  • Use genotype probabilities from imputation in
    GWAS and genomic prediction
Write a Comment
User Comments (0)
About PowerShow.com