From sequence data to genomic prediction - PowerPoint PPT Presentation

About This Presentation

Title:

From sequence data to genomic prediction

Description:

Title: Why detect QTL? Author: bh18 Last modified by: Ben Hayes Created Date: 2/14/2003 4:27:29 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:179

Avg rating:3.0/5.0

Slides: 57

Provided by: bh14

Category:

more less

Transcript and Presenter's Notes

Title: From sequence data to genomic prediction

1
From sequence data to genomic prediction
2
Course overview

Day 1
Introduction
Generation, quality control, alignment of
sequence data
Detection of variants, quality control and
filtering
Day 2
Imputation from SNP array genotypes to sequence
data
Day 3
Genome wide association studies with SNP array
and sequence variant genotypes
Day 4 5
Genomic prediction with SNP array and sequence
variant genotypes (BLUP and Bayesian methods)
Use of genomic selection in breeding programs

3
Imputation

Why impute?
Approaches for imputation
Factors affecting accuracy of imputation
Does imputation give you more power?
Imputation to whole genome sequence variant
genotypes

4
Why impute?

Fill in missing genotypes from the lab
Merge data sets with genotypes on different
arrays
Eg. Affy and Illumina data
Impute from low density to high density
7K-gt 50K (save )
50K-gt800K
capture power of higher density?
Better persistence of accuracy
Sequence expensive, can we impute to full
sequence data?

5
Core concept

Identity by state (IBS)
A pair of individuals have the same allele at a
locus
Identity by descent (IBD)
A pair of individuals have the same alleles at a
locus and it traces to a common ancestor
Imputation methods determine whether a chromosome
segment is IBD

6
Causes of LD

A chunk of ancestral chromosome is conserved in
the current population

Marker Haplotype
1
1
1
2
7
Core concept 2

Any individuals in a population may share a
proportion of their genome identical by descent
(IBD)
IBD segments are the same and have originated in
a common ancestor
The closer the relationship the longer the IBD
segments
Pedigree relationships

8
Several methods for imputation

Two main categories
Family based
Population based
Or combination of the two
Some of the most effective are Beagle (Browning
and Browning, 2009), MACH (Li et al., 2010),
Impute2 (Howie et al., 2009), AlphaPhase (Hickey
et al 2011)

9
Several methods for imputation

Two main categories
Family based
Population based
Or combination of the two
Some of the most effective are Beagle (Browning
and Browning, 2009), MACH (Li et al., 2010),
Impute2 (Howie et al., 2009), AlphaPhase (Hickey
et al 2011)

10
Finding an IBD segment
Sire
Progeny
11
Sire
IBD segment
Progeny
12
Sire
Progeny
13
Several methods for imputation

Two main categories
Family based
Population based (exploits LD)
Or combination of the two
Some of the most effective are Beagle (Browning
and Browning, 2009), MACH (Li et al., 2010),
Impute2 (Howie et al., 2009), AlphaPhase (Hickey
et al 2011)

14
Population based imputation

Hidden Markov Models
Has hidden states
For target individuals these are map of
reference haplotypes that have been inherited
Imputation problem is to derive genotype
probabilities given hidden states, sparse
genotypes, recombination rates, other population
parameters

15
Population based imputation
Reference population
Target population
Marchini J, Howie B. Genotype imputation for
genome-wide association studies. Nat Rev Genet.
2010 11499-511.
16
Population based imputation

Consider three markers, 4 reference haplotypes
0 1 1
0 1 0
1 0 1
0 0 1
Imputation?

17
Li and Stephens
18
Beagle
19
Imputation accuracy

Accuracy correlation of real and imputed
genotypes
Concordance percentage () of genotypes called
correctly

20
Imputation accuracy

Depends on
Size of reference set
bigger the better!
Density of markers
extent of LD, effective population size
Frequency of SNP alleles
Genetic relationship to reference

Table 6. Accuracy of imputation from BovineLD
genotypes to BovineSNP50 genotypes for
Australian, French, and North American breeds.

Boichard D, Chung H, Dassonneville R, David X, et
al. (2012) Design of a Bovine Low-Density SNP
Array Optimized for Imputation. PLoS ONE 7(3)
e34130. doi10.1371/journal.pone.0034130 http//ww
w.plosone.org/article/infodoi/10.1371/journal.pon
e.0034130
22
Imputation accuracy

Density of markers (extent of LD)
In Holstein Dairy cattle
3K -gt 50K accuracy 0.93
7K -gt 50K accuracy 0.98

23
Illumina Bovine HD array

We genotyped
898 Holstein heifers
47 Holstein Key ancestor bulls
After (stringent) QC 634,307 SNPs

24
Imputation 50K -gt 800K

Holsteins

25
Imputation accuracy

Rare alleles?

26
Imputation accuracy

Relationship to reference?

27
Imputation accuracy

Effect of map errors?

28
Why more power with imputation

High accuracies of imputation demonstrate that we
can infer haplotypes of animal genotyped with
e.g. 3K accurately
But potentially large number of haplotypes
With imputed data can test single snp, only use 1
degree of freedom, rather than number of
haplotypes

29
Why more power with imputation

Weigel et al. (2010)

30
Imputation

Why impute?
Approaches for imputation
Factors affecting accuracy of imputation
Does imputation give you more power?
Imputation to whole genome sequence variant
genotypes

31
Which individuals to sequence?

Those which capture greatest genetic diversity?
Select set of individuals which are likely to
capture highest proportion of unique chromosome
segments

32
Which individuals to sequence?

Let total number of individuals in population be
n, number of individuals that can be sequenced be
m.
A average relationship matrix among n
individuals, from pedigree

An example A matrix..

Pedigree
Animals 6 is a half sib of 4 and 5
34
Which individuals to sequence?

Let total number of individuals in population be
n, number of individuals that can be sequenced be
m.
A average relationship matrix among n
individuals, from pedigree
c is a vector of size n, which for each animal
has the average relationship to the population
(eg. Sum up the elements of A down the column for
individual i, take mean)

35
Which individuals to sequence?

If we choose a group of m animals for sequencing,
how much of the diversity do they capture
pm Am-1cm
Where Am is the sub matrix of A for the m
individuals, and cm is the elements of the c
vector for the m individuals
Proportion of diversity pm1n

36
Which individuals to sequence?

Example

37
Which individuals to sequence?

Then choose set of individuals to sequence (m)
which maximise pm1n
Step wise regression
Find single individual with largest pi, set ci to
zero, next largest pi, set ci to zero..
Genetic algorithm

38
Which individuals to sequence?

Then choose set of individuals to sequence (m)
which maximise pm1n
Step wise regression
Find single individual with largest pi, set ci to
zero, next largest pi, set ci to zero..
Genetic algorithm
No A? Use G

39
Which individuals to sequence?

Poll Dorset sheep

40
Imputation of full sequence data

Two groups of individuals
Sequenced individuals reference population
Individuals genotyped on SNP array target
individuals

41
Imputation of full sequence data

Steps
Step 1. Find polymorphisms in sequence data
Step 2. Genotype all sequenced animals for
polymorphisms (SNP, Indels)
Step 3. Phase genotypes (eg Beagle) in sequenced
individuals, create reference file
Step 4. Impute all polymorphisms into
individuals genotyped with SNP array

42
Imputation of full sequence data
Variant calling SamTools mPileup Vcf file -gt
filter (number forward /reverse reads of each
allele, read depth, quality, filter number of
variants in 5bp window)
Create BAM files 1. Filter reads on quality
score, trim ends 2. Remove PCR duplicates 3.
Align with BWA
Beagle Phasing in Reference Input genotype probs
from Phred scores QC with 800K
BAM
Reference file for imputation
Analysis Genome wide association Genomic
selection
Beagle Imputation in Target SNP array data in
target population
Genotype probabilities
43
Imputation of full sequence data

How accurate?

44
Run4.0 1000 bull genomes Run 4.0
Breed/Cross Number
Holstein (Black and White) 288
Simmental (Dual and Beef) 216
Angus (Black and Red) 138
Jersey 61
Brown Swiss 59
Gelbvieh 34
Charolais 33
Hereford 31
Limousin 31
Guelph Composite 30
Beef Booster 29
Alberta Composite 28
Montbeliarde 28
AyrshireFinnish 25
Normande 24
Holstein (Red and White) 23
Swedish Red 16
Danish Red 15
Other Crosses 11
Belgian Blue 10
Piedmontese 5
Eringer 2
Galloway 2
Unknown 2
Scottish Highland 2
Pezzata Rossa Italiana 1
Romagnola 1
Salers 1
Tyrolean Grey 1
Total 1147

1147 animals sequenced
27 breeds
20 Partners
Average 11X

CRV
45
1000 bull genomes Run 4.0

36.9 million filtered variants
35.2 million SNP
1.7 million INDEL

X
46
Imputation of full sequence data

Accuracy?
Chromosome 14
Remove 50 Holsteins, 20 Jerseys from data set
Reduce genotypes to 800K for these animals
Impute full sequence using rest of animals as
reference

47
Imputation of full sequence data
48
Imputation of full sequence data
49
Imputation of full sequence data
50
Imputation of full sequence data

Why so difficult to impute rare mutations?
Examples Complex Veterbral Malformation (CVM) and
Bovine Leukocyte Deficiency (BLAD)
All cases of CVM trace back to Ivanhoe Bell
BLAD traces to Osbornedale Ivanhoe

51
Imputation of full sequence data

Why so difficult to impute rare mutations?

BLAD CVM
Location Chr1145114963 Chr343412427
Frequency 0.0014 0.0103
Bulls genotyped 5987 5987
Imputed correctly 5970 5836
Accuracy 0.9972 0.9748
Carriers 17 123
Carriers correctly imputed 13 5
Prop. Carriers correctly imputed 0.765 0.041
52
Imputation of full sequence data

Why so difficult to impute rare mutations?
The BLAD mutation is in a unique 250kb haplotype,
which does not occur in any non-carriers
The CVM mutation is in a 250kb haplotype which
occurs in many non carriers, and also occurs in
breeds without mutation
Hypothesis BLAD mutation occurred on rare
haplotype, while CVM a recent mutation that
occurred on a common haplotype background

53
Imputation of full sequence data

Computationally efficient strategies
Beagle run imputation in chromosome segments,
say 5MB with 0.5MB overlap (to avoid edge
effects)
Fimpute much faster than Beagle, used to impute
32,500 animals from 800K to 16 million SNP!
Does not give probabilties
Beagle phasing Minimac

54
(No Transcript)
55
(No Transcript)
56
Conclusion

Impute
to fill in missing genotypes
low density to high density to save
Accuracy depends on size of reference, effective
population size, relationship to reference,
marker density
Imputation to sequence possible, relatively low
accuracies for rare alleles
Use genotype probabilities from imputation in
GWAS and genomic prediction