Title: Basic Model For Genetic Linkage Analysis Lecture
1Basic Model For Genetic Linkage AnalysisLecture
3
Prepared by Dan Geiger
2(No Transcript)
3(No Transcript)
4(No Transcript)
5Using the Maximum Likelihood Approach
The probability of pedigree data Pr(data ? ) is
a function of the known and unknown recombination
fractions denoted collectively by ?. How can we
construct this likelihood function ? The
maximum likelihood approach is to seek the value
of ? which maximizes the likelihood function
Pr(data ? ) . This is the ML estimate.
6Constructing the Likelihood function
First, we need to determine the variables that
describe the problem. There are many possible
choices. Some variables we can observe and some
we cannot.
Lijm Maternal allele at locus i of person j.
The values of this variables are the possible
alleles li at locus i.
Lijf Paternal allele at locus i of person j.
The values of this variables are the possible
alleles li at locus i (Same as for Lijm) .
Xij Unordered allele pair at locus i of person
j. The values are pairs of ith-locus alleles
(li,li).
As a starting point, We assume that the data
consists of an assignment to a subset of the
variables Xij. In other words some (or all)
persons are genotyped at some (or all) loci.
7What is the relationships among the variables for
a specific individual ?
Maternal allele at locus 1 of person 1
Paternal allele at locus 1 of person 1
L11f
L11m
P(L11m a) is the frequency of allele a. We use
lower case letters for states writing, in short,
P(l11m).
Unordered allele pair at locus 1 of person 1
data
X11
P(x11 l11m, l11f) 0 or 1 depending on
consistency
8What is the relationships among the variables
across individuals ?
L11m
L11f
L12f
L12m
Mother
Father
X11
X12
L13f
L13m
X13
Offspring
P(l13m l11m, l11f) 1/2 if l13m l11m
or l13m l11f P(l13m l11m, l11f) 0
otherwise
First attempt correct but not efficient as we
shall see.
9Probabilistic model for two loci
10Adding a selector variable
L11f
L11m
Selector of maternal allele at locus 1 of person 3
X11
S13m
P(s13m) ½
L13m
Maternal allele at locus 1 of person 3
(offspring)
Selector variables Sijm are 0 or 1 depending on
whose allele is transmitted to offspring i at
maternal locus j.
P(l13m l11m, l11f,,S13m0) 1 if l13m
l11m P(l13m l11m, l11f,,S13m1) 1 if l13m
l11f P(l13m l11m, l11f,,s13m) 0
otherwise
11Probabilistic model for two loci
Model for locus 1
12Probabilistic Model for Recombination
? is the recombination fraction between loci 2
1.
13Constructing the likelihood function I
Observed variable
All other variables are not-observed (hidden)
14Constructing the likelihood function II
P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13,
l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23,
s13m,s13f,s23m,s23f, ?)
Product over all local probability tables
P(l11m) P(l11f) P(x11 l11m, l11f,)
P(s13m) P(s13f) P(s23m s13m, ?) P(s23m s13m,
?)
The result is a function of the recombination
fraction. The ML estimate is the ? value that
maximizes this function.
15The Disease Locus I
L11f
L11m
X11
S13m
Y11
L13m
Phenotype variables Yij are 0 or 1 depending on
whether a phenotypic trait associated with locus
i of person j is observed. E.g., sick versus
healthy. For example model of perfect recessive
disease yields the penetrance probabilities
P(y11 sick X11 (a,a)) 1 P(y11 sick
X11 (A,a)) 0 P(y11 sick X11 (A,A)) 0
16The Disease Locus II
L11f
L11m
X11
S13m
Y11
L13m
Note that in this model we assume the
phenotype/disease depends only on the alleles of
one locus. Also we did not model levels of
sickness.
17Introducing a tentative disease Locus
Marker locus
Disease locus assume sick means xij(a,a)
Y21
Y22
Y23
The recombination fraction ? is unknown. Finding
it can help determine whether a gene causing the
disease lies in the vicinity of the marker locus.
18Locus-by-Locus Summation order
Sum over locus i vars before summing over locus
i1 vars Sum over orange vars (Lijt) before
summing selector vars (Sijt). This order yields a
Hidden Markov Model (HMM).
19Hidden Markov Models in General
Which depicts the factorization
Application in communication message sent is
(s1,,sm) but we receive (r1,,rm) . Compute
what is the most likely message sent ?
Application in speech recognition word said is
(s1,,sm) but we recorded (r1,,rm) . Compute
what is the most likely word said ?
Application in Genetic linkage analysis to be
discussed now.
20Hidden Markov Model In our case
S1
S2
S3
Si-1
Si
Si1
X1
X2
X3
Yi-1
Xi
Xi1
The compounded variable Si (Si,1,m,,Si,2n,f)
is called the inheritance vector. It has 22n
states where n is the number of persons that have
parents in the pedigree (non-founders). The
compounded variable Xi (Xi,1,m,,Xi,2n,f) is
the data regarding locus i. Similarly for the
disease locus we use Yi. To specify the HMM we
need to write down the transition matrices from
Si-1 to Si and the matrices P(xiSi). Note that
these quantities have already been implicitly
defined.
21The transition matrix
Recall that
Note that theta depends on I but this dependence
is omitted. In our example, where we have one
non-founder (n1), the transition probability
table size is 4 ? 4 22n ? 22n, encoding four
options of recombination/non-recombination for
the two parental meiosis
22Efficient Product
So, if we start with a matrix of size 22n, we
will need 22n multiplications if we had matrix A
in hands. Continuing recursively, at most 2n
times, yields a complexity of O(2n22n), far less
than O(24n) needed for regular multiplication. Wi
th n10 non-founders, we drop from non-feasible
region to feasible one.
23Probability of data in one locus given an
inheritance vector
P(x21, x22 , x23 s23m,s23f)
The five last terms are always zero-or-one,
namely, indicator functions.
24Efficient computation
L21f
L21m
L22f
L22m
1
S23m
X21
S23f
X22
0
L23f
L23m
X23
A1,A2
Model for locus 2
Assume only individual 3 is genotyped. For the
inheritance vector (0,1), the founder alleles
L21m and L22f are not restricted by the data
while (L21f,L22m) have two possible joint
assignments (A1,A2) or (A2,A1) only
p(x21, x22 , x23 s23m1,s23f 0) p(A1)p(A2)
p(A2)p(A1)
In general. Every inheritance vector defines a
subgraph of the Bayesian network above. We build
a founder graph
The five last terms are always zero-or-one,
namely, indicator functions.
25Efficient computation
L21f
L21m
L22f
L22m
1
S23m
X21
S23f
X22
0
L23f
L23m
X23
A1,A2
Model for locus 2
In general. Every inheritance vector defines a
subgraph as indicated by the black lines above.
Construct a founder graph whose vertices are the
founder variables and where there is an edge
between two vertices if they have a common typed
descendent. The label of an edge is the
constraint dictated by the common typed
descendent. Now find all consistent assignments
for every connected component.
A1,A2
L21m
L21f
L22f
L22m
The five last terms are always zero-or-one,
namely, indicator functions.
26A Larger Example
Descent graph
Founder graph (An example of a constraint
satisfaction graph)
a,b
a,b
5
3
6
4
5
3
6
4
b,d
a,c
a,b
2
1
8
7
Connect two nodes if they have a common typed
descendant. The constraint a,b means the
relation (a,b)(b,a)
27The Constraint Satisfaction Problem
The number of possible consistent alleles per
non-isolated node is 0, 1 or 2. For example node
2 has all possible alleles, node 6 can only be b
because its domain must be a,b and b,d. and
node 3 can be assigned either a or b. namely, the
intersection of its adjacent edges labels. For
each non-singleton connected component Start
with an arbitrary node, pick one of its values.
This dictates all other values in the component.
Repeat with the other value if it has one. So
each non-singleton component yields at most two
solutions. What is the special constriant problem
here?
28Solution of the CSP
Since each non-singleton component yields at most
two solutions. The likelihood is simply the
product of sums each of two terms at most. Each
component contributes one term. Singleton
components contribute the term 1 In our
example 1 p(a)p(b)p(a) p(b)p(a)p(b)
p(d)p(b)p(a)p(c).
Complexity. Building the founder graph
O(f2n). While solving general CSPs is
NP-hard. This is graph coloring where domains are
often size 2.
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)