knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE)
The R package clipp
provides a fast and general implementation of the Elston-Stewart
algorithm.
The main function is pedigree_loglikelihood
, which calculates the
pedigree log-likelihood for almost any choice of genetic model. Helper functions
are provided that specify commonly used genetic models, though users
are free to define and use more advanced ones. Combining clipp
with
an optimisation function like mle
allows the user to perform
maximum-likelihood estimation of model parameters, as illustrated below.
clipp
also provides a function, genotype_probabilities
, that calculates
genotype probabilities for a target person within a family, given the family's
observed data.
The function pedigree_loglikelihood
calculates (the logarithm of) any pedigree
likelihood that can be written in Ott's form, as given on page 117 of (Lange, 2002).
In this formulation, the likelihood $L$ for a given family is
$$ L = \sum_{g_1} \dots \sum_{g_n} \prod_{i=1}^n P(x_i \mid g_i) P(g_i \mid g_{m(i)}, g_{f(i)}),$$
where: there are $n$ members of the family, who are labeled by identifiers
$i = 1, \dots, n$; $g_i$ and $x_i$ are the genotype and observed data for
person $i$, respectively; $P(x_i \mid g_i)$ is the conditional probability of
person $i$'s observed data give his or her genotype; $m(i)$ and $f(i)$ are the
identifiers of person $i$'s mother and father, respectively, or missing
if person $i$ is a founder (i.e. if person $i$ does not have parents in the pedigree);
and $P(g_i \mid g_{m(i)}, g_{f(i)})$ is either the population prevalence
of genotype $g_i$ (if person $i$ is a founder) or the conditional probability of
offspring genotype $g_i$ given parental genotypes $g_{m(i)}$ and
$g_{f(i)}$ (if person $i$ is a non-founder). Each sum is over all
possible genotypes at the genetic locus under consideration (or over all
possible multi-locus genotypes if we are modeling two or more genetic loci).
The terms in this likelihood correspond to the main arguments of the function
pedigree_loglikelihood
. The argument dat
specifies the family structure,
i.e. $m(i)$ and $f(i)$ for all family members $i$. The argument geno_freq
specifies the population genotype prevalences, i.e.
$P(g_i \mid g_{m(i)}, g_{f(i)})$ when person $i$ is a founder. The argument trans
specifies the transmission probabilities, i.e. $P(g_i \mid g_{m(i)}, g_{f(i)})$
when person $i$ is a non-founder. Lastly, the penetrance matrix penet
specifies
the relationship between the observed data and the genotypes, i.e. $P(x_i \mid g_i)$.
The transmission probabilities trans
and population genotype prevalences geno_freq
determine the joint probability for any combination of genotypes
within a family, so we say that trans
and geno_freq
define the genetic model
of the analysis. These terms are usually based on known genetic laws (such as
Mendel's laws) and user-specified genotype or allele frequencies.
clipp
provides helper functions to calculate trans
and geno_freq
for
common genetic models. For example, the following code specifies a genetic
model consisting of a single biallelic locus with a minor allele frequency
of 10%.
library(clipp) MAF <- 0.1 geno_freq <- geno_freq_monogenic(p_alleles = c(1 - MAF, MAF)) trans <- trans_monogenic(n_alleles = 2)
We can view a more user-friendly version of the genotype frequencies by using
the option annotate = TRUE
.
geno_freq_monogenic(p_alleles = c(1 - MAF, MAF), annotate = TRUE)
This shows that the alleles at the genetic locus have been given the default names
1
and 2
, and that the possible genotypes are 1/1
, 1/2
and 2/2
.
In this example, the population prevalence of the heterozygous genotype 1/2
is 18%.
The option annotate = TRUE
also reveals a more user-friendly version of the
transmission probabilities.
trans_monogenic(n_alleles = 2, annotate = TRUE)
Here, the rows correspond to the nine possible joint parental genotypes and the
last three columns correspond to the three possible offspring genotypes.
Each number is the conditional probability of the offspring genotype,
given the parental genotypes.
For example, row 3
says that when the mother has genotype 1/1
and the father has genotype 2/2
then the offspring will have genotype 1/2
(with probability 1
). The probabilities in this table are
given by Mendel's laws of genetics. Note that the probabilities in each row
sum to 1
, since the offspring must always have one of the three possible
genotypes, regardless of the parental genotypes.
Now that we've defined a genetic model, we can use the function pedigree_loglikelihood
to calculate the pedigree log-likelihoods of some sample families.
data("dat_small", "penet_small", "dat_large", "penet_large") head(dat_small) head(penet_small)
Each row of dat_small
corresponds to a person in one of 10 families.
Here, dat_small
specifies the family structure for these 10 families, meaning that
dat_small
specifies the mother and father of each person in each family.
Some people, such as person ora002
, have missing parental identifiers
because they are founders (i.e. their parents are not included in the pedigree).
Each person has either both or no parents included in their pedigree, i.e. if
the mother identifier is missing then so is the father identifier, and vice versa.
Also, each person who is mentioned as a parent has a corresponding row, e.g.
person ora009
is listed as the mother of person ora001
, so there must be
a row later in the pedigree with indiv
equal to ora009
.
The columns sex
, aff
, age
and geno
are ignored by the function
pedigree_loglikelihood
, though data like this will usually contribute to the
corresponding pedigree likelihood via the penetrance matrix.
We will soon give examples of calculating penetrance matrices from this sort of observed
data, but for now we simply use the sample penetrance matrix penet_small
.
If there are any identical twins or triplets in the family then we can specify
them using an optional argument monozyg
. For example, to indicate that
ora024
and ora027
are identical twins, and so are aey063
and aey064
,
then we can use the following as the monozyg
argument:
monozyg_small <- list(c("ora024", "ora027"), c("aey063", "aey064"))
We can now calculate the log-likelihoods of the 10 families.
pedigree_loglikelihood(dat_small, geno_freq, trans, penet_small, monozyg = monozyg_small, sum_loglik = FALSE, ncores = 2)
By default, pedigree_loglikelihood
will return the sum of the pedigree
log-likelihoods of all of the families. However, the option sum_loglik = FALSE
above instructs pedigree_loglikelihood
to return the separate
log-likelihoods, e.g.
the above output shows that family ora
has a log-likelihood of -224.0860
.
The option ncores = 2
instructs pedigree_loglikelihood
to perform this
calculation in parallel on two cores, with the log-likelihoods of five families
calculated on one core and the remaining five on another core.
The function pedigree_loglikelihood
can also handle very large families.
For example, the following code takes much less than one minute on a
standard desktop computer to calculate the log-likelihood of a family
with approximately 10,000 family members.
system.time(ll <- pedigree_loglikelihood(dat_large, geno_freq, trans, penet_large)) #> user system elapsed #> 10.64 0.15 10.83 ll #> [1] -18020.99
The argument penet
of pedigree_loglikelihood
specifies the penetrance matrix,
which usually depends on the observed data and some unknown parameters that we
would like to estimate. In this section, we give a simple example of such a
penetrance matrix and then use this to estimate its unknown parameters.
Let's take dat_small
as our sample data, and suppose
we're interested in estimating the log-odds of disease for the three possible
genotypes from previous sections. Here,
the log-odds of disease is log(p/(1-p))
when p
is the probability
of disease.
head(dat_small)
As a toy example, we ignore age and most other data, and take the observed data to
be just the disease affected status aff
. For simplicity, we also assume that
allele 1
is dominant to allele 2
, meaning that the log-odds of disease
are the same for people with genotypes 1/1
and 1/2
. The parameters that we
would like to estimate are therefore the log-odds of disease for genotypes
1/1
and 1/2
(combined) and for genotype 2/2
. We denote these parameters
by logodds1
and logodds2
, respectively.
A penetrance matrix penet
is designed to give a probabilistic connection
between a peron's genotype and his or her observed data. Each row of
penet
corresponds to a person and each column corresponds to one of the
possible genotypes. (Row i
of penet
and row i
of dat
should correspond
to the same person, and the order of the genotypes should be the same for penet
,
geno_freq
and trans
.) The (i, j)
^th^ entry of penet
gives the conditional
probability of person i
's observed data, given that he or she has the j
^th^
genotype. In our example, these conditional probabilities are determined
by the parameters logodds1
and logodds2
. For instance, if person i
has
genotype 1/1
then his or her probability of disease is
prob1 = 1/(1 + exp(-logodds1))
, so if person i
is affected
then the conditional probability of his or her observed data
is prob1
, and if person i
is unaffected then this
conditional probability is 1 - prob1
. Therefore, the (i, 1)
^th^ entry of
penet
, corresponding to person i
and the first genotype (i.e. 1/1
),
is prob1
if person i
is affected and 1 - prob1
if person i
is unaffected.
The following function uses this and similar reasoning to calculate row i
of penet
.
penet.fn <- function(i, logodds1, logodds2) { prob1 <- 1/(1 + exp(-logodds1)) prob2 <- 1/(1 + exp(-logodds2)) penet.i <- c(prob1, prob1, prob2) if (dat_small$aff[i] == 0) penet.i <- 1 - penet.i return(penet.i) }
We can now estimate our parameters of interest, logodds1
and logodds2
, using
maximum likelihood estimation, as implemented in the mle
function of the
stats4
package.
library(stats4) minusll <- function(logodds1 = 0, logodds2 = 0) { penet <- t(sapply(1:nrow(dat_small), penet.fn, logodds1, logodds2)) loglik <- pedigree_loglikelihood(dat_small, geno_freq, trans, penet, monozyg = monozyg_small, ncores = 2) return(-loglik) } minusll() #> [1] 705.6238 fit <- mle(minusll) summary(fit) #> Maximum likelihood estimation #> #> Call: #> mle(minuslogl = minusll) #> #> Coefficients: #> Estimate Std. Error #> logodds1 -1.359962 0.1054336 #> logodds2 -1.313467 6.8597364 #> #> -2 log L: 1030.901
The function minusll
above first calculates the penetrance matrix penet
corresponding to any given values of the parameters logodds1
and logodds2
,
and then uses this matrix and pedigree_loglikelihood
to calculate the
corresponding log-likelihood. The above code then uses mle
to perform
maximum likelihood estimation, giving estimates of -1.36
and -1.31
for logodds1
and logodds2
, respectively.
We will now extend the analysis of the previous section to include known genotypes.
The dataset dat_small
has a column geno
corresponding to known genotypes.
As is often the case with family data, many people in dat_small
do not have
measured genotypes, and so have a blank (""
) in the geno
column.
head(dat_small)
The standard method for incorporating known genotypes into the Elston-Stewart
algorithm is to first calculate the penetrance matrix penet
while ignoring all
genotype data, and then change penet
for people who have measured genotypes
according to the following simple rule. If person i
is known to have the
j
^th^ genotype then penet[i, j]
is unchanged but penet[i, k]
is set to 0
for all k != j
. See later for brief justification for this rule.
For example, to incorporate known genotypes into the analysis of the previous
section, we modify penet.fn
by adding the three lines of code below that are
marked with $\color{darkred}{\text{###}}$.
penet.fn <- function(i, logodds1, logodds2) { prob1 <- 1/(1 + exp(-logodds1)) prob2 <- 1/(1 + exp(-logodds2)) penet.i <- c(prob1, prob1, prob2) if (dat_small$aff[i] == 0) penet.i <- 1 - penet.i if (dat_small$geno[i] == "1/1") penet.i[-1] <- 0 ### if (dat_small$geno[i] == "1/2") penet.i[-2] <- 0 ### if (dat_small$geno[i] == "2/2") penet.i[-3] <- 0 ### return(penet.i) }
We can now use the same code as in the previous section to estimate the log-odds of disease, though our estimates are now based on the known genotypes as well as disease affected statuses.
minusll() #> [1] 788.5003 fit <- mle(minusll) summary(fit) #> Maximum likelihood estimation #> #> Call: #> mle(minuslogl = minusll) #> #> Coefficients: #> Estimate Std. Error #> logodds1 -1.357596 0.08405912 #> logodds2 -1.395321 0.61632248 #> #> -2 log L: 1196.65
The above method for incorporating known genotypes can be justified as follows.
Measured genotypes can differ from actual genotypes due to genotyping
errors, and even when genotyping errors are negligible, we can make a conceptual
distinction between the actual genotype $g_i$ of person i
, and any measured
genotype of person i
, which is part of his or her observed data $x_i$.
The (i, k)
^th^ component penet[i, k]
of the penetrance matrix is the
conditional probability of person i
's observed data $x_i$, given that
his or her actual genotype $g_i$ is the k
^th^ genotype (e.g. 2/2
is
the third genotype, in our running example). When genotyping errors
are negligible, the chance of observing a genotype different from the actual
genotype is 0
. So if person i
is observed to have the j
^th^ genotype then
penet[i, k]
is 0
for all k != j
. Using similar reasoning, non-negligible
genotyping errors and partial genotyping information can also be incorporated
in an analysis.
Given a genetic model and a penetrance matrix, clipp
can also use the
Elston-Stewart algorithm to calculate genotype probabilities for a person of
interest, using the function genotype_probabilities
. For example, we can
calculate the genotype probabilities for individual ora008
in the family ora
,
as follows.
head(dat_small) genotype_probabilities(target = "ora008", dat_small, geno_freq, trans, penet_small, monozyg_small)
This shows that person ora008
has a 19.8%
chance of having the first genotype
(1/1
), given the inputs (family structure, genetic model and penetrance matrix).
In this calculation, the penetrance matrix penet_small
depends on all of the
observed data. If we instead wanted genotype probabilities that are based only
on the family structure and observed genotypes then we could use the following code.
penet.fn <- function(i) { penet.i <- rep(1, 3) if (dat_small$geno[i] == "1/1") penet.i[-1] <- 0 if (dat_small$geno[i] == "1/2") penet.i[-2] <- 0 if (dat_small$geno[i] == "2/2") penet.i[-3] <- 0 return(penet.i) } penet <- t(sapply(1:nrow(dat_small), penet.fn)) genotype_probabilities(target = "ora008", dat_small, geno_freq, trans, penet, monozyg_small)
Many statistical analyses of family data can be performed by the
Elston-Stewart algorithm combined with maximum likelihood estimation,
for some choice of genetic model and penetrance matrix. We have given
simple examples of segregation analyses to estimate disease risk. These
can be modified slightly and combined with a genetic model based on phased
genotypes (as given by clipp
's functions geno_freq_phased
and trans_phased
)
to investigate parent-of-origin effects. With some ingenuity, the user can also
use clipp
to investigate other interesting genetic models, such as
linked loci and non-standard transmission probabilities, or whatever genetic
model can be imagined.
Ott's form for the pedigree likelihood assumes that the observed data of
different family members is conditionally independent, given their genotypes.
However, this assumption can be weakened for a genetic locus of interest, by
adding an unmeasured polygene into the genetic model
(using functions like trans_polygenic
and combine_loci
).
General references for the Elston-Stewart algorithm are (Elston & Stewart, 1971),
(Lange & Elston, 1975) and (Cannings et al., 1978).
Up to logarithmic factors, the complexity of the algorithm is O(n * m)
,
where n
is the number of people in the family and m
is the number of
possible genotypes.
Cannings C, Thompson E, Skolnick M. Probability functions on complex pedigrees. Advances in Applied Probability, 1978;10(1):26-61.
Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Hum Hered. 1971;21(6):523-542.
Lange K. Mathematical and Statistical Methods for Genetic Analysis (second edition). Springer, New York. 2002.
Lange K, Elston RC. Extensions to pedigree analysis I. Likehood calculations for simple and complex pedigrees. Hum Hered. 1975;25(2):95-105.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.