knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
LBL (Logistic and Quantitative Bayesian LASSO) (@biswas2012logistic, @wang2014famlbl, @zhou2019clbl) is a Bayesian genetic association test aimed at detecting association between rare haplotypes (which could be formed by common SNPs) and diseases. There are currently three software that handle different types of study designs for binary traits, and one software that is designed for quantitative traits while addressing for population stratification.
The three software that handle different types of study designs are: one for independent case-control data (LBL, @biswas2012logistic), one for case-parent triad (family trio) data (famLBL, @wang2014famlbl) and one for a combination of data from the two designs (cLBL, @zhou2019clbl). The software designed for quantitative traits uses principal components (PCs) to adjust for population stratification.
LBLs take genotype and phenotype data as input, and provide statistical inferences of the effect of each haplotype on the phenotype based on the Markov Chain Monte Carlo samples from the posterior distribution.
The rest of the vignette is structured as follows:
In this section, we provide a short description of the LBL methodology formulation. For binary traits, the likelihood portions are formulated with retrsopectively likelihoods, and are connected with disease model via a logistic link. For quantitative traits, we consider the prospective likelihood. The priors of LBL methods include a double exponential distribution on the haplotyic effect, penalizing the coefficients of non-effective haplotypes, so that effective haplotypes (such as the rare haplotypes with large size) will stand out in the association analysis. Monte Carlo Markov Chain algorithm (MCMC, @metropolis1953equation, @hastings1970monte, and @geman1984stochastic) is used to sample from the posterior distribution.
In the following, we first discuss the likelihoods for all LBL methods, and then the priors, computation, and the inferences based on posterior samples. The likelihood formulations of LBLs share some similarities. All possible haplotypes compatible with observed genotypes are obtained from hapassoc. The priors for the parameters are the same for all LBL methods. The posterior samples of each LBL can be obtained via MCMC, and posterior inferences can be carried out once the chain has converged. For a more detailed discussion of each method, see the corresponding papers.
Consider a case-control design with $n$ total individuals ($n_1$ cases and $n_2=n-n_1$ controls), who are allgenetically independent of each other and ethnically homogeneous. For each individual $i$, let $G_i$ be the observed genotype, $Z_i$ be the unobserved halotype pair for the $i$-th individual (which can be inferred from $G_i$), $Y_i$ be the binary case-control status of the $i$-th individual, then the complete retrospective likelihood is:
$$L_{cc}=\prod_{i=1}^{n_1}P(Z_i|Y_i=1,\Psi) \prod_{i=n_1+1}^{n}P(Z_i|Y_i=0,\Psi)$$
where $\phi$ is a collection of parameters, including individual haplotype effect ($\beta$), haplotype frequencies ($\mathbf{f}$) and other hyperparameters.
FamLBL has a similar formulation to LBL. For each family $j, j=1,2,\ldots,m$, consider a "matched pair" design where each (affected child, father, mother) trio is decomposed into a pair of haplotypes transmitted to the offspring ($Z_{jc}$), and a pair of haplotypes not transmitted to the offspring ($Z_{ju}$). The pair not transmitted can be considered as a pseudo control. Similarly, let $G_{jc}$ and $G_{ju}$ be the corresponding genotypes transmitted or not transmitted. Let $Y_{jc}$ denote the disease status of the offspring ($Y_{jc} = 1$). Then, the likelihood can be formulated as,
$$L_{t}=\prod_{j=1}^{m} P(Z_{jc}|Y_{jc}=1,\Psi) \times P(Z_{ju}|\Psi) $$
cLBL combines the independent case-control design and the case-parent triad design. With the same notations as in LBL and famLBL, the likelihood is the product of the two previous likelihoods:
$$L_{comb}=L_{cc}(\Psi) \times L_{t}(\Psi)= \prod_{i=1}^{n_1}P(Z_i|Y_i=1,\Psi) \prod_{i=n_1+1}^{n}P(Z_i|Y_i=0,\Psi) \prod_{j=1}^{m} \left{P(Z_{jc}|Y_{jc}=1,\Psi) \times P(Z_{ju}|\Psi) \right} $$
For individuals $i=1,2,\cdots,n$, let $Y_i$ be the quantitative trait value, $\pmb{C}i$ be a vector of non-genetic covariates (such as age and sex) and $\pmb{R}_i$ be a vector of PC scores of length $L$ to be added to the model. The PCs are calculated from a large quantity of null markers to represent the genetic ancestry background for each individual. We consider the following complete data likelihood, \begin{equation} \begin{split} L(\pmb{\Psi})&=\prod{i=1}^n P(Y_i,Z_i \mid \pmb{G}i,\pmb{C}_i,\pmb{R}_i, \pmb{\Psi})\ &=\prod{i=1}^n P(Y_i\mid Z_i,\pmb{C}_i,\pmb{R}_i,\pmb{\beta}) P(Z_i\mid \pmb{G}_i, \pmb{\zeta}), \end{split} \end{equation} where $\pmb{\Psi}=(\pmb{\beta},\pmb{\zeta})$ consists of the model parameters $\pmb{\beta}$ and parameters $\pmb{\zeta}$ for modeling haplotype frequencies to be explained in the next section.
The aforementioned set of parameters $\phi$ include the following parameters:
$\beta_l, l=1,2,\ldots,k-1$, the effect of haplotype $l$ compared to the baseline haplotype. This is the parameter of interests.
$\alpha$, the effect of the baseline haplotype.
$\lambda$, the hyperparameter controlling the degree of penalty for $\beta_l$.
$\mathbf{f} = (f_1,f_2,\ldots,f_k)$, the frequency distribution of all haplotypes present in the dataset.
$Z = (h_l,h_{l'})$, the unobserved haplotype pair an individual has.
$a_z$, the probability of an individual having haplotype pair $Z=(h_l,h_{l'})$.
$d$, the inbreeding coefficient. $d=0$ would indicate the population is in Hardy-Weinberg Equilibrium. $d > 0$ would indicate an inbreeding populating and $d < 0$ indicate an outbreeding population.
Next we detail the parameters.
We connect $\beta_l$'s with the likelihood through a logistic model. let $\theta$ be the odds of disease given a specific haplotype pair $Z$ (i.e., $\theta=P(Y=1 \mid Z) / P(Y=0 \mid Z)$), then, we model the log odds ratio $\theta$ as,
$$\log \theta = \alpha + X\beta $$
where $X$ is a row vector and each $X$ is the design vector associated with haplotype pair $Z$, and $\alpha$ is log odds of the pre-selected baseline haplotype.
It is worth noting that each $\beta_l$ measures the effect of haplotype $l$ in contrast of the baseline haplotype. Therefore, choosing different baseline haplotypes might result in different $\beta$ values. Only selecting a baseline that is not associated with the disease (i.e., $\alpha = 0$) will yield a correct interpretation. Choosing a haplotype that is associated with the disease might lead to loss of power in detecting other associated haplotypes and false positives. Therefore, one needs to take extra care when choosing the baseline haplotype. One way to avoid such scenarios is to use different baseline haplotypes. By default, the most frequent haplotype is chosen as the baseline.
Let $\mathbf{f}= (f_1, f_2, \ldots, f_k)$ denote frequency distribution of $k$ distinct haplotypes. And let $a_z(\mathbf{f},d)$, the frequency of an individual with a specific haplotype pair $Z=(h_l,h_{l'})$ be modelled as:
$$a_z(\mathbf{f},d) = \left{ \begin{array} {rr} f_l^2+df_l(1-f_l) & \mbox{if } h_l=h_{l'} \ 2(1-d)f_l f_{l'} & \mbox{if } h_l \neq h_{l'} \ \end{array} \right., $$
where $d$ is the within-populatin inbreeding coefficient. $d=0$ denotes Hardy-Weinberg Equilibrium, $d>0$ denotes excessive inbreeding and $d<0$ denotes outbreeding. This allows us the freedom away from the assumption of Hardy-Weinberg equilibrium, as the effect of inbreeding/outbreeding can be modeled with $d$. When $d=0$, the model is assuming HWE in the population.
The set of parameters $\Psi$ include the following parameters:
$\beta_l, l=1,2,\ldots,q$, the effect of haplotypes excluding the baseline, covariates and PCs. This is the parameter of interests.
$\alpha$, the effect of the baseline haplotype.
$\lambda$, the hyperparameter controlling the degree of penalty for $\beta_l$.
$\mathbf{f} = (f_1,f_2,\ldots,f_k)$, the frequency distribution of all haplotypes present in the dataset.
$Z = (h_l,h_{l'})$, the unobserved haplotype pair an individual has.
$d$, the inbreeding coefficient. $d=0$ would indicate the population is in Hardy-Weinberg Equilibrium. $d > 0$ would indicate an inbreeding populating and $d < 0$ indicate an outbreeding population.
$\sigma^2$, the random error term for the linear model.
Given $Z_i, \pmb{C}i, \pmb{R}_i$ and $\pmb{\beta}$, $Y_i$ is assumed to follow the normal distribution, \begin{equation} Y_i\mid Z_i,\pmb{C}_i,\pmb{R}_i,\pmb{\beta}\sim N(\pmb{X}_i^T\pmb{\beta},\sigma_e^2), \end{equation} where $\pmb{X}_i^T=(1,\pmb{x}{Hi}^T,\pmb{x}{Ci}^T,\pmb{x}{Ri}^T)$ is a row vector of model covariates. $\pmb{x}{Hi}^T=(x_1,\cdots,x{S-1})$ is a vector of haplotype copies, where $x_{s}$ denotes the number of copies for haplotype $h_l$. $\pmb{x}{Ci}^T$ and $\pmb{x}{Ri}^T$ are vectors for covariates and PCs, respectively. If discrete covariate exists, $\pmb{x}_{Ci}^T$ includes a collection of dummy variables corresponding to that discrete covariate. Suppose the length of $\pmb{X}_i^T$ is $q+1$.
For the second term $P(Z_i\mid \pmb{G}i, \pmb{\zeta})$, notice that $P(Z_i\mid \pmb{G}_i, \pmb{\zeta})=0$ if a haplotype pair is not compatible with the observed genotypes. Therefore, we only need to consider all possible haplotype pairs that are compatible with $\pmb{G}_i$. For those compatible haplotype pairs, $P(Z_i\mid \pmb{G}_i, \pmb{\zeta})$ does not depend on $\pmb{G}_i$ and is modeled by, \begin{equation} P(Z_i=h_l/h_l'\mid \pmb{\zeta})=\delta{ll'}df_l+(2-\delta_{ll'})(1-d)f_lf_{l'}, \end{equation} where $\pmb{\zeta}=(\pmb{f},d)$, $\pmb{f}$ is a vector of haplotype frequencies, $d\in (-1, 1)$ is the within-population inbreeding coefficient, and $\delta_{ll'}=1 (0)$ if $h_l=h_l' (h_l\ne h_l')$. Then the complete likelihood can be expressed as, \begin{equation} L(\pmb{\Psi})=\prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma_e^2}}\exp\Big{-\frac{(Y_i-\pmb{X}i^T\pmb{\beta})^2}{2\sigma_e^2}\Big} \big{\delta{ll'}df_l+(2-\delta_{ll'})(1-d)f_lf_{l'} \big}. \end{equation}
To penalize unassociated haplotype or PC effects and reduce dimension, double exponential (Laplace) distribution is used as the prior distribution for each $\beta_l$,
$$\pi(\beta_l \mid \lambda) = \frac{\lambda}{2} \exp\left(-\lambda\mid\beta_l\mid\right)$$
The hyperparameter $\lambda$ controls the level of shrinkage. A larger value of $\lambda$ indicates more shrinkage.
Instead of picking a fixed $\lambda$, we let $\lambda$ follow a Gamma$(a,b)$ distribution with pdf
$$\pi(\lambda) = b^a\Gamma(a)^{-1} \lambda^{a-1} \exp{(-b\lambda)}$$
For the parameters involved in frequency calculation, we use Dirichlet(1,1,...,1) distribution as the prior distribution for haplotype frequency distribution parameters $\mathbf{f}$. The prior distribution for the inbreeding coefficient $d$ is set as unif$(\max_l \lbrace-f_l/(1-f_l)\rbrace,1)$.
For each individual $i$, we assign discrete uniform priors to all haplotypes compatible with the observed genotype. Therefore during each iteration, the haplotype will get updated according to the likelihood of each compatible haplotype pair.
For QBLstrat only, we consider the Inverse-Gamma$(a_e,b_e)$ prior for $\sigma^2$ with pdf $$\pi(\sigma^2)=\frac{b_e^{a_e}}{\Gamma(a_e)}\sigma^{2(a_e-1)}\exp(-\frac{b_e}{\sigma^2})$$
Once the Markov Chain has converged, one can carry out inference based on posterior samples. The package includes built-in functions for inference based on posterior samples of $\beta$, providing estimates for OR (binary traits only), CI and Bayes Factor.
Bayes Factor is defined as the ratio between posterior odds and prior odds.
Since the prior and posterior distirbutions for all $\beta_l$'s are both continuous, we cannot directly calculate the prior or posterior odds of $|\beta_l| = 0$. So, we opt to test $H_0: |\beta_l| \le \epsilon$ where $\epsilon$ is a pre-defined a small number. The odds is calculated as $P(|\beta| > \epsilon) / P(|\beta| \leq \epsilon)$ for both posterior and prior distributions. Then BF is the ratio between the two odds.
If all posterior $\mid\beta_l\mid$ exceed $\epsilon$, then we set BF = 999 for computational considerations.
We also provide an odds ratio (OR) estimate based on posterior sample mean for binary traits, and a 95% credible interval (CI) estimate based posterior samples for all types of traits.
All LBL algorithms take some common input (genotypes, phenotypes, starting parameters, etc). First we detail those parameters, and then we follow up with examples for all three algorithms with a simulated dataset.
LBL takes data in pedigree format, regardless of the type of the design. The objects should be either a matrix or a data frame, consisting of $n$ rows ($n$ = number of individuals) and $6 + 2\times p$ columns ($p$= number of SNPs). The first 6 columns of the data describe the pedigree relationship and the phenotype of the individual, and the last $2\times p$ columns describe the genotype information of the individual, with each marker taking up 2 columns. The genotype data can be either alphabetic or numeric.
The first 6 columns of the dataset should consist of:
More information about the format can be found here.
The LBL package includes two example datasets: fam
includes 250 case-parent trios, while cac
includes 250 independent cases and 250 independent controls. Both datasets consist of 5 no-recombining SNPs. Below is a look of the beginning of these datasets.
library(LBL) data(cac) data(fam) head(fam) head(cac)
Note that for case-control data, father ID and mother ID are both 0.
There are some other parameters that need to be specified for the MCMC algorithm. They are:
starting values: providing a starting values for the MCMC.
a, b: the hyperparameters for $\lambda$, which controls the shrinkage effect of $\beta$. See Priors section for details. Different values of $a$ and $b$ have some effect on the outcome, the details can be found in paper.
e: the number $\epsilon$ used as a cutoff as if $\beta$ can be treated as 0. The default is 0.1. See Inferences on Posterior Samples section for details.
The QBLstrat function is designed for rare haplotype effects estimation for quantitative traits. The data input should be either a matrix or data frame, consisting of $n$ rows ($n$ = number of individuals) and $1 + n.cov + 2 \times p$ (allelic format) or $1 + n.cov + p$ (genotypic format) columns ($n.cov$= number of covariates and PCs, $p$= number of SNPs). The first column is the trait, followed by the columns of covariates and PCs. If in allelic format, the other 2*p columns are allels with one column for each allele of the single-locus genotypes. If in genotypic format, the other p columns are genotypes with one column for each SNP.
The LBL package includes one example dataset: QBLstratData consisting of 5 PCs and 5 SNPs for 960 individuals. Below is a look of the beginning of these datasets.
library(LBL) data(QBLstratData) head(QBLstratData)
LBL is the original version of logistic Bayesian LASSO that detects association between common diseases and rare haplotypes. It analyzes independent case-control data. In the LBL package, the corresponding function is LBL.
The procedure below provides a simple example of running LBL on dataset cac. cac is a sample input composed of case-control data. Note that the data is in pedigree format where the first 6 columns are: family ID, individual ID, father ID, mother ID, sex, and phenotype. Since the cases and controls are required to be independent, the family IDs of the individuals are all different. The last $2 \times p$ columns represent the genotype information of the $p$ SNPs. In this example, $p=5$.
By default, the LBL function will return a list of haplotype names (haplotypes), haplotype frequencies (freq), odds ratios (OR), credible intervals of odds ratio (OR.CI), and Bayes factors (BF). For haplotypes and freq, the last value corresponds to the baseline haplotype whose OR, OR.CI, and BF cannot be calculated. If better output summary is preferred, the user can save the outcome list from LBL and call the print_LBL_summary function. Significant haplotypes will be indicated with *+ (risk) or *- (protective).
LBL can also return the entire posterior samples for all parameters. To acquire the entire samples, just set the summary parameter of LBL to be FALSE.
library(LBL) head(cac) set.seed(1) LBL.obj<-LBL(cac,burn.in = 40000,num.it = 70000,summary = T) LBL.obj print_LBL_summary(LBL.obj)
famLBL is the logistic Bayesian LASSO that uses case-parent triad (family trio) data to detect rare haplotype effects. In the LBL package, the corresponding function is famLBL.
The procedure below provides a simple example of running famLBL on dataset fam. fam is a sample input composed of case-parent triad data. Again, the data is in pedigree format where the first 6 columns are: family ID, individual ID, father ID, mother ID, sex, and phenotype. Since the data are of case-parent triad, every three individuals share the same family ID. Within the same family, the affected child's father ID will be the father's individual ID; the affected child's mother ID will be the mother's individual ID. Again, the last $2 \times p$ columns represent the genotype information of the $p$ SNPs. In this example, $p=5$.
By default, the famLBL function will return a list of haplotype names (haplotypes), haplotype frequencies (freq), odds ratios (OR), credible intervals of odds ratio (OR.CI), and Bayes factors (BF). For haplotypes and freq, the last value corresponds to the baseline haplotype whose OR, OR.CI, and BF cannot be calculated. If better output summary is preferred, the user can save the outcome list from famLBL and call the print_LBL_summary function. Significant haplotypes will be indicated with *+ (risk) or *- (protective).
famLBL can also return the entire posterior samples for all parameters. To acquire the entire samples, just set the summary parameter of famLBL to be FALSE.
library(LBL) head(fam) set.seed(1) famLBL.obj<-famLBL(fam,burn.in = 40000,num.it = 70000,summary = T) famLBL.obj print_LBL_summary(famLBL.obj)
cLBL is the latest logistic Bayesian LASSO that detects association between common diseases and rare haplotypes. It analyzes case-control and case-parent triad data simultaneously and thus take advantage of the larger sample size from the combined data. In the LBL package, the corresponding function is cLBL.
The procedure below provides a simple example of running cLBL on dataset cac and fam. The first and the second parameters required from cLBL are case-parent triad and case-control data, respectively. These two dataset should be both in pedigree format. The rest parameter settings of cLBL are similar to those of LBL and famLBL.
By default, the cLBL function will return a list of haplotype names (haplotypes), haplotype frequencies (freq), odds ratios (OR), credible intervals of odds ratio (OR.CI), and Bayes factors (BF). For haplotypes and freq, the last value corresponds to the baseline haplotype whose OR, OR.CI, and BF cannot be calculated. If better output summary is preferred, the user can save the outcome list from cLBL and call the print_LBL_summary function. Significant haplotypes will be indicated with *+ (risk) or *- (protective).
cLBL can also return the entire posterior samples for all parameters. To acquire the entire samples, just set the summary parameter of cLBL to be FALSE.
library(LBL) head(cac) head(fam) set.seed(1) cLBL.obj<-cLBL(fam,cac,burn.in = 40000,num.it = 70000,summary = T) cLBL.obj print_LBL_summary(cLBL.obj)
QBLstrat is the Quantitative Bayesian LASSO that detects rare haplotype effects with a quantitative trait while using PCs to adjust for population stratification. In the LBL package, the corresponding function is QBLstrat.
The procedure below provides a simple example of running QBLstrat on dataset QBLstratData. QBLstratData is a sample input consisting of a quantitative trait, 5 PC scores and phenotype for 960 individuals. Again, the last $2 \times p$ columns represent the genotype information of the $p$ SNPs. In this example, $p=5$.
The QBLstrat function will return a list of BF (BF), coefficient estimates (beta), credible intervals of coefficient estimates (CI.beta), credible interval of $\lambda$ (CI.lambda), credible interval of $d$ (CI.D) and haplotype frequencies (freq). For freq, the last value corresponds to the baseline haplotype whose effects are not of interest.
library(LBL) head(QBLstratData) set.seed(1) QBLstrat.obj<-out<-QBLstrat(QBLstratData, numSNPs=5, cov=5) QBLstrat.obj
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.