knitr::opts_chunk$set(echo = TRUE)
The purpose of this document is to develop math to modify R/qtl2scan so that the kinship object can be decomposed once and used multiple times. The challenge is that the current setup decomposes kinship on each reduced set of data, chosen to eliminate for instance individuals with missing data for a phenotype, or to restrict to a subset of individuals to improve computation.
In summary, the basic model is
$$y \sim N(X\beta,V), \quad V =\sigma_g^2K + \sigma^2I = \sigma^2(\gamma K+I)~,$$
with eigen decomposition (SVD) of $K = UDU^{\text{T}}$. For subsets of individuals, $y_1$ and $X_1$ are horizontal subsets of $y$ and $X$, respectively, with a similar horizontal subset $U_1$ of the eigenvectors $U$ (against the grain of the eigen vectors for the full $K$ kinship matrix). Rather than redo the SVD on the subset kinship matrix ($K_{11}$), we can do a Householder(QR) decomposition: $CU_1^{\text{T}} = FG=F_1G_1~.$ Premultiplying by $G_1^{-\text{T}}$, $y_1^ = G_1^{-\text{T}}y_1$ and $X_1^ = G_1^{-\text{T}}X_1$, reduces the problem to ordinary least squares: $$y_1^ \sim N(X_1^\beta,\sigma^2 I_1)$$
Thus, to solve the sub-problem, we use the SVD for all the data, divide the eigen matrix $U$ horizontally, and do a QR decomposition on $U_1^{\text{T}}$ to obtain $G_1^{-\text{T}}$ for premultiplying $y_1$ and $X_1$. Some care will need to be taken about the condition number of matrices, but that is already incorporated into R/qtl2scan
routines.
The basic math for complete data is as follows.
$$y = X\beta + g + e~, \quad g\sim N(0,\sigma_g^2 K), \quad e\sim N(0,\sigma^2 I)$$
with $\mu_q=X\beta$ = QTL effects (and any other fixed effects), $g$ = polygenic effects (random), e = unexplained variation (random) and distributions, $K$ = kinship matrix and $I$ = identity matrix (1s on diagonal, 0s off diagonal). Put another way, the distribution of phenotypes is
$$y \sim N(X\beta,V), \quad V =\sigma_g^2K + \sigma^2I = \sigma^2(\gamma K+I)~.$$
with $\gamma = \sigma_g^2 / \sigma^2$. The MLEs are found by iterating to solve (similar to EM idea), getting MLE of $\beta$ given $V$, $\hat{\beta}_q=(X^{\text{T}}V^{-1}X)^{-1}X^{\text{T}}V^{-1/2}y$, and estimating $\sigma_g$ and $\sigma^2$ given $\hat{\beta}~.$
Rather than working with $V$ directly, the $K$ matrix is decomposed using SVD as
$$K = UDU^{\text{T}}$$
with $U$ orthonormal (eigenvector columns are uncorrelated with variance 1; $U^{\text{T}}U=UU^{\text{T}}=I$). The $D$ matrix has 0 off-diagonal and diagonal entries being the eigenvalues $d_i$. We can write $V$ as
$$V = \sigma^2 (\gamma UDU^{\text{T}} +I)=\sigma^2 UC^2U^{\text{T}}~, \quad C^2=\gamma D+I~,$$
with $C$ being diagonal with entries $c_i=(1+\gamma d_i)^{1/2}~.$ We transform the problem by left-multiplying $y$ and $X$ by $C^{-1}U^{\text{T}}$ to yield an i.i.d. setting, $y^ \sim N(X^\beta,\sigma^2I)~,$ with
$$y^=C^{-1}U^{\text{T}}y, \quad X^=C^{-1}U^{\text{T}}X~, \quad \sigma^2I=C^{-1}U^{\text{T}}VUC^{-1}~.$$
Then we solve the (unweighted) normal equations,
$$\hat{\beta}=(X^{\text{T}}X^)^{-1}X^{\text{T}}y^ = (X^{\text{T}}UC^{-2}U^{\text{T}}X)^{-1}X^{\text{T}}UC^{-2}U^{\text{T}}y~,$$
with predicted values
$$\hat{y}=X\hat{\beta}=UCX^(X^{\text{T}}X^)^{-1}X^{\text{T}}y^* = X(X^{\text{T}}UC^{-2}U^{\text{T}}X)^{-1}X^{\text{T}}UC^{-2}U^{\text{T}}y~.$$
If some phenotypes are missing, or we compute with a subset of individuals, the problem needs to be refactored. Suppose the $n$ vector of phenotypes $y$ is subdivided as $y=(y_1,y_2)$, with key focus on the $n_1$ phenotypes in $y_1$. Rather than reduce the size of problem and do another SVD, consider a diagonal matrix $B$ with $n_1$ 1s and $n_2=n-n_1$ 0s down the diagonal. In other words, with $I_1$ an identity matrix of size $n_1$ and blocking matrix $B = \begin{bmatrix} I_1 & 0 \ 0 & 0 \end{bmatrix}~,$ $By = (y_1,0)$. Note that $B^{\text{T}} = B^2 =B$, but $B$ does not have an inverse. Formally, we write
$$B y \sim N(B X\beta,BVB), \quad BVB =\sigma^2(\gamma BKB + B)=\sigma^2BUC^2U^{\text{T}}B~.$$
If we partition $K$ (and similarly $UC^2U^{\text{T}}$) into four parts, we have
$$K = \begin{bmatrix} K_{11} & K_{21} \ K_{12} & K_{22} \end{bmatrix}~, \quad BKB = \begin{bmatrix} K_{11} & 0 \ 0 & 0 \end{bmatrix}$$ and $$UC^{2}U^{\text{T}} = \begin{bmatrix} U_1C^{2}U_1^{\text{T}} & U_2C^{2}U_1^{\text{T}} \ U_1C^{2}U_2^{\text{T}} & U_2C^{2}U_2^{\text{T}} \end{bmatrix}~, \quad BVB = \sigma^2BUC^{2}U^{\text{T}}B = \sigma^2\begin{bmatrix} U_1C^{2}U_1^{\text{T}} & 0 \ 0 & 0 \end{bmatrix}~.$$
That is, partition the transpose of the eigenvector matrix as $U^{\text{T}} = (U_1^{\text{T}}, U_2^{\text{T}})~.$ Note the relations: $I=U_1^{\text{T}}U_1+U_2^{\text{T}}U_2$, $U_1U_2^{\text{T}}=0$, $U_2U_1^{\text{T}}=0$, $U_1U_1^{\text{T}}=I_1$ and $U_2U_2^{\text{T}}=I_2$, with $I_j$ being an identity matrix of size $n_j$. The model part we are interested in is
$$y_1 \sim N(X_1\beta,V_{11}), \quad V_{11} =\sigma^2(\gamma K_{11} + I_1)=\sigma^2U_1C^2U_1^{\text{T}}~.$$
Rather than perform a new SVD on $K_{11}$, it is more efficient to do a QR decomposition. The number of calculations for QR are $O(n_1n^2-n^3/3)$ operations vs $O(n_1n^2+n^3)$ for SVD. That is, QR should be much quicker. See Princeton CS Notes.
Consider a Householder (QR) decomposition of $CU_1^{\text{T}} = FG=F_1G_1$ with $G_1$ upper triangular and $FF^{\text{T}}=F^{\text{T}}F=I$, $F=(F_1,F_2)$, $G^{\text{T}}=(G_1^{\text{T}}, 0)$. Also, $F_1F_1^{\text{T}}+F_2F_2^{\text{T}}=I$ and $F_1^{\text{T}}F_1=I_1$.
Hence, $V_{11} = \sigma^2G_1^{\text{T}}F_1^{\text{T}}F_1G_1 = \sigma^2G_1^{\text{T}}G_1$. Thus, the model we need to solve is $y_1^ \sim N(X_1^\beta,\sigma^2I_1)$, with
$$y_1^ = G_1^{-\text{T}}y_1, \quad X_1^ = G_1^{-\text{T}}X_1$$ and the solution for $\beta$ is:
$$\hat{\beta} = (X_1^{\text{T}}X_1^)^{-1}X_1^{\text{T}}y_1^ = (X_1^{\text{T}}G_1^{-1}G_1^{-\text{T}}X_1)^{-1}X_1^{\text{T}}G_1^{-1}G_1^{-\text{T}}y_1$$
That is, we need to take the current value of $\gamma$, do a Householder decomposition on the reduced matrix $CU_1^{\text{T}}$, and solve OLS.
Note that the OLS solution requires another QR decomposition:
$$X_1^ = G_1^{-\text{T}}X_1=QR=Q_1R_1~, \quad Q^{\text{T}}Q=I~, \quad Q_1^{\text{T}}Q_1=I_1~.$$ $$\hat{\beta} = (X_1^{\text{T}}X_1^)^{-1}X_1^{\text{T}}y_1^* = R_1^{-1}Q_1^{\text{T}}G_1^{-\text{T}}y_1~.$$ and for prediction,
$$\hat{y}_1=X_1\hat{\beta}=G_1^{\text{T}}X_1^\hat{\beta} = X_1^(X_1^{\text{T}}X_1^)^{-1}X_1^{\text{T}}y_1^ = G_1^{\text{T}}Q_1Q_1^{\text{T}}G_1^{-\text{T}}y_1~.$$
To summarize, there is an SVD of $K=UDU^{\text{T}}$ done once for the project, and stored. There are two QR decomposition for each subset of individuals. These QR decompositions are of $CU_1^{\text{T}}=FG=F_1G_1$, with $C^2=\gamma D+I$ depending on $\gamma = \sigma_g^2 / \sigma^2$ and $U_1$ being the subset of rows of $U$ for the included individuals; and $G_1^{-\text{T}}X_1=QR=Q_1R_1$. Both QR decompositions need to be done many times.
This section goes through in gory detail the proof of concept calculations with a basic model having kinship $K$ from the about 500 mice (say LOCO for chr 16), $X$ being one $N(0,1)$ predictor, and $\beta=\sigma = \sigma_g=1$. After going through the full data case, verifying that SVD and QRD agree, we examine a subset of 250 mice and show both SVD on $K_{11}$ and QRD on $CU_1^{\text{T}}$ agree. Note that there many "verify" steps here to ensure that matrices are what they are supposed to be, as it is easy (!) to make mistakes with this matrix algebra.
suppressPackageStartupMessages({ library(qtl2geno) library(qtl2scan) library(qtl2ggplot) })
datapath <- "~/Documents/Research/attie_alan/DO/data/DerivedData/"
kinship <- readRDS(file.path(datapath, "kinship.rds"))
kinship <- kinship[["16"]] nind <- nrow(kinship)
system.time(Ke <- qtl2scan::decomp_kinship(kinship))
D <- Ke$values U <- t(Ke$vectors)
Verify that $K=UDU^{\text{T}}$.
summary(c(kinship - U %*% diag(D) %*% t(U)))
Parameter setup.
gamma <- 1 sigma <- 1 beta <- 1
C = sqrt(1 + gamma * D)
$$y = X\beta + g + e$$
X <- matrix(rnorm(nind), nind, 1) X <- X - mean(X) noise <- rnorm(nind) poly <- sigma * sqrt(gamma) * U %*% (C * rnorm(nind)) y <- X %*% beta + poly + noise
dat <- data.frame(y=y,x=X,g=poly)
Verify that $V=K+I=UC^2U^{\text{T}}$
summary(c(kinship + diag(rep(1,nind)) - U %*% diag(C^2) %*% t(U)))
Verify that $C^{-1}U^{\text{T}}VUC^{-1}=I$
summary(c(diag(rep(1,nind)) - (diag(1/C) %*% t(U) %*% (kinship + diag(rep(1,nind))) %*% U %*% diag(1/C))))
Transform problem for full data case. $C^{-1}U^{\text{T}}y \sim N(C^{-1}U^{\text{T}}X\beta,\sigma^2 I)$
dats <- as.data.frame(diag(1/C) %*% (t(U) %*% as.matrix(dat)))
(betahats <- broom::tidy(lm(y~-1+x, dats)))
Verify $\hat{\beta}$ for SVD:
sum((dats$x)*dats$y)/sum((dats$x)^2)
QR decomposition for full case. We can do this, but would avoid in practice. $FG=CU^{\text{T}}$
CUt <- diag(C) %*% t(U) system.time(fg <- qr(CUt))
$G=F^{\text{T}}FG=F^{\text{T}}CU^{\text{T}}$
G = qr.qty(fg, CUt) #G[lower.tri(G)] <- 0 # force 0 lower triangle?
Verify that $FG=CU^{\text{T}}$
FG <- qr.qy(fg, G) summary(c(FG - CUt))
Verify that $V=K+I=UC^2U^{\text{T}}=G^{\text{T}}F^{\text{T}}FG=G^{\text{T}}G$
summary(c(kinship + diag(rep(1,nind)) - (t(FG) %*% FG)))
summary(c(kinship + diag(rep(1,nind)) - (t(G) %*% G)))
Ginv <- backsolve(G, diag(dim(G)[1]))
Verify that $G^{-\text{T}}VG^{-1}=I$
summary(c(diag(rep(1,nind)) - (t(Ginv) %*% (kinship + diag(rep(1,nind)))) %*% Ginv))
Verify that $G^{-1}G=GG^{-1}=I$
summary(c(diag(rep(1,nind)) - Ginv %*% G))
summary(c(diag(rep(1,nind)) - G %*% Ginv))
Pre-multiply by $G^{-\text{T}}$.
datsfg <- as.data.frame(t(Ginv) %*% as.matrix(dat))
(betahatfg <- broom::tidy(lm(y~-1+x, datsfg)))
Show fit using SVD to verify they are the same.
betahats
Verify $\hat{\beta}$ for QRD:
sum((datsfg$x)*datsfg$y)/sum((datsfg$x)^2)
These last two fits agree exactly in terms of $\hat{\beta}. Now, do they agree for predicted values?
Get predicted values $\hat{y}=X\hat{\beta}$
yhats <- X * betahats$estimate
yhatfg <-X * betahatfg$estimate
ggplot2::ggplot(data.frame(x=yhats,y=yhatfg), ggplot2::aes(x,y)) + ggplot2::geom_point() + ggplot2::geom_abline(intercept=0, slope=1)
ggplot2::ggplot(data.frame(x=yhats,y=yhatfg), ggplot2::aes(x,y=y-x)) + ggplot2::geom_point()
Let's try a subset.
nind1 <- 250 ind2keep <- seq(nind1)
dat1 <- dat[ind2keep,]
K11 <- kinship[ind2keep, ind2keep] system.time(Ke1 <- qtl2scan::decomp_kinship(K11))
Clash of notation. Using subscript $11$ for redo of SVD.
D11 <- Ke1$values U11 <- t(Ke1$vectors) C11 = sqrt(1 + gamma * D11)
Verify that $V=K+I=UC^2U^{\text{T}}$
summary(c(K11 + diag(rep(1,nind1)) - U11 %*% diag(C11^2) %*% t(U11)))
Verify that $C^{-1}U^{\text{T}}VUC^{-1}=I$
summary(c(diag(rep(1,nind1)) - (diag(1/C11) %*% t(U11) %*% (K11 + diag(rep(1,nind1))) %*% U11 %*% diag(1/C11))))
Transform problem for full data case. $C_{11}^{-1}U_{11}^{\text{T}}y_1 \sim N(C_{11}^{-1}U_{11}^{\text{T}}X_1\beta,\sigma^2 I)$
dats1 <- as.data.frame(diag(1/C11) %*% (t(U11) %*% as.matrix(dat1))) (betahat1 <- broom::tidy(lm(y~-1+x, dats1)))
Now for QR approach: $CU_1^{\text{T}}=FG=F_1G_1$
CU1t <- diag(C) %*% t(U[ind2keep,]) system.time(fg1 <- qr(CU1t))
Verify that $FG=CU^{\text{T}}$
G <- qr.qty(fg1, CU1t) G1 = G[ind2keep,ind2keep] FG <- qr.qy(fg, G)[,ind2keep] summary(c(FG - CU1t))
Verify that $V=K+I=UC^2U^{\text{T}}=G^{\text{T}}F^{\text{T}}FG=G^{\text{T}}G$
summary(c(K11 + diag(rep(1,nind1)) - (t(FG) %*% FG)))
summary(c(K11 + diag(rep(1,nind1)) - (t(G1) %*% G1)))
Ginv1 <- backsolve(G1, diag(dim(G1)[1]))
Verify that $G^{-\text{T}}VG^{-1}=I$
summary(c(diag(rep(1,nind1)) - (t(Ginv1) %*% (K11 + diag(rep(1,nind1)))) %*% Ginv1))
Verify that $G^{-1}G=GG^{-1}=I$
summary(c(diag(rep(1,nind1)) - Ginv1 %*% G1))
summary(c(diag(rep(1,nind1)) - G1 %*% Ginv1))
$G=F^{\text{T}}FG=F^{\text{T}}CU^{\text{T}}$
datsfg1 <- as.data.frame(t(Ginv1) %*% as.matrix(dat1)) (betahatfg1 <- broom::tidy(lm(y~-1+x, datsfg1)))
Show fit from SVD on $K_{11}$ to verify they are the same.
betahat1
The $\hat{\beta}$ estimates agree.
Thus, to solve the sub-problem, we use the SVD for all the data, divide the eigen matrix $U$ horizontally, and do a QR decomposition on $U_1^{\text{T}}$ to obtian $G_1^{-\text{T}}$ for premultiplying $y_1$ and $X_1$. Some care will need to be taken about the condition number of matrices, but that is already incorporated into R/qtl2scan
routines.
The R/qtl2scan
code uses K
for the kinship matrix and hsq
for heritability ($\sigma_g^2/(\sigma_g^2+\sigma^2)$). The decomposed eigen object is Ke
. Decomposition is done in
qtl2scan::decomp_kinship()
. Three un-exported (and one exported) routines call this, qtl2scan:::scan1_pg()
, qtl2scan:::scan1coef_pg()
, qtl2scan:::scan1blup_pg()
and qtl2scan::est_herit()
.
They calculate heritability hsq
with qtl2scan:::calc_hsq_clean()
using internal by_chr_func()
, which is unexported and found in file scan1_pg.R
. This is called with the eigen decomposition stored in the kinship
object, as kinship$vectors
and kinship$values
. For scan1()
, calculations are done in qtl2scan:::scan1_pg_clean()
and flavors of scan_pg_onechr()
, which are written in Cpp.
It should be possible to have one routine that calls qtl2scan::decomp_kinship()
and qtl2scan:::calc_hsq_clean()
if needed, used for the four instances identified above. We would also want to either use these or something similar as exported routine to precompute SVD once for later use.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.