knitr::opts_chunk$set(echo = TRUE)

The purpose of this document is to develop math to modify R/qtl2scan so that the kinship object can be decomposed once and used multiple times. The challenge is that the current setup decomposes kinship on each reduced set of data, chosen to eliminate for instance individuals with missing data for a phenotype, or to restrict to a subset of individuals to improve computation.

In summary, the basic model is

$$y \sim N(X\beta,V), \quad V =\sigma_g^2K + \sigma^2I = \sigma^2(\gamma K+I)~,$$

with eigen decomposition (SVD) of $K = UDU^{\text{T}}$. For subsets of individuals, $y_1$ and $X_1$ are horizontal subsets of $y$ and $X$, respectively, with a similar horizontal subset $U_1$ of the eigenvectors $U$ (against the grain of the eigen vectors for the full $K$ kinship matrix). Rather than redo the SVD on the subset kinship matrix ($K_{11}$), we can do a Householder(QR) decomposition: $CU_1^{\text{T}} = FG=F_1G_1~.$ Premultiplying by $G_1^{-\text{T}}$, $y_1^ = G_1^{-\text{T}}y_1$ and $X_1^ = G_1^{-\text{T}}X_1$, reduces the problem to ordinary least squares: $$y_1^ \sim N(X_1^\beta,\sigma^2 I_1)$$

Thus, to solve the sub-problem, we use the SVD for all the data, divide the eigen matrix $U$ horizontally, and do a QR decomposition on $U_1^{\text{T}}$ to obtain $G_1^{-\text{T}}$ for premultiplying $y_1$ and $X_1$. Some care will need to be taken about the condition number of matrices, but that is already incorporated into R/qtl2scan routines.

Complete data problem

The basic math for complete data is as follows.

$$y = X\beta + g + e~, \quad g\sim N(0,\sigma_g^2 K), \quad e\sim N(0,\sigma^2 I)$$

with $\mu_q=X\beta$ = QTL effects (and any other fixed effects), $g$ = polygenic effects (random), e = unexplained variation (random) and distributions, $K$ = kinship matrix and $I$ = identity matrix (1s on diagonal, 0s off diagonal). Put another way, the distribution of phenotypes is

$$y \sim N(X\beta,V), \quad V =\sigma_g^2K + \sigma^2I = \sigma^2(\gamma K+I)~.$$

with $\gamma = \sigma_g^2 / \sigma^2$. The MLEs are found by iterating to solve (similar to EM idea), getting MLE of $\beta$ given $V$, $\hat{\beta}_q=(X^{\text{T}}V^{-1}X)^{-1}X^{\text{T}}V^{-1/2}y$, and estimating $\sigma_g$ and $\sigma^2$ given $\hat{\beta}~.$

SVD of $K$ in full data case

Rather than working with $V$ directly, the $K$ matrix is decomposed using SVD as

$$K = UDU^{\text{T}}$$

with $U$ orthonormal (eigenvector columns are uncorrelated with variance 1; $U^{\text{T}}U=UU^{\text{T}}=I$). The $D$ matrix has 0 off-diagonal and diagonal entries being the eigenvalues $d_i$. We can write $V$ as

$$V = \sigma^2 (\gamma UDU^{\text{T}} +I)=\sigma^2 UC^2U^{\text{T}}~, \quad C^2=\gamma D+I~,$$

with $C$ being diagonal with entries $c_i=(1+\gamma d_i)^{1/2}~.$ We transform the problem by left-multiplying $y$ and $X$ by $C^{-1}U^{\text{T}}$ to yield an i.i.d. setting, $y^ \sim N(X^\beta,\sigma^2I)~,$ with

$$y^=C^{-1}U^{\text{T}}y, \quad X^=C^{-1}U^{\text{T}}X~, \quad \sigma^2I=C^{-1}U^{\text{T}}VUC^{-1}~.$$

Then we solve the (unweighted) normal equations,

$$\hat{\beta}=(X^{\text{T}}X^)^{-1}X^{\text{T}}y^ = (X^{\text{T}}UC^{-2}U^{\text{T}}X)^{-1}X^{\text{T}}UC^{-2}U^{\text{T}}y~,$$

with predicted values

$$\hat{y}=X\hat{\beta}=UCX^(X^{\text{T}}X^)^{-1}X^{\text{T}}y^* = X(X^{\text{T}}UC^{-2}U^{\text{T}}X)^{-1}X^{\text{T}}UC^{-2}U^{\text{T}}y~.$$

QR decomposition for subsets

If some phenotypes are missing, or we compute with a subset of individuals, the problem needs to be refactored. Suppose the $n$ vector of phenotypes $y$ is subdivided as $y=(y_1,y_2)$, with key focus on the $n_1$ phenotypes in $y_1$. Rather than reduce the size of problem and do another SVD, consider a diagonal matrix $B$ with $n_1$ 1s and $n_2=n-n_1$ 0s down the diagonal. In other words, with $I_1$ an identity matrix of size $n_1$ and blocking matrix $B = \begin{bmatrix} I_1 & 0 \ 0 & 0 \end{bmatrix}~,$ $By = (y_1,0)$. Note that $B^{\text{T}} = B^2 =B$, but $B$ does not have an inverse. Formally, we write

$$B y \sim N(B X\beta,BVB), \quad BVB =\sigma^2(\gamma BKB + B)=\sigma^2BUC^2U^{\text{T}}B~.$$

If we partition $K$ (and similarly $UC^2U^{\text{T}}$) into four parts, we have

$$K = \begin{bmatrix} K_{11} & K_{21} \ K_{12} & K_{22} \end{bmatrix}~, \quad BKB = \begin{bmatrix} K_{11} & 0 \ 0 & 0 \end{bmatrix}$$ and $$UC^{2}U^{\text{T}} = \begin{bmatrix} U_1C^{2}U_1^{\text{T}} & U_2C^{2}U_1^{\text{T}} \ U_1C^{2}U_2^{\text{T}} & U_2C^{2}U_2^{\text{T}} \end{bmatrix}~, \quad BVB = \sigma^2BUC^{2}U^{\text{T}}B = \sigma^2\begin{bmatrix} U_1C^{2}U_1^{\text{T}} & 0 \ 0 & 0 \end{bmatrix}~.$$

That is, partition the transpose of the eigenvector matrix as $U^{\text{T}} = (U_1^{\text{T}}, U_2^{\text{T}})~.$ Note the relations: $I=U_1^{\text{T}}U_1+U_2^{\text{T}}U_2$, $U_1U_2^{\text{T}}=0$, $U_2U_1^{\text{T}}=0$, $U_1U_1^{\text{T}}=I_1$ and $U_2U_2^{\text{T}}=I_2$, with $I_j$ being an identity matrix of size $n_j$. The model part we are interested in is

$$y_1 \sim N(X_1\beta,V_{11}), \quad V_{11} =\sigma^2(\gamma K_{11} + I_1)=\sigma^2U_1C^2U_1^{\text{T}}~.$$

Rather than perform a new SVD on $K_{11}$, it is more efficient to do a QR decomposition. The number of calculations for QR are $O(n_1n^2-n^3/3)$ operations vs $O(n_1n^2+n^3)$ for SVD. That is, QR should be much quicker. See Princeton CS Notes.

Consider a Householder (QR) decomposition of $CU_1^{\text{T}} = FG=F_1G_1$ with $G_1$ upper triangular and $FF^{\text{T}}=F^{\text{T}}F=I$, $F=(F_1,F_2)$, $G^{\text{T}}=(G_1^{\text{T}}, 0)$. Also, $F_1F_1^{\text{T}}+F_2F_2^{\text{T}}=I$ and $F_1^{\text{T}}F_1=I_1$.

Hence, $V_{11} = \sigma^2G_1^{\text{T}}F_1^{\text{T}}F_1G_1 = \sigma^2G_1^{\text{T}}G_1$. Thus, the model we need to solve is $y_1^ \sim N(X_1^\beta,\sigma^2I_1)$, with

$$y_1^ = G_1^{-\text{T}}y_1, \quad X_1^ = G_1^{-\text{T}}X_1$$ and the solution for $\beta$ is:

$$\hat{\beta} = (X_1^{\text{T}}X_1^)^{-1}X_1^{\text{T}}y_1^ = (X_1^{\text{T}}G_1^{-1}G_1^{-\text{T}}X_1)^{-1}X_1^{\text{T}}G_1^{-1}G_1^{-\text{T}}y_1$$

That is, we need to take the current value of $\gamma$, do a Householder decomposition on the reduced matrix $CU_1^{\text{T}}$, and solve OLS.

QR solution for OLS

Note that the OLS solution requires another QR decomposition:

$$X_1^ = G_1^{-\text{T}}X_1=QR=Q_1R_1~, \quad Q^{\text{T}}Q=I~, \quad Q_1^{\text{T}}Q_1=I_1~.$$ $$\hat{\beta} = (X_1^{\text{T}}X_1^)^{-1}X_1^{\text{T}}y_1^* = R_1^{-1}Q_1^{\text{T}}G_1^{-\text{T}}y_1~.$$ and for prediction,

$$\hat{y}_1=X_1\hat{\beta}=G_1^{\text{T}}X_1^\hat{\beta} = X_1^(X_1^{\text{T}}X_1^)^{-1}X_1^{\text{T}}y_1^ = G_1^{\text{T}}Q_1Q_1^{\text{T}}G_1^{-\text{T}}y_1~.$$

To summarize, there is an SVD of $K=UDU^{\text{T}}$ done once for the project, and stored. There are two QR decomposition for each subset of individuals. These QR decompositions are of $CU_1^{\text{T}}=FG=F_1G_1$, with $C^2=\gamma D+I$ depending on $\gamma = \sigma_g^2 / \sigma^2$ and $U_1$ being the subset of rows of $U$ for the included individuals; and $G_1^{-\text{T}}X_1=QR=Q_1R_1$. Both QR decompositions need to be done many times.

Proof of Concept

This section goes through in gory detail the proof of concept calculations with a basic model having kinship $K$ from the about 500 mice (say LOCO for chr 16), $X$ being one $N(0,1)$ predictor, and $\beta=\sigma = \sigma_g=1$. After going through the full data case, verifying that SVD and QRD agree, we examine a subset of 250 mice and show both SVD on $K_{11}$ and QRD on $CU_1^{\text{T}}$ agree. Note that there many "verify" steps here to ensure that matrices are what they are supposed to be, as it is easy (!) to make mistakes with this matrix algebra.

suppressPackageStartupMessages({
  library(qtl2geno)
  library(qtl2scan)
  library(qtl2ggplot)
})
datapath <- "~/Documents/Research/attie_alan/DO/data/DerivedData/"
kinship <- readRDS(file.path(datapath, "kinship.rds"))
kinship <- kinship[["16"]]
nind <- nrow(kinship)
system.time(Ke <- qtl2scan::decomp_kinship(kinship))
D <- Ke$values
U <- t(Ke$vectors)

Verify that $K=UDU^{\text{T}}$.

summary(c(kinship - U %*% diag(D) %*% t(U)))

Parameter setup.

gamma <- 1
sigma <- 1
beta <- 1
C = sqrt(1 + gamma * D)

$$y = X\beta + g + e$$

X <- matrix(rnorm(nind), nind, 1)
X <- X - mean(X)
noise <- rnorm(nind)
poly <- sigma * sqrt(gamma) * U %*% (C * rnorm(nind))
y <- X %*% beta + poly + noise
dat <- data.frame(y=y,x=X,g=poly)

Verify that $V=K+I=UC^2U^{\text{T}}$

summary(c(kinship + diag(rep(1,nind)) - U %*% diag(C^2) %*% t(U)))

Verify that $C^{-1}U^{\text{T}}VUC^{-1}=I$

summary(c(diag(rep(1,nind)) - (diag(1/C) %*% t(U) %*% (kinship + diag(rep(1,nind))) %*% U %*% diag(1/C))))

Transform problem for full data case. $C^{-1}U^{\text{T}}y \sim N(C^{-1}U^{\text{T}}X\beta,\sigma^2 I)$

dats <- as.data.frame(diag(1/C) %*% (t(U) %*% as.matrix(dat)))
(betahats <- broom::tidy(lm(y~-1+x, dats)))

Verify $\hat{\beta}$ for SVD:

sum((dats$x)*dats$y)/sum((dats$x)^2)

QR decomp

QR decomposition for full case. We can do this, but would avoid in practice. $FG=CU^{\text{T}}$

CUt <- diag(C) %*% t(U)
system.time(fg <- qr(CUt))

$G=F^{\text{T}}FG=F^{\text{T}}CU^{\text{T}}$

G = qr.qty(fg, CUt)
#G[lower.tri(G)] <- 0 # force 0 lower triangle?

Verify that $FG=CU^{\text{T}}$

FG <- qr.qy(fg, G)
summary(c(FG - CUt))

Verify that $V=K+I=UC^2U^{\text{T}}=G^{\text{T}}F^{\text{T}}FG=G^{\text{T}}G$

summary(c(kinship + diag(rep(1,nind)) - (t(FG) %*% FG)))
summary(c(kinship + diag(rep(1,nind)) - (t(G) %*% G)))
Ginv <- backsolve(G, diag(dim(G)[1]))

Verify that $G^{-\text{T}}VG^{-1}=I$

summary(c(diag(rep(1,nind)) - (t(Ginv) %*% (kinship + diag(rep(1,nind)))) %*% Ginv))

Verify that $G^{-1}G=GG^{-1}=I$

summary(c(diag(rep(1,nind)) - Ginv %*% G))
summary(c(diag(rep(1,nind)) - G %*% Ginv))

Pre-multiply by $G^{-\text{T}}$.

datsfg <- as.data.frame(t(Ginv) %*% as.matrix(dat))
(betahatfg <- broom::tidy(lm(y~-1+x, datsfg)))

Show fit using SVD to verify they are the same.

betahats

Verify $\hat{\beta}$ for QRD:

sum((datsfg$x)*datsfg$y)/sum((datsfg$x)^2)

These last two fits agree exactly in terms of $\hat{\beta}. Now, do they agree for predicted values?

Get predicted values $\hat{y}=X\hat{\beta}$

yhats <- X * betahats$estimate
yhatfg <-X * betahatfg$estimate
ggplot2::ggplot(data.frame(x=yhats,y=yhatfg),
                ggplot2::aes(x,y)) +
  ggplot2::geom_point() +
  ggplot2::geom_abline(intercept=0, slope=1)
ggplot2::ggplot(data.frame(x=yhats,y=yhatfg),
                ggplot2::aes(x,y=y-x)) +
  ggplot2::geom_point()

Subset

Let's try a subset.

nind1 <- 250
ind2keep <- seq(nind1)
dat1 <- dat[ind2keep,]
K11 <- kinship[ind2keep, ind2keep]
system.time(Ke1 <- qtl2scan::decomp_kinship(K11))

Clash of notation. Using subscript $11$ for redo of SVD.

D11 <- Ke1$values
U11 <- t(Ke1$vectors)
C11 = sqrt(1 + gamma * D11)

Verify that $V=K+I=UC^2U^{\text{T}}$

summary(c(K11 + diag(rep(1,nind1)) - U11 %*% diag(C11^2) %*% t(U11)))

Verify that $C^{-1}U^{\text{T}}VUC^{-1}=I$

summary(c(diag(rep(1,nind1)) - 
            (diag(1/C11) %*% t(U11) %*% 
               (K11 + diag(rep(1,nind1))) %*% 
               U11 %*% diag(1/C11))))

Transform problem for full data case. $C_{11}^{-1}U_{11}^{\text{T}}y_1 \sim N(C_{11}^{-1}U_{11}^{\text{T}}X_1\beta,\sigma^2 I)$

dats1 <- as.data.frame(diag(1/C11) %*% (t(U11) %*% as.matrix(dat1)))
(betahat1 <- broom::tidy(lm(y~-1+x, dats1)))

Now for QR approach: $CU_1^{\text{T}}=FG=F_1G_1$

CU1t <- diag(C) %*% t(U[ind2keep,])
system.time(fg1 <- qr(CU1t))

Verify that $FG=CU^{\text{T}}$

G <- qr.qty(fg1, CU1t)
G1 = G[ind2keep,ind2keep]
FG <- qr.qy(fg, G)[,ind2keep]
summary(c(FG - CU1t))

Verify that $V=K+I=UC^2U^{\text{T}}=G^{\text{T}}F^{\text{T}}FG=G^{\text{T}}G$

summary(c(K11 + diag(rep(1,nind1)) - (t(FG) %*% FG)))
summary(c(K11 + diag(rep(1,nind1)) - (t(G1) %*% G1)))
Ginv1 <- backsolve(G1, diag(dim(G1)[1]))

Verify that $G^{-\text{T}}VG^{-1}=I$

summary(c(diag(rep(1,nind1)) - (t(Ginv1) %*% (K11 + diag(rep(1,nind1)))) %*% Ginv1))

Verify that $G^{-1}G=GG^{-1}=I$

summary(c(diag(rep(1,nind1)) - Ginv1 %*% G1))
summary(c(diag(rep(1,nind1)) - G1 %*% Ginv1))

$G=F^{\text{T}}FG=F^{\text{T}}CU^{\text{T}}$

datsfg1 <- as.data.frame(t(Ginv1) %*% as.matrix(dat1))
(betahatfg1 <- broom::tidy(lm(y~-1+x, datsfg1)))

Show fit from SVD on $K_{11}$ to verify they are the same.

betahat1

The $\hat{\beta}$ estimates agree.

Implementation

Thus, to solve the sub-problem, we use the SVD for all the data, divide the eigen matrix $U$ horizontally, and do a QR decomposition on $U_1^{\text{T}}$ to obtian $G_1^{-\text{T}}$ for premultiplying $y_1$ and $X_1$. Some care will need to be taken about the condition number of matrices, but that is already incorporated into R/qtl2scan routines.

The R/qtl2scan code uses K for the kinship matrix and hsq for heritability ($\sigma_g^2/(\sigma_g^2+\sigma^2)$). The decomposed eigen object is Ke. Decomposition is done in qtl2scan::decomp_kinship(). Three un-exported (and one exported) routines call this, qtl2scan:::scan1_pg(), qtl2scan:::scan1coef_pg(), qtl2scan:::scan1blup_pg() and qtl2scan::est_herit().

They calculate heritability hsq with qtl2scan:::calc_hsq_clean() using internal by_chr_func(), which is unexported and found in file scan1_pg.R. This is called with the eigen decomposition stored in the kinship object, as kinship$vectors and kinship$values. For scan1(), calculations are done in qtl2scan:::scan1_pg_clean() and flavors of scan_pg_onechr(), which are written in Cpp.

It should be possible to have one routine that calls qtl2scan::decomp_kinship() and qtl2scan:::calc_hsq_clean() if needed, used for the four instances identified above. We would also want to either use these or something similar as exported routine to precompute SVD once for later use.



fboehm/qtl2mediate documentation built on June 18, 2019, 8:27 p.m.