library(knitr) opts_chunk$set(fig.width = 12)

This vignette illustrates the use of MTaSPUsSet test, an adaptive gene-multitrait association testing with GWAS summary statistics.

We first downloaded GWAS summary statistics of Genetic
Investigation of ANthropometric Traits (GIANT) consortium data. We get the genomic coordinates of SNPs using SnpTracker software (hg19 used), and then we mapped SNPs to Genes using MAGMA software (build 37 used). We will consider testing a gene named *SAMD11* for example.

Let's load *SAMD11* data first.

```
library(aSPU)
```

data(SAMD11) attach(SAMD11)

`ZsM`

and `PsM`

are Z-scores and P-values for GIANT Man data mapped on gene *SAMD11*. (We used MAGMA software to map rs ids to genes. )

round(ZsM,3) PsM

Both `ZsM`

and `PsM`

are `r dim(PsM)[1]`

by `r dim(PsM)[2]`

matrix. Row represent SNPs and column represent phenotypes. `r dim(PsM)[1]`

SNPs are mapped on gene *SAMD11* and `r dim(PsM)[2]`

number of phenotypes are considered in testing.

`corSNPM`

and `corPheM`

are correlation among SNPs and Phenotypes respectively for GIANT Man data.

round(corSNPM,2) round(corPheM,2)

`ZsF`

and `PsF`

are Z-scores and P-values for GIANT Woman data mapped on gene *SAMD11*.

round(ZsF,3) PsF

`corSNPF`

and `corPheF`

are correlation among SNPs and Phenotypes respectively for GIANT Woman data.

round(corSNPF,2) round(corPheF,2)

You can see that `corSNPM`

and `corSNPF`

are same. Yes, they are calculated from the reference population (here, we used 1000 genome CEU panel). They are same under H0 no matter what phenotype is.

We can perform MTaSPUsSet using following command.

(outFZ <- MTaSPUsSet(ZsF, corSNP=corSNPF, corPhe = corPheF, pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE)) # available in R/aSPUc from my github #(outFZC <- MTaSPUsSetC(ZsF, corSNP=corSNPF, corPhe = corPheF, # pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))

To use Python version, first save files in `.txt`

format.

#write.table(ZsF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="ZsF.txt") #write.table(corPheF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corPheF.txt") #write.table(corSNPF, quote=FALSE, row.names=FALSE, col.names=FALSE, file="corSNPF.txt")

`MTaSPUsSet.py`

function in py/aSPU_py do the same job with `./MTaSPUsSet.py ZsF.txt corPheF.txt corSNPF.txt 100 outF.txt`

.

Next, We can use the P-values for MTaSPUsSet test using `MTaSPUsSet`

functions. Put P-value matrix `PsF`

and set `Ps=TRUE`

in the option.

(outFP <- MTaSPUsSet(PsF, corSNP=corSNPF, corPhe = corPheF, pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE)) # available in R/aSPUc from my github #(outFPC <- MTaSPUsSetC(PsF, corSNP=corSNPF, corPhe = corPheF, # pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE))

Results are not much different.

Next, let's try MTaSPUsSet test using GIANT Man data with input of P-values and Z-scores.

(outMPC <- MTaSPUsSet(PsM, corSNP=corSNPM, corPhe = corPheM, pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=TRUE)) (outMZC <- MTaSPUsSet(ZsM, corSNP=corSNPM, corPhe = corPheM, pow=c(1,2,4,8), pow2 = c(1,2,4,8), n.perm=100, Ps=FALSE))

This time we got smaller P-value using `ZsM`

input than `PsM`

input. Why is it so? This can be answered from `corSNPM`

, `corPheM`

and `ZsM`

.

round(ZsM,3) round(corSNPM,2) round(corPheM,2)

In `corSNPM`

, correlation between `rs4951864`

and others are negative. The second column of `ZsM`

looks a bit odd. It could be probable since the p-value is not very much significant and `rs11240777`

might be related a bit with `Height`

. However It might be related to the coding error as well. The original data set did not provide Z-scores. P-values for each SNP and beta estimate was provided in the download page of GIANT data. I transformed P-values to Z-scores by `absZs = qnorm(1 - (Ps)/2)`

and then multiplied the sign of beta estimates to recover the original Z-scores. Maybe I am wrong somewhere or provided beta estimates are not all correct. In any case, it would be safe to analyze GIANT data using P-values rather than using Z-scores.

plotG <- function(Ps, zlim = NULL, main = NULL, yt = NULL, title = "SNPs") { log10P <- -log(Ps,10) pos = 1:nrow(log10P) y = 1:ncol(log10P) log10P <- log10P val <- sqrt(seq(0, 1, len=251)) col <- rgb(1, rev(val), rev(val)) if(is.null(yt)) { yt = -length(pos)/15 } if(is.null(zlim)) { maxP <- max(log10P, na.rm=TRUE) zlim <- c(0, maxP) } image.plot(pos, y, log10P, xaxt="n", yaxt="n", ylab="", xlab="", zlim=zlim, col=col, mgp=c(2.6, 1, 0), bty="n", main = main ) title(xlab=title, mgp=c(1, 0, 0)) text(yt,1,"BMI", xpd = TRUE) text(yt,2,"Height", xpd = TRUE) text(yt,3,"HIP", xpd = TRUE) text(yt,4,"WC", xpd = TRUE) text(yt,5,"Weight", xpd = TRUE) text(yt,6,"WHR", xpd = TRUE) } plotLD <- function(ldmatrix, zlim = NULL, main = NULL, yt = NULL, title = "SNPs") { # log10P <- -log(Ps,10) pos = 1:nrow(ldmatrix) y = 1:ncol(ldmatrix) val <- sqrt(seq(0, 1, len=251)) col <- rgb(1, rev(val), rev(val)) if(is.null(yt)) { yt = -length(pos)/15 } if(is.null(zlim)) { maxP <- max(ldmatrix, na.rm=TRUE) zlim <- c(0, maxP) } image.plot(pos, y, ldmatrix, xaxt="n", yaxt="n", ylab="", xlab="", zlim=zlim, col=col, mgp=c(2.6, 1, 0), bty="n", main = main ) title(xlab=title, mgp=c(1, 0, 0)) title(ylab=title, mgp=c(1, 0, 0)) }

So far, we demostrated how we can use the software. In this section we will visualize some identified genes.

Gene *LCORL* and *RASA2* are two of identified genes by MGAS using their software.

data(someGs) par(mfrow = c(2,2)) plotG(someGs$LCORL[[1]], main = "LCORL (P-values)", zlim = c(0,18)) plotG(someGs$RASA2[[1]], main = "RASA2 (P-values)", zlim = c(0,12)) plotLD(abs(someGs$LCORL[[2]]), main = "LCORL (LDmatrix)") plotLD(abs(someGs$RASA2[[2]]), main = "RASA2 (LDmatrix)")

In the Figure, we draw image of $-\log_{10}$ transformed P-values for each SNP and trait.
Gene *LCORL* was the most significant gene. As we can see from the Figure, there are many very significant SNPs for trait Height in Gene *LCORL*. No doubt MGAS and MTaSPUsSet identified this gene.

MGAS use extended Simes procedure so it considers top few p-values in their algorithm. Thus, MGAS works well with sparse signal such as *RASA2*. Since the signal is sparse, MTaSPUsSet detect this gene with larger $\gamma_1$ and $\gamma_2$. With ($\gamma_1, \gamma_2$) = (1,8), (2,8), (4,8) and (8,8), MTaSPUsSet P-values were less than 1e-7 with $10^7$ permutations.

data(someGs) par(mfrow = c(2,2)) plotG(someGs$STK33[[1]], main = "STK33 (P-values)", zlim = c(0,12)) plotG(someGs$RPGRIP1L[[1]], main = "RPGRIP1L (P-values)", zlim = c(0,12)) plotLD(abs(someGs$STK33[[2]]), main = "STK33 (LDmatrix)") plotLD(abs(someGs$RPGRIP1L[[2]]), main = "RPGRIP1L (LDmatrix)")

MTaSPUsSet test have advantage in detecting genes loke *STK33* and *RPGRIP1L*. We can see that signals are widely spread. MTaSPUsSet test have higher power with smaller $\gamma_1$ and $\gamma_2$. As we can see from the figure, each P-values are not very big. These genes would be hard to detect using SNP based, or gene - single trait association testing. However, MTaSPUsSet test can detect these genes by aggregating signals for all traits and SNPs.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.