knitr::opts_chunk$set(echo=T, warning=F, message=F, cache=F, results='hold'); Plot = require(cowplot)&require(ggplot2)&require(reshape2);

- Setting
- Data
- Basic Association Test
- Inverse Normal Transformation
- Direct INT
- Indirect INT
- Omnibus INT
- Notes

Consider genetic association analysis with a continuous trait. If the residual distribution is asymmetric (skewed) or diffuse (kurtotic) relative to the normal distribution, then standard linear regression may fail to control the type I error in moderate samples. Even if the sample is sufficiently large for standard linear regression to provide valid inference, it is not fully efficient when the residual distribution is non-normal. Examples of traits that may exhibit non-normal residual distributions include body mass index, circulating metabolites, gene expression, polysomnography signals, and spirometry measurements. In such cases, the rank based inverse normal transformation (INT) has been used to counteract departures from normality. During INT, the sample measurements are first mapped to the probability scale, by replacing the observed values with fractional ranks, then transformed into Z-scores using the probit function. In the following example, a sample of size $n=1000$ is drawn from the $\chi_{1}^{2}$ distribution. After transformation, the empirical distribution of the measurements in is indistinguishable from standard normal.

library(RNOmni); # Sample from the chi-1 square distribution y = rchisq(n=1000,df=1); # Rank-normalize z = rankNorm(y);

# Data frame df = data.frame("Original"=y,"INT"=z); df = suppressMessages(reshape2::melt(df)); colnames(df) = c("Data","x"); q = ggplot(data=df); q = q+geom_density(aes(x=x,fill=Data),alpha=0.8); q = q+theme_bw()+scale_fill_manual(labels=c("Original","RankNorm"),values=c("coral","royalblue")); q = q+xlab("Measurement")+ylab("Density"); q = q+ggtitle(expression(chi[1]^2~Phenotype~Before~and~After~INT)); print(q);

Here, data are simulated for $n=10^{3}$ subjects. Genotypes are drawn for $10^{3}$ loci in linkage equilibrium with minor allele frequency between 0.05 and 0.50. The model matrix `X`

contains an intercept, four standard normal covariates `Z`

, and the first four genetic principal components. The intercept is set to one, and the remaining regression coefficients are simulated as random effects. The proportion of phenotypic variation explained by covariates is 20%, while the proportion of variation explained by principal components is 5%. Two phenotypes with additive residuals are simulated. The first `yn`

has standard normal residuals, while the second `yt`

has $t_{3}$ residuals, scaled to have unit variance.

set.seed(100); # Sample size n = 1e3; ## Simulate genotypes maf = runif(n=1e3,min=0.05,max=0.50); G = sapply(maf,function(x){rbinom(n=n,size=2,prob=x)}); storage.mode(G) = "numeric"; # Genetic principal components S = svd(scale(G))$u[,1:4]; S = scale(S); # Covariates Z = scale(matrix(rnorm(n*4),nrow=n)); # Overall design X = cbind(1,Z,S); # Coefficient b = c(1,rnorm(n=4,sd=1/sqrt(15)),rnorm(n=4,sd=1/sqrt(60))); # Linear predictor h = as.numeric(X%*%b); # Normal phenotype yn = h+rnorm(n); # T-3 phenotype yt = h+rt(n,df=3)/sqrt(3);

# Normal phenotype q = ggplot(data=data.frame("yn"=yn)); q = q+geom_density(aes(x=yn),alpha=0.8,fill="royalblue"); q = q+theme_bw()+xlab("Phenotype")+ylab("Marginal Density"); q1 = q+ggtitle("Normal Phenotype"); # Log-normal phenotype q = ggplot(data=data.frame("yt"=yt)); q = q+geom_density(aes(x=yt),alpha=0.8,fill="coral"); q = q+theme_bw()+xlab("Phenotype")+ylab("Marginal Density"); q2 = q+ggtitle("T3 Phenotype"); # Plot Q = plot_grid(q1,q2,nrow=1); print(Q);

The outcome `y`

is expected as a numeric vector. Genotypes `G`

are expected as a numeric matrix, with subjects are rows. If adjusting for covariates or population structure, `X`

is expected as a numeric matrix, which should contain an intercept. Factors and interactions should be expanded in advance, e.g. using `model.matrix`

. Missingness is not expected in either the outcome vector `y`

or the model matrix `X`

, however the genotype matrix `G`

may contain missingness. Observations with missing genotypes are excluded from association testing only at those loci where the genotype is missing.

The basic association test is linear regression of the (untransformed) phenotype on genotype and covariates. `BAT`

provides an efficient implementation using phenotype `y`

, genotypes `G`

, and model matrix `X`

. Standard output includes the score statistic, its standard error, the Z-score, and a $p$-value, with one row per column of `G`

. Setting `test="Wald"`

specifies a Wald test. The Wald test may provide more power, but is generally slower. Setting `simple=T`

returns the $p$-values only.

# Basic Association Test, Normal Phenotype Results1 = BAT(y=yn,G=G,X=X); cat("BAT Applied to Normal Phenotype\n"); round(head(Results1),digits=3); cat("\n"); # Basic Association Test, T3 Phenotype cat("BAT Applied to T3 Phenotype\n"); Results2 = BAT(y=yt,G=G,X=X); round(head(Results2),digits=3);

Suppose that a continuous measurement $u_{i}$ is observed for each of $n$ subjects. Let $\text{rank}(u_{i})$ denote the sample rank of $u_{i}$ when the measurements are placed in ascending order. The rank based inverse normal transformation (INT) is defined as:

$$ \text{INT}(u_{i}) = \Phi^{-1}\left[\frac{\text{rank}(u_{i})-k}{n-2k+1}\right] $$

Here $\Phi^{-1}$ is the probit function, and $k\in(0,1/2)$ is an adjustable offset. By default, the Blom offset of $k=3/8$ is adopted.

In direct INT (D-INT), the INT-transformed phenotype is regressed on genotype and covariates. D-INT is most powerful when the phenotype could have arisen from a rank-preserving transformation of a latent normal trait. `DINT`

implements the association test using phenotype `y`

, genotypes `G`

, and model matrix `X`

. Standard output includes the test statistic, its standard error, the Z-score, and a $p$-value, with one row per column of `G`

. Wald and score tests are available. Setting `simple=T`

returns the $p$-values only.

# Direct INT Test, Normal Phenotype cat("D-INT Applied to Normal Phenotype\n"); Results1 = DINT(y=yn,G=G,X=X); round(head(Results1),digits=3); cat("\n"); # Direct INT Test, T3 Phenotype cat("D-INT Applied to T3 Phenotype\n"); Results2 = DINT(y=yt,G=G,X=X); round(head(Results2),digits=3);

In indirect INT (I-INT), the phenotype and genotypes are first regressed on covariates to obtain residuals. The phenotypic residuals are rank normalized. Next, the INT-transformed phenotypic residuals are regressed on genotypic residuals. I-INT is most powerful when the phenotype is linear in covariates, but the residual distribution is skewed or kurtotic. `IINT`

implements the association test, using phenotype `y`

, genotypes `G`

, and model matrix `X`

. Standard output includes the test statistic, its standard error, the Z-score, and a $p$-value, with one row per column of `G`

. Setting `simple=T`

returns the $p$-values only.

# Indirect INT Test, Normal Phenotype cat("I-INT Applied to Normal Phenotype\n"); Results1 = IINT(y=yn,G=G,X=X); round(head(Results1),digits=3); cat("\n"); # Indirect INT Test, T3 Phenotype cat("I-INT Applied to T3 Phenotype\n"); Results2 = IINT(y=yt,G=G,X=X); round(head(Results2),digits=3);

Since neither D-INT nor I-INT is uniformly most powerful, the INT omnibus test (O-INT) adaptively combines them into a single robust and statistically powerful test. Internally, `OINT`

applies both D-INT and I-INT. The corresponding p-values are transformed into Cauchy random deviates, which are averaged to form the omnibus statistic. In general the omnibus p-value will be between the D-INT and I-INT p-values, but closer to smaller of these two. `OINT`

implements the omnibus test, using phenotype `y`

, genotypes `G`

, and model matrix `X`

. The standard output includes the $p$-values estimated by each of D-INT, I-INT, and O-INT. Setting `simple=T`

returns the O-INT $p$-values only.

# Omnibus INT Test, Normal Phenotype cat("O-INT Applied to Normal Phenotype\n"); Results1 = OINT(y=yn,G=G,X=X); round(head(Results1),digits=3); cat("\n"); # Omnibus INT Test, T3 Phenotype cat("O-INT Applied to T3 Phenotype\n"); Results2 = OINT(y=yt,G=G,X=X); round(head(Results2),digits=3);

All association tests have the option of being run in parallel. To do so, register a parallel backend, e.g. `doMC::registerDoMC(cores=4)`

, then specify the `parallel=T`

option.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.