In stephenbates19/digitaltwins: Causal inference for SNPs

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Summary

This document shows how to analyze genetic data with the Digital Twin Test (DTT) using the digitaltwins package.

library(digitaltwins)

Create a synthetic data set

First, we create a synthetic data set which has two parts:

A an external data set of unrelated individuals.
A data set of parent-offspring trios.

#simulation parmaters
n_ext <- 1000 #number of external observations
n_trio <- 250 #number of trios
p <- 500 #number of observed genetic variants
chrome_width <- 50 #length of chromosome
d <- rep(p / chrome_width, p) #genetic distance between sites

k <- 5 #number of causal variants

We will consider a single chromosome of length r chrome_width Mb (roughly the length of chromosome 22), and take r p genetic variants equally spaced across the chromosome. For simplicitly, we assume recombination occurs uniformly at random.

We begin with a matrix set of simulated haplotypes included in the digitaltwins package.

 data("haps_matrix")

External data

We take the first r 2*n_ext rows of haps_matrix as the haplotypes of the r n_ext unrelated individuals (each individual has two corresponding haplotypes).

external_haps <- haps_matrix[1:(2*n_ext), ]
dim(external_haps)

Trio data

Next, we use the remaining r 4 * n_trio rows as the haplotypes of the parents for the r n_trio trios.

parent_haps <- haps_matrix[(2*n_ext + 1):(2*n_ext + 4 * n_trio), ]
dim(parent_haps)

We create the table of ancestries for the offspring haplotypes, and simulate the offspring haplotypes with the digitaltwins::generate_offspring function.

anc <- matrix(1:(4*n_trio), nrow = 2*n_trio, byrow = TRUE) #index of ancestors of haplotypes
print(dim(anc))
head(anc)

set.seed(300)
# use the "generate_offspring" function for each row of the ancestry table
offspring_haps <- mapply(
  function(i, j) {digitaltwins::generate_offspring(parent_haps[i, ], 
                                                 parent_haps[i, ], 
                                                 d = d)},
  anc[, 1], anc[, 2])
dim(offspring_haps)

Response variable

Lastly, we will simulate a response variable from a sparse linear regression model with r k causal variants. The digitaltwins::haps_to_gen function converts the matrix of haplotypes to a matrix of genotypes by adding rows 1 and 2, rows 3 and 4, etc.

#create the regression coefficients
beta <- rep(0, p)
causal_variants <- sample(1:p, k)
beta[causal_variants] <- 1

#sample the response variable from a sparse linear model
Y_ext <- digitaltwins::haps_to_gen(external_haps) %*% beta + rnorm(n_ext)
Y_offspring <- digitaltwins::haps_to_gen(offspring_haps) %*% beta + rnorm(n_trio)

We now have a population of unrelated haplotypes and parent-offspring trios, and so we will next turn to the Digital Twin Test.

DTT Step 1: Modeling using the External Data

The first step of the Digital Twin Test is model fitting. The model fitting is done on the external data set of size r n_ext. This can be done using any fitting software that yields a linear predictor. We will use the glmnet package for sparse linear regression, tuning the model with cross-validation.

library(glmnet)

lasso_fit <- cv.glmnet(digitaltwins::haps_to_gen(external_haps), Y_ext)
beta_hat <- coef(lasso_fit)
length(beta_hat) #Fitted linear predictor. First entry is an intercept.

DTT Step 2: Inference using the Trio Data

Next, we perform inference using the trio data. While our test uses a linear predictor, we emphasize that the validity of the test does not rely on the correctness of our model whatsoever. A better model fit will lead to higher power, but inference remains valid no matter the quality of the model fit.

We will take a group that contains a causal variant.

test_region <- 1:100
sum(causal_variants %in% test_region) #number of causal variants in the test region

p_value <- digitaltwins::linear_crt(offspring_haps, parent_haps, anc, Y_offspring, 
                      matrix(beta_hat, ncol = 1), 
                      group = test_region, d = d, family = "gaussian")
#null p-value
p_value

We find that the p-value for this region is significant.

If we instead test a region that does not contain a causal variant, we will find a p-value that is not siginificant.

test_region <- 400:500
sum(causal_variants %in% test_region) #number of causal variants in the test region

p_value <- digitaltwins::linear_crt(offspring_haps, parent_haps, anc, Y_offspring, 
                      matrix(beta_hat, ncol = 1), 
                      group = test_region, d = d, family = "gaussian")

#non-null p-value
p_value