# Using SNPknock in R In SNPknock: Knockoffs for Hidden Markov Models and Genetic Data

This vignette illustrates the usage of the `SNPknock` package for creating knockoffs of discrete Markov chains and hidden Markov models [@sesia2019]. For simplicity, we will use synthetic data.

The `SNPknock` package also provides a simple interface to the genotype imputation software `fastPhase`, which can be used to fit hidden Markov models for genotype data. Since `fastPhase` is not available as an R package, this particular functionality of `SNPknock` cannot be demonstrated here. A tutorial showing how to use a combination of `SNPknock` and `fastPhase` to create knockoffs of genotype data can be found here.

## Knockoffs of discrete Markov chains

First, we verify that the `SNPknock` can be loaded.

```library(SNPknock)
```

We define a Markov chain model with 50 variables, each taking one of 5 possible values. We specify a uniform marginal distribution for the first variable in the chain and create 49 transition matrices with randomly sampled entries.

```p=50; # Number of variables in the model
K=5;  # Number of possible states for each variable
# Marginal distribution for the first variable
pInit = rep(1/K,K)
# Create p-1 transition matrices
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) {
Q[j,,] = Q[j,,] + diag(rep(1,K))
Q[j,,] = Q[j,,] / rowSums(Q[j,,])
}
```

We can sample 100 independent observations of this Markov chain using the `SNPknock` package.

```set.seed(1234)
X = sampleDMC(pInit, Q, n=100)
print(X[1:5,1:10])
```

Above, each row of `X` contains an independent realization of the Markov chain.

A knockoff copy of `X` can be sampled as follows.

```Xk = knockoffDMC(X, pInit, Q)
print(Xk[1:5,1:10])
```

## Knockoffs of hidden Markov models

We define a hidden Markov chain model with 50 variables, each taking one of 3 possible values, and 50 latent variables distributed as a Markov chain, each taking one of 5 values. We specify a uniform marginal distribution for the first variable in the chain and create 49 transition matrices with randomly sampled entries. We also define an emission distributions over the 3 possible emission states, for each of the 50 latent variables and the 5 latent states.

```p=200; # Number of variables in the model
K=5;  # Number of possible states for each variable
M=3;  # Number of possible emission states for each variable
# Marginal distribution for the first variable
pInit = rep(1/K,K)
# Create p-1 transition matrices
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
rho = stats::runif(p-1, min = 1, max = 50)
for(j in 1:(p-1)) {
Q[j,,] = Q[j,,] + rho[j] * diag(rep(1,K))
Q[j,,] = Q[j,,] / rowSums(Q[j,,])
}
# Create p emission matrices
pEmit = array(stats::runif(p*M*K),c(p,M,K))
for(j in 1:p) { pEmit[j,,] = sweep(pEmit[j,,],2,colSums(pEmit[j,,]),`/`) }
```

We can sample 100 independent observations of this hidden Markov model using the `SNPknock` package.

```set.seed(1234)
X = sampleHMM(pInit, Q, pEmit, n=100)
print(X[1:5,1:10])
```

Above, each row of `X` contains an independent realization of the hidden Markov model.

A knockoff copy of `X` can be sampled as follows.

```Xk = knockoffHMM(X, pInit, Q, pEmit)
print(Xk[1:5,1:10])
```

## See also

If you want to see how to use `SNPknock` to create knockoffs of genotype data, see the genotypes vignette.

If you want to learn about `SNPknock` for large genome-wide association studies [@sesia2019multi], see https://msesia.github.io/knockoffzoom/

## Try the SNPknock package in your browser

Any scripts or data that you put into this service are public.

SNPknock documentation built on May 18, 2019, 1:03 a.m.