Tutorial for GWAS with SLOPE
In geneSLOPE: Genome-Wide Association Study with SLOPE

How to do GWAS?
How changing parameters affects my analysis?
How this procedure works exactly?

How to do GWAS?

This tutorial will guide you on how to perform GWAS with SLOPE. Analysis consists of three simple steps.

Reading data

You need to provide paths to three files:

.fam file with information about observations including phenotype. By default this file is assumed to have six column, with last one containing phenotype. For details see documentation of function readPhenotype
.map file with mapping information about snps. This file is not required for subsequent analysis, but is highly recommended. Note that lack of mapping information will result in less informative plots and results summary.
.raw file with snps. We assume that snps were previously exported from PLINK with command
plink --file input --recodeAD --out output, where input is you name of .ped file.

library(geneSLOPE)
famFile <- system.file("extdata", "plinkPhenotypeExample.fam", package = "geneSLOPE")
mapFile <- system.file("extdata", "plinkMapExample.map", package = "geneSLOPE")
snpsFile <- system.file("extdata", "plinkDataExample.raw", package = "geneSLOPE")

phenotype <- read_phenotype(filename = famFile)

When you have phenotype you can move to reading snp data. Depending on data size reading SNPs may long time. As data is very large, snps are filtered with their marginal test p-value. All snps which p-values are larger than threshold $pValMax$ will be truncated. For details on how to choose $pValMax$ see How changing parameters affects my analysis?

screening.result <- screen_snps(snpsFile, mapFile, phenotype, pValMax = 0.05, 
                      chunkSize = 1e2, verbose=FALSE)

Parameter verbose=FALSE suppresses progress bar. Default value is TRUE.

User look into result of reading and screening dataset

summary(screening.result)

When data is successfully read, one can move to the second step of analysis.

Clumping highly correlated genes

Next step is clumping. Highly correlated snps will be clustered. For details see How this procedure works exactly? $rho$ controls number and size of clumps. For details see How changing parameters affects my analysis?

clumping.result <- clump_snps(screening.result, rho = 0.3, verbose = FALSE)

What is the result of clumping procedure?

summary(clumping.result)

We can also plot our results

plot(clumping.result)

If we are interested in specific chromosome we can "zoom it"

plot(clumping.result, chromosomeNumber = 1)

It is possible to identify interactively clump number that contains SNP of interest. The procedure is the following. First plot the whole genome, then run function \emph{identify_clump} and click on SNP of interest.

plot(clumping.result)
identify_clump(clumping.result)

Knowing clump number one can zoom into it.

plot(clumping.result, clumpNumber = 1)

Running SLOPE on result of clumping procedure

Last step of analysis is using SLOPE

slope.result <- select_snps(clumping.result, fdr=0.1)

As before one can plot and summarize results

summary(slope.result)
plot(slope.result)

Like with result of clumping, it is possible to identify interactively clump number which contains specific SNP selected by SLOPE. The procedure is the following. First plot the whole genome, then run function \emph{identify_clump} and click on SNP of interest.

plot(slope.result)
identify_clump(slope.result)

When clump is identified one can zoom into it

plot(slope.result, clumpNumber = 1)

It is easy to get information about selected SNPs. To get indices of columns in original SNP matrix they refer to use

slope.result$selectedSnpsNumbers

If .map file was given, then one can get more information about SNPs

slope.result$X_info[slope.result$selectedSnpsNumbers,]

For information about SNPs that are part of specific clump use

summary(slope.result, clumpNumber = 1)

How changing parameters affects my analysis?

There are three numerical parameters that influence result

$pValMax$ is the threshold p-value for marginal test. When data is loaded to R, initial screening of snps is performed. For every snp, test for slope coefficient in simple linear regression model $lm(phenotype \sim snp)$ is performed. All snps with p-value larger than pValMax are discarded. Setting this parameter to too large will significantly increase number of snps on which clumping procedure will be performed. This may cause two technical threats
- Computer might run out of RAM memory
- Clumping procedure might take a lot of time
$rho$ is threshold for correlation between snps in clumping procedure. For given snp, every that is correlated with it at least at level $rho$ will be clumped together. Setting this parameter too high (say 0.7) will cause SLOPE to work on highly correlated snps which might affect FDR control
$fdr$ false discover rate (FDR) for SLOPE procedure. The higher $fdr$ is, the more variables will be accepted to the model. Contrary, small $fdr$ yields more conservative models

How this procedure works exactly?

Clumping Procedure for SLOPE (CPS)

Input: $rho \in (0, 1)$;

for each SNPs calculate p-value for simple linear regression test, i.e. after assuming linear regression model with a single explanatory variable and testing if slope parameter is nonzero. Vector created from p-values gives hierarchy which is used in next steps;
select index, Idx, corresponding to the smallest p-value, create group of all SNPs correlated with SNPs Idx at least on level Rho (Pearson correlation);
define this group as clump and Idx as representative of clump. Exclude entire clump from remained SNPs;
repeat two previous steps until each SNP is assigned to some clump.