sp.gwas: Selection probabilities using generalized linear model with...

Description Usage Arguments Details Value Author(s) References Examples

View source: R/sp.gwas.R

Description

For analysis of high-dimensional genomic data, penalized regression can be a solution to accomodate correlations between predictors. Moreover, selection probabilities do not depend on tuning parameter selection so that it produces a stablity selection. Thresholds are also generated to control the false positives(errors). Thresholds vary with the expected number of false positives to be controlled by the user. Input data is the hapmap formatted SNP data and phenotype data corresponding to SNP data. Output includes files of three types: (1) Matched data files (genotype, numerical, snp info, and phenotype), (2) Results file (selection probabilities and thresholds), (3) Circular manhattan plot with (blue dotted) significant line corresponding to the largest value among user-defined false discoveries.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
sp.gwas(
  input.genotype = NULL,
  input.phenotype = NULL,
  input.type = c("object", "path"),
  input.y.col = 2,
  input.y.id.col = 1,
  input.y.norm = TRUE,
  impute = FALSE,
  impute.type = c("distribution", "mode"),
  qc = TRUE,
  qc.callrate.range = c(0, 1),
  qc.maf.range = c(0, 1),
  qc.HWE.range = c(0, 1),
  qc.hetero.range = c(0, 1),
  qc.remove.missingY = TRUE,
  pop = TRUE,
  pop.tsne = FALSE,
  pop.clustering = "kmeans",
  pop.gap = "firstSEmax",
  pop.gap.Kmax = NULL,
  pop.gap.nboot = 50,
  gwas.method = "lasso",
  gwas.family = "gaussian",
  gwas.alpha.seq = seq(0.1, 0.9, by = 0.2),
  gwas.nlambda = 10,
  gwas.lambda.min.quantile = 0.5,
  gwas.nrep = 100,
  gwas.psub = 0.5,
  gwas.seed = NULL,
  gwas.cpp = FALSE,
  threshold.false.discovery = c(1, 5, 10),
  threshold.perm = FALSE,
  threshold.nperm = 100,
  plot.display = FALSE,
  plot.manhattan.type = "c",
  plot.name = "",
  plot.ylim = NULL,
  plot.extension = "jpg",
  plot.dpi = 300,
  save.path = "./"
)

Arguments

input.genotype

Either R object or file path can be considered. A genotype data is not a data.frame but a matrix with dimension p by (n+11). It is formatted by hapmap which has (rs, allele, chr, pos) in the first four(1-4) columns, (strand, assembly, center, protLSID, assayLSID, panel, Qcode) in the following seven(5-11) columns. If NULL, user can choose a path in interactive use.

input.phenotype

Either R object or file path can be considered. A phenotype data is an n by p matrix. Since the first some columns can display attributes of the phenotypes, you should enter the arguments, y.col and y.id.col, which represent the columns of phenotypes to be analyzed and the column of sample ID. If NULL, user can choose a path in interactive use.

input.type

Default is "object". If input.type is "object", obejects of genotype/phenotype will be entered, and if "path", paths of genotype/phenotype will be enterd. If you want to use an object, you have to make sure that the class of each column of genotype data is equal to "character".

input.y.col

The columns of phenotypes. At most 4 phenotypes can be considered, because the plot of them will be fine. Default is 2.

input.y.id.col

The column of sample ID in the phenotype data file. Default is 1.

input.y.norm

If TRUE. phenotypes are converted to be normal-shape using box-cox transformation when all phenotypes are positive.

impute

TRUE or FALSE for whether imputation will be conducted.

impute.type

Two imputation methods are supported for (only) imputation=TRUE. Default is "distribution" which impute a genotype from allele distribution. The other is "mode" which indicates an imputation from the most frequent genotype.

qc

TRUE or FALSE for whether QC for SNPs will be conducted.

qc.callrate.range

A numeric vector indicating the range of non-missing proportion. Default is c(0, 1).

qc.maf.range

A numeric vector indicating the range of minor allele frequency (MAF) to be used. Default is c(0, 1).

qc.HWE.range

A numeric vector indicating the range of pvalue by Hardy-Weinberg Equillibrium to be used. Default is c(0, 1).

qc.hetero.range

A numeric vector indicating the range of heterozygosity values to be used, because, in some cases, heterozygosity higher than expected indicates the low quality variants or sample contamination. Default is c(0, 1).

qc.remove.missingY

If TRUE, the samples with missing values in phenotype data are removed. Accordingly, the corresponding genotype samples are also filtered out. Default is TRUE.

pop

Whether the population structure as a covariate into our model is incorporated or not. The estimation for subpopulation will be conducted using 'gap' statistic and the corresponding PCoA and tSNE plot will be provided.. Default is TRUE.

pop.tsne

Whether the tSNE plot would be displayed or not. Defaults is FALSE.

pop.clustering

The clustering method to classify the samples into subpopulations. Either "kmeans" or "pam" is available. Dafualt is "kmeans".

pop.gap

The method of calculating the gap statistics. Default is "firstSEmax".

pop.gap.Kmax

The maximum number of clusters to be taken into account. Default is 10% of the sample size.

pop.gap.nboot

The number of bootstraps when calculating the gap statistics. Default is 50.

gwas.method

A method of penalized regression. It includes "lasso" for the lasso and "enet" for the elastic-net.

gwas.family

A family of response variable(phenotype). It is "gaussian" for continuous response variable, "binomial" for binary, "poisson" for count, etc. Now you can use only the same family for the multi phenotypes. For more details, see the function(stats::glm). Default is "gaussian".

gwas.alpha.seq

A mixing proportion between lasso and ridge penalties. Default is from 0.1 to 0.9 by 0.2, i.e. elastic-net penalty. When you want to use the lasso penalty, set this value at 1.

gwas.nlambda

The length of lambda sequence. The larger n.lambda, the more detailed lambda sequence will be.

gwas.lambda.min.quantile

A range of lambda sequence. Default is 0.5 (median). If the range is so small that it can have many tied selection probabilities which is 1. To handle with this problem, you should increase the value of "lambda.min.quantile".

gwas.nrep

The number of iterations in resampling when calculating the selection probabilities.

gwas.psub

The subsampling proportion. For efficiency, default is 0.5.

gwas.seed

An integer value to reproduce the analysis result. Default is current date (e.g. 20201111).

gwas.cpp

Whether or not the code using Rcpp will be used.

threshold.false.discovery

The expected number of false discovery to be controlled. The larger it is, the higher threshold becomes. Default is c(1, 5, 10).

threshold.perm

Permutation-based empirical threshold can be provided for false.discovery if permutation is TRUE. Default is FALSE.

threshold.nperm

The number of permutation replicates. Note that the calculation time increases as nperm increases. Default is 100.

plot.display

Figures will be displayed or not

plot.manhattan.type

A type of manhattan plot to be drawn includes circular('c') and rectangular('m').

plot.name

A name of plot file.

plot.ylim

A range of the y-axis. If NULL, automatic range in the y-axis will be provided. For plot.ylim=c(0,1), the y-axis has a range of 0 and 1.

plot.extension

A type of plot file which includes "jpg", "pdf", "tiff", etc.

plot.dpi

A resolution of plot. If you want to get a high-resolution image, plot.dpi should be large.

save.path

A save.path which has all output files. If there exists save.path, sp.gwas will check if there is an output file. Note that if there is an output RData file in "save.path", sp.gwas will just load the output files(.RData) in there, thereby not providing the results for new "genotype" and "phenotype".

Details

The penalty function of elastic-net is defined as

α||β||_1+(1-α)||β||_2/2,

where α is a mixing proportion of ridge and the lasso, and β is regression coefficients. This penalty is equivalent to the Lasso penalty if alpha=1.

An algorithm of selection probabilities with elastic-net.

0 : Let us assume that a genomic data has n samples and p variables.
1 : For all Λ=(α, λ), where α in [0,1], λ>0.
2 : for k=1 to K do.
3 : ——- Subsample I_k with size [n/2].
4 : ——- Compute \hat{β}_j^{Λ}(I_k) with regularization model.
5 : end for.
6 : SP_j^Λ = \frac{1}{K}\#\{k≤ K: \hat{β}_j^Λ(I_k) \ne 0 \}.
7 : SP_j = \max_Λ SP_j^Λ, ~~j=1,\cdots, p.
8 : return SP=(SP_1, \cdots, SP_p).

Value

Histogram of original and transformed phenotypes

Histogram of phenotypes with p-value by Shapiro-test on the top right corner.

myDATA

A list of myX, myGD, myGM, myGT, myY, and myY.original(for "gaussian").

sp.res

A list of sp.df and threshold.

Circular Manhattan plot

Manhattan plot for the first phenotype is the innermost circle. Colors for chromosome is fixed, so that if you want to change colors, you would edit the R code of sp.manhattan function.

Note that, before your code, you have to specify the setseed value to get reproducible results, because it uses the resampling approach when calculating the selection probability.

The dataframe with new mean and sum columns

Author(s)

Kipoong Kim <kkp7700@gmail.com>

References

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320. Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. Kim, K., Koo, J., & Sun, H. (2020). An empirical threshold of selection probability for analysis of high-dimensional correlated data. Journal of Statistical Computation and Simulation, 1-12.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
if(FALSE){

# Note that you have to change the parameters used in this example when applying this code into your works.
genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")

# elastic-net model
sp.gwas(input.genotype = genotype,
        input.phenotype = phenotype,
        input.type = "object",
        input.y.col = 2:3,
        input.y.id.col = 1,
        input.y.norm = FALSE, # due to the negative values of phenotypes

        impute = TRUE,
        impute.type = "distribution", # based on the allele frequency distribution

        qc = TRUE,
        qc.callrate.range = c(0.9, 1),
        qc.maf.range = c(0.01, 1),
        qc.HWE.range = c(0, 1),
        qc.hetero.range = c(0, 1),
        qc.remove.missingY = TRUE,

        pop = TRUE,
        pop.tsne = FALSE,
        pop.clustering = "kmeans",
        pop.gap = "firstSEmax",
        pop.gap.Kmax = NULL,
        pop.gap.nboot = 50,

        gwas.method = "enet",
        gwas.family = "gaussian",
        gwas.alpha.seq = 0.1, # for a fixed alpha value
        gwas.nlambda = 10,
        gwas.lambda.min.quantile = 0.5,
        gwas.nrep = 10, # the number of resampling replicates
        gwas.psub = 0.5,
        gwas.seed = NULL,
        gwas.cpp = FALSE,

        threshold.false.discovery = c(1,5,10),
        threshold.perm = TRUE,
        threshold.nperm = 10,

        plot = FALSE,
        plot.manhattan.type = "c",
        plot.name = "",
        plot.ylim = NULL,
        plot.extension = "jpg",
        plot.dpi = 300,

        save.path = "./EXAMPLE_enet")
        
# Manhattan plot for the first phenotype (Y1)
results <- read.csv("./EXAMPLE_enet/[2]sp.results.csv")
class(results$chr) <- "numeric"
class(results$pos) <- "numeric"
thresholds <- read.csv("./EXAMPLE_enet/[2]sp.thresholds.csv")
threshold_Y1_FD1 <- subset( subset(thresholds, Method=="Theoretical"), FD==1)$Y1
threshold_Y1_FD10 <- subset( subset(thresholds, Method=="Theoretical"), FD==10)$Y1
highlight1 <- results$rs[results$Y1 > threshold_Y1_FD1]
highlight10 <- setdiff( results$rs[results$Y1 > threshold_Y1_FD10], highlight1 )

jpeg("./EXAMPLE_enet/Manhattan_from_qqman.jpeg", width=12, height=5, unit="in", res=600)
qqman_manhattan(results,
                chr="chr", bp="pos", snp="rs", p="Y1", logp=FALSE, 
                suggestiveline = threshold_Y1_FD1,
                genomewideline = threshold_Y1_FD10,
                highlight = list(highlight1, highlight10),
                col.highlight = c("blue", "red"),
                ylab="Selection probabilities", ylim=c(0,1))
dev.off()


# Lasso model with permuted threshold for multinomial phenotypes

genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")

phenotype[,2] <- gtools::quantcut(phenotype[,2], 3)
phenotype[,3] <- gtools::quantcut(phenotype[,3], 3)
phenotype[,4] <- gtools::quantcut(phenotype[,4], 3)


sp.gwas(input.genotype = genotype,
        input.phenotype = phenotype,
        input.type = "object",
        input.y.col = 2:4,
        input.y.id.col = 1,
        input.y.norm = FALSE, # due to the negative values of phenotypes
        
        impute = TRUE,
        impute.type = "distribution", # based on the allele frequency distribution
        
        qc = TRUE,
        qc.callrate.range = c(0.9, 1),
        qc.maf.range = c(0.01, 1),
        qc.HWE.range = c(0, 1),
        qc.hetero.range = c(0, 1),
        qc.remove.missingY = TRUE,
        
        pop = TRUE,
        pop.tsne = FALSE,
        pop.clustering = "kmeans",
        pop.gap = "firstSEmax",
        pop.gap.Kmax = NULL,
        pop.gap.nboot = 50,
        
        gwas.method = "lasso",
        gwas.family = "multinomial",
        gwas.alpha.seq = 1, # for a fixed alpha value
        gwas.nlambda = 10,
        gwas.lambda.min.quantile = 0.5,
        gwas.nrep = 10, # the number of resampling replicates
        gwas.psub = 0.5,
        gwas.seed = NULL,
        gwas.cpp = FALSE,
        
        threshold.false.discovery = c(1,5,10),
        threshold.perm = TRUE,
        threshold.nperm = 10,
        
        plot = FALSE,
        plot.manhattan.type = "c",
        plot.name = "",
        plot.ylim = NULL,
        plot.extension = "jpg",
        plot.dpi = 300,
        
        save.path = "./EXAMPLE_lasso_multinomial")


png.manhattan_from_dir("./EXAMPLE_lasso_multinomial", "Theoretical", FD=c(1,5,10))
png.manhattan_from_dir("./EXAMPLE_lasso_multinomial", "Permuted", FD=c(1,5,10))



}

statpng/sp.gwas documentation built on Dec. 17, 2020, 5:55 a.m.