Description Usage Arguments Details Value Author(s) References Examples
For analysis of high-dimensional genomic data, penalized regression can be a solution to accomodate correlations between predictors. Moreover, selection probabilities do not depend on tuning parameter selection so that it produces a stablity selection. Thresholds are also generated to control the false positives(errors). Thresholds vary with the expected number of false positives to be controlled by the user. Input data is the hapmap formatted SNP data and phenotype data corresponding to SNP data. Output includes files of three types: (1) Matched data files (genotype, numerical, snp info, and phenotype), (2) Results file (selection probabilities and thresholds), (3) Circular manhattan plot with (blue dotted) significant line corresponding to the largest value among user-defined false discoveries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | sp.gwas(
input.genotype = NULL,
input.phenotype = NULL,
input.type = c("object", "path"),
input.y.col = 2,
input.y.id.col = 1,
input.y.norm = TRUE,
impute = FALSE,
impute.type = c("distribution", "mode"),
qc = TRUE,
qc.callrate.range = c(0, 1),
qc.maf.range = c(0, 1),
qc.HWE.range = c(0, 1),
qc.hetero.range = c(0, 1),
qc.remove.missingY = TRUE,
pop = TRUE,
pop.tsne = FALSE,
pop.clustering = "kmeans",
pop.gap = "firstSEmax",
pop.gap.Kmax = NULL,
pop.gap.nboot = 50,
gwas.method = "lasso",
gwas.family = "gaussian",
gwas.alpha.seq = seq(0.1, 0.9, by = 0.2),
gwas.nlambda = 10,
gwas.lambda.min.quantile = 0.5,
gwas.nrep = 100,
gwas.psub = 0.5,
gwas.seed = NULL,
gwas.cpp = FALSE,
threshold.false.discovery = c(1, 5, 10),
threshold.perm = FALSE,
threshold.nperm = 100,
plot.display = FALSE,
plot.manhattan.type = "c",
plot.name = "",
plot.ylim = NULL,
plot.extension = "jpg",
plot.dpi = 300,
save.path = "./"
)
|
input.genotype |
Either R object or file path can be considered. A genotype data is not a data.frame but a matrix with dimension |
input.phenotype |
Either R object or file path can be considered. A phenotype data is an |
input.type |
Default is "object". If |
input.y.col |
The columns of phenotypes. At most 4 phenotypes can be considered, because the plot of them will be fine. Default is 2. |
input.y.id.col |
The column of sample ID in the phenotype data file. Default is 1. |
input.y.norm |
If TRUE. phenotypes are converted to be normal-shape using box-cox transformation when all phenotypes are positive. |
impute |
TRUE or FALSE for whether imputation will be conducted. |
impute.type |
Two imputation methods are supported for (only) imputation=TRUE. Default is "distribution" which impute a genotype from allele distribution. The other is "mode" which indicates an imputation from the most frequent genotype. |
qc |
TRUE or FALSE for whether QC for SNPs will be conducted. |
qc.callrate.range |
A numeric vector indicating the range of non-missing proportion. Default is c(0, 1). |
qc.maf.range |
A numeric vector indicating the range of minor allele frequency (MAF) to be used. Default is c(0, 1). |
qc.HWE.range |
A numeric vector indicating the range of pvalue by Hardy-Weinberg Equillibrium to be used. Default is c(0, 1). |
qc.hetero.range |
A numeric vector indicating the range of heterozygosity values to be used, because, in some cases, heterozygosity higher than expected indicates the low quality variants or sample contamination. Default is c(0, 1). |
qc.remove.missingY |
If TRUE, the samples with missing values in phenotype data are removed. Accordingly, the corresponding genotype samples are also filtered out. Default is TRUE. |
pop |
Whether the population structure as a covariate into our model is incorporated or not. The estimation for subpopulation will be conducted using 'gap' statistic and the corresponding PCoA and tSNE plot will be provided.. Default is TRUE. |
pop.tsne |
Whether the tSNE plot would be displayed or not. Defaults is FALSE. |
pop.clustering |
The clustering method to classify the samples into subpopulations. Either "kmeans" or "pam" is available. Dafualt is "kmeans". |
pop.gap |
The method of calculating the gap statistics. Default is "firstSEmax". |
pop.gap.Kmax |
The maximum number of clusters to be taken into account. Default is 10% of the sample size. |
pop.gap.nboot |
The number of bootstraps when calculating the gap statistics. Default is 50. |
gwas.method |
A method of penalized regression. It includes "lasso" for the lasso and "enet" for the elastic-net. |
gwas.family |
A family of response variable(phenotype). It is "gaussian" for continuous response variable, "binomial" for binary, "poisson" for count, etc. Now you can use only the same family for the multi phenotypes. For more details, see the function( |
gwas.alpha.seq |
A mixing proportion between lasso and ridge penalties. Default is from 0.1 to 0.9 by 0.2, i.e. elastic-net penalty. When you want to use the lasso penalty, set this value at 1. |
gwas.nlambda |
The length of lambda sequence. The larger n.lambda, the more detailed lambda sequence will be. |
gwas.lambda.min.quantile |
A range of lambda sequence. Default is 0.5 (median). If the range is so small that it can have many tied selection probabilities which is 1. To handle with this problem, you should increase the value of "lambda.min.quantile". |
gwas.nrep |
The number of iterations in resampling when calculating the selection probabilities. |
gwas.psub |
The subsampling proportion. For efficiency, default is 0.5. |
gwas.seed |
An integer value to reproduce the analysis result. Default is current date (e.g. 20201111). |
gwas.cpp |
Whether or not the code using Rcpp will be used. |
threshold.false.discovery |
The expected number of false discovery to be controlled. The larger it is, the higher threshold becomes. Default is c(1, 5, 10). |
threshold.perm |
Permutation-based empirical threshold can be provided for |
threshold.nperm |
The number of permutation replicates. Note that the calculation time increases as |
plot.display |
Figures will be displayed or not |
plot.manhattan.type |
A type of manhattan plot to be drawn includes circular('c') and rectangular('m'). |
plot.name |
A name of plot file. |
plot.ylim |
A range of the y-axis. If NULL, automatic range in the y-axis will be provided. For plot.ylim=c(0,1), the y-axis has a range of 0 and 1. |
plot.extension |
A type of plot file which includes "jpg", "pdf", "tiff", etc. |
plot.dpi |
A resolution of plot. If you want to get a high-resolution image, plot.dpi should be large. |
save.path |
A save.path which has all output files. If there exists save.path, sp.gwas will check if there is an output file. Note that if there is an output RData file in "save.path", sp.gwas will just load the output files(.RData) in there, thereby not providing the results for new "genotype" and "phenotype". |
The penalty function of elastic-net
is defined as
α||β||_1+(1-α)||β||_2/2,
where α is a mixing proportion of ridge and the lasso, and β is regression coefficients. This penalty is equivalent to the Lasso penalty if alpha=1
.
An algorithm of selection probabilities with elastic-net.
0 : Let us assume that a genomic data has n
samples and p
variables.
1 : For all Λ=(α, λ), where α in [0,1], λ>0.
2 : for
k=1 to K do
.
3 : ——- Subsample I_k with size [n/2].
4 : ——- Compute \hat{β}_j^{Λ}(I_k) with regularization model.
5 : end for.
6 : SP_j^Λ = \frac{1}{K}\#\{k≤ K: \hat{β}_j^Λ(I_k) \ne 0 \}.
7 : SP_j = \max_Λ SP_j^Λ, ~~j=1,\cdots, p.
8 : return
SP
=(SP_1, \cdots, SP_p).
Histogram of original and transformed phenotypes |
Histogram of phenotypes with p-value by Shapiro-test on the top right corner. |
myDATA |
A list of myX, myGD, myGM, myGT, myY, and myY.original(for "gaussian"). |
sp.res |
A list of sp.df and threshold. |
Circular Manhattan plot |
Manhattan plot for the first phenotype is the innermost circle. Colors for chromosome is fixed, so that if you want to change colors, you would edit the R code of sp.manhattan function. |
Note that, before your code, you have to specify the setseed value to get reproducible results, because it uses the resampling approach when calculating the selection probability.
The dataframe with new mean and sum columns
Kipoong Kim <kkp7700@gmail.com>
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320. Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. Kim, K., Koo, J., & Sun, H. (2020). An empirical threshold of selection probability for analysis of high-dimensional correlated data. Journal of Statistical Computation and Simulation, 1-12.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | if(FALSE){
# Note that you have to change the parameters used in this example when applying this code into your works.
genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")
# elastic-net model
sp.gwas(input.genotype = genotype,
input.phenotype = phenotype,
input.type = "object",
input.y.col = 2:3,
input.y.id.col = 1,
input.y.norm = FALSE, # due to the negative values of phenotypes
impute = TRUE,
impute.type = "distribution", # based on the allele frequency distribution
qc = TRUE,
qc.callrate.range = c(0.9, 1),
qc.maf.range = c(0.01, 1),
qc.HWE.range = c(0, 1),
qc.hetero.range = c(0, 1),
qc.remove.missingY = TRUE,
pop = TRUE,
pop.tsne = FALSE,
pop.clustering = "kmeans",
pop.gap = "firstSEmax",
pop.gap.Kmax = NULL,
pop.gap.nboot = 50,
gwas.method = "enet",
gwas.family = "gaussian",
gwas.alpha.seq = 0.1, # for a fixed alpha value
gwas.nlambda = 10,
gwas.lambda.min.quantile = 0.5,
gwas.nrep = 10, # the number of resampling replicates
gwas.psub = 0.5,
gwas.seed = NULL,
gwas.cpp = FALSE,
threshold.false.discovery = c(1,5,10),
threshold.perm = TRUE,
threshold.nperm = 10,
plot = FALSE,
plot.manhattan.type = "c",
plot.name = "",
plot.ylim = NULL,
plot.extension = "jpg",
plot.dpi = 300,
save.path = "./EXAMPLE_enet")
# Manhattan plot for the first phenotype (Y1)
results <- read.csv("./EXAMPLE_enet/[2]sp.results.csv")
class(results$chr) <- "numeric"
class(results$pos) <- "numeric"
thresholds <- read.csv("./EXAMPLE_enet/[2]sp.thresholds.csv")
threshold_Y1_FD1 <- subset( subset(thresholds, Method=="Theoretical"), FD==1)$Y1
threshold_Y1_FD10 <- subset( subset(thresholds, Method=="Theoretical"), FD==10)$Y1
highlight1 <- results$rs[results$Y1 > threshold_Y1_FD1]
highlight10 <- setdiff( results$rs[results$Y1 > threshold_Y1_FD10], highlight1 )
jpeg("./EXAMPLE_enet/Manhattan_from_qqman.jpeg", width=12, height=5, unit="in", res=600)
qqman_manhattan(results,
chr="chr", bp="pos", snp="rs", p="Y1", logp=FALSE,
suggestiveline = threshold_Y1_FD1,
genomewideline = threshold_Y1_FD10,
highlight = list(highlight1, highlight10),
col.highlight = c("blue", "red"),
ylab="Selection probabilities", ylim=c(0,1))
dev.off()
# Lasso model with permuted threshold for multinomial phenotypes
genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")
phenotype[,2] <- gtools::quantcut(phenotype[,2], 3)
phenotype[,3] <- gtools::quantcut(phenotype[,3], 3)
phenotype[,4] <- gtools::quantcut(phenotype[,4], 3)
sp.gwas(input.genotype = genotype,
input.phenotype = phenotype,
input.type = "object",
input.y.col = 2:4,
input.y.id.col = 1,
input.y.norm = FALSE, # due to the negative values of phenotypes
impute = TRUE,
impute.type = "distribution", # based on the allele frequency distribution
qc = TRUE,
qc.callrate.range = c(0.9, 1),
qc.maf.range = c(0.01, 1),
qc.HWE.range = c(0, 1),
qc.hetero.range = c(0, 1),
qc.remove.missingY = TRUE,
pop = TRUE,
pop.tsne = FALSE,
pop.clustering = "kmeans",
pop.gap = "firstSEmax",
pop.gap.Kmax = NULL,
pop.gap.nboot = 50,
gwas.method = "lasso",
gwas.family = "multinomial",
gwas.alpha.seq = 1, # for a fixed alpha value
gwas.nlambda = 10,
gwas.lambda.min.quantile = 0.5,
gwas.nrep = 10, # the number of resampling replicates
gwas.psub = 0.5,
gwas.seed = NULL,
gwas.cpp = FALSE,
threshold.false.discovery = c(1,5,10),
threshold.perm = TRUE,
threshold.nperm = 10,
plot = FALSE,
plot.manhattan.type = "c",
plot.name = "",
plot.ylim = NULL,
plot.extension = "jpg",
plot.dpi = 300,
save.path = "./EXAMPLE_lasso_multinomial")
png.manhattan_from_dir("./EXAMPLE_lasso_multinomial", "Theoretical", FD=c(1,5,10))
png.manhattan_from_dir("./EXAMPLE_lasso_multinomial", "Permuted", FD=c(1,5,10))
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.