Home

/

GitHub

/

statpng/sp.gwas

/

sp.gwas: Selection probabilities using generalized linear model with...

sp.gwas: Selection probabilities using generalized linear model with...
In statpng/sp.gwas: An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study

Description Usage Arguments Details Value Author(s) References Examples

View source: R/sp.gwas.R

For analysis of high-dimensional genomic data, penalized regression can be a solution to accomodate correlations between predictors. Moreover, selection probabilities do not depend on tuning parameter selection so that it produces a stablity selection. Thresholds are also generated to control the false positives(errors). Thresholds vary with the expected number of false positives to be controlled by the user. Input data is the hapmap formatted SNP data and phenotype data corresponding to SNP data. Output includes files of three types: (1) Matched data files (genotype, numerical, snp info, and phenotype), (2) Results file (selection probabilities and thresholds), (3) Circular manhattan plot with (blue dotted) significant line corresponding to the largest value among user-defined false discoveries.

sp.gwas(
  input.genotype = NULL,
  input.phenotype = NULL,
  input.type = c("object", "path"),
  input.y.col = 2,
  input.y.id.col = 1,
  input.y.norm = TRUE,
  impute = FALSE,
  impute.type = c("distribution", "mode"),
  qc = TRUE,
  qc.callrate.range = c(0, 1),
  qc.maf.range = c(0, 1),
  qc.HWE.range = c(0, 1),
  qc.hetero.range = c(0, 1),
  qc.remove.missingY = TRUE,
  pop = TRUE,
  pop.tsne = FALSE,
  pop.clustering = "kmeans",
  pop.gap = "firstSEmax",
  pop.gap.Kmax = NULL,
  pop.gap.nboot = 50,
  gwas.method = "lasso",
  gwas.family = "gaussian",
  gwas.alpha.seq = seq(0.1, 0.9, by = 0.2),
  gwas.nlambda = 10,
  gwas.lambda.min.quantile = 0.5,
  gwas.nrep = 100,
  gwas.psub = 0.5,
  gwas.seed = NULL,
  gwas.cpp = FALSE,
  threshold.false.discovery = c(1, 5, 10),
  threshold.perm = FALSE,
  threshold.nperm = 100,
  plot.display = FALSE,
  plot.manhattan.type = "c",
  plot.name = "",
  plot.ylim = NULL,
  plot.extension = "jpg",
  plot.dpi = 300,
  save.path = "./"
)

`input.genotype`	Either R object or file path can be considered. A genotype data is not a data.frame but a matrix with dimension `p` by `(n+11)`. It is formatted by hapmap which has (rs, allele, chr, pos) in the first four(1-4) columns, (strand, assembly, center, protLSID, assayLSID, panel, Qcode) in the following seven(5-11) columns. If NULL, user can choose a path in interactive use.
`input.phenotype`	Either R object or file path can be considered. A phenotype data is an `n` by `p` matrix. Since the first some columns can display attributes of the phenotypes, you should enter the arguments, y.col and y.id.col, which represent the columns of phenotypes to be analyzed and the column of sample ID. If NULL, user can choose a path in interactive use.
`input.type`	Default is "object". If `input.type` is "object", obejects of genotype/phenotype will be entered, and if "path", paths of genotype/phenotype will be enterd. If you want to use an object, you have to make sure that the class of each column of genotype data is equal to "character".
`input.y.col`	The columns of phenotypes. At most 4 phenotypes can be considered, because the plot of them will be fine. Default is 2.
`input.y.id.col`	The column of sample ID in the phenotype data file. Default is 1.
`input.y.norm`	If TRUE. phenotypes are converted to be normal-shape using box-cox transformation when all phenotypes are positive.
`impute`	TRUE or FALSE for whether imputation will be conducted.
`impute.type`	Two imputation methods are supported for (only) imputation=TRUE. Default is "distribution" which impute a genotype from allele distribution. The other is "mode" which indicates an imputation from the most frequent genotype.
`qc`	TRUE or FALSE for whether QC for SNPs will be conducted.
`qc.callrate.range`	A numeric vector indicating the range of non-missing proportion. Default is c(0, 1).
`qc.maf.range`	A numeric vector indicating the range of minor allele frequency (MAF) to be used. Default is c(0, 1).
`qc.HWE.range`	A numeric vector indicating the range of pvalue by Hardy-Weinberg Equillibrium to be used. Default is c(0, 1).
`qc.hetero.range`	A numeric vector indicating the range of heterozygosity values to be used, because, in some cases, heterozygosity higher than expected indicates the low quality variants or sample contamination. Default is c(0, 1).
`qc.remove.missingY`	If TRUE, the samples with missing values in phenotype data are removed. Accordingly, the corresponding genotype samples are also filtered out. Default is TRUE.
`pop`	Whether the population structure as a covariate into our model is incorporated or not. The estimation for subpopulation will be conducted using 'gap' statistic and the corresponding PCoA and tSNE plot will be provided.. Default is TRUE.
`pop.tsne`	Whether the tSNE plot would be displayed or not. Defaults is FALSE.
`pop.clustering`	The clustering method to classify the samples into subpopulations. Either "kmeans" or "pam" is available. Dafualt is "kmeans".
`pop.gap`	The method of calculating the gap statistics. Default is "firstSEmax".
`pop.gap.Kmax`	The maximum number of clusters to be taken into account. Default is 10% of the sample size.
`pop.gap.nboot`	The number of bootstraps when calculating the gap statistics. Default is 50.
`gwas.method`	A method of penalized regression. It includes "lasso" for the lasso and "enet" for the elastic-net.
`gwas.family`	A family of response variable(phenotype). It is "gaussian" for continuous response variable, "binomial" for binary, "poisson" for count, etc. Now you can use only the same family for the multi phenotypes. For more details, see the function(`stats::glm`). Default is "gaussian".
`gwas.alpha.seq`	A mixing proportion between lasso and ridge penalties. Default is from 0.1 to 0.9 by 0.2, i.e. elastic-net penalty. When you want to use the lasso penalty, set this value at 1.
`gwas.nlambda`	The length of lambda sequence. The larger n.lambda, the more detailed lambda sequence will be.
`gwas.lambda.min.quantile`	A range of lambda sequence. Default is 0.5 (median). If the range is so small that it can have many tied selection probabilities which is 1. To handle with this problem, you should increase the value of "lambda.min.quantile".
`gwas.nrep`	The number of iterations in resampling when calculating the selection probabilities.
`gwas.psub`	The subsampling proportion. For efficiency, default is 0.5.
`gwas.seed`	An integer value to reproduce the analysis result. Default is current date (e.g. 20201111).
`gwas.cpp`	Whether or not the code using Rcpp will be used.
`threshold.false.discovery`	The expected number of false discovery to be controlled. The larger it is, the higher threshold becomes. Default is c(1, 5, 10).
`threshold.perm`	Permutation-based empirical threshold can be provided for `false.discovery` if permutation is TRUE. Default is FALSE.
`threshold.nperm`	The number of permutation replicates. Note that the calculation time increases as `nperm` increases. Default is 100.
`plot.display`	Figures will be displayed or not
`plot.manhattan.type`	A type of manhattan plot to be drawn includes circular('c') and rectangular('m').
`plot.name`	A name of plot file.
`plot.ylim`	A range of the y-axis. If NULL, automatic range in the y-axis will be provided. For plot.ylim=c(0,1), the y-axis has a range of 0 and 1.
`plot.extension`	A type of plot file which includes "jpg", "pdf", "tiff", etc.
`plot.dpi`	A resolution of plot. If you want to get a high-resolution image, plot.dpi should be large.
`save.path`	A save.path which has all output files. If there exists save.path, sp.gwas will check if there is an output file. Note that if there is an output RData file in "save.path", sp.gwas will just load the output files(.RData) in there, thereby not providing the results for new "genotype" and "phenotype".

The penalty function of elastic-net is defined as

α||β||_1+(1-α)||β||_2/2,

where α is a mixing proportion of ridge and the lasso, and β is regression coefficients. This penalty is equivalent to the Lasso penalty if alpha=1.

An algorithm of selection probabilities with elastic-net.

0 : Let us assume that a genomic data has n samples and p variables.
1 : For all Λ=(α, λ), where α in [0,1], λ>0.
2 : for k=1 to K do.
3 : ——- Subsample I_k with size [n/2].
4 : ——- Compute \hat{β}_j^{Λ}(I_k) with regularization model.
5 : end for.
6 : SP_j^Λ = \frac{1}{K}\#\{k≤ K: \hat{β}_j^Λ(I_k) \ne 0 \}.
7 : SP_j = \max_Λ SP_j^Λ, ~~j=1,\cdots, p.
8 : return SP=(SP_1, \cdots, SP_p).

`Histogram of original and transformed phenotypes`	Histogram of phenotypes with p-value by Shapiro-test on the top right corner.
`myDATA`	A list of myX, myGD, myGM, myGT, myY, and myY.original(for "gaussian").
`sp.res`	A list of sp.df and threshold.
`Circular Manhattan plot`	Manhattan plot for the first phenotype is the innermost circle. Colors for chromosome is fixed, so that if you want to change colors, you would edit the R code of sp.manhattan function.

Note that, before your code, you have to specify the setseed value to get reproducible results, because it uses the resampling approach when calculating the selection probability.

The dataframe with new mean and sum columns

Kipoong Kim <kkp7700@gmail.com>

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320. Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. Kim, K., Koo, J., & Sun, H. (2020). An empirical threshold of selection probability for analysis of high-dimensional correlated data. Journal of Statistical Computation and Simulation, 1-12.

if(FALSE){

# Note that you have to change the parameters used in this example when applying this code into your works.
genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")

# elastic-net model
sp.gwas(input.genotype = genotype,
        input.phenotype = phenotype,
        input.type = "object",
        input.y.col = 2:3,
        input.y.id.col = 1,
        input.y.norm = FALSE, # due to the negative values of phenotypes

        impute = TRUE,
        impute.type = "distribution", # based on the allele frequency distribution

        qc = TRUE,
        qc.callrate.range = c(0.9, 1),
        qc.maf.range = c(0.01, 1),
        qc.HWE.range = c(0, 1),
        qc.hetero.range = c(0, 1),
        qc.remove.missingY = TRUE,

        pop = TRUE,
        pop.tsne = FALSE,
        pop.clustering = "kmeans",
        pop.gap = "firstSEmax",
        pop.gap.Kmax = NULL,
        pop.gap.nboot = 50,

        gwas.method = "enet",
        gwas.family = "gaussian",
        gwas.alpha.seq = 0.1, # for a fixed alpha value
        gwas.nlambda = 10,
        gwas.lambda.min.quantile = 0.5,
        gwas.nrep = 10, # the number of resampling replicates
        gwas.psub = 0.5,
        gwas.seed = NULL,
        gwas.cpp = FALSE,

        threshold.false.discovery = c(1,5,10),
        threshold.perm = TRUE,
        threshold.nperm = 10,

        plot = FALSE,
        plot.manhattan.type = "c",
        plot.name = "",
        plot.ylim = NULL,
        plot.extension = "jpg",
        plot.dpi = 300,

        save.path = "./EXAMPLE_enet")
        
# Manhattan plot for the first phenotype (Y1)
results <- read.csv("./EXAMPLE_enet/[2]sp.results.csv")
class(results$chr) <- "numeric"
class(results$pos) <- "numeric"
thresholds <- read.csv("./EXAMPLE_enet/[2]sp.thresholds.csv")
threshold_Y1_FD1 <- subset( subset(thresholds, Method=="Theoretical"), FD==1)$Y1
threshold_Y1_FD10 <- subset( subset(thresholds, Method=="Theoretical"), FD==10)$Y1
highlight1 <- results$rs[results$Y1 > threshold_Y1_FD1]
highlight10 <- setdiff( results$rs[results$Y1 > threshold_Y1_FD10], highlight1 )

jpeg("./EXAMPLE_enet/Manhattan_from_qqman.jpeg", width=12, height=5, unit="in", res=600)
qqman_manhattan(results,
                chr="chr", bp="pos", snp="rs", p="Y1", logp=FALSE, 
                suggestiveline = threshold_Y1_FD1,
                genomewideline = threshold_Y1_FD10,
                highlight = list(highlight1, highlight10),
                col.highlight = c("blue", "red"),
                ylab="Selection probabilities", ylim=c(0,1))
dev.off()


# Lasso model with permuted threshold for multinomial phenotypes

genotype <- sp.gwas::genotype # load("genotype.rda")
phenotype <- sp.gwas::phenotype # load("phenotype.rda")

phenotype[,2] <- gtools::quantcut(phenotype[,2], 3)
phenotype[,3] <- gtools::quantcut(phenotype[,3], 3)
phenotype[,4] <- gtools::quantcut(phenotype[,4], 3)


sp.gwas(input.genotype = genotype,
        input.phenotype = phenotype,
        input.type = "object",
        input.y.col = 2:4,
        input.y.id.col = 1,
        input.y.norm = FALSE, # due to the negative values of phenotypes
        
        impute = TRUE,
        impute.type = "distribution", # based on the allele frequency distribution
        
        qc = TRUE,
        qc.callrate.range = c(0.9, 1),
        qc.maf.range = c(0.01, 1),
        qc.HWE.range = c(0, 1),
        qc.hetero.range = c(0, 1),
        qc.remove.missingY = TRUE,
        
        pop = TRUE,
        pop.tsne = FALSE,
        pop.clustering = "kmeans",
        pop.gap = "firstSEmax",
        pop.gap.Kmax = NULL,
        pop.gap.nboot = 50,
        
        gwas.method = "lasso",
        gwas.family = "multinomial",
        gwas.alpha.seq = 1, # for a fixed alpha value
        gwas.nlambda = 10,
        gwas.lambda.min.quantile = 0.5,
        gwas.nrep = 10, # the number of resampling replicates
        gwas.psub = 0.5,
        gwas.seed = NULL,
        gwas.cpp = FALSE,
        
        threshold.false.discovery = c(1,5,10),
        threshold.perm = TRUE,
        threshold.nperm = 10,
        
        plot = FALSE,
        plot.manhattan.type = "c",
        plot.name = "",
        plot.ylim = NULL,
        plot.extension = "jpg",
        plot.dpi = 300,
        
        save.path = "./EXAMPLE_lasso_multinomial")


png.manhattan_from_dir("./EXAMPLE_lasso_multinomial", "Theoretical", FD=c(1,5,10))
png.manhattan_from_dir("./EXAMPLE_lasso_multinomial", "Permuted", FD=c(1,5,10))



}

statpng/sp.gwas documentation built on Dec. 17, 2020, 5:55 a.m.

statpng/sp.gwas index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

statpng/sp.gwas
An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study

sp.gwas: Selection probabilities using generalized linear model with...
In statpng/sp.gwas: An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Related to sp.gwas in statpng/sp.gwas...

R Package Documentation

Browse R Packages

We want your feedback!

statpng/sp.gwas An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study

sp.gwas: Selection probabilities using generalized linear model with... In statpng/sp.gwas: An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Related to sp.gwas in statpng/sp.gwas...

R Package Documentation

Browse R Packages

We want your feedback!

statpng/sp.gwas
An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study

sp.gwas: Selection probabilities using generalized linear model with...
In statpng/sp.gwas: An integrated tool for an analysis of high-dimensional genomic data in Genome-Wide Association Study