feature.selection: Feature selection

Description Usage Arguments Value See Also Examples

View source: R/FeatureSelection.R

Description

Logistic regression-based feature selection approach.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
feature.selection(
  input.dir,
  output.dir,
  genotype,
  phenotype,
  covar.number = NULL,
  plink.path = NULL,
  topK = 10,
  P.value = NULL,
  candidate.SNPs = NULL,
  verbose = TRUE
)

Arguments

input.dir

[character] The full absolute path to the directory containing the training and test dataset. If input.dir is missing, the current working directory obtained by getwd() is used.

output.dir

[character] The full absolute path where the result will be written to. If output.dir is missing, the current working directory obtained by getwd() is used.

genotype

[character] The prefix of PLINK binary files (bed/bim/fam).

phenotype

[character] A space- or tab-delimited file to specify an alternate phenotype for the logistic regression analysis using the "--pheno" flag in plink. This file must have a header row. The first and second columns of the phenotype file must be "FID" and "IID", the case/control phenotype in column 3 (1 = unaffected (control), 2 = affected (case)), and covariates in remaining columns. See the PLINK 1.9 documentation for details (https://www.cog-genomics.org/plink/1.9/).

plink.path

[character] The full absolute path to the PLINK executable file. The executable to run is path/to/plink.exe if you are on a Windows operating system, for Unix-like operating system this is path/to/plink. If plink.path is NULL, the PLINK PATH should be added as a system environment variable.

topK

[numeric] To specify the top K significant SNPs to build a prediction model. For a fair comparison, the number of the top-ranked SNPs from entire sample (for LR and PRS model) equals to the number of the unique union set of the selected SNPs from each stratum in PV. The default value is 10. This value is ignored when P.value or candidate.SNPs is not NULL.

P.value

[double] To specify the genome-wide significance P-value threshold to select the significant SNPs to build a prediction model. The default value is NULL. This value is ignored when candidate.SNPs is not NULL. When left NULL (the default), the topK or candidate.SNPs will be used. The P-value of each SNP is calculated from logistic regression analysis using PLINK 1.9 (via plink.lr).

candidate.SNPs

[vector] A character vector of SNP name, used to specify the candidate SNPs to build a prediction model, ignores P.value and topK. The default value is NULL. Should match the names of SNPs in the provided PLINK binary files.

verbose

[logical] If TRUE, the PLINK log, error, and warning information are printed to standard out. The default value is TRUE.

Value

feature.selection return a list containing the results of logistic regression analysis derived from PLINK (via plink.lr), the indices and names of selected features.

lr.result

The output of plink.lr

index

A vector of indices of the selected features.

name

A vector of names of the selected features.

See Also

plink.lr

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
input.dir <- system.file("extdata", package="pv")
output.dir <- system.file("extdata", package="pv")
path2plink <- '/path/to/plink'
## Not run: 
feature.selection.result <- feature.selection(input.dir = input.dir,
output.dir = output.dir,
genotype = "train",
phenotype = "train.phenotypes.txt",
covar.number = c(2, 3),
plink.path = path2plink,
topK = 10,
verbose = TRUE)

## End(Not run)

abnerzyx/pv documentation built on Feb. 27, 2022, 12:06 a.m.