stratification: Data stratification

Description Usage Arguments Value See Also Examples

View source: R/Stratification.R

Description

A Function to stratify samples to several strata using top principal components.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
stratification(
  input.dir,
  output.dir,
  train.genotype,
  test.genotype,
  stratum.count = 2,
  PCA.separate = FALSE,
  PCs.count = 10,
  plink.path = NULL,
  CS = FALSE,
  verbose = TRUE
)

Arguments

input.dir

[character] The full absolute path to the directory containing the training and test dataset. If input.dir is missing, the current working directory obtained by getwd() is used.

output.dir

[character] The full absolute path where the result will be written to. If output.dir is missing, the current working directory obtained by getwd() is used.

train.genotype

[character] The prefix of PLINK binary files (bed/bim/fam) of the training dataset.

test.genotype

[character] The prefix of PLINK binary files (bed/bim/fam) of the test dataset.

stratum.count

[numeric] To specify the number of strata, as default the sample size of each stratum is N/stratum.count, N is the sample size.

PCA.separate

[logical] If TURE, the principal components are calculated from the training dataset and then project the test dataset onto those principal components. If FALSE, the principal components are calculated from the combined data of the training and test dataset. The default value is FALSE.

PCs.count

[numeric] To specify the number of top principal components that should be extracted. The default value is 10.

plink.path

[character] The full absolute path to the PLINK executable file. The executable to run is path/to/plink.exe if you are on a Windows operating system, for Unix-like operating system this is path/to/plink. If plink.path is NULL, the PLINK PATH should be added as a system environment variable.

CS

[logical] If TRUE, the softmax of cosine similarity will be used to calculate the probability that the samples belong to each stratum. If FALSE, the squared distance of a subject to a cluster center empirically follows a chi-squared distribution will be used. The default value is FALSE.

verbose

[logical] If TRUE, the PLINK log, error, and warning information are printed to standard out. The default value is TRUE.

Value

stratification returns a list containing the following components:

train.stratum

A vector containing the stratum number each training sample belongs to.

train.stratum.index

A list containing the index of training samples belonging to each stratum.

stratum.center

A vector containing the center of each stratum.

train.distance.to.center

A list containing the distance between the training samples and the center of each stratum.

test.distance.to.center

A list containing the distance between the training samples and the center of each stratum.

train.prob.to.g

A list containing the probability of the training samples having variable g under the hypothesis that the sample belongs to each stratum.

test.prob.to.g

A list containing the probability of the test samples having variable g under the hypothesis that the sample belongs to each stratum.

train.prob.to.stratum

A list containing the probability that the training samples belong to each stratum.

test.prob.to.stratum

A list containing the probability that the test samples belong to each stratum.

See Also

PCA

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
input.dir <- system.file("extdata", package="pv")
output.dir <- system.file("extdata", package="pv")
path2plink <- '/path/to/plink'
## Not run: 
stratification.result <- stratification(input.dir = input.dir,
output.dir = input.dir,
train.genotype = "train",
test.genotype = "test",
stratum.count = 2,
PCA.separate = FALSE,
PCs.count = 10,
plink.path = path2plink,
CS = FALSE,
verbose = TRUE)

## End(Not run)

abnerzyx/pv documentation built on Feb. 27, 2022, 12:06 a.m.