prepareEnv: Prepare environmental input

Description Usage Arguments Details Value Author(s) Examples

View source: R/Preprocessing.R

Description

Writes a new environmental file that sambada can work with after having removed too correlated variables. Also calculates population structure from a PCA in SNPRelate and add it at the end of the environmental file

Usage

1
2
3
4
5
6
prepareEnv(envFile, outputFile, maxCorr, idName, separator = " ",
  genoFile = NULL, numPc = 0.5, mafThresh = NULL,
  missingnessThresh = NULL, ldThresh = NULL, numPop = -1,
  clustMethod = "kmeans", includeCol = NULL, excludeCol = NULL,
  popStrCol = NULL, x, y, locationProj, interactiveChecks = FALSE,
  verbose = TRUE)

Arguments

envFile

char Name of the input environmental file (must be in active directory). Can be .csv or .shp

outputFile

char Name of the output file. Must have a .csv extension.

maxCorr

double A number between 0 and 1 specifying the maximum allowable correlation coefficient between environmental files. If above (in absolute value), one of the variables will be deleted (the kept variable among the two will always be the one that appears first in the environmental file)

idName

char Name of the id in the environmental file matching the one of genoFile

separator

char If envFile is .csv, the separator character. If file created with create_env, separator is ' '

genoFile

char (optional) Name of the input genomic file (must be in active directory). If not null, population variable will be calculated from a PCA relying on the SNPRelate package. Can be .gds, .ped, .bed, .vcf. If different from .gds, a gds file (SNPRelate specific format) will be created

numPc

double If above 1, number of principal components to analyze. If between 0 and 1, automatic detection of number of PC (the program will find the first leap in the proportion of variance where the ratio (difference in variance between PC x and x+1)/(variance of PC x) is greater than numPc. If 0, PCA and population structure will not be computed: in that case, the genoFile will only be used to make the sample order in the envFile match the one of the genoFile (necessary for sambada's computation). Set it to 0 if genoFile is null

mafThresh

double A number between 0 and 1 specifying the Major Allele Frequency (MAF) filtering when computing PCA (if null no filtering on MAF will be computed)

missingnessThresh

double A number between 0 and 1 specifying the missing rate filtering when computing PCS(if null no filtering on missing rate will be computed)

ldThresh

double A number between 0 and 1 specifying the linkage disequilibrium (LD) rate filtering before computing the PCA (if null no filtering on LD will be computed)

numPop

integer If not null, clustering based on numPc first PC will be computed to divide into numPop populations. If -1 automatic detection of number of cluster (elbow method if clustMethod = 'kmeans', maximise branch length if clustMethod = 'hclust'). If null, no clustering will be computed: if genoFile is set, principal component scores will be included as population information in the final file.

clustMethod

char One of 'kmeans' or 'hclust' for K-means and hierarchical clustering respectively. Default 'kmeans'

includeCol

character vector Columns in the environmental file to be considered as variables. If none specified, all numeric variables will be considered as env var except for the id

excludeCol

character vector Columns in the environmental file to exclude in the output (non-variable column). If none specified, all numeric variables will be considered as environmental variables except for the id

popStrCol

character vector Columns in the environmental file describing population structure (ran elsewhere). Those columns won't be excluded when correlated with environmental files

x

character Name of the column corresponding to the x coordinate (or longitude if spherical coordinate). If not null, x column won't be removed even if correlated with other variable. This parameter is also used to display the map of the population structure.

y

character Name of the column corresponding to the y coordinate (or latitude if spherical coordinate). If not null, y column won't be removed even if correlated with other variable. This parameter is also used to display the map of the population structure.

locationProj

integer EPSG code of the projection of x-y coordinate

interactiveChecks

logical If TRUE, plots will show up showing number of populations chosen, and correlation between variables and the user can interactively change the chosen threshold for maxCorr and numPop (optional, default value=FALSE)

verbose

boolean If true show information about progress of the process

Details

The population structure is calculated as a PCA of all the SNPs that pass the filtering (maf, ld, missingness). You can either choose to use the score of the X first components to evaluate the population structure (set 'numPop' to NULL) or you can compute a "membership coefficient" to a cluster of individuals based on the scores on the first X components. You can choose between two clustering algorithm (k-means or hierarchical cluster in the 'clustMethod' argument). One of the option to decide the number of PCs that you should keep is to detect a bump in the proportion of variance explained and keep all the PC before the bump.

Value

None

Author(s)

Solange Duruz, Oliver Selmoni

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
################
# Run prepareEnv
################
#Without calculating population structure.
prepareEnv(envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada"), 
     outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, 
     numPc=0, idName='short_name', x='longitude',y='latitude', locationProj=4326, 
     interactiveChecks = FALSE)

# While it is not mandatory to provide gdsFile, it is recommended to define it so that IDs 
# in envrionmental and genomic file are in the same order (gdsFile also needed to compute
# population structure)

# determine gdsFile according to OS
if(Sys.info()['sysname']=='Windows'){
  gdsFile="uganda-subset-mol_windows.gds"
} else {
  gdsFile="uganda-subset-mol_unix.gds"
}

#Calculating PCA-based population structure
prepareEnv(envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada"), 
     outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, 
     idName='short_name', genoFile=system.file("extdata", gdsFile, package = "R.SamBada"),
     numPc=0.2, mafThresh=0.05, missingnessThresh=0.1, ldThresh=0.2, numPop=NULL,
     x='longitude', y='latitude', locationProj=4326, interactiveChecks = TRUE)

#Calculating structure membership coefficient based on kmeans clustering
prepareEnv(envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada"), 
     outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, 
     idName='short_name', genoFile=system.file("extdata", gdsFile, package = "R.SamBada"),
     numPc=0.2, mafThresh=0.05, missingnessThresh=0.1, ldThresh=0.2, numPop=NULL,
     x='longitude', y='latitude', locationProj=4326, interactiveChecks = TRUE)

SolangeD/R.SamBada documentation built on Dec. 25, 2021, 10:48 a.m.