prepareEnv: Prepare environmental input
In R.SamBada: Processing Pipeline for 'SamBada' from Pre- To Post-Processing

Description Usage Arguments Details Value Author(s) Examples

Writes a new environmental file that sambada can work with after having removed too correlated variables. Also calculates population structure from a PCA in SNPRelate and add it at the end of the environmental file

prepareEnv(
  envFile,
  outputFile,
  maxCorr,
  idName,
  separator = " ",
  genoFile = NULL,
  numPc = 0.5,
  mafThresh = NULL,
  missingnessThresh = NULL,
  ldThresh = NULL,
  numPop = -1,
  clustMethod = "kmeans",
  includeCol = NULL,
  excludeCol = NULL,
  popStrCol = NULL,
  x,
  y,
  locationProj,
  interactiveChecks = FALSE,
  verbose = TRUE
)

`envFile`	char Name of the input environmental file (must be in active directory). Can be .csv or .shp
`outputFile`	char Name of the output file. Must have a .csv extension.
`maxCorr`	double A number between 0 and 1 specifying the maximum allowable correlation coefficient between environmental files. If above (in absolute value), one of the variables will be deleted (the kept variable among the two will always be the one that appears first in the environmental file)
`idName`	char Name of the id in the environmental file matching the one of `genoFile`
`separator`	char If `envFile` is .csv, the separator character. If file created with create_env, separator is ' '
`genoFile`	char (optional) Name of the input genomic file (must be in active directory). If not null, population variable will be calculated from a PCA relying on the SNPRelate package. Can be .gds, .ped, .bed, .vcf. If different from .gds, a gds file (SNPRelate specific format) will be created
`numPc`	double If above 1, number of principal components to analyze. If between 0 and 1, automatic detection of number of PC (the program will find the first leap in the proportion of variance where the ratio (difference in variance between PC x and x+1)/(variance of PC x) is greater than `numPc`. If 0, PCA and population structure will not be computed: in that case, the `genoFile` will only be used to make the sample order in the `envFile` match the one of the `genoFile` (necessary for sambada's computation). Set it to 0 if `genoFile` is null
`mafThresh`	double A number between 0 and 1 specifying the Major Allele Frequency (MAF) filtering when computing PCA (if null no filtering on MAF will be computed)
`missingnessThresh`	double A number between 0 and 1 specifying the missing rate filtering when computing PCS(if null no filtering on missing rate will be computed)
`ldThresh`	double A number between 0 and 1 specifying the linkage disequilibrium (LD) rate filtering before computing the PCA (if null no filtering on LD will be computed)
`numPop`	integer If not null, clustering based on `numPc` first PC will be computed to divide into `numPop` populations. If -1 automatic detection of number of cluster (elbow method if `clustMethod` = 'kmeans', maximise branch length if `clustMethod` = 'hclust'). If null, no clustering will be computed: if `genoFile` is set, principal component scores will be included as population information in the final file.
`clustMethod`	char One of 'kmeans' or 'hclust' for K-means and hierarchical clustering respectively. Default 'kmeans'
`includeCol`	character vector Columns in the environmental file to be considered as variables. If none specified, all numeric variables will be considered as env var except for the id
`excludeCol`	character vector Columns in the environmental file to exclude in the output (non-variable column). If none specified, all numeric variables will be considered as environmental variables except for the id
`popStrCol`	character vector Columns in the environmental file describing population structure (ran elsewhere). Those columns won't be excluded when correlated with environmental files
`x`	character Name of the column corresponding to the x coordinate (or longitude if spherical coordinate). If not null, x column won't be removed even if correlated with other variable. This parameter is also used to display the map of the population structure.
`y`	character Name of the column corresponding to the y coordinate (or latitude if spherical coordinate). If not null, y column won't be removed even if correlated with other variable. This parameter is also used to display the map of the population structure.
`locationProj`	integer EPSG code of the projection of x-y coordinate
`interactiveChecks`	logical If TRUE, plots will show up showing number of populations chosen, and correlation between variables and the user can interactively change the chosen threshold for `maxCorr` and `numPop` (optional, default value=FALSE)
`verbose`	boolean If true show information about progress of the process

The population structure is calculated as a PCA of all the SNPs that pass the filtering (maf, ld, missingness). You can either choose to use the score of the X first components to evaluate the population structure (set 'numPop' to NULL) or you can compute a "membership coefficient" to a cluster of individuals based on the scores on the first X components. You can choose between two clustering algorithm (k-means or hierarchical cluster in the 'clustMethod' argument). One of the option to decide the number of PCs that you should keep is to detect a bump in the proportion of variance explained and keep all the PC before the bump.

None

Solange Duruz, Oliver Selmoni

################
# Run prepareEnv
################
#Without calculating population structure.
prepareEnv(envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada"), 
     outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, 
     numPc=0, idName='short_name', x='longitude',y='latitude', locationProj=4326, 
     interactiveChecks = FALSE)

# While it is not mandatory to provide gdsFile, it is recommended to define it so that IDs 
# in envrionmental and genomic file are in the same order (gdsFile also needed to compute
# population structure)

# determine gdsFile according to OS
if(Sys.info()['sysname']=='Windows'){
  gdsFile="uganda-subset-mol_windows.gds"
} else {
  gdsFile="uganda-subset-mol_unix.gds"
}

#Calculating PCA-based population structure
prepareEnv(envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada"), 
     outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, 
     idName='short_name', genoFile=system.file("extdata", gdsFile, package = "R.SamBada"),
     numPc=0.2, mafThresh=0.05, missingnessThresh=0.1, ldThresh=0.2, numPop=NULL,
     x='longitude', y='latitude', locationProj=4326, interactiveChecks = TRUE)

#Calculating structure membership coefficient based on kmeans clustering
prepareEnv(envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada"), 
     outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, 
     idName='short_name', genoFile=system.file("extdata", gdsFile, package = "R.SamBada"),
     numPc=0.2, mafThresh=0.05, missingnessThresh=0.1, ldThresh=0.2, numPop=NULL,
     x='longitude', y='latitude', locationProj=4326, interactiveChecks = TRUE)