Description Usage Arguments Details Value Examples
View source: R/module_GlobalSettings.R
vigilante is designed to work in a highly self-contained pipeline style. To ensure that, there are several global settings need to be specified by the user. Also, in order to properly perform downstream analyses, vigilante needs to prepare all input data files in the working directory before continuing. This internal process includes several steps such as renaming data files, setting group-related parameters, setting ENSEMBL refernece, etc.
1 2 3 4 5 6 7 8 9 10 11 12 |
studyID |
character, study or project name, will be used in multiple naming situations (such as on the plot, or in the output file names), it is recommended to be concise and meaningful (e.g. "ProstateCancer", or "PCa" for short; "TripleNegativeBreastCancer", or "TNBC" for short), also avoid using special characters. |
studyID_regex |
regular expression, type "?regex" in the console to see more instruction about how to write a valid regex. Here studyID_regex asks for a regular expression that can capture all the input data files if there are multiple studies/projects involved, otherwise set it to be the same as "studyID". |
studyID_altered |
logical, default FALSE, but if there are multiple studies/projects involved, should be set to TRUE so that all the involved studies/projects will be treated as a whole in an higher level (they will still remain separated in their individual level). See "Details" section for more details about "studyID", "studyID_regex" and "studyID_altered". |
speciesID |
character, choose one from c("hg38", "hg19", "mm10"), default "hg38". This will affect downstream analysis methods based on different genome reference builds: "hg38" for GRCh38, "hg19" for GRCh37, "mm10" for GRCm38. In addition, each vigilante run should set only one "speciesID" (genome reference build). If input data files involve multiple genome reference build, separate them into multiple vigilante runs of which each uses a single "speciesID". |
fileNum, fileDNA_mafNum |
integer, "fileNum" for the maximum total number of input data files per each sample (e.g. there are 50 samples, 35 of them have 5 files per sample, 12 of them have 7 files per sample, 3 of them have 6 files per sample, set "fileNum" to 7); similarly, "fileDNA_mafNum" for the maximum total number of .maf files per each sample (e.g. there are .maf files from Strelka came in a set of 2, while other .maf files from MuTect came in 1, set "fileDNA_mafNum" to 2). |
clinicalFeature |
logical, whether input data have associated clinical information (e.g. Gleason score, Race, Ethnicity). If TRUE, addtional columns for specifying clinical information will be added to the vigilante-generated groupInfo.csv file and user needs to fill them before downstream analyses can properly take these clinical information into consideration. |
createOutputFolders |
logical, whether to allow vigilante create specific output folders as the default place for storing downstream analyses output files (e.g. plots, results tables). If TRUE, a set of output folders will be created in the working directory under ./_VK/. It is recommended to set it to TRUE so that downstream analyses output files can be better organized inside "_VK" folder; if FALSE, user needs to specify a output path each time user chooses to generate a output file. |
prepareVdata |
logical, whether to allow vigilante prepare all input data files in the working directory. This internal process includes several steps such as renaming data files, setting group-related parameters, setting ENSEMBL refernece, etc. If TRUE, user will be asked to backup input data files before continuing; if FALSE, vigilante will stop the run and no input data files will be affected. |
addTCLabel |
logical, whether to add TC (Tumor/Control) label into consideration when capture and rename the input data files, see Details for more information. |
The workflow of vigilante is highly module-based. Modules are connected and there are certain settings that are shared across all modules. To ensure a successful and smooth run, these settings need to be properly specified by the user.
Take "studyID", "studyID_regex" and "studyID_altered" for example. Oftentimes input data files generated by upstream tools came with diverse naming conventions. It might be easy for the user to recognize those files, but not for vigilante if there is no recognizable patterns.
To make input data files clear to vigilante, it would be nice to have them named something like "studyID_sampleID_(other descriptions).file extension". Here "studyID" is the name of the study or project, and it will be used in multiple naming situations (such as on the plot, or in the output file names), so it is recommended to be concise and meaningful.
In the first demo example below, "studyID" is set to "KSCWUSCRF" (K - certain prefix, SCW - schwannoma, USC - University of Southern California, RF - certain suffix). Because it is a single study focusing on schwannoma and doesn't contain input data files from other studies (i.e. all input data files are named after "KSCWUSCRF" by upstream tools), "studyID_regex" is set to "[[:upper:]]9" to properly capture the study name pattern in the file names, or to make it simpler, here "studyID_regex" can be set to "KSCWUSCRF" as only one study is involved. Moreover, "studyID_altered" is set to FALSE because for single study, it is not necessary to change the study name inherited from upstream tools; however, if the user chooses to alter the study name, set "studyID_altered" to TRUE, and then "studyID" will be used as the new study name.
Sometimes the working study isn't a single study but a combined study, and contains input data files from different sub-studies. In that case, "studyID_altered" should be set to TRUE so that all the involved sub-studies will be treated as a whole in an higher level (they will still remain separated in their individual level), and here "studyID_regex" should be a regular expression that can properly capture the study name patterns in the file names across all involved sub-studies.
In the second demo example below, "studyID" is set to "KPROCOMBI" (K - certain prefix, PRO - prostate cancer, COMBI - combined study). Because sub-studies of "KPROCOMBI" are named such as "KGARUSCAG", "KPINSKIJC", "KRPLUSCJC" etc., here "studyID_regex" is set to "[[:upper:]]9" to properly capture them all.
Here is more information about "addTCLabel". Unlike in the first demo example, in the second demo example below, there are suffixes like "C1", "T1" or "T2" in the file name where sample IDs are the same. In order to correctly capture and differentiate different sections/parts of the same sample, "addTCLabel" needs to be set to TRUE.
Another very important thing about vigilante is the "groupInfo.csv" file. This file contains the meta-info required by downstream analyses. By default, "groupInfo.csv" has five columns: "assayID", "Group", "MAF_group", "realID" and "aliasID." If "clinicalFeature" is set to TRUE, there can be more columns. Usually, user should leave "assayID", "MAF_group" and "realID" unchanged as they are auto-populated by vigilante and already in the right format. The "aliasID" column can be changed if user wants to set specific names for their samples, and this change can only be made after the first run or when input files are already moved into position, otherwise should be left unchanged as well. The "Group" column (and possible additional "CliFea" columns) is where user should properly fill in.
For example, if samples are divided into training, validation and testing groups, the "Group" column should be filled with "Training", "Validation" and "Testing" accordingly. Similarly, if additional clinical information are available, user can specify them in the "CliFea" columns. Here "CliFea" is only a placeholder name, and user should change the column name to reflect that clinical feature (e.g. change it to "Race" column and fill in race information like "White", "Black or African American", "Asian" etc.; "Gleason Score" and fill in Gleason score values). Also, "CliFea" columns are not limited to two. User can add more columns to the right following the above instruction.
list, because R CMD check discourages assignments to the global environment within functions, user needs to run the function with explicitly assigning the return value to a global variable named "globalSettings_returnList", which will be a list containing the required variables for downstream analyses.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## Not run:
# single study of schwannoma
globalSettings_returnList = v_globalSettings(studyID = "KSCWUSCRF",
studyID_regex = "[[:upper:]]{9}", studyID_altered = FALSE, speciesID =
"hg19", fileNum = 7, fileDNA_mafNum = 2, clinicalFeature = TRUE,
createOutputFolders = TRUE, prepareVdata = TRUE, addTCLabel = FALSE)
# combined study of prostate cancer
globalSettings_returnList = v_globalSettings(studyID = "KPROCOMBI",
studyID_regex = "[[:upper:]]{9}", studyID_altered = TRUE, speciesID =
"hg19", fileNum = 7, fileDNA_mafNum = 2, clinicalFeature = TRUE,
createOutputFolders = TRUE, prepareVdata = TRUE, addTCLabel = TRUE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.