run_pathwayanal: Basically the same as run_pathwayanal.R with an additional...
In ryanrsun/LungCancerAssoc: Pathway Analysis Using Summary Statistics

View source: R/leave_k_out.R View source: R/run_pathwayanal.R

Basically the same as run_pathwayanal.R with an additional outside loop to control leaving out k genes.

High-level function to control pathway analysis.

run_pathwayanal(pathways_tab = NULL, pathways_tab_fname = NULL,
  gene_tab = NULL, gene_tab_fname = NULL, SS_file = NULL,
  SS_fname_root = NULL, evecs_tab = NULL, evecs_tab_fname = NULL,
  num_PCs_use = 5, pathways_per_job = 10, gene_buffer = 5000,
  threshold_1000G = 0.03, prune_factor = 0.5, prune_limit = 0.0625,
  snp_limit = 1000, hard_snp_limit = 1500, prune_to_start = TRUE,
  run_GHC = FALSE, out_name_root = "pathway_anal", refsnp_dir = NULL,
  input_dir = NULL, output_dir = NULL, Snum = 1, aID = 1,
  checkpoint = TRUE)

run_pathwayanal(pathways_tab = NULL, pathways_tab_fname = NULL,
  gene_tab = NULL, gene_tab_fname = NULL, SS_file = NULL,
  SS_fname_root = NULL, evecs_tab = NULL, evecs_tab_fname = NULL,
  num_PCs_use = 5, pathways_per_job = 10, gene_buffer = 5000,
  threshold_1000G = 0.03, prune_factor = 0.5, prune_limit = 0.0625,
  snp_limit = 1000, hard_snp_limit = 1500, prune_to_start = TRUE,
  run_GHC = FALSE, out_name_root = "pathway_anal", refsnp_dir = NULL,
  input_dir = NULL, output_dir = NULL, Snum = 1, aID = 1,
  checkpoint = TRUE)

`pathways_tab`	A data.frame of pathways defined by genes in the pathway. First column name should be 'Pathway_name', second should be 'Pathway_description', all others should be 'Gene1', 'Gene2', etc. Use NA to fill blanks.
`pathways_tab_fname`	The name of a file formatted in the manner described by pathways_tab. You only need to specify either pathways_tab or pathways_tab_fname.
`gene_tab`	A data.frame which defines the location of each gene in the genome. Should have column headings including at least 'Gene', 'CHR', 'Start', 'End'.
`gene_tab_fname`	The name of a file formatted in the manner described by gene_tab. You only need to specify either gene_tab or gene_tab_fname.
`SS_file`	A data.frame holding all the summary statistics. Should have column headings including at least 'RS' and 'P-value'.
`SS_fname_root`	The root name of a file formatted in the manner described by SS_file. If you use this option it is assume you have separated your summary statistics by chromosome into files name [SS_fname_root][1].txt, [SS_fname_root][2].txt, etc. You only need to specify either SS_file or SS_fname_root.
`evecs_tab`	Data.frame of eigenvectors for correlation estimation. Should have column headings 'Subject', 'EV1', 'EV2', and so on.
`evecs_tab_fname`	The name of a file formatted in the manner described by evecs_tab. You only need to specify either evecs_tab or evecs_tab_fname.
`num_PCs_use`	Number of PCs to use.
`pathways_per_job`	How many pathways to test in one call of the run_pathwayanal() function. Will only be used if you also specify aID to determine which part of the pathway_tab to use.
`gene_buffer`	A buffer region added to the Start and End of each gene region to capture, for example, possible cis-eQTL effects.
`threshold_1000G`	The minimum MAF needed for a reference panel SNP before we trust it to be used in covariance estimation.
`prune_factor`	If the pathway has more than snp_limit SNPs, then multiply the current pruning level by this factor and rerun.
`prune_limit`	If the pruning factor is less than this amount, stop pruning and move on.
`snp_limit`	If the pathway has more than this many SNPs, rerun the function to prune more aggressively before testing. Recommended value of 1000, do not set above 2000 or numerical stability will suffer greatly.
`hard_snp_limit`	If after pruning limit we still haven't gone under snp_limit, can slightly raise the threshold and see if calculation is stable enough.
`prune_to_start`	If true, begin pruning at prune_factor, otherwise don't prune on first run.
`run_GHC`	Boolean, if true then test with both GBJ and GHC, if false just GBJ.
`out_name_root`	Root of output filename. If Snum and aID are specified then output name will be [out_name_root]_S[Snum]_[aID].txt.
`refsnp_dir`	Directory holding reference panel genotypes.
`input_dir`	Directory holding summary statistics, pathway table, gene table, eigenvectors, PLINK binary.
`output_dir`	Directory to save output file.
`Snum`	Used in cluster job submission scripts to organize jobs.
`aID`	Used in cluster job submission scripts to organize jobs.
`checkpoint`	Boolean, if true, print out diagnostic messages.
`k_vec`	A vector that holds all the different k to try, i.e. k=0:5.
`gene_sig_tab`	Data.frame with two columns - 'Gene' and 'P_value'. Holds the significance of each single gene.
`gene_sig_tab_fname`	The name of a file formatted in the manner described by gene_sig_tab. You only need to specify one of gene_sig_tab or gene_sig_tab_fname.
`pathways_tab`	A data.frame of pathways defined by genes in the pathway. First column name should be 'Pathway_name', second should be 'Pathway_description', all others should be 'Gene1', 'Gene2', etc. Use NA to fill blanks.
`pathways_tab_fname`	The name of a file formatted in the manner described by pathways_tab. You only need to specify either pathways_tab or pathways_tab_fname.
`gene_tab`	A data.frame which defines the location of each gene in the genome. Should have column headings including at least 'Gene', 'CHR', 'Start', 'End'.
`gene_tab_fname`	The name of a file formatted in the manner described by gene_tab. You only need to specify either gene_tab or gene_tab_fname.
`SS_file`	A data.frame holding all the summary statistics. Should have column headings including at least 'RS' and 'P-value'.
`SS_fname_root`	The root name of a file formatted in the manner described by SS_file. If you use this option it is assume you have separated your summary statistics by chromosome into files name [SS_fname_root][1].txt, [SS_fname_root][2].txt, etc. You only need to specify either SS_file or SS_fname_root.
`evecs_tab`	Data.frame of eigenvectors for correlation estimation. Should have column headings 'Subject', 'EV1', 'EV2', and so on.
`evecs_tab_fname`	The name of a file formatted in the manner described by evecs_tab. You only need to specify either evecs_tab or evecs_tab_fname.
`num_PCs_use`	Number of PCs to use.
`pathways_per_job`	How many pathways to test in one call of the run_pathwayanal() function. Will only be used if you also specify aID to determine which part of the pathway_tab to use.
`gene_buffer`	A buffer region added to the Start and End of each gene region to capture, for example, possible cis-eQTL effects.
`threshold_1000G`	The minimum MAF needed for a reference panel SNP before we trust it to be used in covariance estimation.
`snp_limit`	If the pathway has more than this many SNPs, rerun the function to prune more aggressively before testing. Recommended value of 1000, do not set above 2000 or numerical stability will suffer greatly.
`hard_snp_limit`	If after pruning limit we still haven't gone under snp_limit, can slightly raise the threshold and see if calculation is stable enough.
`prune_factor`	If the pathway has more than snp_limit SNPs, then multiply the current pruning level by this factor and rerun.
`prune_limit`	If the pruning factor is less than this amount, stop pruning and move on.
`prune_to_start`	If true, begin pruning at prune_factor, otherwise don't prune on first run.
`run_GHC`	Boolean, if true then test with both GBJ and GHC, if false just GBJ.
`out_name_root`	Root of output filename. If Snum and aID are specified then output name will be [out_name_root]_S[Snum]_[aID].txt.
`refsnp_dir`	Directory holding reference panel genotypes.
`input_dir`	Directory holding summary statistics, pathway table, gene table, eigenvectors, PLINK binary.
`output_dir`	Directory to save output file.
`Snum`	Used in cluster job submission scripts to organize jobs.
`aID`	Used in cluster job submission scripts to organize jobs.
`checkpoint`	Boolean, if true, print out diagnostic messages.