Alexandra Tidd October 28, 2021
TGP is a package to predict the target genes of fine-mapped variants of a trait.
predict_target_genes()
is the master, user-facing function of this package. All other functions are helper functions called by predict_target_genes()
. The functions are all documented, so once the package is loaded you can access the help pages for individual functions, which explain all the arguments.
?predict_target_genes()
This package uses both reference genomic annotation datasets and user-provided trait-specific datasets. Some data that accompanies the package must be downloaded separately. Genomic coordinates use the hg19 build. The package's BED handling follows UCSC's bedtools
conventions, so it expects 0-based start positions and 1-based end positions.
The reference panels and example trait data are saved locally...
package_data_dir=/working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/
reference_panels_dir=$package_data_dir/reference_panels/output/
traits_dir=$package_data_dir/traits/output/
Smaller generic reference datasets, including chromosome sizes, GENCODE annotations and REVEL annotations (ChrSizes
, TSSs
, exons
, introns
, promoters
, missense
, nonsense
, splicesite
) are stored internally as parsed objects in R/sysdata.R
. They are accessible when the package is loaded, but not visible due to lazy loading. The raw data from which these objects are derived are in the reference panels data directory and the scripts to generate the output are in the reference panels code directory.
The weightings of annotations used to generate pair scores are stored as a TSV file in data/weights.tsv.
Larger cell-type-specific reference panels are stored externally as local files in the reference panels output directory. These are too large to upload to GitHub and would make the package too bulky, so they will be published in a directory to be downloaded alongside the package. The reproducible scripts to generate these files are in the reference panels code directory.
list.files("/working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/reference_panels/output/")
## [1] "DHSs" "expression" "GENCODE" "H3K27ac"
## [5] "HiChIP" "metadata.tsv" "REVEL" "TADs"
There is one required user-provided file for the predict_target_genes()
function: the trait variants (the variants_file
argument). Known genes for the trait (the known_genes_file
argument) are needed only if do_performance = T
. Example inputs can be found in the traits directory for several tested traits.
(cd /working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/traits/output/ ; ls -d */)
## BC_cS2GxMichailidou2017_assoc/
## BC_L2GxMichailidou2017_FM/
## BC_Michailidou2017_FM/
## BloodCancer_Law2017andWent2018_LD/
## CRC_Law2019_LD/
## EC_Wang2022_LD/
## EOC_Jones2020_FM/
## IBD_Huang2017_FM/
## IBD_Huang2017_LD/
## LC_McKay2017_LD/
## Lymphoma_Sud2018_LD/
## old/
## PrCa_Benafif2018_LD/
## PrCa_cS2GxSchumacher2018_assoc/
## PrCa_L2GxSchumacher2018_FM/
The variants file should be a BED file with metadata columns for the variant name and the credible set to which it belongs.
head /working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/traits/output/BC_Michailidou2017_FM/variants.bed
## chr1 10551762 10551763 rs657244 BCAC_FM_1.1
## chr1 10563363 10563364 rs202087283 BCAC_FM_1.1
## chr1 10564674 10564675 rs2847344 BCAC_FM_1.1
## chr1 10566521 10566522 rs617728 BCAC_FM_1.1
## chr1 10569000 10569000 rs60354536 BCAC_FM_1.1
## chr1 10569257 10569258 rs2480785 BCAC_FM_1.1
## chr1 10579544 10579545 rs1411402 BCAC_FM_1.1
## chr1 10580890 10580891 rs2483677 BCAC_FM_1.1
## chr1 10581050 10581051 rs2506885 BCAC_FM_1.1
## chr1 10581657 10581658 rs2056417 BCAC_FM_1.1
The known genes file should be a text file with a single column of known gene symbols. These symbols must be GENCODE-compatible.
head /working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/traits/output/BC_Michailidou2017_FM/known_genes.txt
## AKT1
## ARID1A
## ATM
## BRCA1
## BRCA2
## CBFB
## CDH1
## CDKN1B
## CHEK2
## CTCF
To run predict_target_genes()
...
package_data_dir <- "/working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/"
MA <- predict_target_genes(
trait = "BC_Michailidou2017_FM",
variants_file = paste0(package_data_dir, "traits/output/BC_Michailidou2017_FM/variants.bed"),
known_genes_file = paste0(package_data_dir, "traits/output/BC_Michailidou2017_FM/known_genes.txt"),
reference_panels_dir = paste0(package_data_dir, "reference_panels/output/")
)
Unless an out_dir
argument is passed, the results will be saved to "out/${trait}/${celltypes}/". If do_timestamp = T
, then the run results will be saved to a timestamped subdirectory.
If you are calling predict_target_genes()
repeatedly in the same session, you can load the large reference objects H3K27ac
and HiChIP
into the global environment once, and then pass them to the function pre-loaded. This prevents redundant re-loading with each call to predict_target_genes()
.
# 1. load large reference panel objects
package_data_dir <- "/working/lab_jonathb/alexandT/tgp_paper/wrangle_package_data/"
HiChIP <- readRDS(paste0(package_data_dir, "reference_panels/output/HiChIP/HiChIP.rds"))
H3K27ac <- readRDS(paste0(package_data_dir, "reference_panels/output/H3K27ac/H3K27ac.rds"))
# 2. pass objects to predict_target_genes for a quicker runtime
MA <- predict_target_genes(
trait = "BC_Michailidou2017_FM",
variants_file = paste0(package_data_dir, "traits/output/BC_Michailidou2017_FM/variants.bed"),
known_genes_file = paste0(package_data_dir, "traits/output/BC_Michailidou2017_FM/known_genes.txt"),
reference_panels_dir = paste0(package_data_dir, "reference_panels/output/"),
HiChIP = HiChIP,
H3K27ac = H3K27ac
)
To simulate installing and loading the package during interactive development...
setwd("/working/lab_jonathb/alexandT/tgp/")
devtools::load_all()
This package was written using package development conventions from https://r-pkgs.org/.
All of the arguments passed to predict_target_genes()
in a given run are written to arguments_for_predict_target_genes.R
in the output directory. If you wish to restore the internal environment of a run in your global environment to run the R/predict_target_genes.R
script...
args <- dget("out/BC_Michailidou2017_FM/enriched_tissues/arguments_for_predict_target_genes.R")
list2env(args, envir=.GlobalEnv)
All functions in the package, unless generic, use the same object names as predict_target_genes()
, so you can also run the internal code of the helper functions directly as long as you have already run the internal code of predict_target_genes()
up to the point at which that helper function is called. This allows you to run the complete pipeline line-by-line for package debugging and development.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.