knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
This package requires cmdstanr, and so can't go on CRAN (nor Bioc I think). Install cmdstanr
first, then use it to install CmdStan
:
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos"))) library(cmdstanr) check_cmdstan_toolchain() install_cmdstan(cores = 2)
Note that C++ compilers are also required (check_cmdstan_toolchain()
will check for this). Installing CmdStan
can take around ~5 minutes, depending on how fast your system can compile it. You can set Sys.setenv(MAKEFLAGS = "-j4")
before the install command to make it compile in parallel if you wish.
Then use:
remotes::install_github("andrewGhazi/basehitmodel")
to install the remaining dependencies (listed in the DESCRIPTION
file) and basehitmodel
itself. You can add Ncpus=3
to that install command to make the dependency installation run in parallel, though you will probably need to restart your R session if you set the parallel makeflag above.
Installing the R packages will take a ~20 seconds if your OS gets precompiled binaries from CRAN, otherwise it will be ~2-5 minutes on a normal laptop.
Point model_proteins_separately
to a directory of mapped basehit counts (here I point it to the demo directory from this repo) and it will fit the ZINB model per-protein and save the results in the specified output directory (for this example I just tell it to output to a directory on my desktop). An optional cache directory can be used to save the pre-processing steps so that future model runs with different statistical parameters (e.g. priors) don't have to re-run the initial QC steps. Help documentation on the other arguments can be displayed with ?model_proteins_separately
.
res = model_proteins_separately(count_path = "~/basehitmodel/data-raw/demo", out_dir = "~/Desktop/basehit_tmp", cache_dir = "~/Desktop/basehit_tmp/cache", split_data_dir = "~/Desktop/basehit_tmp/splits/", ixn_prior_width = 0.15, algorithm = "variational", iter_sampling = 2000, iter_warmup = 1000, save_split = TRUE, save_fits = FALSE, save_summaries = TRUE, bead_binding_threshold = 1, save_bead_binders = TRUE, min_n_nz = 3, min_frac_nz = 0.5, verbose = TRUE, seed = 1234)
Using the example data in data-raw/demo
and this example command with variational inference and no parallelization, the full processing, modeling, and summarization pipeline took 5 minutes 25 seconds on an i7-1370P (a modern laptop processor).
The main function model_proteins_separately()
returns a data frame with one row for each protein:strain pair that passed the QC steps. It has columns of scores, hit-calls, concordances, type S error probabilities, several score quantiles, and a final column of notes that will indicate when a particular protein showed higher-than usual bead binding enrichment.
res[,1:6] protein strain ixn_score strong_hit weak_hit concordance 1: LTB AB26 0.6602471 FALSE FALSE 0.01159683 2: BTC AB26 0.6064522 FALSE FALSE 0.01648394 3: FGF19 EpiG7 0.5368637 FALSE FALSE 0.31062751 4: PDIA6 EpiG10 0.5218827 FALSE FALSE 0.74763547 5: SLC34A3 EpiD4 0.4239569 FALSE FALSE 0.50798559 ... --- 15695: CCL4 EpiB9 -0.2758499 FALSE FALSE 0.69314718 15696: MUCL3 AIEC -0.2903377 FALSE FALSE 1.08893603 15697: PDIA6 AB9 -0.3108420 FALSE FALSE 1.05451725 15698: MUCL3 AB9 -0.3143061 FALSE FALSE 1.09323906 15699: DEFB130B AB9 -0.3371342 FALSE FALSE 1.07972289
This data frame of results is also written to the output directory as ixn_scores.tsv
. The output directory also contains per-protein score summaries, warnings, other model parameter estimates, and various other diagnostic outputs in files with explanatory names.
The highest scoring interaction (first row in the data frame returned by model_proteins_separately()
and in ixn_scores.tsv
in the output) in this case is LTB:AB26 with a score of 0.66, though none of the interactions in the demo data subset pass the default hit-calling thresholds.
Using algorithm = "mcmc"
will give a more accurate approximation, but usually takes ~10x longer. Analyzing the full dataset
took about 6 days (still only a fraction of the time required to design and run the assay!).
To reproduce the estimates used in the paper, uncompress the 6 data files in data-raw/
(not demo/
) with xz -dk *.xz
, set the parallelization in R library(furrr); plan(multisession, workers = 12)
and use the following arguments to model_proteins_separately()
: iter_sampling = 2500, iter_warmup = 2000, algorithm = 'mcmc', seed = 123
. This should reproduce the estimates to within Monte Carlo error.
This package has been tested on R 4.1.0 and the following package versions (though anything from ~2020 forwards / not ancient should work):
|pkg |ver | |----------|------| |dplyr |1.0.7 | |data.table|1.14.2| |magrittr |2.0.2 | |openxlsx |4.2.4 | |tibble |3.1.6 | |tidyr |1.2.0 | |cmdstanr |0.5.1 | |readxl |1.3.1 | |furrr |0.2.3 | |stringr |1.4.0 | |ggplot2 |3.3.5 | |patchwork |1.1.1 | |forcats |0.5.1 | |purrr |0.3.4 | |posterior |1.2.0 | |rlang |1.0.0 | |stats |4.1.0 | |utils |4.1.0 | |readr |2.1.2 | |progressr |0.10.0|
Any OS that can install R and Stan should be sufficient. We have tested the package most thoroughly on CentOS Linux release 7.6.1810 (Core), as well as various recent versions of macOS and Windows 10. No non-standard hardware is required.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.