In andrewGhazi/basehitmodel: Runs the basehit ZINB model

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

Installation

This package requires cmdstanr, and so can't go on CRAN (nor Bioc I think). Install cmdstanr first, then use it to install CmdStan:

install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/",
                                       getOption("repos")))

library(cmdstanr)
check_cmdstan_toolchain()
install_cmdstan(cores = 2)

Note that C++ compilers are also required (check_cmdstan_toolchain() will check for this). Installing CmdStan can take around ~5 minutes, depending on how fast your system can compile it. You can set Sys.setenv(MAKEFLAGS = "-j4") before the install command to make it compile in parallel if you wish.

Then use:

remotes::install_github("andrewGhazi/basehitmodel")

to install the remaining dependencies (listed in the DESCRIPTION file) and basehitmodel itself. You can add Ncpus=3 to that install command to make the dependency installation run in parallel, though you will probably need to restart your R session if you set the parallel makeflag above.

Installing the R packages will take a ~20 seconds if your OS gets precompiled binaries from CRAN, otherwise it will be ~2-5 minutes on a normal laptop.

Example

Point model_proteins_separately to a directory of mapped basehit counts (here I point it to the demo directory from this repo) and it will fit the ZINB model per-protein and save the results in the specified output directory (for this example I just tell it to output to a directory on my desktop). An optional cache directory can be used to save the pre-processing steps so that future model runs with different statistical parameters (e.g. priors) don't have to re-run the initial QC steps. Help documentation on the other arguments can be displayed with ?model_proteins_separately.

res = model_proteins_separately(count_path = "~/basehitmodel/data-raw/demo",
                                out_dir = "~/Desktop/basehit_tmp",
                                cache_dir = "~/Desktop/basehit_tmp/cache",
                                split_data_dir = "~/Desktop/basehit_tmp/splits/",
                                ixn_prior_width = 0.15,
                                algorithm = "variational",
                                iter_sampling = 2000,
                                iter_warmup = 1000,
                                save_split = TRUE,
                                save_fits = FALSE,
                                save_summaries = TRUE,
                                bead_binding_threshold = 1,
                                save_bead_binders = TRUE,
                                min_n_nz = 3,
                                min_frac_nz = 0.5,
                                verbose = TRUE,
                                seed = 1234)

Using the example data in data-raw/demo and this example command with variational inference and no parallelization, the full processing, modeling, and summarization pipeline took 5 minutes 25 seconds on an i7-1370P (a modern laptop processor).

The main function model_proteins_separately() returns a data frame with one row for each protein:strain pair that passed the QC steps. It has columns of scores, hit-calls, concordances, type S error probabilities, several score quantiles, and a final column of notes that will indicate when a particular protein showed higher-than usual bead binding enrichment.

 res[,1:6]
        protein strain  ixn_score strong_hit weak_hit concordance
    1:      LTB   AB26  0.6602471      FALSE    FALSE  0.01159683
    2:      BTC   AB26  0.6064522      FALSE    FALSE  0.01648394
    3:    FGF19  EpiG7  0.5368637      FALSE    FALSE  0.31062751
    4:    PDIA6 EpiG10  0.5218827      FALSE    FALSE  0.74763547
    5:  SLC34A3  EpiD4  0.4239569      FALSE    FALSE  0.50798559 ...
   ---                                                           
15695:     CCL4  EpiB9 -0.2758499      FALSE    FALSE  0.69314718
15696:    MUCL3   AIEC -0.2903377      FALSE    FALSE  1.08893603
15697:    PDIA6    AB9 -0.3108420      FALSE    FALSE  1.05451725
15698:    MUCL3    AB9 -0.3143061      FALSE    FALSE  1.09323906
15699: DEFB130B    AB9 -0.3371342      FALSE    FALSE  1.07972289

This data frame of results is also written to the output directory as ixn_scores.tsv. The output directory also contains per-protein score summaries, warnings, other model parameter estimates, and various other diagnostic outputs in files with explanatory names.

The highest scoring interaction (first row in the data frame returned by model_proteins_separately() and in ixn_scores.tsv in the output) in this case is LTB:AB26 with a score of 0.66, though none of the interactions in the demo data subset pass the default hit-calling thresholds.

Using algorithm = "mcmc" will give a more accurate approximation, but usually takes ~10x longer. Analyzing the full dataset

with MCMC
parallelized over 12 cores
on an Intel(R) Xeon(R) CPU E5-2680 (released Q1 2012)
and 2000 warmup iterations and 2500 posterior draws

took about 6 days (still only a fraction of the time required to design and run the assay!).

Reproduction instructions

To reproduce the estimates used in the paper, uncompress the 6 data files in data-raw/ (not demo/) with xz -dk *.xz, set the parallelization in R library(furrr); plan(multisession, workers = 12) and use the following arguments to model_proteins_separately(): iter_sampling = 2500, iter_warmup = 2000, algorithm = 'mcmc', seed = 123. This should reproduce the estimates to within Monte Carlo error.

Tested Software Versions

This package has been tested on R 4.1.0 and the following package versions (though anything from ~2020 forwards / not ancient should work):

|pkg |ver | |----------|------| |dplyr |1.0.7 | |data.table|1.14.2| |magrittr |2.0.2 | |openxlsx |4.2.4 | |tibble |3.1.6 | |tidyr |1.2.0 | |cmdstanr |0.5.1 | |readxl |1.3.1 | |furrr |0.2.3 | |stringr |1.4.0 | |ggplot2 |3.3.5 | |patchwork |1.1.1 | |forcats |0.5.1 | |purrr |0.3.4 | |posterior |1.2.0 | |rlang |1.0.0 | |stats |4.1.0 | |utils |4.1.0 | |readr |2.1.2 | |progressr |0.10.0|

Any OS that can install R and Stan should be sufficient. We have tested the package most thoroughly on CentOS Linux release 7.6.1810 (Core), as well as various recent versions of macOS and Windows 10. No non-standard hardware is required.

andrewGhazi/basehitmodel documentation built on Oct. 22, 2023, 9:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com