In alex-kalinka-cruk/fgcQC: QC metric calculations for CRISPR screen data produced at the FGC

Summary

fgcQC is an R package that provides implementations of CRISPR QC metric calcluations pertaining to sequencing, laboratory, and analysis data - both standard and novel metrics are included. It is designed to ingest data produced by the AZ-CRUK CRISPR analysis pipeline.

Dependencies

R $\geq$ v3.6.1 (not tested on earlier versions).
CRAN R packages: dplyr, tibble, tidyr, magrittr, rlang, purrr, jsonlite, ineq, energy, PRROC, matrixcalc
Bioconductor R packages: DSS

Design

The main entry point is QC_fgc_crispr_data() and this function takes an analysis config JSON file, count data, bagel output, sequencing metrics (bcl2fastq), and a cleanr.tsv gRNA library file.
QC data is produced for a single comparison in the analysis config.
QC metrics are saved to a csv (non-confidential) file and an R object containing the intermediate data is saved to an rds file (confidential - to be used for troubleshooting or more in-depth analyses and plotting).
Each row of the output csv file corresponds to a single sample.
All the QC calculation functions are exported and available to be used stand-alone within the R package.
Lethality screens are handled by setting treatment file paths to NULL (see 'Usage' section below).

Assumptions

A qc section exists in the analysis config and contains a date_transduced variable for each sample.
Sample Labels from the analysis config do not contain any confidential information.
Index barcodes are unique within a single SLX_ID even if sequencing has occurred across more than one flowcell.
Processing individual analysis comparisons one at a time is feasible.
Non-targeting guides are named 'Control' in the gene column of the library file.
CRISPR-i and CRISPR-a screens will only return sequencing QC metrics.

Usage

In the below command, multiple bcl2fastq Stats.json files can be provided in a single comma-separated string. For 'lethality' screens, set the argument bagel_treat_plasmid to NULL (bagel_ctrl_plasmid may also be set to NULL in case Bagel was unable to produce any output).

qc <- QC_fgc_crispr_data(analysis_config = "path/to/analysis_config.json",
                         combined_counts = "path/to/combined_counts.csv",
                         bagel_ctrl_plasmid = "path/to/bagel_Control_vs_Plasmid.bf",
                         bagel_treat_plasmid = "path/to/bagel_Treatment_vs_Plasmid.bf",
                         bcl2fastq = "path/to/first/Stats.json,path/to/second/Stats.json",
                         library = "path/to/gRNA-library-file/cleanr.tsv",
                         comparison_name = "Single_Comparison_name",
                         output = "qc-out.csv",
                         output_R_object = "qc-out",
                         mock_missing_data = FALSE)

If mock_missing_data is set to TRUE (defaults to FALSE) then one or both of bcl2fastq and library can be set to NULL and the missing data will be mocked within the function and the mocked column data will be set to NA values in the output. When this is the case, a mock 'qc' section will also be included in the analysis config if it is missing.

A summary of the return object can be printed to the console:

qc
####  fgcQC summary  ####
SLX_ID: SLX-19037
Screen Type: n
Screen Goal: lethality
Number of Flowcells: 1
AUROC for pan-cancer Sanger (Control vs Plasmid): 0.996
AUROC for Hart essentials (Control vs Plasmid): 0.887
AUROC for Moderately negative (Control vs Plasmid): 0.932
AUROC for Weakly negative (Control vs Plasmid): 0.634

Unit Tests

All unit tests can be run from the root of the R package directory:

devtools::test("path/to/fgcQC")

QC metrics

QC metrics to the left of the SampleId column are experiment-wide, comparison-wide, or calculated for pairs of samples (e.g. control vs plasmid). QC metrics to the right of the SampleId column are sample-specific metrics. In the below tables, <comp> is one of ctrl_plasmid, treat_plasmid, or treat_ctrl, and <gene_set> is the name of a gene set, e.g. hart_essential.

Standard

|QC metric|Description| ----------|-----------| AUROC.<comp>.<gene_set>|Area Under the Receiver Operating Curve for Bagel Bayes Factors.| AUPrRc.<comp>.<gene_set>|Area Under the Precision-Recall curve for Bagel Bayes Factors.| Sensitivity_FDR_10pct.<comp>.<gene_set>|Sensitivity at 10% FDR for Bagel Bayes Factors.| NNMD.<comp>.<gene_set>|log2 fold change Null-Normalized Mean Difference relative to plasmid.| NNMD_robust.<comp>.<gene_set>|log2 fold change Null-Normalized Mean Difference relative to plasmid defined using the median and median absolute deviation.| repl_log2FC_pearson_corr.<comp>|Pearson correlation of log-fold change values for replicate pairs belonging to the same sample type. Only calculated for the first two replicates when there are >2 total replicates.| cas_activity|Cas-9 activity.| virus_batch|Batch ID of virus.| plasmid_batch|Batch ID of plasmid.| minimum_split_cell_number|Cell density when cell populations are split.| cell_population_doublings|Doubling rate of cell lines.| library_dna_yield_ng.ul|Amplified library DNA yield.| index_plate|The index plate ID for prepping the library.| index_plate_well_id|The well ID for the index plate.| percent_low_count_plasmid_gRNAs|Percentage of plasmid gRNAs with less than 30 reads.| percent_zero_plasmid_gRNAs|Percentage of plasmid gRNAs with zero-count reads.|

Novel

|QC metric|Description| ----------|-----------| GICC.<comp>.<gene_set>|Generalized Intraclass Correlation Coefficient of normalized counts - multivariate measure of reproducibility (fits a linear model and returns the proportion of the total variance attributable to between-sample variance). Values range between 0 and 1 - higher values indicate more reproducible data.| mahalanobis_dist_ratio.<comp>|Mahalanobis distance ratio of control or treatment samples relative to the plasmid for normalized counts. Lower values indicate the control or treatment samples cluster away from the plasmid. Values $\geq$ 1 indicate control or treatment samples are close to the plasmid and should be investigated.| dispersion_adj_gRNA.treat_ctrl|An empirical-Bayesian shrinkage estimator for the dispersion of gRNA counts - the mean across gRNAs is returned. Only calculated for Treatment-vs-Control.| distcorr_GC_content_counts|A 'distance correlation' (allows for non-linear, non-monotonic relationships) for gRNA GC content and normalized counts| distcorr_GC_content_logfc.<comp>|A 'distance correlation' (allows for non-linear, non-monotonic relationships) for gRNA GC content and log fold changes| gini_coefficient_counts|The Gini coefficient for count data. Values range between 0 and 1 with high values indicating most reads belong to a small number of gRNAs.| norm_counts_GCC_ratio|gRNAs with GCC in the 4 bases upstream of the PAM have been found to be inefficient. This metric captures the ratio of median normalized counts for gRNAs with GCC versus the median of all gRNAs.| norm_counts_TT_ratio|gRNAs with TT in the 4 bases upstream of the PAM have been found to be inefficient. This metric captures the ratio of median normalized counts for gRNAs with TT versus the median of all gRNAs.| log2FC_GCC_diff.<comp>|gRNAs with GCC in the 4 bases upstream of the PAM have been found to be inefficient. This metric captures the logFC difference for gRNAs with GCC versus all gRNAs.| log2FC_TT_diff.<comp>|gRNAs with TT in the 4 bases upstream of the PAM have been found to be inefficient. This metric captures the logFC difference for gRNAs with TT versus all gRNAs.| NNMD.<comp>.control_guides|NNMD for non-targeting control gRNAs. Expected to be > 0 since they do not incur the cost of dsDNA cuts.| NNMD_robust.<comp>.control_guides|NNMD_robust for non-targeting control gRNAs. Expected to be > 0 since they do not incur the cost of dsDNA cuts.|

Sequencing

When more than one flowcell has been used, the worst example of each flowcell-level QC metric is reported (min or max depending on the metric).

|QC metric|Description| ----------|-----------| ReadsPF_percent|The percentage of reads Passing Filter (PF) for the experiment as a whole. Should be high.| Non_Demultiplexed_Reads_percent|The percentage of non-demultiplexed reads for the experiment as a whole.| Sample_Representation|The percentage of reads assigned to each sample.| percent_reads_matching_gRNAs|The percentage of reads that match a gRNA sequence in each sample.| Q30_bases_percent|The percentage of reads with quality scores in excess of 30 for each sample. Should be high.| Average_base_quality|Average base quality score for each sample.| Index_OneBaseMismatch_percent|The percentage of 1-base index mismatches per sample. Should be low.|

Gene sets

pan_cancer_Sanger - the core-fitness genes defined by Sanger.
hart_essential - the Hart essentials defined in 2017.
moderately_negative - 245 genes defined as having logFC between -1.2 and -0.5 for $\geq$ 90% of the DepMap cell lines.
weakly_negative - 340 genes defined as having logFC between -0.8 and -0.2 for $\geq$ 90% of the DepMap cell lines.
hart_nonessential - the Hart non-essential set.

Bugs, Issues, or Requests

Please contact Alex Kalinka

alex-kalinka-cruk/fgcQC documentation built on June 23, 2020, 9:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

alex-kalinka-cruk/fgcQC
QC metric calculations for CRISPR screen data produced at the FGC

In alex-kalinka-cruk/fgcQC: QC metric calculations for CRISPR screen data produced at the FGC

Summary

Dependencies

Design

Assumptions

Usage

Unit Tests

QC metrics

Standard

Novel

Sequencing

Gene sets

Bugs, Issues, or Requests

R Package Documentation

Browse R Packages

We want your feedback!

alex-kalinka-cruk/fgcQC QC metric calculations for CRISPR screen data produced at the FGC

In alex-kalinka-cruk/fgcQC: QC metric calculations for CRISPR screen data produced at the FGC

Summary

Dependencies

Design

Assumptions

Usage

Unit Tests

QC metrics

Standard

Novel

Sequencing

Gene sets

Bugs, Issues, or Requests

R Package Documentation

Browse R Packages

We want your feedback!

alex-kalinka-cruk/fgcQC
QC metric calculations for CRISPR screen data produced at the FGC