Description Usage Arguments Details Results Additional fields See Also Examples
View source: R/genie_functions.R
For each replicate associated with an input region, deletion_analysis
identifies HDR or WT sequences, as well as deletion alleles, and returns
statistics that indicate for every allele whether the allele:WT ratio
differs in cDNA and gDNA. Statistics are also computed for deletion alleles
aggregated together.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | deletion_analysis(
regions,
replicates,
required_match_left = 10,
required_match_right = 10,
crispr_del_window = 100,
min_mapq = 0,
max_mismatch_frac = 0.05,
min_aligned_bases = 50,
exclude_multiple_deletions = FALSE,
exclude_nonspanning_reads = TRUE,
allele_profile = FALSE,
del_span_start = -20,
del_span_end = 20,
quiet = FALSE
)
|
regions |
A data.frame defining GenIE regions. |
replicates |
A data.frame defining GenIE replicates. |
required_match_left |
The length of sequence to the left of the HDR site that must exactly match to identify HDR or WT reads. |
required_match_right |
The length of sequence to the right of the HDR site that must exactly match to identify HDR or WT reads. |
crispr_del_window |
The window around the cut site within which any deletion is considered a CRISPR deletion. deletions that do not span the region [cut_site - crispr_del_window, cut_site + crispr_del_window] will be ignored; i.e. such reads can be considered HDR or WT. |
min_mapq |
The minimum mapping quality for reads to be included in the analysis. |
max_mismatch_frac |
The maximum fraction of mismatches a read can have and be included in the analysis. |
min_aligned_bases |
The minimum number of aligned bases within the region of interest for a read to be included in the analysis. |
exclude_multiple_deletions |
If TRUE, then reads with multiple deletions will be excluded from the analysis. |
exclude_nonspanning_reads |
If TRUE, then reads are excluded if their alignment does not overlap the region's highlight_site (or cut site if no highlight_site is specified) |
allele_profile |
If TRUE, then the result object will contain data.frames named site_profiles and mismatch_profiles, as detailed in the description below. |
del_span_start |
An integer that specifies the start of a window, relative to the region's highlight site, within which deletions are counted. |
del_span_end |
An integer that specifies the end of a window, relative to the region's highlight site, within which deletions are counted. |
quiet |
If TRUE, then no messages are printing during the analysis. |
For a deletion analysis, the regions parameter is a data.frame with a format as follows. All of the column names below must be specified.
name | sequence_name | start | end | highlight_site | cut_site | hdr_allele_profile | wt_allele_profile | ref_sequence |
MUL1_rs6700034 | MUL1 | 1 | 21 | 11 | 9 | ----------A---------- | ----------C---------- | ACCGCACCCCCCCGGCCTAAC |
name | A unique identifier for the region. |
sequence_name | The chromosome or amplicon sequence name. |
start | The start coordinate of the amplicon relative to the chromosome or amplicon reference. |
end | The end coordinate of the amplicon relative to the chromosome or amplicon reference (the end coordinate is included in the region). |
highlight_site | The relative position of the site of interest, usually the HDR SNP site. |
cut_site | The relative position of the cut site. |
ref_sequence | The reference sequence for the amplicon region, which must have length (end - start + 1). |
hdr_allele_profile | An allele profile describing the HDR allele. See details below. |
wt_allele_profile | An allele profile describing the WT allele. See details below. |
If multiple rows are defined for 'regions', then a separate analysis is run for each region, using matched replicates from 'replicates'.
The allele_profile columns indicate the positions in the amplicon sequence that must match a given nucleotide for a read to be considered either HDR or WT. This sequence must be the same length as the reference sequence, and all other positions should be "-". The total sequence region that must match is determined by both the positions of specified nucleotides and by the required_match_left and required_match_right parameters. These parameters give the length of sequence which must match the provided reference sequence to the left of the leftmost specified nucleotide, or to the right of the rightmost specified nucleotide.
The replicates parameter is a data.frame with a format as below.
name | replicate | type | bam |
MUL1_rs6700034 | c1.2 | cDNA | bam_amplicon/MUL1_rs6700034_cDNA_rep1_pcr2.sortedByCoord.bam |
MUL1_rs6700034 | c1.3 | cDNA | bam_amplicon/MUL1_rs6700034_cDNA_rep1_pcr3.sortedByCoord.bam |
name | Indicates the region that a given replicate corresponds with. All replicates matching the name in the regions table will be used. |
replicate | an ID for the replicate, which must be unique among replicates for the region. |
type | Must have the value "cDNA" or "gDNA", indicating whether a given replicate contains data for cDNA or gDNA. |
bam | the path (relative to the working directory) to a BAM file with sequencing reads for the replicate. |
Statistics can only be computed if there are at least 2 replicates of each type (cDNA and gDNA). Replicates are matched to the region based on the name column.
The returned object is a list, where each item is the result for one region. The result for a region (e.g. results[[1]]) is itself a list, with the following items:
region_stats | Main analysis output, with statistics indicating whether the HDR/WT levels differ in cDNA relative to gDNA. |
replicate_stats | A data.frame with a row for each replicate, which has counts of reads in different categories and some summary values. |
region | Details of the input region the result corresponds to. |
replicates | Details of the input replicates the result corresponds to. |
opts | A list containing the options that were given for the analysis. |
type | Has the value “deletion_analysis”, and is used by plotting functions that take a full del_result list as input. |
The main output of interest is the 'region_stats' field, which is a one-row data.frame with the following values:
name | Name of the region. |
hdr_rate_gDNA | Fraction of reads in gDNA identified as HDR. |
hdr_rate_cDNA | Fraction of reads in cDNA identified as HDR. |
wt_rate_gDNA | Fraction of reads in gDNA identified as WT |
wt_rate_cDNA | Fraction of reads in cDNA identified as WT. |
del_rate_gDNA | Fraction of reads in gDNA identified as having a CRISPR deletion. |
del_rate_cDNA | Fraction of reads in cDNA identified as having a CRISPR deletion. |
hdr_effect | HDR allele: Estimated effect size - the amount by which the HDR:WT ratio differs in cDNA relative to gDNA. |
hdr_effect_sd | HDR allele: Standard deviation of the estimated effect size. |
hdr_effect_confint_lo | HDR allele: Lower bound of the 95% confidence interval for the effect size. |
hdr_effect_confint_hi | HDR allele: Upper bound of the 95% confidence interval for the effect size. |
hdr_df_estimated | THDR allele: he degrees of freedom used in the unequal variances t-test, which is estimated from the data. |
hdr_pval | HDR allele: A p value from the unequal variants t-test. |
The fields above beginning with 'hdr_' give statistics relating to the HDR allele. There are 6 equivalent fields that begin with 'del_' which relate to all deletions. There are also 6 equivalent fields that begin with 'del_window_', which relate to deletions that are contained within a window around the 'highlight_site' (defined in the region input), the extent of which is determined by 'crispr_del_window' parameter.
The deletion_analysis result object additionally has the following fields:
replicate_qc | A data.frame of summary information for each replicate, with counts of reads in different categories, editing rates per replicate, and quality control summary information. |
replicate_alleles | A data.frame of summary information for each unique allele in each replicate. |
region_alleles | A data.frame of summary information for each unique allele averaged across all replicates. |
replicate_allele_fractions | A data.frame of summary information for the top 20 alleles across all replicates, which is used for replicate quality control. |
allele_effect | A data.frame of GenIE effect size estimates (difference in allele:WT ratio in cDNA vs. gDNA) for each unique allele with sufficient reads across replicates. |
site_profiles | A data.frame which indicates read counts for each combination of observed nucleotides at each specified position in the wt_allele_profile, separately for each replicate. For example, if 1 position is defined, then there will be up to 5 unique combinations in each replicate, accounting for A, C, G, T, or * (deletion). If 2 positions are defined, there are up to 25 combinations. This can be useful to check whether the fraction of WT or edited reads is as expected. In particular, if a heterozygous site is targeted, and only one of the haplotype alleles actually gets edited, then you expect depletion of only one of the nucleotides at the SNP site. |
mismatch_profiles | A data.frame which contains every unique read profile, including mismatches, for reads considered as WT or HDR. This can help to identify if there are an abundance of "HDR" or "WT" reads with mismatches at a particular position, or an excess of mismatches in general. |
replicate_list | Internal data for each replicate. Not likely to be of interest to the user. |
type | |
1 2 3 4 5 6 7 8 9 10 11 | # Note: to run an analysis you need BAM files from a GenIE experiment.
# An example is available for download using download_example().
download_example(dir = "~/genie_example", name = "MUL1")
# Data are downloaded and we can run an rgenie analysis
setwd("~/genie_example/MUL1/")
regions = readr::read_tsv("mul1.genie_regions.tsv")
replicates = readr::read_tsv("mul1.genie_replicates.tsv")
delresults = deletion_analysis(regions, replicates)
deletion_plots(delresults[[1]])
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.