grep_analysis: Grep-based GenIE analysis
In rgenie: Analysis of GenIE Experiments

Description Usage Arguments Details Results See Also Examples

For each replicate associated with an input region, grep_analysis searches for alleles matching the HDR or WT sequences, and returns statistics that indicate whether the HDR:WT ratio differs in cDNA and gDNA.

grep_analysis(
  regions,
  replicates,
  required_match_left = 10,
  required_match_right = 10,
  min_mapq = 0,
  quiet = FALSE
)

`regions`	A data.frame defining GenIE regions.
`replicates`	A data.frame defining GenIE replicates.
`required_match_left`	The length of sequence to the left of the HDR site that must exactly match to identify HDR or WT reads.
`required_match_right`	The length of sequence to the right of the HDR site that must exactly match to identify HDR or WT reads.
`min_mapq`	The minimum mapping quality for reads to be included in the analysis.
`quiet`	If TRUE, then no messages are printing during the analysis.

For a grep analysis, the regions parameter is a data.frame with a format as follows. All of the column names below must be specified.

name	sequence_name	start	end	highlight_site	cut_site	hdr_allele_profile	wt_allele_profile	ref_sequence
MUL1_rs6700034	MUL1	1	21	11	9	----------A----------	----------C----------	ACCGCACCCCCCCGGCCTAAC

name	A unique identifier for the region.
sequence_name	The chromosome or amplicon sequence name.
start	The start coordinate of the amplicon relative to the chromosome or amplicon reference.
end	The end coordinate of the amplicon relative to the chromosome or amplicon reference (the end coordinate is included in the region).
highlight_site	The relative position of the site of interest, usually the HDR SNP site.
cut_site	The relative position of the cut site.
ref_sequence	The reference sequence for the amplicon region, which must have length (end - start + 1).
hdr_allele_profile	An allele profile describing the HDR allele. See details below.
wt_allele_profile	An allele profile describing the WT allele. See details below.

If multiple rows are defined for 'regions', then a separate analysis is run for each region, using matched replicates from 'replicates'.

The allele_profile columns indicate the positions in the amplicon sequence that must match a given nucleotide for a read to be considered either HDR or WT. This sequence must be the same length as the reference sequence, and all other positions should be "-". The total sequence region that must match is determined by both the positions of specified nucleotides and by the required_match_left and required_match_right parameters. These parameters give the length of sequence which must match the provided reference sequence to the left of the leftmost specified nucleotide, or to the right of the rightmost specified nucleotide.

The replicates parameter is a data.frame with a format as below.

name	replicate	type	bam
MUL1_rs6700034	c1.2	cDNA	bam_amplicon/MUL1_rs6700034_cDNA_rep1_pcr2.sortedByCoord.bam
MUL1_rs6700034	c1.3	cDNA	bam_amplicon/MUL1_rs6700034_cDNA_rep1_pcr3.sortedByCoord.bam

name	Indicates the region that a given replicate corresponds with. All replicates matching the name in the regions table will be used.
replicate	an ID for the replicate, which must be unique among replicates for the region.
type	Must have the value "cDNA" or "gDNA", indicating whether a given replicate contains data for cDNA or gDNA.
bam	the path (relative to the working directory) to a BAM file with sequencing reads for the replicate.

Statistics can only be computed if there are at least 2 replicates of each type (cDNA and gDNA). Replicates are matched to the region based on the name column.

The returned object is a list, where each item is the result for one region. The result for a region (e.g. results[[1]]) is itself a list, with the following items:

region_stats	Main analysis output, with statistics indicating whether the HDR/WT levels differ in cDNA relative to gDNA.
replicate_stats	A data.frame with a row for each replicate, which has counts of reads in different categories and some summary values.
region	Details of the input region the result corresponds to.
replicates	Details of the input replicates the result corresponds to.
opts	A list containing the options that were given for the analysis.
type	Has the value “grep_analysis”, and is used by plotting functions that take a full grep_result list as input.

The main output of interest is the 'region_stats' field, which is a one-row data.frame with the following values:

name	Name of the region.
effect	Estimated effect size - the amount by which the HDR:WT ratio differs in cDNA relative to gDNA.
effect_sd	Standard deviation of the estimated effect size.
effect_confint_lo	Lower bound of the 95% confidence interval for the effect size.
effect_confint_hi	Upper bound of the 95% confidence interval for the effect size.
df_estimated	The degrees of freedom used in the unequal variances t-test, which is estimated from the data.
pval	A p value from the unequal variants t-test.

deletion_analysis

grep_summary_plot

# Note: to run an analysis you need BAM files from a GenIE experiment.
# An example is available for download using download_example().


download_example(dir = "~/genie_example", name = "MUL1")
# Data are downloaded and we can run an rgenie analysis
setwd("~/genie_example/MUL1/")
regions = readr::read_tsv("mul1.genie_regions.tsv")
replicates = readr::read_tsv("mul1.genie_replicates.tsv")
grep_results = grep_analysis(regions, replicates)
grep_summary_plot(grep_results[[1]])