bedtools_groupby: bedtools_groupby
In HelloRanges: Introduce *Ranges to bedtools users

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Query sequence from a FASTA file given a set of ranges, including compound regions like transcripts and junction reads. This assumes the sequence is DNA.

1
2
3

    bedtools_groupby(cmd = "--help")
    R_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",")
    do_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",")

`cmd`	String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing.
`i`	Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or a ranged data structure, such as a GRanges. Use `"stdin"` for input from another process (presumably while running via `Rscript`). For streaming from a subprocess, prefix the command string with “<”, e.g., `"<grep foo file.bed"`. Any streamed data is assumed to be in BED format.
`g`	Column index(es) for grouping the input. Columns may be comma-separated. By default, the grouping is by range.
`c`	Specify columns (by integer index) from the input file to operate upon (see `o` option, below). Multiple columns can be specified in a comma-delimited list.
`o`	Specify the operations (by name) that should be applied to the columns indicated in `c`. Multiple operations can be specified in a comma-delimited list. Recycling is used to align `c` and `o`. See the details for the available operations.
`delim`	Delimiter character used to collapse strings.

As with all commands, there are three interfaces to the groupby command:

bedtools_groupby: Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_groupby: Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_groupby: Evaluates the result of R_bedtools_groupby. Recommended only for demonstration and testing. It is best to integrate the compiled code into an R script, after studying it.

The workhorse for aggregation in R is aggregate and we have extended its interface to make it more convenient. See aggregate for details.

The following operations are supported (with R translation):

sum: sum(X)
min: min(X)
max: max(X)
absmin: min(abs(X))
absmax: max(abs(X))
mean: mean(X)
median: median(X)
mode: distmode(X)
antimode: distmode(X, anti=TRUE)
collapse: unstrsplit(X, delim)
distinct: unstrsplit(unique(X), delim)
count: lengths(X)
count_distinct: lengths(unique(X))
sstdev: sd(X)

freqtable(X) firstdrop(heads(X, 1L)) lastdrop(tails(X, 1L))

For the sake of simplicity, and because the use cases are not clear, we do not support aggregation of every column. Here are some of the restrictions:

No support for the last column of GFF (the ragged list of attributes).
No support for the INFO, FORMAT and GENO fields of VCF.
No support for the FLAG field of BAM (bedtools does not support this either).

A language object containing the compiled R code, generally evaluating to a DataFrame, with a column for each grouping variable and each summarized variable. As a special case, if there are no grouping variables specified, then the grouping is by range, and an aggregated GRanges is returned.

We admit that using column subscripts for c makes code hard to read. All the more reason to just write R code.

Michael Lawrence

http://bedtools.readthedocs.io/en/latest/content/tools/groupby.html

aggregate-methods for general aggregation.

## Not run: 
setwd(system.file("unitTests", "data", "groupby", package="HelloRanges"))

## End(Not run)
    ## aggregation by range
    bedtools_groupby("-i values3.header.bed -c 5")
    ## average variant qualities by chromosome and reference base
## Not run: 
    indexTabix(bgzip("a_vcfSVtest.vcf", overwrite=TRUE), "vcf")

## End(Not run)
    bedtools_groupby("-i a_vcfSVtest.vcf.bgz -g 1,4 -c 6 -o mean")