bedtools_groupby: bedtools_groupby

View source: R/groupby.R

bedtools_groupbyR Documentation

bedtools_groupby

Description

Query sequence from a FASTA file given a set of ranges, including compound regions like transcripts and junction reads. This assumes the sequence is DNA.

Usage

    bedtools_groupby(cmd = "--help")
    R_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",")
    do_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",")

Arguments

cmd

String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing.

i

Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or a ranged data structure, such as a GRanges. Use "stdin" for input from another process (presumably while running via Rscript). For streaming from a subprocess, prefix the command string with “<”, e.g., "<grep foo file.bed". Any streamed data is assumed to be in BED format.

g

Column index(es) for grouping the input. Columns may be comma-separated. By default, the grouping is by range.

c

Specify columns (by integer index) from the input file to operate upon (see o option, below). Multiple columns can be specified in a comma-delimited list.

o

Specify the operations (by name) that should be applied to the columns indicated in c. Multiple operations can be specified in a comma-delimited list. Recycling is used to align c and o. See the details for the available operations.

delim

Delimiter character used to collapse strings.

Details

As with all commands, there are three interfaces to the groupby command:

bedtools_groupby

Parses the bedtools command line and compiles it to the equivalent R code.

R_bedtools_groupby

Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.

do_bedtools_groupby

Evaluates the result of R_bedtools_groupby. Recommended only for demonstration and testing. It is best to integrate the compiled code into an R script, after studying it.

The workhorse for aggregation in R is aggregate and we have extended its interface to make it more convenient. See aggregate for details.

The following operations are supported (with R translation):

sum

sum(X)

min

min(X)

max

max(X)

absmin

min(abs(X))

absmax

max(abs(X))

mean

mean(X)

median

median(X)

mode

distmode(X)

antimode

distmode(X, anti=TRUE)

collapse

unstrsplit(X, delim)

distinct

unstrsplit(unique(X), delim)

count

lengths(X)

count_distinct

lengths(unique(X))

sstdev

sd(X)

freqtable(X) firstdrop(heads(X, 1L)) lastdrop(tails(X, 1L))

For the sake of simplicity, and because the use cases are not clear, we do not support aggregation of every column. Here are some of the restrictions:

  • No support for the last column of GFF (the ragged list of attributes).

  • No support for the INFO, FORMAT and GENO fields of VCF.

  • No support for the FLAG field of BAM (bedtools does not support this either).

Value

A language object containing the compiled R code, generally evaluating to a DataFrame, with a column for each grouping variable and each summarized variable. As a special case, if there are no grouping variables specified, then the grouping is by range, and an aggregated GRanges is returned.

Note

We admit that using column subscripts for c makes code hard to read. All the more reason to just write R code.

Author(s)

Michael Lawrence

References

http://bedtools.readthedocs.io/en/latest/content/tools/groupby.html

See Also

aggregate-methods for general aggregation.

Examples

## Not run: 
setwd(system.file("unitTests", "data", "groupby", package="HelloRanges"))

## End(Not run)
    ## aggregation by range
    bedtools_groupby("-i values3.header.bed -c 5")
    ## average variant qualities by chromosome and reference base
## Not run: 
    indexTabix(bgzip("a_vcfSVtest.vcf", overwrite=TRUE), "vcf")

## End(Not run)
    bedtools_groupby("-i a_vcfSVtest.vcf.bgz -g 1,4 -c 6 -o mean")

lawremi/HelloRanges documentation built on Oct. 29, 2023, 4:08 p.m.