bedtools_closest: bedtools_closest

View source: R/closest.R

bedtools_closestR Documentation

bedtools_closest

Description

Finds the features in one dataset that are closest to those in another. Supports restriction by strand, upstream, downstream, and overlap. There are several methods for resolving ties. Optionally returns the distance.

Usage

bedtools_closest(cmd = "--help")
R_bedtools_closest(a, b, s = FALSE, S = FALSE, d = FALSE,
                   D = c("none", "ref", "a", "b"), io = FALSE, iu = FALSE, 
                   id = FALSE, fu = FALSE, fd = FALSE,
                   t = c("all", "first", "last"), mdb = c("each", "all"), 
                   k = 1L, names = NULL, filenames = FALSE, N = FALSE)
do_bedtools_closest(a, b, s = FALSE, S = FALSE, d = FALSE,
                    D = c("none", "ref", "a", "b"), io = FALSE, iu = FALSE, 
                    id = FALSE, fu = FALSE, fd = FALSE,
                    t = c("all", "first", "last"), mdb = c("each", "all"), 
                    k = 1L, names = NULL, filenames = FALSE, N = FALSE)

Arguments

cmd

String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing.

a

Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or a ranged data structure, such as a GRanges. Each feature in a is compared to b in search of nearest neighbors. Use "stdin" for input from another process (presumably while running via Rscript). For streaming from a subprocess, prefix the command string with “<”, e.g., "<grep foo file.bed". Any streamed data is assumed to be in BED format.

b

Like a, except supports multiple datasets, either as a vector/list or a comma-separated string. Also supports file glob patterns, i.e., strings containing the wildcard, “*”.

s

Require same strandedness. That is, find the closest feature in b that overlaps a on the same strand. By default, overlaps are reported without respect to strand. Note that this is the exact opposite of Bioconductor behavior.

S

Require opposite strandedness. That is, find the closest feature in b that overlaps a on the opposite strand. By default, overlaps are reported without respect to strand.

d

In addition to the closest feature in b, report its distance to a as an extra column. The reported distance for overlapping features will be 0.

D

Like d, report the closest feature in b, and its distance to a as an extra column. However unlike d, D conveys a notion of upstream that is useful with other arguments. See details.

io

Ignore features in b that overlap a. That is, we want close, yet not touching features only.

iu

Ignore features in b that are upstream of features in a. This option requires D and follows its orientation rules for determining what is “upstream”.

id

Ignore features in b that are downstream of features in a. This option requires D and follows its orientation rules for determining what is “downstream”.

fu

Choose first from features in b that are upstream of features in a. This option requires D and follows its orientation rules for determining what is “upstream”.

fd

Choose first from features in b that are downstream of features in a. This option requires D and follows its orientation rules for determining what is “downstream”.

t

Specify how ties for closest feature should be handled. This occurs when two features in b have exactly the same “closeness” with a. By default, all such features in b are reported. The modes options are “all”, “first” and “last”.

mdb

How multiple databases should be resolved, either “each” (find closest in each b dataset independently) or “all” (combine all b datasets prior to the search).

k

Not supported yet. Report the k closest hits. Default is 1. If t is "all", all ties will still be reported.

names

When using multiple databases, provide an alias for each to use instead of their integer index. If a single string, can be comma-separated.

filenames

When using multiple databases, use their complete filename instead of their integer index.

N

Not yet supported, but related use cases are often solved by passing a single argument to nearest. Require that the query and the closest hit have different names. For BED, the 4th column is compared.

Details

As with all commands, there are three interfaces to the closest command:

bedtools_closest

Parses the bedtools command line and compiles it to the equivalent R code.

R_bedtools_closest

Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.

do_bedtools_closest

Evaluates the result of R_bedtools_closest. Recommended only for demonstration and testing. It is best to integrate the compiled code into an R script, after studying it.

The generated code includes calls to utilities like nearest, precede and follow. nearest lacks the ability to restrict its search by direction/overlap, so some complex code is generated to support all of the argument combinations.

Arguments io, iu, id, fu, and fd require a notion of upstream/downstream to be defined via D, which accepts one of these values:

ref

Report distance with respect to the reference genome. B features with a lower (start, stop) are “upstream”.

a

Report distance with respect to A. When A is on the - strand, “upstream” means B has a higher (start,stop).

b

Report distance with respect to B. When B is on the - strand, “upstream” means A has a higher (start,stop).

Value

A language object containing the compiled R code, evaluating to a Pairs object with the closest hits from a and b. If d or D is TRUE, has a metadata column called “distance”.

Author(s)

Michael Lawrence

References

http://bedtools.readthedocs.io/en/latest/content/tools/closest.html

See Also

nearest-methods for the various ways to find the nearest ranges.

Examples

## Not run: 
setwd(system.file("unitTests", "data", "closest", package="HelloRanges"))

## End(Not run)

## basic
bedtools_closest("-a a.bed -b b.bed -d")
## strand-specific
bedtools_closest("-a strand-test-a.bed -b strand-test-b.bed -s")
## break ties
bedtools_closest("-a close-a.bed -b close-b.bed -t first")
## multiple databases
bedtools_closest("-a mq1.bed -b mdb1.bed,mdb2.bed,mdb3.bed -names a,b,c")
## ignoring upstream
bedtools_closest("-a d.bed -b d_iu.bed -D ref -iu")

lawremi/HelloRanges documentation built on Oct. 29, 2023, 4:08 p.m.