gtxregion: Interface to define a genomic region

Description Usage Arguments Details Value Author(s) Examples

View source: R/gtxbase.R

Description

Unified interface to define a genomic region.

Usage

1
2
3
4
5
6
7
8
9
gtxregion(chrom, pos_start, pos_end, 
          hgncid, ensemblid, rs, pos, surround = 500000, 
          dbc = getOption("gtx.dbConnection", NULL))
gtxwhere(chrom, 
         pos, pos_ge, pos_le, 
         pos_end_ge, pos_start_le, 
         pos_start_ge, pos_end_le, 
         rs, hgncid, ensemblid,
         tablename)

Arguments

chrom

Character specifying chromosome

pos_start

Integer start position on chromosome

pos_end

Integer end position on chromosome

pos

Integer position on chromosome

hgncid

HGNC gene identifier

ensemblid

ENSEMBL gene identifier

rs

dbSNP rs identifier

surround

Distance around entity to include in region

pos_ge

Position greater-or-equal required

pos_le

Position less-or-equal required

pos_end_ge

End position greater-or-equal required

pos_start_le

Start position less-or-equal required

pos_start_ge

Start position greater-or-equal required

pos_end_le

End position less-or-equal required

tablename

Database table name

dbc

Database connection

Details

The gtxregion() function provides a unified interface for other functions to define a genomic region (or potentially for a user to invoke directly). For any valid combination of its optional arguments, it returns genomic coordinates (chromosome, start and end positions) as described below, using the database connection dbc to resolve any queries (such as the coordinates of a named gene).

When accessing this functionality indirectly via higher level functions (such as regionplot() and coloc()), the functionality should be almost completely intuitive for most users, and if necessary can be learned by example from the manual pages and vignettes for those higher level functions. It suffices to add that the optional arguments are used according to a priority order, which is exactly the order of arguments in the function definition. For example if chrom, pos_start, pos_end and hgnc are all provided, hgnc has lower priority and is ignored. Similarly if hgnc and pos are provided, pos has lower priority and is ignored.

It is an intended design feature that pos and rs are lowest in the priority order. When used in conjunction with higher priority arguments such as hgnc, a pos or rs argument can be used without affecting the genomic region specified, which then allows a function that wraps gtxregion() to use pos or rs for secondary purposes, such as to highlight a specific position or variant in a visual display. Thus, regionplot(..., pos = 1234567, surround = 500000) selects a 500kb region around position 1234567 and visually highlights any variant present at position 1234567, and regionplot(..., hgnc = 'ABC123', surround = 10000, pos = 1234567) selects a 10kb region around the ABC123 gene and visually highlights any variant present at position 1234567.

The remainder of this manual page is more technical documentation, intended for programmers writing new high level functions that will work alongside regionplot() and coloc(), and should be read in combination with the source code.

The gtxregion() function resolves its arguments to genomic coordinates as follows:

If the arguments chrom, pos_start and pos_end are all provided, these are checked for validity and used to directly specify the return value.

Otherwise, if the argument hgnc is provided, TABLE genes is queried (using dbc and gtxwhere) and a region spanning the gene(s) plus surrounding distance is returned. Otherwise, if the argument ensg (integer) is provided, TABLE genes is similarly queried.

Otherwise, if the arguments chrom and pos are both provided, these are checked for validity and used plus surrounding distance to directly specify the return value.

Otherwise, if the argument rs is provided, TABLE sites (sites_by_rs) is queried (using gtxwhere) and a region plus surrounding distance is returned.

The methods just described are implemented using if ... else if ... else if ... logic, so for example if a hgnc argument is provided then any ensg argument is ignored, etc.

The gtxwhere function provides a standardized and sanitized way to dynamically construct part of a SQL WHERE statement. This is best illustrated by the examples below. When more than one argument value is given, either as multiple values for a single argument, or for more that one argument, the following logic seems most useful: Multiple values for a single argument are combined or-wise, and multiple arguments are combined and-wise.

To use gtxwhere to select chromosome segments (such as genes or other entities, recombination rate segments, etc) that wholly or partially overlap a query region, use pos_end_ge=query_start and pos_start_le=query_end. To select only chromosome segments that wholly overlap, instead use pos_start_ge=query_start and pos_end_le=query_end.

For identifiers that are represented (for efficiency) as integers in database tables but as strings in “user space”, gtxwhere is the layer at which string-to-integer checking and conversion should occur.

Value

gtxregion returns a named list with elements ‘chrom’ (character), ‘pos_start’ (integer) and ‘pos_end’ (integer).

gtxwhere returns a character string suitable for inclusion after the WHERE clause in a SQL statement.

Author(s)

Toby Johnson Toby.x.Johnson@gsk.com

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## Not run: 
  gtxregion(chrom = 1, pos_start = 109616403, pos_end = 109623689)
  # dies without an open ODBC connection

## End(Not run)
gtxwhere(rs = 'rs599839')
gtxwhere(chrom = 1, pos = c(109616403, 109623689))
gtxwhere(chrom = 1, pos_end_ge = 109616403, pos_start_le = 109623689)
## Not run: 
gtxwhere()

## End(Not run)

tobyjohnson/gtx documentation built on Aug. 30, 2019, 8:07 p.m.