multicoloc: Multiple colocalization analyses
In tobyjohnson/gtx: Genetics ToolboX

Description Usage Arguments Details Value Author(s)

Multiple colocalization analyses.

multicoloc(analysis1, analysis2,
           chrom, pos_start, pos_end, pos, 
           hgncid, ensemblid, rs, surround = 0,
           hard_clip = FALSE,
           style = 'heatplot',
           thresh_analysis = 0.1, thresh_entity = 0.1, 
           dbc = getOption("gtx.dbConnection", NULL))
multicoloc.data(analysis1, analysis2,
                chrom, pos_start, pos_end, pos, 
                hgncid, ensemblid, rs, surround = 0,
                hard_clip = FALSE, 
                dbc = getOption("gtx.dbConnection", NULL))

`analysis1`	The key value(s) for GWAS analysis/es to analyze
`analysis2`	The key value for the second GWAS analysis to analyze
`chrom`	Argument passed to `gtxregion()`
`pos_start`	Argument passed to `gtxregion()`
`pos_end`	Argument passed to `gtxregion()`
`pos`	Argument passed to `gtxregion()`
`hgncid`	Argument passed to `gtxregion()`
`ensemblid`	Argument passed to `gtxregion()`
`surround`	Argument passed to `gtxregion()`
`hard_clip`	Logical, see details
`style`	Character specifying plot style(s)
`thresh_analysis`	Probability threshold for inclusion in plots
`thresh_entity`	Probability threshold for inclusion in plots
`dbc`	Database connection

multicoloc() is an entry point for multiple colocalization analyses. It supports the most common use case, to colocalize association signals from one or more analyses of gene expression/protein levels (specified by analysis1), each of which includes association statistics for multiple entities (genes or proteins), against an association signal from a single analysis (typically a disease or clinical phenotype, specified by analysis2). For this use case, multicoloc() is typically more convenient and (much) faster than looping over multiple calls to coloc().

multicoloc() offers a choice of two different algorithms for controlling the genomic region from which summary statistics are used for colocalization analyses, controlled by the argument hard_clip. The default, hard_clip=FALSE, uses the full set of available summary statistics for the entity/ies analyzed from each analysis included in analysis1. In this mode, the genomic range arguments chrom, pos_start, pos_end, hgncid etc. are only used (via gtxregion()) to determine the set of entities to be analyzed. Typically, this results in different entities being analyzed for colocalization using different (albeit overlapping) regions of the association signal from analysis2. The alternative hard_clip=TRUE, uses only summary statistics within the genomic range specified (via the arguments passed to gtxregion()). Typically, different entities will be analyzed for colocalization using the same or similar regions of the association signal from analysis2, depending on how the genomic range overlaps the summary statistics available for each entity. The exact algorithms used in each mode are detailed below. (And can be visualized using plot style= in a forthcoming release.)

When hard_clip=FALSE, the algorithm used by multicoloc() first determines a “seed region” using the genomic region arguments, as interpreted by gtxregion(). Next, a set of entities is determined from the summary statistics for all analyses included in analysis1, consisting of all entites with summary statistics overlapping this “seed region”. (Better implementation of overlap is forthcoming). Finally, an “expanded region” is determined, that includes all available summary statistics for all of these entities. This “expanded region” is then used for each colocalization analyses, for each entity within each analysis within analysis1, against analysis2. Notes and Warnings: This algorithm only makes sense if the summary statistics are restricted to localized regions around each entity, such as cis- regions for eQTL analyses. Typically, different entities will be evaluated for colocalization using different regions of summary statistics for analysis2. Because the set of entities is determined by aggregating over all analyses in analysis1, unexpected results may be produced if a given entity has summary statistics at very different genomic positions in different analyses. The set of entities is combined across “Seed regions” specified using only the index variant from a GWAS signal (e.g. using pos or rs with the default surround=0) will not guarantee to select all entities with summary statistics for cis- regions spanning such a single base pair “seed” region, if some entities are missing summary statistics for the variant in that “seed” region. [This last issue will be fixed in a forthcoming update.]

When hard_clip=FALSE, the algorithm used by multicoloc() is simply to select all summary statistics within the genomic region arguments, as interpreted by gtxregion(). The typical use case is to set this genomic region as the extent of the ‘significant’ part of the association signal for analysis2. The hard_clip=FALSE mode is (currently) not the default option, because in initial exploratory analyses it is unusual to precisely specify this region, and because we believe the number of ‘false positive’ colocalizations is reduced by including the whole cis- eQTL region (assuming that the strongest disease signal in the region ‘should’ be aligned with the strongest cis- eQTL signal). Notes and Warnings: In general, a given entity may have summary statistics that only partially overlap the genomic region specified, which may have unexpected consequences. In a future release it will be possible to automatically subset to entities that overlap the genomic region specified by more than a chosen percentage. When using a hgncid or ensemblid gene identifier to specify the region from which to use summary statistics, the default surround=0 will not include the full cis eQTL region.

In a future release the output of multicoloc will be a long skinny dataframe with the full colocalization results (all priors, bfs and posteriors, numbers of variants and min and max positions used).

multicoloc returns a data frame containing the result of the colocalization analyses, see coloc.fast for details. The plot is generated as a side effect.

Toby Johnson Toby.x.Johnson@gsk.com

tobyjohnson/gtx documentation built on Aug. 30, 2019, 8:07 p.m.