# xGR2xGenes: Function to define genes from an input list of genomic... In hfang-bristol/XGR: Exploring Genomic Relations for Enhanced Interpretation Through Enrichment, Similarity, Network and Annotation Analysis

## Description

xGR2xGenes is supposed to define genes crosslinking to an input list of genomic regions (GR). Also required is the crosslink info with a score quantifying the link of a GR to a gene. Currently supported built-in crosslink info is enhancer genes, eQTL genes, conformation genes and nearby genes (purely), though the user can customise it via 'crosslink.customised'; if so, it has priority over the built-in data.

## Usage

 1 2 3 4 5 6 7 8 9 xGR2xGenes(data, format = c("chr:start-end", "data.frame", "bed", "GRanges"), build.conversion = c(NA, "hg38.to.hg19", "hg18.to.hg19"), crosslink = c("genehancer", "PCHiC_combined", "GTEx_V6p_combined", "nearby"), crosslink.customised = NULL, cdf.function = c("original", "empirical"), scoring = F, scoring.scheme = c("max", "sum", "sequential"), scoring.rescale = F, nearby.distance.max = 50000, nearby.decay.kernel = c("rapid", "slow", "linear", "constant"), nearby.decay.exponent = 2, verbose = T, RData.location = "http://galahad.well.ox.ac.uk/bigdata") 

## Arguments

 data input genomic regions (GR). If formatted as "chr:start-end" (see the next parameter 'format' below), GR should be provided as a vector in the format of 'chrN:start-end', where N is either 1-22 or X, start (or end) is genomic positional number; for example, 'chr1:13-20'. If formatted as a 'data.frame', the first three columns correspond to the chromosome (1st column), the starting chromosome position (2nd column), and the ending chromosome position (3rd column). If the format is indicated as 'bed' (browser extensible data), the same as 'data.frame' format but the position is 0-based offset from chromomose position. If the genomic regions provided are not ranged but only the single position, the ending chromosome position (3rd column) is allowed not to be provided. The data could also be an object of 'GRanges' (in this case, formatted as 'GRanges') format the format of the input data. It can be one of "data.frame", "chr:start-end", "bed" or "GRanges" build.conversion the conversion from one genome build to another. The conversions supported are "hg38.to.hg19" and "hg18.to.hg19". By default it is NA (no need to do so) crosslink the built-in crosslink info with a score quantifying the link of a GR to a gene. It can be one of 'genehancer' (enhancer genes; PMID:28605766), 'nearby' (nearby genes; if so, please also specify the relevant parameters 'nearby.distance.max', 'nearby.decay.kernel' and 'nearby.decay.exponent' below), 'PCHiC_combined' (conformation genes; PMID:27863249), 'GTEx_V6p_combined' (eQTL genes; PMID:29022597), 'eQTL_scRNAseq_combined' (eQTL genes; PMID:29610479), 'eQTL_jpRNAseq_combined' (eQTL genes; PMID:28553958), 'eQTL_ImmuneCells_combined' (eQTL genes; PMID:24604202,22446964,26151758,28248954,24013639) crosslink.customised the crosslink info with a score quantifying the link of a GR to a gene. A user-input matrix or data frame with 4 columns: 1st column for genomic regions (formatted as "chr:start-end", genome build 19), 2nd column for Genes, 3rd for crosslink score (crosslinking a genomic region to a gene, such as -log10 significance level), and 4th for contexts (optional; if not provided, it will be added as 'C'). Alternatively, it can be a file containing these 4 columns. Required, otherwise it will return NULL cdf.function a character specifying how to transform the input crosslink score. It can be one of 'original' (no such transformation), and 'empirical' for looking at empirical Cumulative Distribution Function (cdf; as such it is converted into pvalue-like values [0,1]) scoring logical to indicate whether gene-level scoring will be further calculated. By default, it sets to false scoring.scheme the method used to calculate seed gene scores under a set of GR. It can be one of "sum" for adding up, "max" for the maximum, and "sequential" for the sequential weighting. The sequential weighting is done via: ∑_{i=1}{\frac{R_{i}}{i}}, where R_{i} is the i^{th} rank (in a descreasing order) scoring.rescale logical to indicate whether gene scores will be further rescaled into the [0,1] range. By default, it sets to false nearby.distance.max the maximum distance between genes and GR. Only those genes no far way from this distance will be considered as seed genes. This parameter will influence the distance-component weights calculated for nearby GR per gene nearby.decay.kernel a character specifying a decay kernel function. It can be one of 'slow' for slow decay, 'linear' for linear decay, and 'rapid' for rapid decay. If no distance weight is used, please select 'constant' nearby.decay.exponent a numeric specifying a decay exponent. By default, it sets to 2 verbose logical to indicate whether the messages will be displayed in the screen. By default, it sets to true for display RData.location the characters to tell the location of built-in RData files. See xRDataLoader for details

## Value

If scoring sets to false, a data frame with following columns:

• GR: genomic regions

• Gene: crosslinked genes

• Score: the original score between the gene and the GR (if cdf.function is 'original'); otherwise cdf (based on the whole crosslink inputs)

• Context: the context

If scoring sets to true, a data frame with following columns:

• Gene: crosslinked genes

• Score: gene score summarised over its list of crosslinked GR

• Pval: p-value-like significance level transformed from gene scores

• Context: the context

xRDataLoader, xGR
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ## Not run: # Load the XGR package and specify the location of built-in data library(XGR) RData.location <- "http://galahad.well.ox.ac.uk/bigdata" # 1) provide the genomic regions ## load ImmunoBase ImmunoBase <- xRDataLoader(RData.customised='ImmunoBase', RData.location=RData.location) ## get lead SNPs reported in AS GWAS and their significance info (p-values) gr <- ImmunoBase$AS$variant names(gr) <- NULL dGR <- xGR(gr, format="GRanges") # 2) using built-in crosslink info ## enhancer genes df_xGenes <- xGR2xGenes(dGR, format="GRanges", crosslink="genehancer", RData.location=RData.location) ## conformation genes df_xGenes <- xGR2xGenes(dGR, format="GRanges", crosslink="PCHiC_combined", RData.location=RData.location) ## eQTL genes df_xGenes <- xGR2xGenes(dGR, format="GRanges", crosslink="GTEx_V6p_combined", RData.location=RData.location) ## nearby genes (50kb, decaying rapidly) df_xGenes <- xGR2xGenes(dGR, format="GRanges", crosslink="nearby", nearby.distance.max=50000, nearby.decay.kernel="rapid", RData.location=RData.location) # 3) advanced use # 3a) provide crosslink.customised ## illustration purpose only (see the content of 'crosslink.customised') df <- xGR2nGenes(dGR, format="GRanges", RData.location=RData.location) crosslink.customised <- data.frame(GR=df$GR, Gene=df$Gene, Score=df$Weight, Context=rep('C',nrow(df)), stringsAsFactors=F) #crosslink.customised <- data.frame(GR=df$GR, Gene=df$Gene, Score=df$Weight, stringsAsFactors=F) # 3b) define crosslinking genes # without gene scoring df_xGenes <- xGR2xGenes(dGR, format="GRanges", crosslink.customised=crosslink.customised, RData.location=RData.location) # with gene scoring df_xGenes <- xGR2xGenes(dGR, format="GRanges", crosslink.customised=crosslink.customised, scoring=T, scoring.scheme="max", RData.location=RData.location) ## End(Not run)