How to use gpart

Introduction

In this package "gpart", we provide a new SNP sequence partitioning method which partitions the whole SNP sequence based on not only LD block structures but also gene location information. The LD block construction for GPART is performed using Big-LD algorithm, with additional improvement from previous version reported in Kim et al.(2017). We also add a visualization tool to show the LD heatmap with the information of LD block boundaries and gene locations in the package.

gpart including 5 functions as follows.

Running gpart

Installation

Download and install the most current of "gpart".

library(gpart)
library(gpart)

Input data for gpart

To run the CLQD and BigLD function, you need two kind of input data, additive genotype data in matrix or data.frame (geno) and SNP information data(SNPinfo). Each column of geno includes the additive genotypes of individuls for each SNP, and each rwo of SNPinfo includes the chromosome name, rsID and bp of each SNP. The example data contains 286 individuals in EAS population in 1000 Genomes phase 1 data of chr21:16040232-16835352 (4000 SNPs).

data(geno)
data(SNPinfo)
data(geneinfo)
geno[1:5, 1:5]
head(SNPinfo)
head(geneinfo)

Quick Start

To construct LD blocks, use BigLD function.

# By default, BigLD uses LD measure r2.
BigLDres = BigLD(geno = geno, SNPinfo = SNPinfo)

To partition SNPs into blocks considering the LD block structure and gene regions, use GPART function. You can speicify the min and max size of blocks.

# load the gene information from the DB (via biomaRt)
GPARTres = GPART(geno = geno, SNPinfo = SNPinfo,
                 assembly = "GRCh37", minsize = 4, maxsize = 50)
# use inputted gene information
GPARTres = GPART(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo, 
                 minsize = 4, maxsize = 50)

To visualize the BigLD or GPART results, use the LDblockHeatmap function.

# load the gene information from the DB (via biomaRt)
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, assembly = "GRCh37", 
                filename = "heatmap_example2", res = 150)
# use inputted gene information
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo, 
                filename = "heatmap_example1", res = 150)

Functions

CLQD function.

CLQD is an algorithm for clustering the SNPs into LD bins which are high-correlated SNP groups.

Examples

clq1 = CLQD(geno[, 1:100], SNPinfo[1:100,])

Values

CLQD returns a vector of bin numbers of SNPs. NA means that the corresponding SNP is a singleton, i.e. that SNP is an LD bin of size 1.

# CLQ results for 300 SNPs
head(clq1, n = 20)
table(clq1)
# n of singletons
sum(is.na(clq1))

Details

CLQD(geno[, 1:100], SNPinfo[1:100,], LD = "Dprime")

The algorithm first constructs a graph in which the vertices are SNPs and each edge is an unordered pair of two SNPs. Each edge weight is given by LD value between two SNPs. CLQD uses $r^2$ or D' as the LD measurement. If you want to measure $r^2$, set LD = "r2"(default), otherwise set LD = "Dprime". When LD = "r2", we have to specify a threshold for $|r|$ by using the parameter CLQcut. The default is CLQcut = 0.5. For the D', we define valid edges following definition of 'strong LD' suggested by Gabriel et al., (2002).

CLQD(geno[, 1:100], SNPinfo[1:100,], CLQcut = 0.7, hrstType = "fast", 
     CLQmode = "maximal")

CLQ algorithm first detects maximal cliques in the constructed graph and then takes cliques which include SNPs not yet selected as LD bins greedily. There are two clique selection methods. If you want to give a priority to a clique of larger size, set as CLQmode = "maximal", or if you want to give a priority to a cluster of denser LD bins (that is, the average distance of SNPs in a bin is smaller), set as CLQmode = "density"(default).

For the worst case, memory and time to listing all maximal cliques could be exponentially increased. Although, we can find all maximal cliques in reasonable time and memory usage for the most graphs corresponding to the LD structures. However, there are SNP sequence regions in which the time and memory usage for detecting maximal cliques extremely increasing, and we suggested heuristic algorithm for these regions. There are two kind of heuristic algorithm, "fast" and "near-nonhrst"(default). You can choose which algorithm to be used by setting the option hrstType such as hrstType = "fast" or hrstType = "near-nonhrst". "fast" algorithm needs less memory usage and time than the "near-nonhrst" algorithm. The "near-nonhrst" algorithm needs more memory and time than "fast" algorithm, but it returns the results similar to results obtained by non-heuristic algorithm. Adopting the "near-nonhrst" algorithm should specify a heuristic parameter hrstParam that specifies how similar to the non-heuristic result. Entering a high value produces results similar to non-heuristic results. It is recommended to set at least 150, and the default value is 200. To use the non-heuristic algorithm, set as hrstType = "nonhrst".

If the distance of two consecutive SNPs in a clique is too far, we can divide the clique into two cliques. The parameter for the threshold is clstgap and the default is 40000(bp).

BigLD function.

Big-LD is an algorithm to construct LD blocks using the interval graph modeling. The detailed algorithm can be found in Kim et al.(2017). Big-LD have several distinctive features. Since the Big-LD algorithm designed based on the graph modeling, large and mosaic pattern of LD blocks which allow the “holes” can be detected by the algorithm. Moreover, the boundaries of LD blocks identified by Big-LD tend to agree better with the recombination hotspots than the above mentioned methods. It also was shown that the LD block boundaries found by this algorithm are more invariant for the changes in the marker density compared to existing methods. For the vast number of SNPs, execution of the algorithm was completed within a reasonable time.

Examples

# Use the options `geno` and `SNPinfo` to input additive genotype data and SNP information data respectively.
# use r2 measure (default)
res1 = BigLD(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,]) 
#use D' measure
res1_dp = BigLD(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,]
                ,LD = "Dprime") 

Values

BigLD returns a data frame which contains 7 information of each block (chromosome name, indices of the first SNP and last SNP, rsID of the first SNP and last SNP, bp of the first SNP and last SNP)

head(res1)

Details

# When you have to input files directly, use the parameter `genofile` (and `SNPinfofile`) instead of `geno` and `SNPinfo`.
res2 = BigLD(genofile = "geno.vcf")
res3 = BigLD(genofile = "geno.ped", SNPinfofile = "geno.map")
# change LD measure, hrstParam
res4 = BigLD(geno = geno, SNPinfo = SNPinfo, LD = "Dprime", hrstParam = 150)

The input consists of genotype and SNP information. The function also can directly load the data from the files of PLINK format or VCF (.ped, .map, .traw, .raw, .vcf). If the file sizes (.ped, .traw, .raw, .vcf) are larger than 10Gb, we strongly recommand to use geno and SNPinfo options after reformatting your data instead of loading large files file directly. As with CLQD function, you can choose an LD measure to use (LD), r2 threshold (CLQcut), heuristic algorithm (hrstType, hrstParam), CLQ methods (CLQmode), threshold for distance of consecutive SNPs in a clique (clstgap).

res5 = BigLD(geno = geno, SNPinfo = SNPinfo, MAFcut = 0.1, appendRare = TRUE)

BigLD algorithm construct LD blocks using the common variants after filtering rare variants by the option MAFcut(default 0.05). To include a rare variant in the LD block results, add the appendRare = TRUE option. Rare variants with maf $\leq$ MAFcut could be appended to a proper LD block already constructed by common variants.

We can force specified SNP to be the last SNP in a block using cutByForce option.

cutlist = rbind(c(21, "rs440600", 16051956), c(21, "rs9979041", 16055738))
res6 = BigLD(geno = geno[, 1:100], SNPinfo = SNPinfo[1:100,], cutByForce = cutlist)
cutlist = rbind(c(21, "rs440600", 16051956), c(21, "rs9979041", 16055738))
res6 = BigLD(geno = geno[, 1:100], SNPinfo = SNPinfo[1:100,], cutByForce = cutlist)
print(cutlist)
head(res6)

If you want to obtain BigLD results of some specific region of an input data, you can speicify the region to calculate by using the options chrN, startbp, and endbp.

res7 = BigLD(geno = geno, SNPinfo = SNPinfo, startbp = 16058400, endbp = 16076500)
res7 = BigLD(geno = geno, SNPinfo = SNPinfo, startbp = 16058400, endbp = 16076500)
res7

GPART function

GPART is a SNP sequence partitioning algorithm based on the Big-LD results. Big-LD algorithm might allow quite long LD blocks or very small LD blocks, also do not consider the gene region information during the LD block construction procedures. GPART could partition SNP sequences considering gene region information by merging the overlapping gene regions and LD blocks. To partition the SNP sequences into blocks satisfying the predefined size threshold, GPART split the big blocks or merging the small consecutive blocks.

For the function GPART, you must input gene information data.

data(geneinfo)
head(geneinfo)

The GPART functions uses BigLD results as an initial clustering results. Therefore you need to input the BigLD results using the option BigLDresult. In other ways, you can make that the GPART function automatically run the BigLD before the main GPART algorithm (set BigLDresult=NULL, default). For this case, you can set parameters (CLQmode, CLQcut) for BigLD additionally.

Examples

# gene based GPART using the pre-calculated BigLD result 'res1'
# The input data of GPART must be the same as the input data used to obtain 'res1'
# note that the `res1` is obtained by using the first 1000 SNPs in geno.
# default minsize = 4, maxsize = 50
Gres0 = GPART(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
              geneinfo = geneinfo,BigLDresult = res1, 
              minsize = 4, maxsize = 50)

Values

GPART() returns data frame which contains 9 information of each partition (chromosome, indices of the first SNP and last SNP in the inputted data, rsIDs of the first SNP and last SNP, basepair positions of the first SNP and last SNP, blocksize, Name of a block)

# results of gene-based GPART
head(Gres0)

< Naming Examples >

Details

As with BigLD, GPART can run the algorithm based on LD measure, D' by setting LD = "Dprime".

# gene based model
Gres1 = GPART(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo)
# gene based model using LD measure Dprime
Gres2 = GPART(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo, LD = "Dprime")

There are three GPART algorithms:

  1. "Gene based". first merge the overlapping LD blocks and gene regions, split the big LD blocks and merged gene-block regions, and then merge small consecutive blocks such as singletons to satisfy the pre-defined min-size threshold (set GPARTmode="geneBased", default).

  2. "LDblock based only". This algorithm does not use any gene information. It partitions the SNP sequence by splitting or merging the LD blocks to satisfy the given size threshold (GPARTmode="LDblockBased", and Blockbasedmode="onlyBlocks", default for LD block based methods).

  3. "LDblock based - use gene info". This use gene information only for the merging consecutive small blocks procedure. If consecutive small blocks of which size are less than threshold do overlap with gene region, we merge them as large as possible (at most max-size threshold), otherwise, we merge them as small as possible. (set GPARTmode="LDblockBased", and Blockbasedmode="useGeneRegions")

# LD block based - use only LD block information
Gres3 = GPART(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo, 
              GPARTmode = "LDblockBased", Blockbasedmode = "onlyBlocks")
# LD block based - use gene information to merge singletons
Gres4 = GPART(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo, 
              GPARTmode = "LDblockBased", Blockbasedmode = "useGeneRegions")

You can load gene information from a text file using the option geneinfofile.

# you can load text file containing gene information.
Gres5 = GPART(geno = geno, SNPinfo = SNPinfo, geneinfofile = "geneinfo.txt")

You can also load the gene information data from "Ensembl" or "UCSC" database via internal sub-function of GPART. You need to set the database and assembly via the options of function, geneDB for database and assembly for assembly. Default for geneDB is "ensembl" and default for assembly is "GRCh38".

When you use "ensembl" data, you can set the version of the data by using the option ensbversion. By default, we use "hgnc symbol" for the gene name. You can change it by the option geneid depending on the gene DB you choose. If there are more than 2 gene information with the same gene name, we union the regions with the same name.

# use `geneDB` option instead of `geneinfo` or `geneinfofile`
# default geneDB is "ensembl" and  default assembly is "GRCh38"
Gres6 = GPART(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,],
              BigLDresult = res1, geneDB = "ensembl", assembly = "GRCh37")
Gres7 = GPART(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
              BigLDresult = res1, geneDB = "ucsc", assembly = "GRCh37" )

LDblockHeatmap

LDblockHeatmap function visualizes the BigLD (or GPART) results. The function can draw

If you want to show the Big-LD (or GPART) results, you need to input the Big-LD (or GPART) results. Otherwise the function first executes the Big-LD (or GPART) algorithm and then draw the figures with the obtained results as blocktype you specify.

Set the option filename and res to specify the file name and resolution of a plot respectively.

# First LDblockHeatmap calculate the LD block results using BigLD(), and then draw the LD heatmap, block boundaries and physical locations, and gene information
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo,geneinfo = geneinfo, 
                filename = "heatmap_all", res = 150)

In geno, there are 3340 SNPs with MAF>0.05. The "heatmap_all.png" shows the LDheatmap and BigLD results of the 3340SNPs.

knitr::include_graphics("heatmap_all.png")
# You can use the already obtained BigLD result using an option 'blockresult'.
# note that the `res1` is obtained by using the first 1000 SNPs in geno.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneinfo = geneinfo, blockresult = res1, 
               filename = "heatmap1", res = 200)
# For the obtained result 'res1', you can plot the only part of the result.
LDblockHeatmap(geno = geno[,300:800], SNPinfo = SNPinfo[300:800,], 
               geneinfo = geneinfo, blockresult = res1, 
               filename = "heatmap1_part", res = 200)
# If there is no inputted Big-LD result,
# the function will calculate the BigLD result first and then draw the figure using the result.
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, 
               geneinfo = geneinfo, filename = "heatmap2")
# you can save the output in tiff file
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, 
               geneinfo = geneinfo, filename = "heatmap3", type = "tif")

To draw heatmap, there are three kind of LD measures

Set LD = "Dprime" if you want to draw heatmap using D' measure. Set LD = "Dp-str" if you want to show only strong LD relations.

# the function first execute BigLD to obtain LDblock results, and then run GPART algorithm.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneinfo = geneinfo, LD = "Dprime", 
               filename = "heatmap_Dp", res = 200)
# or you can use the Big-LD results already obtained.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneinfo = geneinfo, LD = "Dp-str"
               , filename = "heatmap_Dp-str", res = 200)
knitr::include_graphics("heatmap_Dp-str.png")

You can speicify the region by using options chrN, startbp and endbp. If the data includes more than two chromosome, you need to specify the chromosome name to draw by using chrN

# the function first execute BigLD to obtain LDblock results, and then run GPART algorithm.
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo,
               chrN = 21, startbp = 16000000, endbp = 16200000, 
               filename = "heatmap_16mb-16.2mb")

The default block construction method is "Big-LD". When LDblockHeatmap function automatically run the BigLD, you can additionally specify the parameters, CLQmode, CLQcut.

when you set as LD = "Dprime" or LD = "Dp-str", the BigLD execute the algorithm using LD measure D-prime.

To draw the figures using the GPART results, you need to set the option blocktype = "gpart". If you do not input GPART results via blockresult, GPART first will be executed and gene-based algorithm will be applied.

# using the obatined GPART results, draw figure.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneinfo = geneinfo, blockresult = Gres0, 
               blocktype = "gpart", filename = "heatmap_gpart")
# or if you set the blocktype only, the function will execute the proper block construction algorithm.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneinfo = geneinfo, blocktype = "gpart", 
               filename = "heatmap_gpart2")

As with GPART, you can use the options geneinfofile, geneDB, assembly, and ensbversion.

# Show gene region information obtained from "ensembl" DB in assembly GRCh38.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneDB = "ensembl", assembly = "GRCh37",
               filename = "heatmap_enb")

You also can choose to not show the gene region information by setting as geneshow = FALSE

# Do not show the gene region information.
LDblockHeatmap(geno = geno[,1:1000], SNPinfo = SNPinfo[1:1000,], 
               geneshow = FALSE, filename = "heatmap_wogene")

To show LD bins obtained by CLQ algorithm (when the total number of SNPs <=200), use the option CLQshow = TRUE.

# using CLQshow = TRUE options to show LD bin results.
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo, 
               CLQshow = TRUE, startbp = 16300000, endbp = 16350000, 
               res=200, filename = "heatmap_clq")
knitr::include_graphics("heatmap_clq.png")

In the "heatmap_clq.png", there are 2 LD blocks. The first LD block contains 2 LD bins(colored by navy and blue), and the second LD block contains 7 LD bins. SNPs colored by the same color are strongly correlated, and pairwise |r| values are at least inputted CLQcut value. For the above figure, CLQcut was set to 0.5 by default.

You can choose not to draw block boundaries by using the option onlyHeatmap = TRUE. The function then ignores the entered block results or not pre-calculates the block results and draws pictures without block boundaries

# using "onlyHeatmap = TRUE"
LDblockHeatmap(geno = geno, SNPinfo = SNPinfo, geneinfo = geneinfo,
               onlyHeatmap = TRUE, filename = "heatmap_no_bound", res = 150)

convert2GRange

You can convert an output of BigLD or GPART in data.frame format to a GRanges object.

BigLD_grange = convert2GRange(res1)
BigLD_grange

References



Try the gpart package in your browser

Any scripts or data that you put into this service are public.

gpart documentation built on Nov. 8, 2020, 5:16 p.m.