LDnClusteringEL: Linkage disequilibrium (LD) network clustering, starting from...

View source: R/LDnClusteringEL.R

LDnClusteringELR Documentation

Linkage disequilibrium (LD) network clustering, starting from edge lists of pairwise LD-values as input.

Description

Finds clusters of loci connected by high LD within non-overlapping windows and returns a summary file, the resulting clusters and the names of the rSNPs (previously MCL), to be use e.g. in Genome Wide Association (GWA) or outlier analyses.

Usage

LDnClusteringEL(
  EL_path = "./LD_EL/",
  nSNPs = 1000,
  columns = c(1, 2, 4),
  out_folder = "LDnCl_out",
  min_LD = 0.7,
  plot.network = NULL,
  threshold_net = 0.9,
  to_do = NULL,
  cores = 1,
  min.cl.size = 2
)

Arguments

EL_path

Path to folder (only) containing relevant el's (one per chromosome/linkage group). For convenience the file name should be "name_of_chromosome"."el" e.g. chr1.el, 1.el, LG1.el, LG1_clean.el etc. Note that it is unnecessary to use a window size of more than ca. 100 SNPs for the edge list (specified in whatever software you are using), otherwise you may have to wait for a long time...

nSNPs

Desired number of SNPs per window.

columns

Index of the column that contain LD (r^2). The two first two columns must be locus 1 and 2 for each edge.

out_folder

Path to folder where output is produced (default is './LDnCl_out/'). Proceed to use Concat_files to concatenate this data to a single file.

min_LD

Minimum LD value that at least one locus must be connected to all other loci in the recursive step.

plot.network

File name for plotting network. If NULL (default) no network is plotted.

threshold_net

Threshold for edges when plotting network.

to_do

Vector with indexes for files in folder specified by EL_path that still need to be done. If NULL (default) all files will be processed.

cores

Number of cores, default is 1

min.cl.size

If 1 also singletons will be retained, which may be necessary for data sets with weak LD-structures (e.g. most loci are independent). Produced very large files though, so typically 2 is used to exclude all singleton clusters.

Details

Uses single linkage clustering within non-overlapping windows of SNPs (within chromosomes) to find groups of correlated SNPs connected by high LD. This is done recursively within the single linkage clustering sub-trees (from root and up); as soon as a clade is reached where at least one locus is connected with all other loci within its clade above threshold min_LD, the algorithm stops. The default (0.7) produces clusters where the first PC typically explains >99 percent of the variation in each cluster.

Uses edge lists of LD values as input. The informative columns are specified by LD_column, where first and second specifies the columns for locus names and the third contains LD (r^2) values.

nSNPs determines the goal size for each window; smaller windows will produce faster computation times, but the smaller this size, the larger is the risk that you miss LD-clusters across breakpoints. However, if such a cluster is large, it will be split in two separate LD-clusters and they will be correlated in subsequent LDna steps.

If something goes wrong you can specify to_do with indexes for files in folder specified by EL_path that still need to be done.

Value

Returns a list of three objects which are saved as ".rds" files in the folder specified by out_folder. cluster_summary is a data frame that contains most of the relevant information for each cluster. clusters contains a list of locus names (chr:position; each entry corresponding to a row in cluster_summary. MCL is a vector of names which which best represents the LD-cluster in downstream analyses ('maximally connected SNP', MCL aka rSNP).

Each .rds file contains only information for each chromosome but they can be concatenated into a single file using Concat_files.

The columns in file cluster_summary are: Chr', 'Window', 'Pos', 'Min', 'Max', 'Range', 'nSNPs', 'Min_LD'

Chr

Chromosome or linkage group identifer

Window

Window identifier, recycled among chromosomes

Pos

Mean position of SNPs in a cluster

Min

Most downstream position of SNPs in a cluster

Max

Most upstream position of SNPs in a cluster

Range

Max-Min

nSNPs

Number of SNPs in the cluster

Min_LD

the minimum LD between the rSNP/MCL and all other loci in its cluster

Author(s)

Petri Kemppainen petrikemppainen2@gmail.com, zitong.li lizitong1985@gmail.com

References

Kemppainen, P., Knight, C. G., Sarma, D. K., Hlaing, T., Prakash, A., Maung Maung, Y. N., Walton, C. (2015). Linkage disequilibrium network analysis (LDna) gives a global view of chromosomal inversions, local adaptation and geographic structure. Molecular Ecology Resources, 15(5), 1031-1045. https://doi.org/10.1111/1755-0998.12369

Li, Z., Kemppainen, P., Rastas, P., Merila, J. Linkage disequilibrium clustering-based approach for association mapping with tightly linked genome-wide data. Accepted to Molecular Ecology Resources.

See Also

emmax_group

Examples

## Not run: 
## We will first create some example data to live in folder "LD_EL"
library(LDna)
data("LDna")
## make directory for edge lists to live
system("mkdir LD_EL")
length(ELs) # edge lists for two chromosomes
## write them in LD_EL folder
# the locus names need to be "Chr:Pos"
tmp <- as.data.table(ELs[[1]])
tmp[,V1:=paste("Chr1",V1,sep=":")]
tmp[,V2:=paste("Chr1",V2,sep=":")]
ELs[[1]] <- tmp
tmp <- as.data.table(ELs[[2]])
tmp[,V1:=paste("Chr2",V1,sep=":")]
tmp[,V2:=paste("Chr2",V2,sep=":")]
ELs[[2]] <- tmp
## write the files to the EL folder
fwrite(ELs[[1]],file="LD_EL/Chr1.ld", row.names=FALSE,quote=FALSE)
fwrite(ELs[[2]],file="LD_EL/Chr2.ld", row.names=FALSE,quote=FALSE)

## run LD-network clustering (LD-network complexity reduction)
LDnClusteringEL(EL_path = "./LD_EL/",cores = 10, min.cl.size = 2) ## no singleton clusters are kept
## read in results
LDnC_res <- Concat_files("./LDnCl_out/")
cluster_summary<- as.data.table(LDnC_res$cluster_summary)
cluster_summary
cluster_summary[,hist(Min_LD)] ## distribution of minimum LD among any two loci within a cluster
cluster_summary[,table(nSNPs)]  ## distribution cluster sizes
cluster_summary[,plot(nSNPs,Min_LD)]  ## larger clusters tend to have lower minimum LD, those large clusters are from inversions

LDnC_res$clusters[cluster_summary[,which.max(nSNPs)]] ## this is the cluster with the most loci (e.g. putative inversion); the name is the MCL/rSNP
LDnC_res$MCL[cluster_summary[,which.max(nSNPs)]] ## the MCL/rSNP, i.e. the SNP that has the highest median LD with all other loci in this cluster 
## and can be used to "represent" (hence "rSNP") this cluster in downstream analsyes.
##  The alternative is to analyse the first PC as a forms of "synthetic multilocus genotypes"

## End(Not run)

petrikemppainen/LDna documentation built on April 14, 2024, 6:37 p.m.