View source: R/LDnClusteringEL.R
LDnClusteringEL | R Documentation |
Finds clusters of loci connected by high LD within non-overlapping windows and returns a summary file, the resulting clusters and the names of the rSNPs (previously MCL), to be use e.g. in Genome Wide Association (GWA) or outlier analyses.
LDnClusteringEL(
EL_path = "./LD_EL/",
nSNPs = 1000,
columns = c(1, 2, 4),
out_folder = "LDnCl_out",
min_LD = 0.7,
plot.network = NULL,
threshold_net = 0.9,
to_do = NULL,
cores = 1,
min.cl.size = 2
)
EL_path |
Path to folder (only) containing relevant el's (one per chromosome/linkage group). For convenience the file name should be "name_of_chromosome"."el" e.g. chr1.el, 1.el, LG1.el, LG1_clean.el etc. Note that it is unnecessary to use a window size of more than ca. 100 SNPs for the edge list (specified in whatever software you are using), otherwise you may have to wait for a long time... |
nSNPs |
Desired number of SNPs per window. |
columns |
Index of the column that contain LD (r^2). The two first two columns must be locus 1 and 2 for each edge. |
out_folder |
Path to folder where output is produced (default is './LDnCl_out/'). Proceed to use |
min_LD |
Minimum LD value that at least one locus must be connected to all other loci in the recursive step. |
plot.network |
File name for plotting network. If |
threshold_net |
Threshold for edges when plotting network. |
to_do |
Vector with indexes for files in folder specified by |
cores |
Number of cores, default is 1 |
min.cl.size |
If 1 also singletons will be retained, which may be necessary for data sets with weak LD-structures (e.g. most loci are independent). Produced very large files though, so typically 2 is used to exclude all singleton clusters. |
Uses single linkage clustering within non-overlapping windows of SNPs (within chromosomes) to find groups of correlated SNPs connected by high LD. This is done recursively within the single linkage clustering sub-trees (from root and up); as soon as a clade is reached where at least one locus is connected with all other loci within its clade above threshold min_LD
, the algorithm stops. The default (0.7) produces clusters where the first PC typically explains >99 percent of the variation in each cluster.
Uses edge lists of LD values as input. The informative columns are specified by LD_column
, where first and second specifies the columns for locus names and the third contains LD (r^2) values.
nSNPs
determines the goal size for each window; smaller windows will produce faster computation times, but the smaller this size, the larger is the risk that you miss LD-clusters across breakpoints. However, if such a cluster is large, it will be split in two separate LD-clusters and they will be correlated in subsequent LDna steps.
If something goes wrong you can specify to_do
with indexes for files in folder specified by EL_path
that still need to be done.
Returns a list of three objects which are saved as ".rds" files in the folder specified by out_folder
.
cluster_summary
is a data frame that contains most of the relevant information for each cluster.
clusters
contains a list of locus names (chr:position; each entry corresponding to a row in cluster_summary
.
MCL
is a vector of names which which best represents the LD-cluster in downstream analyses ('maximally connected SNP', MCL aka rSNP).
Each .rds file contains only information for each chromosome but they can be concatenated into a single file using Concat_files
.
The columns in file cluster_summary
are:
Chr', 'Window', 'Pos', 'Min', 'Max', 'Range', 'nSNPs', 'Min_LD'
Chr |
Chromosome or linkage group identifer |
Window |
Window identifier, recycled among chromosomes |
Pos |
Mean position of SNPs in a cluster |
Min |
Most downstream position of SNPs in a cluster |
Max |
Most upstream position of SNPs in a cluster |
Range |
Max-Min |
nSNPs |
Number of SNPs in the cluster |
Min_LD |
the minimum LD between the rSNP/MCL and all other loci in its cluster |
Petri Kemppainen petrikemppainen2@gmail.com, zitong.li lizitong1985@gmail.com
Kemppainen, P., Knight, C. G., Sarma, D. K., Hlaing, T., Prakash, A., Maung Maung, Y. N., Walton, C. (2015). Linkage disequilibrium network analysis (LDna) gives a global view of chromosomal inversions, local adaptation and geographic structure. Molecular Ecology Resources, 15(5), 1031-1045. https://doi.org/10.1111/1755-0998.12369
Li, Z., Kemppainen, P., Rastas, P., Merila, J. Linkage disequilibrium clustering-based approach for association mapping with tightly linked genome-wide data. Accepted to Molecular Ecology Resources.
emmax_group
## Not run:
## We will first create some example data to live in folder "LD_EL"
library(LDna)
data("LDna")
## make directory for edge lists to live
system("mkdir LD_EL")
length(ELs) # edge lists for two chromosomes
## write them in LD_EL folder
# the locus names need to be "Chr:Pos"
tmp <- as.data.table(ELs[[1]])
tmp[,V1:=paste("Chr1",V1,sep=":")]
tmp[,V2:=paste("Chr1",V2,sep=":")]
ELs[[1]] <- tmp
tmp <- as.data.table(ELs[[2]])
tmp[,V1:=paste("Chr2",V1,sep=":")]
tmp[,V2:=paste("Chr2",V2,sep=":")]
ELs[[2]] <- tmp
## write the files to the EL folder
fwrite(ELs[[1]],file="LD_EL/Chr1.ld", row.names=FALSE,quote=FALSE)
fwrite(ELs[[2]],file="LD_EL/Chr2.ld", row.names=FALSE,quote=FALSE)
## run LD-network clustering (LD-network complexity reduction)
LDnClusteringEL(EL_path = "./LD_EL/",cores = 10, min.cl.size = 2) ## no singleton clusters are kept
## read in results
LDnC_res <- Concat_files("./LDnCl_out/")
cluster_summary<- as.data.table(LDnC_res$cluster_summary)
cluster_summary
cluster_summary[,hist(Min_LD)] ## distribution of minimum LD among any two loci within a cluster
cluster_summary[,table(nSNPs)] ## distribution cluster sizes
cluster_summary[,plot(nSNPs,Min_LD)] ## larger clusters tend to have lower minimum LD, those large clusters are from inversions
LDnC_res$clusters[cluster_summary[,which.max(nSNPs)]] ## this is the cluster with the most loci (e.g. putative inversion); the name is the MCL/rSNP
LDnC_res$MCL[cluster_summary[,which.max(nSNPs)]] ## the MCL/rSNP, i.e. the SNP that has the highest median LD with all other loci in this cluster
## and can be used to "represent" (hence "rSNP") this cluster in downstream analsyes.
## The alternative is to analyse the first PC as a forms of "synthetic multilocus genotypes"
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.