filter_ld | R Documentation |
SNP short and long distance linkage disequilibrium pruning.
What sets appart radiator LD pruning is the RADseq data tailored arguments:
minimize short linkage disequilibrium (LD):
5 values available for filter.short.ld
argument (see below).
reduce long distance LD: Long distance LD pruning is usually advised to avoid capturing the variance LD in PCA analysis.
Use the argument filter.long.ld
with values between 0.7 and 0.9 is a
good starting point. Ideally, you want to visualize the LD before choosing a threshold.
Strategically, run the function with filter.long.ld
argument and
refilter the data using the outlier statistic
generated by the function (printed on the figure in the output) and using
long.ld.missing = TRUE
. This advanced argument will choose the best SNP
based on missing data statistics, instead of choosing randomly one SNP
(see details).
This function is used internally in radiator and might be of interest for users.
filter_ld(
data,
interactive.filter = TRUE,
filter.short.ld = "mac",
filter.long.ld = NULL,
parallel.core = parallel::detectCores() - 1,
filename = NULL,
verbose = TRUE,
...
)
data |
(4 options) A file or object generated by radiator:
How to get GDS and tidy data ?
Look into |
interactive.filter |
(optional, logical) Do you want the filtering session to
be interactive. Figures of distribution are shown before asking for filtering
thresholds.
Default: |
filter.short.ld |
(character) 5 options (default:
|
filter.long.ld |
(optional, double) The threshold to prune SNP based on
Long Distance Linkage Disequilibrium. The argument filter.long.ld is
the absolute value of measurement.
Default: |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
filename |
(optional, character) File name prefix for file written in
the working directory.
Default: |
verbose |
(optional, logical) When |
... |
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section) |
The function requires SNPRelate (see example below on how to install).
Advance mode, using dots-dots-dots
maf.data
(path) this argument is no longer supported.
It's a small cost in time in favour of making sure the MAC/MAF fits the actual
data.
long.ld.missing
(logical) With long.ld.missing = TRUE
.
The function first generates long distance LD values between markers along
the same chromosome or scaffold with SNPRelate::snpgdsLDMat.
Based on the LD threshold (filter.long.ld
) SNPs in LD will be pruned
based on missingness.
e.g. if 4 SNPs are in LD, the 1 SNP selected in
the end is base on genotyping rate/missingness. If this statistic is equal
between the SNPs in LD, 1 SNP is chosen randomly.
Using missigness add extra computational time. To speed the analysis when
missingness between markers is not an issue, use long.ld.missing = FALSE
.
The function will use SNPRelate::snpgdsLDpruning
to prune the dataset. SNPs in LD are selected randomly.
Default: long.ld.missing = FALSE
.
ld.method
: (optional, character) The values available are
"composite"
, for LD composite measure, "r"
for R coefficient
(by EM algorithm assuming HWE, it could be negative), "r2"
for r^2,
"dprime"
for D',
"corr"
for correlation coefficient. The method corr and composite are
equivalent when SNPs are coded based on the presence of the alternate allele
(0, 1, 2
).
Default: ld.method = "r2"
.
ld.figures
: (logical) Generate long distance LD statistics and
figures.
Default: ld.figures = TRUE
path.folder
: to write ouput in a specific path
(used internally in radiator). Default: path.folder = getwd()
.
If the supplied directory doesn't exist, it's created.
A list in the global environment, with these objects:
$ld.summary: tibble with LD statistics used for the boxplot
$ld.boxplot: box plot of LD values
$whitelist.ld: whitelist of markers kept after filtering for LD.
The argument filter.long.ld
must be used to generate the whitelist.
$blacklist.ld: blacklist of markers prunned during the filtering
for LD.
The argument filter.long.ld
must be used to generate the blacklist.
$data: The filtered tidy dataset.
$gds: the path to the GDS file.
Thierry Gosselin thierrygosselin@icloud.com
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. (2012) A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 28: 3326-3328. doi:10.1093/bioinformatics/bts606
## Not run:
require(SNPRelate)
#To install SNPRelate:
install.packages("BiocManager")
BiocManager::install ("SNPRelate")
library(radiator)
data <- radiator::read_vcf(data = "my.vcf", strata = "my.strata.tsv", verbose = TRUE)
# short distance LD, no long distance LD:
check.short.ld <- radiator::filter_ld(data = data, filter.short.ld = "mac")
# short distance LD and long distance LD:
pruned.ld <- radiator::filter_ld(
data = data,
filter.short.ld = "mac",
filter.long.ld = 0.8)
# short distance LD and long distance LD, incorporating missing data:
pruned.ld <- radiator::filter_ld(
data = data, # a GDS object generated by radiator
filter.short.ld = "mac",
filter.long.ld = 0.8,
long.ld.missing = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.