vcm: Make a variant call matrix

Description Usage Arguments Details

View source: R/vcm.r

Description

Create a Variant Call Matrix (vcm) by retaining the most frequent base of reads for each cell contained in a bam file and at the position of detected variants contained in the vcf file.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
vcm(
  bam,
  vcf,
  barcodes = NULL,
  output = NULL,
  min_coverage = 0,
  min_mapq = 60,
  nthreads = 16,
  remake = FALSE,
  message = FALSE,
  varchunk = NULL,
  chunksize = NULL,
  adaptive = FALSE,
  factor = NULL
)

Arguments

bam

an input bam file

vcf

an input vcf file

barcodes

optional a barcode file

output

**optional an output file. If not specified <bam>.vcm in a current directory is used.

min_coverage

optional a minimum coverage for a position to not be considered unknown data

min_mapq

optional minimum mapping quality

nthreads

optional a number of threads to run on

remake

optional remake the output if it already exists

message

optional Print a progress message.

varchunk

optional By default, the whole vcf file is processed at once. Alternatively, this parameter can be specified to iterate over the vcf file and read and process only a limited number of variants at once. This shouldl improve the performance with a huge amount of cells as it decrease the size of the text that is being written in a file. However, this can also decrease performance if a small number of variants take a long time to process as other processes would wait for these.

chunksize

optional Number of variants assigned to a process at once. The variants can be divided into equally sized chunks that are then distributed among processes. Ideally, if the time to process each variant is the same, a larger chunks should be assigned as this reduce multiprocessing overhead. As this is not guranteed for the scRNAseq, only a single variant is send to a process at once by default.

adaptive

optional Use an adaptive chunksize calculation instead of a fixed number. The adaptive approach calculates chunksize for each subset of variants that passed the filter. This guarantee that the data are aways divided among the processes equally.

factor

optional If an adaptive chunksize is chosen, the variants that passed the filter are divided into a factor\*nthreads of equally sized subsets. Larger factor trades a smaller overhead for a potential scheduling problems.

Details

This code is a thin wrapper over the Python script vcm found in the package's inst folder.

This script reads a vcf file and then distribute the variants among the processes. The varchunk, chunksize, adaptive and factor are parameters that govern how will this distribution take place (see description of these parameters for more detail). As scRNAseq are notoriously patchy, variants can take highly varied time dependent on a number of reads mapped to that particular place. For this reason, the default values should provide the best performance. However, there might be situations where their modification would be beneficial.

Mapping quality reported by various tools is highly inconsistent. STAR assign 255 to uniquely mapped reads and 0, 1 and 3 to reads that are mapped non-uniquelly. Other tools assign 255 to reads that are not mapped at all. This is what GATK assumes and when preprocessing, 255 is reasigned to 60, which is the default value for the min_mapq parameter.


bioDS/phyloRNA documentation built on Feb. 21, 2022, 3:28 p.m.