vcm: Make a variant call matrix
In bioDS/phyloRNA: Tools for phylogenetic analyses of scRNAseq data

Create a Variant Call Matrix (vcm) by retaining the most frequent base of reads for each cell contained in a bam file and at the position of detected variants contained in the vcf file.

vcm(
  bam,
  vcf,
  barcodes = NULL,
  output = NULL,
  min_coverage = 0,
  min_mapq = 60,
  nthreads = 16,
  remake = FALSE,
  message = FALSE,
  varchunk = NULL,
  chunksize = NULL,
  adaptive = FALSE,
  factor = NULL
)

`bam`	an input bam file
`vcf`	an input vcf file
`barcodes`	optional a barcode file
`output`	**optional an output file. If not specified `<bam>.vcm` in a current directory is used.
`min_coverage`	optional a minimum coverage for a position to not be considered unknown data
`min_mapq`	optional minimum mapping quality
`nthreads`	optional a number of threads to run on
`remake`	optional remake the output if it already exists
`message`	optional Print a progress message.
`varchunk`	optional By default, the whole vcf file is processed at once. Alternatively, this parameter can be specified to iterate over the vcf file and read and process only a limited number of variants at once. This shouldl improve the performance with a huge amount of cells as it decrease the size of the text that is being written in a file. However, this can also decrease performance if a small number of variants take a long time to process as other processes would wait for these.
`chunksize`	optional Number of variants assigned to a process at once. The variants can be divided into equally sized chunks that are then distributed among processes. Ideally, if the time to process each variant is the same, a larger chunks should be assigned as this reduce multiprocessing overhead. As this is not guranteed for the scRNAseq, only a single variant is send to a process at once by default.
`adaptive`	optional Use an adaptive chunksize calculation instead of a fixed number. The adaptive approach calculates chunksize for each subset of variants that passed the filter. This guarantee that the data are aways divided among the processes equally.
`factor`	optional If an adaptive chunksize is chosen, the variants that passed the filter are divided into a factor\*nthreads of equally sized subsets. Larger factor trades a smaller overhead for a potential scheduling problems.

This code is a thin wrapper over the Python script vcm found in the package's inst folder.

This script reads a vcf file and then distribute the variants among the processes. The varchunk, chunksize, adaptive and factor are parameters that govern how will this distribution take place (see description of these parameters for more detail). As scRNAseq are notoriously patchy, variants can take highly varied time dependent on a number of reads mapped to that particular place. For this reason, the default values should provide the best performance. However, there might be situations where their modification would be beneficial.

Mapping quality reported by various tools is highly inconsistent. STAR assign 255 to uniquely mapped reads and 0, 1 and 3 to reads that are mapped non-uniquelly. Other tools assign 255 to reads that are not mapped at all. This is what GATK assumes and when preprocessing, 255 is reasigned to 60, which is the default value for the min_mapq parameter.

bioDS/phyloRNA documentation built on Feb. 21, 2022, 3:28 p.m.