qtl Analysis

Description Usage Arguments What does this function do? What does this function not do? SNP marker IDs See Also Examples

View source: R/convert_vcf_to_genfile.R

Read SNP genotype data from one or more VCF infiles, and output these to a genfile that may be accepted as input by R/qtl.

convert_vcf_to_genfile(
  infiles,
  genfile,
  samples,
  founders,
  alleles = NULL,
  max.seqlength = NULL,
  na.string = c("-", "NA")
)

`infiles`	Input VCF file paths.
`genfile`	Output R/qtl genotype data file.
`samples`	Cross sample IDs.
`founders`	Founder sample IDs.
`alleles`	Optional mapping of founder IDs to allele symbols (e.g. `utl::mapping( c(DBVPG6044 = 'W', Y12 = 'S') )`). If this parameter is not specified, allele symbols are taken from the letters of the alphabet (i.e. `'A'`, `'B'` etc.).
`max.seqlength`	Optional parameter to indicate the maximum reference sequence length, which is used to determine the zero-padded width of genomic positions in SNP marker IDs. Without this information, SNP marker IDs may be formatted inconsistently in different datasets. For more details, see the SNP marker IDs section.
`na.string`	String to replace `NA` values.

Given one or more VCF infiles, along with a set of identifiers for samples and founders from a two-founder cross, this function first reads SNP allele and genotype calls. Then, for the set of SNPs that are common to samples and founders, where the base call of a sample allele matches that of a founder, the sample is assigned the corresponding founder allele. The resulting sample genotype is made up of the combination of its founder alleles. A SNP is retained only if each founder has a distinct homozygous genotype, and there are at least two distinct genotypes among the samples for that SNP. Finally, the set of SNP markers satisfying these conditions are output to genfile.

This function does not convert genotypes for a multi-founder cross.

Variants are taken at face value, and there is no consideration of variant or genotype quality. To exclude low-quality variants and genotypes, you may wish to perform a hard filter beforehand using a suitable VCF or BCF toolkit.

No attempt is made to collapse redundant markers. Depending on SNP density and linkage block size, the output genfile may contain groups of linked SNPs that could be collapsed to a single marker. If appropriate, this can be done with the data in the genfile.

While care has been taken to ensure that the output genfile accurately reflects the genotype data in the input VCF file(s), the genotype data is not validated with respect to a specific cross type or experimental design. In general, you should review the output file to ensure that the results are appropriate for your dataset.

Each SNP marker is assigned an identifier of the form 'chr11:002160000', where 'chr11' is the chromosome identifier taken from the VCF CHROM field, and '002160000' is a zero-padded number giving the position of the given SNP in the reference genome, taken from the VCF POS field.

The width to which the genomic position is padded is determined by the maximum sequence length. Ideally reference sequence lengths are already specified in contig fields in the header of the VCF. In that case the maximum sequence length is taken from the input VCF file, and there is no need to specify a max.seqlength parameter.

If the VCF header does not contain the necessary reference sequence length information, then the max.seqlength parameter should be set to the appropriate value for the given reference genome to ensure consistent formatting of SNP marker IDs across datasets.

Sequence length information may be obtained by using one of the ChromInfo functions provided by the GenomeInfoDb package. Alternatively, it may be obtained from a Picard Tools sequence dictionary file using the function load_seq_dict, or directly from a reference genome FASTA file using the Biostrings function fasta.seqlengths. For more, see the Examples section below.

Biostrings package

GenomeInfoDb package

Picard Tools documentation

R/qtl website

## Not run: 

# To get maximum sequence length from a sequence dictionary.
seqinfo <- utl::load_seq_dict('reference_genome.dict')
max.seqlength <- max(GenomeInfoDb::seqlengths(seqinfo))

# To get maximum sequence length from reference sequence FASTA file.
max.seqlength <- max(Biostrings::fasta.seqlengths('reference_genome.fasta'))

# Convert VCF to R/qtl genfile.
infiles <- c('samples.vcf', 'founders.vcf')
genfile <- 'geno.csv'
sample.ids <- utl::read_samples_from_vcf('samples.vcf')
founder.ids <- utl::read_samples_from_vcf('founders.vcf')
alleles <- utl::mapping(c(FOUNDER1='A', FOUNDER2='B'))
utl::convert_vcf_to_genfile(infiles, genfile, sample.ids, founder.ids,
                            alleles=alleles, max.seqlength=max.seqlength)


## End(Not run)