Description Usage Arguments What does this function do? What does this function not do? SNP marker IDs See Also Examples
View source: R/convert_vcf_to_genfile.R
Read SNP genotype data from one or more VCF infiles
, and output
these to a genfile
that may be accepted as input by R/qtl.
1 2 3 4 5 6 7 8 9 | convert_vcf_to_genfile(
infiles,
genfile,
samples,
founders,
alleles = NULL,
max.seqlength = NULL,
na.string = c("-", "NA")
)
|
infiles |
Input VCF file paths. |
genfile |
Output R/qtl genotype data file. |
samples |
Cross sample IDs. |
founders |
Founder sample IDs. |
alleles |
Optional mapping of founder IDs to allele symbols (e.g. |
max.seqlength |
Optional parameter to indicate the maximum reference sequence length, which is used to determine the zero-padded width of genomic positions in SNP marker IDs. Without this information, SNP marker IDs may be formatted inconsistently in different datasets. For more details, see the SNP marker IDs section. |
na.string |
String to replace |
Given one or more VCF infiles
, along with a set of identifiers for samples
and
founders
from a two-founder cross, this function first reads SNP allele and genotype
calls. Then, for the set of SNPs that are common to samples and founders, where the base
call of a sample allele matches that of a founder, the sample is assigned the corresponding
founder allele. The resulting sample genotype is made up of the combination of its founder
alleles. A SNP is retained only if each founder has a distinct homozygous genotype, and there
are at least two distinct genotypes among the samples for that SNP. Finally, the set of SNP
markers satisfying these conditions are output to genfile
.
This function does not convert genotypes for a multi-founder cross.
Variants are taken at face value, and there is no consideration of variant or genotype quality. To exclude low-quality variants and genotypes, you may wish to perform a hard filter beforehand using a suitable VCF or BCF toolkit.
No attempt is made to collapse redundant markers. Depending on SNP density and linkage block
size, the output genfile
may contain groups of linked SNPs that could be collapsed to
a single marker. If appropriate, this can be done with the data in the genfile
.
While care has been taken to ensure that the output genfile
accurately reflects
the genotype data in the input VCF file(s), the genotype data is not validated with
respect to a specific cross type or experimental design. In general, you should review
the output file to ensure that the results are appropriate for your dataset.
Each SNP marker is assigned an identifier of the form 'chr11:002160000'
, where
'chr11'
is the chromosome identifier taken from the VCF CHROM
field, and
'002160000'
is a zero-padded number giving the position of the given SNP in the
reference genome, taken from the VCF POS
field.
The width to which the genomic position is padded is determined by the maximum sequence
length. Ideally reference sequence lengths are already specified in contig fields in the
header of the VCF. In that case the maximum sequence length is taken from the input VCF
file, and there is no need to specify a max.seqlength
parameter.
If the VCF header does not contain the necessary reference sequence length information,
then the max.seqlength
parameter should be set to the appropriate value for the
given reference genome to ensure consistent formatting of SNP marker IDs across datasets.
Sequence length information may be obtained by using one of the ChromInfo
functions
provided by the GenomeInfoDb package. Alternatively, it may be obtained from a Picard
Tools sequence dictionary file using the function load_seq_dict
, or directly
from a reference genome FASTA file using the Biostrings function fasta.seqlengths
.
For more, see the Examples section below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ## Not run:
# To get maximum sequence length from a sequence dictionary.
seqinfo <- utl::load_seq_dict('reference_genome.dict')
max.seqlength <- max(GenomeInfoDb::seqlengths(seqinfo))
# To get maximum sequence length from reference sequence FASTA file.
max.seqlength <- max(Biostrings::fasta.seqlengths('reference_genome.fasta'))
# Convert VCF to R/qtl genfile.
infiles <- c('samples.vcf', 'founders.vcf')
genfile <- 'geno.csv'
sample.ids <- utl::read_samples_from_vcf('samples.vcf')
founder.ids <- utl::read_samples_from_vcf('founders.vcf')
alleles <- utl::mapping(c(FOUNDER1='A', FOUNDER2='B'))
utl::convert_vcf_to_genfile(infiles, genfile, sample.ids, founder.ids,
alleles=alleles, max.seqlength=max.seqlength)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.