MixviR_v3.5.0

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

MixviR is a package designed to aid in exploring and visualizing genomic and amino-acid level variation obtained from microbial high-throughput sequencing samples, including samples that contain mixtures of genotypes. The package was originally written to detect and estimate relative frequencies of SARS-CoV-2 lineages (variants) in samples obtained from environmental sources, but can be applied to any microbial taxon.

Quick Start

Inputs

Required

  1. Sample VCFs - one for each sample to be analyzed. Must contain "DP" and "AD" in FORMAT field. Can be "gz" compressed.
  2. Reference genome file (fasta; remove spaces and underscores '_' from chromosome names).
  3. File (bed format) defining translated regions of the genome. 6 columns: chromosome, feature start position, feature end position, feature name, score (not used - can enter "."), and strand. Column names should not be included. Ensure that chromosome names in this file match those in the fasta reference exactly.

     **Pre-constructed reference information is available for SARS-CoV-2 (Covid-19) and can be specified
      as an argument to call_mutations(). In that case, only the sample VCFs are required.

Optional

Lineage-Associated Mutations File

The lineage-associated mutations (lineage.muts) file (csv) provides mutations/amino acid substitutions associated with lineages of interest. Requires columns named "Gene", "Mutation", and "Lineage". Three additional columns can be included named "Sublineage" "Chr" and "Pos". If "Chr" and "Pos" are included, make sure the chromosome names match those in the reference and bed files (no spaces, remove underscores). The file (including all optional colummns) should look like...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig1.jpg")

Location/Date File

The location/date (dates) file (csv) associates dates and locations with specific samples for cases where samples are obtained at various time points at one or more locations. Contains columns "SAMP_NAME", "LOCATION", and "DATE". Dates are given in mmddyyyy format, and the file should look like...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig2.jpg")

Typical Workflow

Run call_mutations() to identify mutations in the sample(s). Required arguments:

Run estimate_lineages() to generate a table with summarized results for each sample. Required arguments:

Run explore_mutations() to interactively visualize results. Required arguments:

   Include one or both of the following if available:

Full Documentation

Inputs

Required

There are three required inputs for running MixviR in its most basic form...

Sample Data

Sample data are provided as variant call format (vcf) files - one for each sample to be analyzed. These vcf files are expected to contain the "DP" and "AD" flags in the FORMAT field. Illumina's DRAGEN software, bcftools (mpileup/call; Danecek et al 2021) and the GATK (DePristo et al 2011) offer some widely-used workflows that can generate these vcf files. Below is some example bash code I've used for generating these files (most arguments can be customized as needed - including the FORMAT/AD AND FORMAT/DP fields in the output is important, if not there by default)...

bcftools example

```{bash, eval = FALSE} [path to bcftools]/bcftools mpileup -f fasta reference -d 4000 -q 60 -Q 30 -L 4500 / --ff UNMAP,SECONDARY,QCFAIL / -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,INFO/AD,INFO/ADF,INFO/ADR / input.bam | [path to bcftools]/bcftools call -m -A -Ov -o out_temp.vcf

[path to bcftools]/bcftools norm out_temp.vcf -c w -f fasta reference / -m -both -Ov -o out.vcf

**Reference Files**

The other two necessary inputs define the reference information for the taxon of interest. They include...

1. A fasta-formatted reference genome file (remove spaces and underscores '_' from chromosome names).
2. An associated bed-formatted file that defines the translated regions of the genome (ORFs/genes). This bed file should be tab delimited with 6 columns: chromosome, feature start position, feature end position, feature name, score (not used - can enter "."), and strand (+/-). Column names should not be included. Ensure that the chromosome names in this file match those in the reference fasta file. An example of this file is...

```r
knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig3.jpg")

Pre-constructed reference information is available for SARS-CoV-2 (Covid-19), and can be utilized with the reference option to call_mutations(). In that case, only sample vcf files are necessary as user-provided inputs.

Optional

There are two optional input files...

Lineage-associated mutations

The lineage-associated mutations file (lineage.muts) provides mutations associated with lineages/groups of interest. It requires columns named "Gene", "Mutation", and "Lineage". Three additional columns can be optionally included: "Sublineage", "Chr", and "Pos". If including the "Sublineage" column, the name of the sublineage associated with a given mutation can be provided. Mutations associated with a main/primary lineage should have "NA" in this column. If the "Chr" and "Pos" columns are included, make sure the chromosome names match those in the reference and bed input files (including no spaces or underscores). The file (including all optional columns) should look like...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig1.jpg")

It's important that the syntax of the mutations in this file matches that used by MixviR - see details on naming conventions under the sections "SNP-Based Mutation Identification"," "Indel-Based Mutation Identification", and "Nonsense Mutation Identification" below. The "Chr" and "Pos" columns store the chromosome and associated genomic position giving rise to the mutation. These columns are needed if, as part of the results, you want the program to return the sequencing depth at the position when the mutation is not observed for a sample. This information could be relevant in determining whether a target mutation not observed in a sample is absent because it doesn't occur in the sample, or because the sequencing coverage at that position is insufficient. If these columns are included, make sure the chromosome names match those in the reference genome (see note on naming chromosomes in section Cautions/Important Considerations below).

Location/Date File

In cases where samples are taken from the same location at different time points, the temporal information can be included by providing a csv location/date file (dates) that associates the sample dates and locations with each unique sample name. The file should contain columns named "SAMP_NAME", "LOCATION", and "DATE", and should look like...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig2.jpg")

Primary Functions

call_mutations()

Overview

Methods

The call_mutations() function uses one or more vcf files as input, along with a MixviR reference object (data frame) and creates a data frame/table that stores all mutations identified in the sample(s), along with a customizable set of associated information about each mutation. This data frame can be written to a file, and/or saved as an object in the global environment. In either case it is used as input to the explore_mutations() or estimate_lineages() functions.

call_mutations() first obtains the MixviR reference object (Fig 6), which is created as part of the run if the fasta.genome and bed options are provided. Alternatively, if analyzing SARS-CoV-2, the reference option can be set to “Wuhan” and a pre-constructed reference will be used. In the case of overlapping genes, positions will be duplicated in this reference object, with a separate entry for each gene the nucleotide position is associated with.

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig6.jpg")

MixviR then reads in the set of files to be analyzed. These should be the only files stored in the directory given by the sample.dir option. In most cases, samples will be provided in variant call format (vcf). These vcf files need to include the DP and AD flags in the FORMAT field. Relevant information from each vcf file is extracted with functionality from vcfR (Knaus and Grünwald, 2017). If the write.all.targets option will be used to report sequencing depths for genomic positions associated with a priori-defined mutations that don’t occur in the sample, all positions should be included in the input vcf file(s). Otherwise, only variant positions need to be included.

MixviR loops over the set of input files, sequentially calling mutations from each and appending them to a master data frame that stores all mutations. The process of calling mutations for each sample involves several steps. The overall sequencing depths at each position in the input file are first added to the reference object (Fig. 7, column ‘DP’; note that all objects shown in Figs 7-10 are temporary objects created during a MixviR analysis and are not directly available to the user). Depths are ‘NA’ for any positions not in the vcf input file.

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig7.jpg")

Variable sites in the sample (sites with an entry in the ALT column of the input file) are then extracted, the frequency of the ALT allele (AF) is calculated for each, and the set of variants is filtered to retain those that exceed the value of min.alt.freq. This filtering step is intended to remove low-frequency sequencing noise and/or PCR artifacts from the data (see note on min.alt.freq in section Cautions/Important Considerations below). Deletions are represented as strings >1 bp in the REF column (Fig 8; position 11282 is a 9-bp deletion) and insertions are represented as strings >1 bp in the ALT column (Fig 8; 1 bp insertion at position 19983).

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig8.jpg")

Next, MixviR checks the set of sample variants to determine if there are any genomic positions where more than one unique variant occurs in the (potentially mixed) sample. If not, mutations are characterized in two steps, first by identifying those based on single nucleotide polymorphism (SNP) variation, and subsequently by identifying mutations arising from insertions or deletions. Each of these steps is described below. In the event one or more positions have multiple mutations, the duplicated sites are first removed and stored. Mutations are initially called from the first set of variants (all variants without duplicated positions and the first variant at each duplicated site), and the process is iterated on the stored set of duplicated variants until no duplicated sites remain.

SNP-Based Mutation Identification

For mutations based on SNP variation, the full set of variants identified in the sample is filtered to retain just SNP-based variants (single nucleotide changes, no indels). These are joined with the reference object, replacing the reference base with any alternate alleles observed in the sample to create a “sample genome” (in other words, the reference genome with SNP variants substituted in to their respective position). This updated sample genome is then translated using functionality from Biostrings (Pagès et al 2019) to get the sample amino acids. Mutations resulting in amino acid substitutions are retained and named in the form "S_D614G", which would indicate a substitution of ‘G’ for the original ‘D’ at amino acid position 614 of the ‘S’ gene. Subsequently, all SNP variants that don’t result in an amino acid change, including synonymous changes or variants outside of genes, are identified and named in the form Chr1_500A->T, which would represent a mutation from ‘A’ to ‘T’ at nucleotide position 500 of Chromosome 1. The example below (Fig 9) shows one SNP in a non-genic region, one synonymous and one non-synonymous SNP-based mutation in the ‘ORF1a’ gene, and one synonymous and one non-synonymous SNP-based mutation in the ‘S’ gene.

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig9.jpg")
Indel-Based Mutation Identification

Indel-based mutation identification is performed separately for insertions and deletions, and for each, separately for in-frame and frameshift indels (defined as indels for which the number of nucleotides gained or lost is an even multiple of 3 or not, respectively, regardless of whether the event occurred in a gene). MixviR uses the following rules for naming indels…

In-frame deletions

In-frame deletions (even multiples of 3bp) are indicated with ‘del’.

Examples: S_del144/144 (genic); ORF1a_del3675/3677 (genic); NC-045512.2_del23121/3bp (non-genic)

 

Out-of-frame deletions

Out-of-frame deletions (not even multiples of 3bp deleted) are indicated with ‘Fdel’.

Examples: ORF1a_Fdel2454/7bp (genic); NC-045512.2_del23121/2bp (non-genic)

In-frame insertions

In-frame insertions (even multiples of 3bp) are indicated with ‘ins’.

Examples: S_ins214/216 (genic); NC-045512.2_ins23121/3bp (non-genic)

Out-of-frame insertions

Examples: S_Fins649/1bp (genic); NC-045512.2_Fins23121/1bp (non-genic)

Out-of-frame insertions (not even multiples of 3bp deleted) are indicated with ‘Fins’.

Indel examples are given in Figure 10, which shows a one-bp deletion within ORF1a at position 4947 of chromosome NC045512.2, a deletion of amino acid positions 3674-3676 in the ORF1a gene, a 1bp insertion within amino acid position 2173 of ORF1b, and a deletion of amino acid position 212 of the S gene.

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig10.jpg")
Nonsense Mutation Identification

Nonsense mutations resulting in a premature stop codon are designated with an asterisk (i.e. ORF1a_R718*).

Output

The full set of mutations identified (SNP and indel-based) are joined to create a data frame that serves as the output returned by call_mutations(). Any of the following columns can be included in this data frame (defined by the out.cols option):

This final data frame can be assigned as an object in the global environment, and/or written to a file if write.mut.table is set to TRUE. By default, the columns included in the output are SAMP_NAME, CHR, POS, GENE, ALT_ID, AF, and DP. This data frame serves as required input to the explore_mutations() and estimate_lineages() functions, and must include at least the set of default columns to run either of these subsequent functions.

explore_mutations()

Summary

The explore_mutations() function uses the output from call_mutations() and opens an interactive RShiny dashboard that allows you to explore the data. The dashboard will have from 1 to 4 tabs at the top, depending on what combination of optional input files (if any) are provided...

Available Tabs

estimate_lineages()

Summary

The estimate_lineages() function is meant to output the data from the "Variants Present" plots of explore_mutations() in table form. You can choose to write data for all lineages analyzed or just the lineages identified as present in the sample (default) based on the presence.thresh threshold. The tables are printed to the screen and can also be written to a file. This function requires both the output from call_mutations() and a lineage.muts file.

Cautions/Important Considerations

It may be tempting to try to include mutation data for lots of closely-related lineages in the lineage/mutation (lineage.muts) file to try to distinguish among lineages at a very fine scale. One example of this in SARS-CoV-2 is the Delta variant, which as of the time this vignette was written had >100 sublineages defined that were designated AY.1, AY.2, AY.3, etc... Note that any mutations that show up more than once in this file are removed for the analysis, and so including many closely-related lineages/groups will likely result in having few or no mutations to use for the analysis.

Example Analyses

Example Data

MixviR comes with a set of example data files that can be used to test out the program. These include vcf files representing SARS-CoV-2 environmental samples obtained from a single location on three different dates, a "lineage.muts" file (example_lineage_muts.csv) containing a subset of four SARS-CoV-2 variants to evaluate, a "location/dates" file (example_location_date.csv), and a "samp_mutations.tsv" file that stores the output of a call_mutations() run for these three samples with default settings.

You can use the system.file() function to see the location of the raw files. For example, the path to the directory containing the example vcf files is given by system.file("extdata", "vcfs", package = "MixviR") and the path to the lineage.muts file by system.file("extdata", "vcfs", "example_lineage_muts.csv", package = "MixviR")

Example Analysis 1- No Target Mutations In call_mutations()

Step 1: call_mutations()

Your first step with MixviR will typically be to run the call_mutations() function, which identifies all the mutations in the input datasets (vcf files). We'll point this function to the directory storing the three example vcf files, and since these data are for SARS-CoV-2, we can use the pre-formatted "Wuhan" reference. These two pieces of information (location of input files and reference info) are all that's required to make it run. In this case though, we'll also clean up the sample names by trimming off all the text after the "_" in each input file name with the name.sep option. We'll assign the output to an object named samp_mutations.

samp_mutations <- call_mutations(sample.dir = system.file("extdata", "vcfs", 
                                                          package = "MixviR"), 
                                 reference = "Wuhan",
                                 name.sep = "_")

Running this creates a new object named samp_mutations, which stores all the mutations observed in each sample. In this case, there are a total of 316 mutations identified. The samp_mutations data frame looks like...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig11.jpg")

Most of the columns here are fairly self-explanatory. Each mutation identified is listed in the "ALT_ID" column. "AF" gives the estimated frequency of that mutation in the sample, which is simply the "ALT_COUNT"/"DP". The "POS" column provides the genomic position of the mutation along the chromosome. This data frame serves as input for the explore_mutations() function, which we'll run next.

Step 2: explore_mutations()

Basic Run

Running the explore_mutations() function opens an RShiny dashboard in a separate window. The samp_mutations object created with call_mutations() is required as input. The number of tabs available in the window depends on which optional files (if any) are passed to the function.

explore_mutations(muts.df = samp_mutations)
knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig12.jpg")

Here we ran explore_mutations() in its most basic form and got just a single tab named "View Mutations", which is just a slightly reformatted version of the samp_mutations data frame. There are options to select the sample you want to view (drop-down box in top left), and also to filter results by searching for specific text (i.e. a gene or mutation name). Each column is also sortable with the arrows at the top of the column (shift+click allows sorting by multiple columns).

With Dates

Now we'll add one option to explore_mutations(), passing it a location/dates file with the dates option...

explore_mutations(muts.df = samp_mutations,
                  dates = system.file("extdata", 
                                      "example_location_date.csv", 
                                      package = "MixviR"))
knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig13.jpg")

 

This time we have three tabs available in the RShiny window: New Mutations, Mutation Frequencies, and View Mutations.

New Mutations Tab
The New Mutations tab (Fig 13) allows you to select a date and view the set of mutations that were observed for the first time (across the entire dataset) on or after that date.

Mutation Frequencies Tab

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig14.jpg")

The Mutation Frequencies tab (Fig 14) allows you to enter one or more mutation names (comma separated if more than one), and the frequencies of these mutations are plotted for each date available in the dataset for the selected location. Multiple mutations are plotted as separate facets, while multiple locations can also be selected and are distinguished by color on each facet.

View Mutations Tab
The View Mutations tab here is the same as what we got in the basic run of explore_mutations() above (see Fig 12).

With Dates + Lineages

Providing a lineage/mutations file with the lineage.muts option in addition to a locations/dates file results in a fourth tab (and also some additional information included in the New Mutations and View Mutations tabs from above).

explore_mutations(muts.df = samp_mutations,
                  dates = system.file(muts.df = samp_mutations,
                                      "extdata", 
                                      "example_location_date.csv", 
                                      package = "MixviR"), 
                  lineage.muts = system.file("extdata", 
                                             "example_lineage_muts.csv", 
                                             package = "MixviR"))

The fourth tab is named "Variants Present", and provides two plots...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig15.jpg")

In this example, B.1.1.7 and P.1 have been selected for analysis. When a lineage is selected, MixviR filters the lineage/mutations file for mutations that are unique to that lineage. The number of these analyzed for each lineage is given as one of the entries in the tooltips on the top plot (see below for tooltip details) - in this case, there are 15 mutations unique to B.1.1.7 and 16 unique to P.1. MixviR then checks to see how many of these were detected (at any frequency > the min.alt.freq set in call_mutations()) for each sample and plots this proportion in the top plot. The red horizontal dashed line is a reference line showing the threshold proportion of mutations necessary to consider the lineage "present" in the sample. By default this is set to 0.5, but it can be adjusted with the slider on the left. For any lineages identified as present in a sample, the frequencies of the characteristic mutations present in the sample are summarized (mean by default) to estimate the proportion of that lineage in the sample, which is plotted in the bottom plot.

Notice in Figure 15 that B.1.1.7 on the 2021-05-02 date had just over 50% of the lineage-characteristic mutations in the sample. Based on the default threshold of 0.5, both B.1.1.7 and P.1 are called as present in that sample (and therefore plotted in the bottom plot). However, if the threshold is raised to 0.55 with the 'Presence Threshold' slider on the left, then only P.1 is called as present for the 2021-05-02 sample and represented in the lower plot...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig16.jpg")

 
Tooltips are also available if you hover over features on the plot...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig17.jpg")

 

 

 

Elements in the tooltip on the top plot (Fig 17)...

 
Tooltips on the bottom plot provide identities of the target mutations (Fig 18)...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig18.jpg")

Example Analysis 2 - Target Mutations Included In call_mutations()

This analysis is similar to the one above, but the write.all.targets option is used to report sequencing depths for all mutations associated with lineages of interest, even if the mutation isn't observed in the sample. This requires that the optional columns "Chr" and "Pos" are included in the lineage.muts file.

samp_mutations <- call_mutations(sample.dir = system.file("extdata", 
                                                          "vcfs", 
                                                          package = "MixviR"),
               reference = "Wuhan",
               name.sep = "_", 
               write.all.targets = TRUE, 
               lineage.muts = system.file("extdata", 
                                          "example_lineage_muts.csv", 
                                          package = "MixviR"))

In comparison to the call_mutations() run in example 1 above, which reported 316 mutations, this time 393 are reported. The extra mutations are those that were listed in the lineage.muts file but not identified in the sample. See rows 103-116 for Sample 1 in Fig 19, which all have an observed frequency of zero, though the sequencing depth associated with the genomic positions underlying the mutations is > 0...

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig19.jpg")

If target mutations were included in the call_mutations() run, this can be indicated when running explore_mutations() by setting all.target.muts to TRUE, and an additional piece of information (Proportion Exceeding Seq Depth Threshold) is added to the tooltips on the "Variants Present" plot (Fig 20). When mutations are not identified in a sample, this measure can help provide some insight into whether they are missing because they truly don't occur in the sample, or because the sequencing didn't adequately cover their associated genomic positions.

knitr::include_graphics("https://raw.githubusercontent.com/mikesovic/MixviR/main/MixviR_v3.5.0/inst/images/Fig20.jpg")

The individual mutations, along with their sequencing depth, can also be explored in the "View Mutations" tab.

Getting Help

While you're welcome to email the authors directly if you have questions about, or problems with MixviR that aren't addressed in this vignette, we encourage you to instead visit the MixviR Google Group, and post your issue if it hasn't yet been addressed there.

References

Danecek P, Bonfield JK, et al. Twelve years of SAMtools and BCFtools. Gigascience (2021) 10(2):giab008

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43:491-498.

Knaus BJ, Grünwald NJ (2017). “VCFR: a package to manipulate and visualize variant call format data in R.” Molecular Ecology Resources, 17(1), 44-53.

H. Pagès, P. Aboyoun, R. Gentleman and S. DebRoy (2019). Biostrings: Efficient manipulation of biological strings. R package version 2.52.0.

#Note: Example data are from...
#0PD00000 April 18, May 2, Aug 15


Try the MixviR package in your browser

Any scripts or data that you put into this service are public.

MixviR documentation built on Oct. 23, 2022, 1:09 a.m.