Getting Started With prewas

  collapse = TRUE,
  comment = "#>"

Introduction to prewas

prewas is a tool to standardize the pre-processing of your genomic data before performing a bacterial genome-wide association study (bGWAS). Currently, prewas can pre-process single nucleotide variants (SNVs) and small insertions and deletions (indels).

prewas creates a variant matrix (where each row is a variants, each column is a sample, and the entries are presence - 1 - or absence - 0 - of the variants) that can be used as input for bGWAS tools, such as hogwash or treeWAS.

When creating the binary variant matrix, prewas can perform 3 pre-processing steps including:

  1. dealing with multiallelic variants,

  2. (optional) dealing with variants in overlapping genes, and

  3. choosing a reference allele.

These 3 steps in the prewas workflow are shown below. (B) shows how prewas handles multiallelic sites, (C) shows options for choosing a reference allele and (D) shows how prewas can group variants into genes. A detailed workflow indicating required file inputs can be seen below in the "Detailed prewas workflow" section.

prewas workflow{width=70%}

1. Dealing with multiallelic variants

A multiallelic site is a site with more than two alleles present at a locus. Most bGWAS methods require a binary input (variants presence or absence) - so multiallelic sites don't fit into this mold.

One could consider the reference allele 0 and any alternative allele as 1. However, each alternative allele could confer a different functional impact on the expressed protein. Thus, lumping the alternative alleles into one category will take away the ability to study these functional differences.

prewas uses the multi-line format to represent mutliallelic variants. Below are examples of how a triallelic site (T, A, and G) can be represented in a single line of a variant matrix compared to multiple lines of a variant matrix.

Single-line format:

  1. T -> A, G

Multi-line format:

  1. T->A

  2. T->G

2. Dealing with variants in overlapping genes

A variants in an overlapping gene could have a different impact on those two genes. Therefore, when a gff is provided, prewas will output a binary variant matrix where a variants in x overlapping genes will be represented on x lines. When a gff is provided, prewas will also output a gene-based variant matrix indicating presence or absence of at least 1 variants in that gene, and variants in overlapping genes will be assigned to both genes.

3. Choosing a reference allele

A reference allele, the allele that will be denoted 0 in a binary matrix could be:

  1. the allele in a reference genome
  2. the major allele at each position
  3. the ancestral allele at each position

Choosing a reference allele can be particularly important when doing gene-based analysis and therefore aggregating variants by gene. We suggest referencing to the ancestral allele for evolutionary interpretability. In cases where ancestral reconstruction is not feasible (e.g. computational intensity) or low confidence, we suggest referencing to the major allele instead of referencing to an arbitrary reference genome because using the major allele leads to less masking of variation at the gene level.

Reference to the major allele to avoid masking variation at the gene level{width=70%}

When ancestral reconstruction is not used (anc = FALSE), prewas will use the major allele as the reference allele.


prewas can output matrices for use with both variants-based bGWAS and gene-based bGWAS.


At minimum, prewas requires a .vcf, but can also take in a phylogenetic tree, the name of the outgroup in that tree (if any), and a .gff for use with gene-based analysis.

Below, we'll explore some usage examples using toy data built into the package.

vcf = prewas::vcf
gff = prewas::gff
tree = prewas::tree
outgroup = prewas::outgroup

Minimal prewas run:


  1. required vcf file


results_minimal = prewas(dna = vcf, 
                         anc = FALSE)

Will output a list containing:


Maximal prewas run:


  1. required vcf file
  2. tree
  3. string of the outgroup
  4. gff file


results_maximal = prewas(dna = vcf, 
                         tree = tree, 
                         outgroup = outgroup, 
                         gff = gff,
                         anc = TRUE)

Will output:

head(results_maximal$bin_mat, 10)

No tree?

If you do not have a tree, a neighbor-joining tree will be generated and ancestral reconstruction can be conducted. We recommend using more sophisticated methods for tree building (e.g. maximum likelihood or Bayesian methods) but for an initial analysis, we have provided the option to create a neighbor-joining tree.

results_no_tree = prewas(dna = vcf, 
                         gff = gff,
                         anc = TRUE)

results_no_tree returns similar results to results_maximal but with one critical difference in the tree object: tree: this phylogenetic tree was built so that ancestral reconstruction could be performed for the variants referencing step. Note that this tree has one more tip than results_maximal$tree because an outgroup was provided with results_maximal which was dropped after being used to root that tree. For results_no_tree the generated tree was midpoint rooted & no outgroup was provided so no tips were dropped.


Want to filter variants by predicted functional impact prior to your gene-based analysis?

To filter variants by predicted functional impact prior to a gene-based analysis, you can provide prewas with a multivcf with SnpEff annotations. To merge individual SnpEff vcf files into a single multivcf, you can use bcftools merge.

# from the command line
bcftools merge -i ANN:join -m both -o out_multivcf.vcf -O *.vcf.gz
results_snpeff = prewas(dna = snpeff_vcf, anc = FALSE)

The default is to output the same variant binary matrix, and 5 different binary gene matrices: one for each snpeff impact (low, moderate, high, modifier), and one with all of the variants (all). There is also a custom output that will be NULL with the default options. If you would like to create a custom grouping (e.g. moderate and high), you can provide a vector to snpeff_grouping (e.g. c('MODERATE','HIGH')). In this case, the custom output will keep those specific variants types and no others when grouping by gene.

results_snpeff_custom = prewas(dna = snpeff_vcf, snpeff_grouping = c('MODERATE','HIGH'), anc = FALSE)

Want to group by minor allele?

If you want to group by minor allele to increase power by grouping rarer alleles together. This feature re-collapses multiallelic sites for each locus. This feature is analagous to grouping by gene, but now by site instead.

If you are additionally using SnpEff annotations, you can appropriately group variants with the indicated annotation by site.

results_ma = prewas(dna = vcf, grp_nonref = TRUE, anc = FALSE)

Detailed prewas workflow

A detailed visualization of the prewas workflow is shown below.

A detailed prewas workflow{width=50%}

Try the prewas package in your browser

Any scripts or data that you put into this service are public.

prewas documentation built on April 2, 2021, 5:06 p.m.