Home

/

CRAN

/

prewas

/

Getting Started With prewas
In prewas: Data Pre-Processing for Bacterial Genome-Wide Association Studies

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction to prewas

prewas is a tool to standardize the pre-processing of your genomic data before performing a bacterial genome-wide association study (bGWAS). Currently, prewas can pre-process single nucleotide variants (SNVs) and small insertions and deletions (indels).

prewas creates a variant matrix (where each row is a variants, each column is a sample, and the entries are presence - 1 - or absence - 0 - of the variants) that can be used as input for bGWAS tools, such as hogwash or treeWAS.

When creating the binary variant matrix, prewas can perform 3 pre-processing steps including:

dealing with multiallelic variants,
(optional) dealing with variants in overlapping genes, and
choosing a reference allele.

These 3 steps in the prewas workflow are shown below. (B) shows how prewas handles multiallelic sites, (C) shows options for choosing a reference allele and (D) shows how prewas can group variants into genes. A detailed workflow indicating required file inputs can be seen below in the "Detailed prewas workflow" section.

prewas workflow {width=70%}

1. Dealing with multiallelic variants

A multiallelic site is a site with more than two alleles present at a locus. Most bGWAS methods require a binary input (variants presence or absence) - so multiallelic sites don't fit into this mold.

One could consider the reference allele 0 and any alternative allele as 1. However, each alternative allele could confer a different functional impact on the expressed protein. Thus, lumping the alternative alleles into one category will take away the ability to study these functional differences.

prewas uses the multi-line format to represent mutliallelic variants. Below are examples of how a triallelic site (T, A, and G) can be represented in a single line of a variant matrix compared to multiple lines of a variant matrix.

Single-line format:

T -> A, G

Multi-line format:

T->A
T->G

2. Dealing with variants in overlapping genes

A variants in an overlapping gene could have a different impact on those two genes. Therefore, when a gff is provided, prewas will output a binary variant matrix where a variants in x overlapping genes will be represented on x lines. When a gff is provided, prewas will also output a gene-based variant matrix indicating presence or absence of at least 1 variants in that gene, and variants in overlapping genes will be assigned to both genes.

3. Choosing a reference allele

A reference allele, the allele that will be denoted 0 in a binary matrix could be:

the allele in a reference genome
the major allele at each position
the ancestral allele at each position

Choosing a reference allele can be particularly important when doing gene-based analysis and therefore aggregating variants by gene. We suggest referencing to the ancestral allele for evolutionary interpretability. In cases where ancestral reconstruction is not feasible (e.g. computational intensity) or low confidence, we suggest referencing to the major allele instead of referencing to an arbitrary reference genome because using the major allele leads to less masking of variation at the gene level.

Reference to the major allele to avoid masking variation at the gene level {width=70%}

When ancestral reconstruction is not used (anc = FALSE), prewas will use the major allele as the reference allele.

Outputs:

prewas can output matrices for use with both variants-based bGWAS and gene-based bGWAS.

Usage

At minimum, prewas requires a .vcf, but can also take in a phylogenetic tree, the name of the outgroup in that tree (if any), and a .gff for use with gene-based analysis.

Below, we'll explore some usage examples using toy data built into the package.

library(prewas)
vcf = prewas::vcf
gff = prewas::gff
tree = prewas::tree
outgroup = prewas::outgroup

Minimal prewas run:

Inputs:

required vcf file

Specifications:

No ancestral reconstruction

results_minimal = prewas(dna = vcf, 
                         anc = FALSE)

Will output a list containing:

allele_mat: An allele matrix, created from the vcf where each multiallelic site will be on its own line. The rowname will be the position of the variants in the vcf file. If the position is triallelic, there will be two rows containing the same information. The rows will be labeled "pos" and "pos.1". If the position is quadallelic, there will be three rows containing the same information. The rows will be labeled "pos", "pos.1", and "pos.2".
bin_mat: A binary matrix, the same dimensions as the allele matrix and with corresponding row names, where 0 is the reference allele and 1 indicates a variants. The reference allele is the major allele (because anc = FALSE).
ar_results: Will indicate the allele used as the reference allele (in this case, the major allele). The number of rows is the length of the number of rows of bin_mat.
dup: An index that identifies duplicated rows. If the index is unique (appears once), that means it is not a multiallelic site. If the index appears more than once, that means the row was replicated x times, where x is the number of alternative alleles.

table(results_minimal$dup)

gene_mat: NULL; Because no information is provided about genes (that is, a gff file), there will be no gene matrix generated. This also means variants that appear in more than 1 overlapping gene will not be split into multiple lines.
tree: NULL; Because ancestral reconstruction is not needed no tree was generated.

Maximal prewas run:

Inputs:

required vcf file
tree
string of the outgroup
gff file

Specifications:

Conduct ancestral reconstruction

results_maximal = prewas(dna = vcf, 
                         tree = tree, 
                         outgroup = outgroup, 
                         gff = gff,
                         anc = TRUE)

Will output:

allele_mat: Nearly identical to results_minimal, but now with the column corresponding to the outgroup removed.

dim(results_maximal$allele_mat)

head(results_maximal$allele_mat,10)

bin_mat: A binary matrix, where 0 is the reference allele and 1 indicates a variants. The reference allele is the ancestral allele (because anc = TRUE). The dimensions do not match the allele_mat, because variants in overlapping genes are represented on multiple lines. Position and locus tag name is provided in the rowname. Again, the column corresponding to the outgroup is removed.

dim(results_maximal$bin_mat)

head(results_maximal$allele_mat)

head(results_maximal$bin_mat, 10)

ar_results: Will indicate the allele used as the reference allele (in this case, the ancestral allele). Because ancestral reconstruction was used, the probability of the ancestral alleles is also reported. The number of rows is the same as the number of rows of allele_mat.

head(results_maximal$ar_results)

dup: Same format as results_minimal. Multiple indices indicates multiallelic site split, but not overlapping genes split.
gene_mat: Because a gff file was provided, a gene matrix is generated. Each row is a gene, not a variants.

head(results_maximal$gene_mat)

tree: Returns the phylo object that was provided after it has been rooted to the provided outgroup and the outgroup was removed.

results_maximal$tree

No tree?

If you do not have a tree, a neighbor-joining tree will be generated and ancestral reconstruction can be conducted. We recommend using more sophisticated methods for tree building (e.g. maximum likelihood or Bayesian methods) but for an initial analysis, we have provided the option to create a neighbor-joining tree.

results_no_tree = prewas(dna = vcf, 
                         gff = gff,
                         anc = TRUE)

results_no_tree returns similar results to results_maximal but with one critical difference in the tree object: tree: this phylogenetic tree was built so that ancestral reconstruction could be performed for the variants referencing step. Note that this tree has one more tip than results_maximal$tree because an outgroup was provided with results_maximal which was dropped after being used to root that tree. For results_no_tree the generated tree was midpoint rooted & no outgroup was provided so no tips were dropped.

results_no_tree$tree

Want to filter variants by predicted functional impact prior to your gene-based analysis?

To filter variants by predicted functional impact prior to a gene-based analysis, you can provide prewas with a multivcf with SnpEff annotations. To merge individual SnpEff vcf files into a single multivcf, you can use bcftools merge.

# from the command line
bcftools merge -i ANN:join -m both -o out_multivcf.vcf -O *.vcf.gz

results_snpeff = prewas(dna = snpeff_vcf, anc = FALSE)
names(results_snpeff$gene_mat)

The default is to output the same variant binary matrix, and 5 different binary gene matrices: one for each snpeff impact (low, moderate, high, modifier), and one with all of the variants (all). There is also a custom output that will be NULL with the default options. If you would like to create a custom grouping (e.g. moderate and high), you can provide a vector to snpeff_grouping (e.g. c('MODERATE','HIGH')). In this case, the custom output will keep those specific variants types and no others when grouping by gene.

results_snpeff_custom = prewas(dna = snpeff_vcf, snpeff_grouping = c('MODERATE','HIGH'), anc = FALSE)
names(results_snpeff_custom$gene_mat)
head(results_snpeff_custom$gene_mat$gene_mat_custom)

Want to group by minor allele?

If you want to group by minor allele to increase power by grouping rarer alleles together. This feature re-collapses multiallelic sites for each locus. This feature is analagous to grouping by gene, but now by site instead.

If you are additionally using SnpEff annotations, you can appropriately group variants with the indicated annotation by site.

results_ma = prewas(dna = vcf, grp_nonref = TRUE, anc = FALSE)

Detailed prewas workflow

A detailed visualization of the prewas workflow is shown below.

A detailed prewas workflow {width=50%}

Any scripts or data that you put into this service are public.

prewas documentation built on April 2, 2021, 5:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

prewas
Data Pre-Processing for Bacterial Genome-Wide Association Studies

Getting Started With prewas
In prewas: Data Pre-Processing for Bacterial Genome-Wide Association Studies

Introduction to prewas

1. Dealing with multiallelic variants

Single-line format:

Multi-line format:

2. Dealing with variants in overlapping genes

3. Choosing a reference allele

Outputs:

Usage

Minimal prewas run:

Inputs:

Specifications:

Will output a list containing:

Maximal prewas run:

Inputs:

Specifications:

Will output:

No tree?

Want to filter variants by predicted functional impact prior to your gene-based analysis?

Want to group by minor allele?

Detailed prewas workflow

Try the prewas package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

prewas Data Pre-Processing for Bacterial Genome-Wide Association Studies

Getting Started With prewas In prewas: Data Pre-Processing for Bacterial Genome-Wide Association Studies

Introduction to prewas

1. Dealing with multiallelic variants

Single-line format:

Multi-line format:

2. Dealing with variants in overlapping genes

3. Choosing a reference allele

Outputs:

Usage

Minimal prewas run:

Inputs:

Specifications:

Will output a list containing:

Maximal prewas run:

Inputs:

Specifications:

Will output:

No tree?

Want to filter variants by predicted functional impact prior to your gene-based analysis?

Want to group by minor allele?

Detailed prewas workflow

Try the prewas package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

prewas
Data Pre-Processing for Bacterial Genome-Wide Association Studies

Getting Started With prewas
In prewas: Data Pre-Processing for Bacterial Genome-Wide Association Studies