knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(MIPanalyzer)

This tutorial covers:

Reading in data and understanding the mipanalyzer_biallelic format

Before doing anything, we will need to load some additional packages:

# load some packages
library(here)

We will be working from a vcf file stored in the inst/extdata folder of this package. Our first decision point is whether we want to read this vcf file in allowing for more than two alleles at any locus, or forcing data to bi-allelic. Many analyses assume bi-allelic SNPs, which also makes the data format more efficient as we only need to encode the read counts and coverage as single numbers, rather than storing multiple counts for each allele. Here we will assume bi-allelic data, but there will be a section below on the multi-allelic data format.

The vcf_to_mipanalyzer_biallelic() can read in file and convert it to our custom format. It accepts a path to a vcf file, or alternatively you can read the vcf in yourself and pass this to the function:

# define the path to your .vcf file
vcf_path <- here("inst/extdata", "DRC_pf3k_filtered.vcf")

# load as mipanalyzer_biallelic object
dat_biallelic <- vcf_to_mipanalyzer_biallelic(file = vcf_path)

# take a peek
dat_biallelic

We can see that we are working with 113 samples and 2667 loci. This dataset is actually filtered down from Pf3k whole genome data using samples from the Democratic Republic of the Congo (DRC). Although this object is its own custom class, it is essentially a list:

class(dat_biallelic)

# look at sub-object names
names(dat_biallelic)

The coverage and counts objects are our main genetic data:

Both of these objects are matrices, with samples in rows and loci in columns:

dim(dat_biallelic$coverage)
dim(dat_biallelic$counts)

The next object is the samples element. This is a data.frame containing the sample names from the vcf. You can add whatever columns you like to this data.frame, making it easy to keep track of meta-data alongside genetic data.

dim(dat_biallelic$samples)
head(dat_biallelic$samples)

Next up is the loci element. This is a data.frame containing all the information in the "fix" element of the vcf:

dim(dat_biallelic$loci)
head(dat_biallelic$loci)

Next is the filter_history element. This contains a running record of all filters that are applied to the data via the MIPanalyzer package. Currently, we have applied no filters and so what we see are the raw dimensions of the data. We will return to this object later:

dat_biallelic$filter_history

Finally, the vcfmeta element contains anything in the "meta" slot of the vcf. There may be a wide range of different information here, which is stored as a list. Here are just the first three elements in our example data:

dat_biallelic$vcfmeta[1:3]

In summary, the mipanalyzer_biallelic class of data is just a re-coding of the vcf. It exposes the coverage and counts data, while also nice features in terms of keeping track of meta-data and filters.

Filtering data based on coverage

We very often want to focus our attention on sites that have good coverage. Low coverage sites lead to uncertain estimates of within-sample allele frequencies, and may also be a sign of sequencing problems.

We could choose to filter by sample, or by locus. Loci that have systematically low coverage over many samples may indicate issues with sequencing. We can use the explore_filter_coverage_loci() to explore this issue via the following steps:

This gives an indication of how many samples we will lose if we apply this filter, and produces the following plot:

dat_biallelic |>
  explore_filter_coverage_loci(min_coverage = 10, max_low_coverage = 25)

We can see that we will lose 3.22% of samples if we apply this filter, which is acceptable. We can therefore go ahead and apply the filter:

dat_biallelic <- dat_biallelic |>
  filter_coverage_loci(min_coverage = 10, max_low_coverage = 25)

Next, we can perform the same task but looking at it from a sample perspective. This involves the following steps:

We obtain the following plot:

dat_biallelic |>
  explore_filter_coverage_samples(min_coverage = 10, max_low_coverage = 10)

Note that from the shape of the histogram we have good coverage for many samples, and then a few samples that perform poorly with low coverage over many loci. We therefore set quite a stringent threshold of max_low_coverage = 10 to exclude these samples. We can see that we will lose 20.35% of samples, which is a fairly stringent filter, but will leave us with high quality samples.

dat_biallelic <- dat_biallelic |>
  filter_coverage_samples(min_coverage = 10, max_low_coverage = 10)

Every time we apply one of these filters, the change to data dimension is stored in the filter_history element along with the function call itself:

dat_biallelic$filter_history

We can see that we are left with 91 samples, down from 113 initially, and 2581 loci, down from 2667. Our proportion of missing data has dropped considerably through these filters.

Filtering based on other criteria

Above, we filtered based on coverage, but we may also want to exclude loci/samples for other reasons. The filter_overcounts() function replaces any cell where the count exceeds the coverage with NA. It should technically not be possible for count to exceed coverage due to the way they are defined, but this can come about due to upstream issues. Note that we are not actually dropping any samples or loci via this filter, we are just recoding information as missing.

# deal with count>coverage issues
dat_biallelic <- dat_biallelic |>
  filter_overcounts()

Now that we have discarded some samples, we may be left with some loci that are no longer variable in the samples that remain. We probably want to discard these as they do not contain useful information. We can do this via the filter_loci_invariant() function:

# drop loci that are no longer variable
dat_biallelic <- dat_biallelic |>
  filter_loci_invariant()

Again, we can look at our filter history to see how these changes have impacted our data dimensions:

dat_biallelic$filter_history


mrc-ide/MIPanalyzer documentation built on Jan. 17, 2024, 7:16 p.m.