In PoonLab/highlineR: A Novel Visualization Tool to Assess Diversity in NGS Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/CONTRIBUTING-",
  out.width = "100%",
  dpi = 300,
  fig.align = "center"
)

Import

HighlineR accepts aligned nucleotide or amino acid NGS data in the FASTA format as input. Aligned sequences stored in FASTQ format can also be converted and read by the program. In theory, data generated using any NGS instrument and methodology can be input to highlineR as long as the sequence alignment is supplied in an accepted file format. There is no limit on the number of sequences that a file may contain nor on the number of files that may be input.

The import function accepts three types of input:

A path to a single NGS file,
A path to a directory containing NGS file(s), or
A vector of paths to NGS file(s).

For example, let's import a FASTA file to analyze using highlineR. (Data adapted from Ferrandiz-Rovira et. al. available here)

library(highlineR)
highlineR.session <- init_session() # create session
base <- system.file('extdata', package = "highlineR")
fn <- paste0(base, "/sample.aligned.fa")

# Default import
import_raw_seq(fn)

highlineR recognizes standard file extensions (.fasta, .fa, .fas, .fastq, .fq). If a nonstandard file extension is being used (eg .txt) then the file format can be specified as follows:

# Import NGS files with non-standard file extensions
import_raw_seq("/sample.txt", datatype = "fasta")

The file we are using contains nucleotide sequences. If the file of interest contains amino acid sequences instead then run:

# Import amino acid data
import_raw_seq("/sample.aligned.fa", seqtype = "amino acid")

The import_raw_seq() function creates an S3 object called a Data object with the following structure:

# Classes of Data objects
str(highlineR.session[[fn]])

# Attributes of Data objects
ls(highlineR.session[[fn]])

This means that the classes of a Data object provide information about the data type of the NGS file and the sequence type of the sequences contained in the file.

Data objects are comprised of the following attributes:

path is the absolute path to the NGS file represented by the object,
raw_seq is a list that will contain the parsed sequences of the NGS file,
compressed is an S3 object that inherits from R's base environment class and will contain the compressed sequences of the NGS file,
sample is a compressed environment that will contain randomly sampled sequences from the the compressed structure,
master is a character string representing the master sequence to which other sequences will be compared during plotting, and
seq_diff is a character matrix that will contain sequence differences in variant sequences compared to the master sequence.

The import_raw_seq() function also creates another S3 object called a session object that acts as a parent container to hold multiple Data objects for later plotting. This functionality allows multiple NGS files to be added to a single session by unique function calls for later batch processing.

In the above example, we first created an empty session using the init_session() function. This function returns the created session object which must be stored by the user for further use. In the above example, the session was named highlineR.session. This is the default name expected by the program. Following this, if the session argument is not provided, Data objects are added to the highlineR.session object by default.

To override this default behaviour, the session created using init_session() can be stored using a different variable name Data objects can be imported into that session, using the session argument. Alternatively, the result returned by the import_raw_seq() function can be stored without a direct call to init_session():

# Import into initialized session
my_session <- init_session()
import_raw_seq(fn, session = "my_session")

# Import into uninitialized session
my_session <- import_raw_seq(fn) # To create session using only one import call, the session argument is not needed
# To add more Data objects to the custom session, you must specify the session argument
import_raw_seq(paste0(base, "/sample.fa"), session = my_session)

Parsing

Following data import, single Data objects or complete session objects can be parsed as follows:

# Parse single Data object
# parse_raw_seq(highlineR.session$`./data/sample.aligned.fa`)

# Parse entire session
parse_raw_seq(highlineR.session)

The parse_raw_seq() function stream reads NGS files and populates the raw_seq lists of Data objects with (header, sequence) tuples.

# Strucure of the parsed file
head(highlineR.session[[fn]]$raw_seq, n=3)

Compression

After parsing, Data or session objects will be compressed into a format reminiscent of a dictionary structure of key:value pairs. In this structure, unique sequences serve as keys with read counts as values. Data or session objects can be compressed as follows:

# Compression
compress(highlineR.session, unique = TRUE)

The unique flag specifies that the NGS file that we are working with has been processed so that it only contains unique sequences with read counts stored at the end of the sequence labels. If read counts should be inferred from the number of distinct instances of sequences in the file, then this flag should be FALSE.

The compress() function populates the compressed environments of Data objects.

# Number of unique sequences in compressed structure
c_len <- length(highlineR.session[[fn]]$compressed)
c_len

# Most abundant sequences in compressed structure
sort(highlineR.session[[fn]]$compressed)[c_len:(c_len-2), , drop = F]

The N and M arguments to the compress() function are used to determine the behaviour of the sampling functions described below which populate the sample environments of Data objects.

Sampling

In an NGS experiment, if the number of input sequences is much greater than the number of templates (nucleic acids) available during the sequencing reaction, then the frequency distribution of variants produced may be skewed. In an attempt to accommodate for this phenomenon, highlineR utilizes a read_sample() function which randomly samples N reads with replacement, M times. The average of the resulting frequencies of variants present in at least 85% of the replicates is then used as the expected variant frequencies.

Sampling is performed automatically during compression. A dataset can be resampled without re-compressing using the resample() function.

# Sample created during compression
# Number of sequences in sample environment
s_len <- length(highlineR.session[[fn]]$sample)
s_len

# Most abundance sequences in sample environment
sort(highlineR.session[[fn]]$sample)[s_len:(s_len-2), , drop = F]

# Resampling
resample(highlineR.session[[fn]], 
         N = c_len * 0.75,
         M = 10)

# New number of sequences in sample environment
new_s_len <- length(highlineR.session[[fn]]$sample)
new_s_len

# Most abundance sequences in new sample environment
sort(highlineR.session[[fn]]$sample)[new_s_len:(new_s_len-2), , drop = F]

Plotting

The highlineR plot() function has several parameters:

x: A Data object or session object containing Data objects to plot.
mode: The desired mutation annotation for plotting. Options: "mismatch" (default), "svn" (Synonymous versus Non-Synonymous), "tvt" (Transition versus Transversion).
master: The sequence to which other sequences should be compared. By default, the most abundant sequence is selected.
sort_by: How should the sequences be ordered in a plot? Options: "similarity" (default), "frequency".
rf: Which reading frame should be used when determining Synonymous vs Non-Synonymous mutations? Options: 1 (default), 2, 3.
use_sample: Which environment should be plotted? If TRUE, then the sample environment is plotted. If FALSE, the complete compressed environment is plotted.

Using the default parameters, the most abundant sequence is selected as the master sequences. Variant sequences in the sample environment are compared to this master to identify their compositional differences. These differences are plotted such that mismatches are highlighted as coloured lines and the variants are ordered according to their similarity (number of mismatches) to the master.

# Plot session with variants sorted by similarity to master
plot(highlineR.session)

Variants can also be sorted by frequency rather than similarity:

# Plot session with variants sorted by relative abundance
plot(highlineR.session, sort_by = "frequency")

Finally, rather than simply plotting mismatches, mutations can be annotated as synonymous and non-synonymous mutations or as transitions and transversions as follows:

# Plot synonymous and non-synonymous mutations
plot(highlineR.session, mode = "svn")

# Plot transitions and transversions
plot(highlineR.session, mode = "tvt")