knitr::opts_chunk$set(echo = TRUE)
devtools::load_all(quiet = TRUE)

Introduction

CHIIMP (Computational, High-throughput Individual Identification through Microsatellite Profiling) is a program to analyze microsatellite (short tandem repeat) DNA sequence data, producing genotypes from raw data and automating some typical analysis tasks.

CHIIMP runs as a standalone tool, but is built as an R language package. All functionality of the standalone program can be accessed from functions within R, and the reporting and visualization functions are designed to integrate well with RStudio and R Markdown.

This document mostly focuses on CHIIMP as a standalone tool. For more information on the use of specific functions within R, also see the built-in package documentation.

Installation

First install R and RStudio, which will supply most software dependencies for CHIIMP. Once these are installed, follow the specific instructions below for your operating system. In all three cases CHIIMP performs an analysis when a configuration file is dragged and dropped onto the desktop icon; there is no interactive interface via the icon, though the R package can be used interactively. See the Usage section for more information.

Windows

On Windows, double-click the install_windows.cmd script. This will install the package and R dependencies, and create a desktop shortcut.

Mac OS

On Mac OS, right-click (control+click) the install_mac.command shell script, select "Open," and also click "Open" in the window that appears to confirm that really do want to open it. (Apple has specific instructions about these security precautions here.) This will automatically install the package along with R dependencies and create a desktop alias.

If a window appears recommending installation of the Mac OS command-line developer tools, go ahead and install them. After that you'll probably need to re-run the CHIIMP installer again to finish the install.

Linux

On Linux, run the install_linux.sh shell script to automatically install the package along with R dependencies. An icon for the program is created at $HOME/Desktop/CHIIMP.desktop. Specific usage of the desktop icon will depend on the desktop environment in use. (The CHIIMP.desktop text file references the installed chiimp executable, and supplies the config file as a command-line argument when dragged and dropped onto the icon.)

Input Data Organization

The information CHIIMP uses during analysis is:

The spreadsheets are in comma-separated (CSV) format. Column names are important but not column order. Extra columns are imported as-is but are otherwise ignored.

Sequence Files

The sequence files must contain sequences that span complete microsatellites. No assembly is performed to handle fragments of microsatellites, and the lengths of sequences identified as alleles are reported as-is. (An implicit assumption throughout the analysis is that any candidate allele sequence begins and ends with conserved regions corresponding to the PCR primers used, and the forward primer sequence is one of the filtering criteria during analysis.)

Dataset Sample Attributes

The description of the samples to be analyzed can be provided in a spreadsheet, or automatically loaded from the data file names. An example spreadsheet:

| Filename | Replicate | Sample | Locus | |:------------------:|:-----------:|:--------:|:-------:| | 100-1-A.fastq.gz | 1 | 100 | A | | 100-2-A.fastq.gz | 2 | 100 | A | | 100-1-B.fastq.gz | 1 | 100 | B | | 100-2-B.fastq.gz | 2 | 100 | B | | 100-1-1.fastq.gz | 1 | 100 | 1 | | 100-2-1.fastq.gz | 2 | 100 | 1 | | 100-1-2.fastq.gz | 1 | 100 | 2 | | 100-2-2.fastq.gz | 2 | 100 | 2 | | 101-1-A.fastq.gz | 1 | 101 | A | | 101-2-A.fastq.gz | 2 | 101 | A | | 101-1-B.fastq.gz | 1 | 101 | B | | 101-2-B.fastq.gz | 2 | 101 | B | | 101-1-1.fastq.gz | 1 | 101 | 1 | | 101-2-1.fastq.gz | 2 | 101 | 1 | | 101-1-2.fastq.gz | 1 | 101 | 2 | | 101-2-2.fastq.gz | 2 | 101 | 2 |

These columns are required for each entry:

For simple cases that have a one-to-one match between sequence files and sample/locus combinations, and with descriptive filenames following a consistent pattern, the dataset table can be created automatically at run-time. See the Usage section for more information.

Locus Attributes

The description of the loci should be given in a spreadsheet with loci on rows and attributes on columns. For example:

| Locus | LengthMin | LengthMax | LengthBuffer | Motif | Primer | ReversePrimer | |:-------:| -----------:| -----------:| --------------:|:-------:|:-----------------:|:----------------:| | A | 131 | 179 | 20 | TAGA | TATCACTGGTGT... | CACAGTTGTGTG...| | B | 194 | 235 | 20 | TAGA | AGTCTCTCTTTC... | TAGGAGCCTGTG...| | 1 | 232 | 270 | 20 | TATC | ACAGTCAAGAAT... | CTGTGGCTCAAA...| | 2 | 218 | 337 | 20 | TCCA | TTGTCTCCCCAG... | TCTGTCATAAAC...|

These columns are required:

Known Individuals (Optional)

If a spreadsheet of genotypes for known individuals is supplied, the analysis can attempt to match samples with the known genotypes automatically. For example:

| Name | Locus | Allele1Seq | Allele2Seq | |:---------:|:-------:|:----------------:|:----------------:| | CH001 | A | ATTATCACTGG... | ATTATCACTGG... | | CH001 | B | TCAGTCTCTCT... | | | CH001 | 1 | AGACAGTCAAG... | AGACAGTCAAG... | | CH001 | 2 | CTTTGTCTCCC... | CTTTGTCTCCC... | | CH002 | A | ATTATCACTGG... | ATTATCACTGG... | | CH002 | B | TCAGTCTCTCT... | TCAGTCTCTCT... | | CH002 | 1 | AGACAGTCAAG... | | | CH002 | 2 | CTTTGTCTCCC... | CTTTGTCTCCC... |

The order of the alleles given is not important, and homozygous individuals may have Allele2Seq either left blank or set to a copy of Allele1Seq. The sequences should contain any conserved region before and after the repeats including that used for the PCR primers described above.

Named Alleles (Optional)

If a spreadsheet of allele names and sequences is supplied, the analysis will use those names in summary tables in the output report. For example:

| Locus | Name | Seq | |:-------:|:-----------:|:----------------:| | A | 200-a | ATTATCACTGG... | | A | 180-a | ATTATCACTGG... | | A | 180-b | ATTATCACTGG... | | B | 300-a | ATTATCACTGG... | | B | 305-a | ATTATCACTGG... | | B | 290-a | ATTATCACTGG... |

The software will automatically create short allele names for any identified allele not listed in the allele spreadsheet (or for all alleles if no spreadsheet is given).

The automatic names are the sequence length and a sequence-specific suffix separated by a hyphen, for example, "180-fdd1c6" for a 180 bp sequence with no assigned name and particular sequence content. Any other 180 bp sequence would receive a different suffix when the name is assigned.

Algorithm

CHIIMP breaks the genotyping process into two parts. First a sample file is de-replicated and a table of unique sequences is created, with no filtering yet applied. Second the table is filtered to just candidate allele sequences, and up to two sequences are reported as the genotype. Both the per-sequence table and the final genotypes are saved in the final output, as spreadsheets in the processed-files directory and as the summary.csv spreadsheet.

Sample Processing

The table of unique sequences includes basic information for each case: sequence content, length, and read counts observed. These are the Seq, Count, and Length columns. The sequences are ordered by count with the most abundant first.

Additional columns associate each sequence with a particular locus, using the locus attributes described above. First each locus' forward primer (and optionally reverse as well) is compared with the sequence and the matching locus name is stored in a MatchingLocus column. The sequence is then checked for several tandem repeats of the motif for that locus, and compared to the length range expected for that locus. TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns respectively. The Ambiguous column marks any sequences containing bases outside of A, C, T, and G (such as N).

The primer matching recognizes IUPAC ambiguity codes in the primer sequences (e.g. an "N" signifies any of A, C, T, or G) but not in the read sequences. By default no mismatches are allowed, but a maximum number of mismatches per comparison can be defined in the configuration. The default behavior will not modify the reads based on matched primers, but several read-modifying actions are available via the primer_action configuration option:

These can also be specified for forward and reverse primers only with primer_action_fwd and primer_action_rev, respectively.

PCR artifacts can obscure real allele sequences with incorrect sequences. There are extra filters to attempt to remove these if possible or highlight cases that may require further attention.

The sample data tables include "Stutter" and "Artifact" columns to mark entries that look like possible polymerase stutter or other artifacts of another sequence present at higher counts. For cases of potential polymerase stutter, the higher-count sequence is one motif repeat longer. For insertion/deletion/substitution artifacts the higher-count sequence is within 1 bp of the same length. In both cases the supposed artifact sequence will be marked if the read counts are 1/3 or lower than the higher-count sequence. (This represents a trade-off in sensitivity and specificity since genuine allele sequences may differ in length by one or even zero repeats, and read counts for pairs of alleles in a given sample can vary considerably.) Both of these columns store row numbers for the higher-count sequence that an artifact may have originated in, if found. Note that relative sequence lengths and counts determine the outcome here, since sequence content for the artifacts is largely indistinguishable from real allele sequences.

Lastly, the ratio of read counts for each sequence to the total reads in the sample and the reads with the same MatchingLocus value is stored in FractionOfTotal and FractionOfLocus columns respectively.

This is the analyze_seqs function in the R package.

Genotype Calling

In the previous stage every single unique sequence for each data file was described in a table, but no filtering or genotyping occurred. Now just one or two candidate allele sequences are extracted from each table and reported as the genotype.

First, the table rows are restricted to just those matching the expected locus' primer, motif, and length range (using the MatchingLocus, MotifMatch, and LengthMatch columns). If the resulting total read count is below a minimum value (by default r cfg("min_locus_reads"), customizable via the min_locus_reads setting) no genotyping will be attempted. Next only those sequences accounting for at least a minimum fraction of the remaining reads are considered. (The default value is r cfg("min_allele_abundance"). This can be changed via the min_allele_abundance setting.) Sequences that are marked as potential stutter or other artifacts (via the Stutter and Artifact columns of the table) or contain ambiguous sequence content (via the Ambiguous column) are excluded next.

After these filters are applied, the top one or two remaining sequences are labeled as the alleles. (If only one sequence remains, the sample is labeled homozygous; if two or more, heterozygous.) The final details kept for each sample are:

These tasks (the filtering and categorizing of each sequence in the table and the short genotype summary) are the analyze_sample and summarize_sample functions in the R package.

Summary and Reporting

The genotype and details identified in the previous step for each sample are aggregated into a spreadsheet with a row for each sample. This summary spreadsheet and the more detailed per-file and per-sample tables are all saved in the final output.

For inter-sample comparisons, the alleles identified across samples for each locus are aligned to one another. The genotypes for each sample are clustered by number of matching alleles, showing similarity between samples. If a spreadsheet of known genotypes was given, the sample genotypes are also compared to the known genotypes, with any close matches reported. If a Name column was provided with the sample definition table as well as a known genotypes spreadsheet, the known-correct genotypes will be paired with applicable samples and a column tracking the result of the genotyping (Correct, Incorrect, Blank, or Dropped Allele) will be added. A single report document summarizes the genotyping and these other details. See the Output Data Organization section below for more information on the output.

These steps are handled by the full_analysis function in the R package.

Usage

CHIIMP takes a configuration file as input and saves all output to a folder. The configuration file points to all of the input data described above, and specifies options for the analysis and output. All options have defaults, so the file may be very brief or even empty. The file format a simple two-column CSV spreadsheet, with a "Key" column storing the names of configuration options and a "Value" column storing each associated value. (The YAML file used in previous CHIIMP versions is still supported, but the new CSV format should be easier to to manage for those unfamiliar with YAML.)

For example, a configuration file might have just two entries, showing the spreadsheets to use for the samples and loci to analyze:

Key,Value
dataset,samples.csv
locus_attrs,locus_attrs.csv

The configuration file can be dragged and dropped onto the desktop icon created during installation.

For command-line usage, the configuration file can be given as the first argument to the R script installed with the package. (The location of the script can be shown in R with system.file("bin", "chiimp", package="chiimp").) To run the same analysis within R, pass a list of configuration options to the chiimp::full_analysis() function.

Example Configuration File

The text in the example configuration file included here shows a slightly more complex case:

# Show example configuration file text
fp <- "inst/example_config.csv"
config_example <- load_config(fp)
kableExtra::kable_styling(knitr::kable(config_example, booktabs = TRUE, linesep = ""))

Common Options

Below is a list of a few commonly-customized options. (See also the end of this document for a full list with all default settings.) For more information on the format of the spreadsheets listed here, see the "Input Data Organization" section above.

Output Data Organization

A the end of an analysis CHIIMP creates a directory of files with all results.

An additional file will be created if fp_rds is defined in the output setting of the configuration. This file contains all analysis results in a single R object using R's native data serialization format for easy post-analysis in R if desired.

These directory and file names are customizable in the output section of the configuration.

Full Configuration Options List

Configuration options and their defaults for CHIIMP version r devtools::as.package(".")$version:

cfg_defaults <- subset(CFG_DEFAULTS, select = -c(Parser, OldName))
colnames(cfg_defaults)[colnames(cfg_defaults) == "Value"] <- "Default Value"
kableExtra::landscape(kableExtra::kable_styling(
  knitr::kable(cfg_defaults,
  booktabs = TRUE, linesep = ""), latex_options = "scale_down"))


ShawHahnLab/chiimp documentation built on Aug. 20, 2023, 1:41 a.m.