In ShawHahnLab/microsat: Computational, High-throughput Individual Identification through Microsatellite Profiling

knitr::opts_chunk$set(echo = TRUE)
devtools::load_all(quiet = TRUE)

Introduction

CHIIMP (Computational, High-throughput Individual Identification through Microsatellite Profiling) is a program to analyze microsatellite (short tandem repeat) DNA sequence data, producing genotypes from raw data and automating some typical analysis tasks.

CHIIMP runs as a standalone tool, but is built as an R language package. All functionality of the standalone program can be accessed from functions within R, and the reporting and visualization functions are designed to integrate well with RStudio and R Markdown.

This document mostly focuses on CHIIMP as a standalone tool. For more information on the use of specific functions within R, also see the built-in package documentation.

Installation

First install R and RStudio, which will supply most software dependencies for CHIIMP. Once these are installed, follow the specific instructions below for your operating system. In all three cases CHIIMP performs an analysis when a configuration file is dragged and dropped onto the desktop icon; there is no interactive interface via the icon, though the R package can be used interactively. See the Usage section for more information.

Windows

On Windows, double-click the install_windows.cmd script. This will install the package and R dependencies, and create a desktop shortcut.

Mac OS

On Mac OS, right-click (control+click) the install_mac.command shell script, select "Open," and also click "Open" in the window that appears to confirm that really do want to open it. (Apple has specific instructions about these security precautions here.) This will automatically install the package along with R dependencies and create a desktop alias.

If a window appears recommending installation of the Mac OS command-line developer tools, go ahead and install them. After that you'll probably need to re-run the CHIIMP installer again to finish the install.

Linux

On Linux, run the install_linux.sh shell script to automatically install the package along with R dependencies. An icon for the program is created at $HOME/Desktop/CHIIMP.desktop. Specific usage of the desktop icon will depend on the desktop environment in use. (The CHIIMP.desktop text file references the installed chiimp executable, and supplies the config file as a command-line argument when dragged and dropped onto the icon.)

Input Data Organization

The information CHIIMP uses during analysis is:

Sequence files containing complete microsatellite sequences (FASTA or FASTQ, plain text or gzip-compressed)
Spreadsheet of dataset sample attributes
Spreadsheet of locus attributes
Spreadsheet of known individuals (optional)
Spreadsheet of named alleles (optional)

The spreadsheets are in comma-separated (CSV) format. Column names are important but not column order. Extra columns are imported as-is but are otherwise ignored.

Sequence Files

The sequence files must contain sequences that span complete microsatellites. No assembly is performed to handle fragments of microsatellites, and the lengths of sequences identified as alleles are reported as-is. (An implicit assumption throughout the analysis is that any candidate allele sequence begins and ends with conserved regions corresponding to the PCR primers used, and the forward primer sequence is one of the filtering criteria during analysis.)

Dataset Sample Attributes

The description of the samples to be analyzed can be provided in a spreadsheet, or automatically loaded from the data file names. An example spreadsheet:

| Filename | Replicate | Sample | Locus | |:------------------:|:-----------:|:--------:|:-------:| | 100-1-A.fastq.gz | 1 | 100 | A | | 100-2-A.fastq.gz | 2 | 100 | A | | 100-1-B.fastq.gz | 1 | 100 | B | | 100-2-B.fastq.gz | 2 | 100 | B | | 100-1-1.fastq.gz | 1 | 100 | 1 | | 100-2-1.fastq.gz | 2 | 100 | 1 | | 100-1-2.fastq.gz | 1 | 100 | 2 | | 100-2-2.fastq.gz | 2 | 100 | 2 | | 101-1-A.fastq.gz | 1 | 101 | A | | 101-2-A.fastq.gz | 2 | 101 | A | | 101-1-B.fastq.gz | 1 | 101 | B | | 101-2-B.fastq.gz | 2 | 101 | B | | 101-1-1.fastq.gz | 1 | 101 | 1 | | 101-2-1.fastq.gz | 2 | 101 | 1 | | 101-1-2.fastq.gz | 1 | 101 | 2 | | 101-2-2.fastq.gz | 2 | 101 | 2 |

These columns are required for each entry:

Filename: The name of the sequence file to analyze. Note that if samples were multiplexed on the sequencer by pooling PCR products for multiple loci, filenames can be repeated here; just vary the text in the Locus column across rows. The analysis will use the forward PCR primer (see Locus Attributes below) to select just the sequences matching each locus as needed.
Replicate: An identifier for a repeated case of the same biological sample. If not applicable, the column may be left blank, but is still required.
Sample: An identifier for a particular biological sample.
Locus: The identifier for the locus being genotyped with. These must match the identifiers in Locus Attributes below.

For simple cases that have a one-to-one match between sequence files and sample/locus combinations, and with descriptive filenames following a consistent pattern, the dataset table can be created automatically at run-time. See the Usage section for more information.

Locus Attributes

The description of the loci should be given in a spreadsheet with loci on rows and attributes on columns. For example:

| Locus | LengthMin | LengthMax | LengthBuffer | Motif | Primer | ReversePrimer | |:-------:| -----------:| -----------:| --------------:|:-------:|:-----------------:|:----------------:| | A | 131 | 179 | 20 | TAGA | TATCACTGGTGT... | CACAGTTGTGTG...| | B | 194 | 235 | 20 | TAGA | AGTCTCTCTTTC... | TAGGAGCCTGTG...| | 1 | 232 | 270 | 20 | TATC | ACAGTCAAGAAT... | CTGTGGCTCAAA...| | 2 | 218 | 337 | 20 | TCCA | TTGTCTCCCCAG... | TCTGTCATAAAC...|

These columns are required:

Locus: A short unique identifier.
LengthMin: The minimum expected sequence length in bases.
LengthMax: The maximum expected sequence length in bases.
LengthBuffer: An additional length below LengthMin or above LengthMax to accept for candidate allele sequences. (If the length range of alleles for a given locus is uncertain or unknown, this may be set very high to effectively disable the length range requirement.)
Motif: The short sequence repeating in tandem.
Primer: The forward PCR primer used in preparing the sequencing library. This is used as one of the checks for candidate allele sequences.
ReversePrimer: The reverse PCR primer used in preparing the sequencing library. This is not currently used unless use_reverse_primers is enabled in the configuration.

Known Individuals (Optional)

If a spreadsheet of genotypes for known individuals is supplied, the analysis can attempt to match samples with the known genotypes automatically. For example:

| Name | Locus | Allele1Seq | Allele2Seq | |:---------:|:-------:|:----------------:|:----------------:| | CH001 | A | ATTATCACTGG... | ATTATCACTGG... | | CH001 | B | TCAGTCTCTCT... | | | CH001 | 1 | AGACAGTCAAG... | AGACAGTCAAG... | | CH001 | 2 | CTTTGTCTCCC... | CTTTGTCTCCC... | | CH002 | A | ATTATCACTGG... | ATTATCACTGG... | | CH002 | B | TCAGTCTCTCT... | TCAGTCTCTCT... | | CH002 | 1 | AGACAGTCAAG... | | | CH002 | 2 | CTTTGTCTCCC... | CTTTGTCTCCC... |

The order of the alleles given is not important, and homozygous individuals may have Allele2Seq either left blank or set to a copy of Allele1Seq. The sequences should contain any conserved region before and after the repeats including that used for the PCR primers described above.

Named Alleles (Optional)

If a spreadsheet of allele names and sequences is supplied, the analysis will use those names in summary tables in the output report. For example:

| Locus | Name | Seq | |:-------:|:-----------:|:----------------:| | A | 200-a | ATTATCACTGG... | | A | 180-a | ATTATCACTGG... | | A | 180-b | ATTATCACTGG... | | B | 300-a | ATTATCACTGG... | | B | 305-a | ATTATCACTGG... | | B | 290-a | ATTATCACTGG... |

The software will automatically create short allele names for any identified allele not listed in the allele spreadsheet (or for all alleles if no spreadsheet is given).

The automatic names are the sequence length and a sequence-specific suffix separated by a hyphen, for example, "180-fdd1c6" for a 180 bp sequence with no assigned name and particular sequence content. Any other 180 bp sequence would receive a different suffix when the name is assigned.

Algorithm

CHIIMP breaks the genotyping process into two parts. First a sample file is de-replicated and a table of unique sequences is created, with no filtering yet applied. Second the table is filtered to just candidate allele sequences, and up to two sequences are reported as the genotype. Both the per-sequence table and the final genotypes are saved in the final output, as spreadsheets in the processed-files directory and as the summary.csv spreadsheet.

Sample Processing

The table of unique sequences includes basic information for each case: sequence content, length, and read counts observed. These are the Seq, Count, and Length columns. The sequences are ordered by count with the most abundant first.

Additional columns associate each sequence with a particular locus, using the locus attributes described above. First each locus' forward primer (and optionally reverse as well) is compared with the sequence and the matching locus name is stored in a MatchingLocus column. The sequence is then checked for several tandem repeats of the motif for that locus, and compared to the length range expected for that locus. TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns respectively. The Ambiguous column marks any sequences containing bases outside of A, C, T, and G (such as N).

The primer matching recognizes IUPAC ambiguity codes in the primer sequences (e.g. an "N" signifies any of A, C, T, or G) but not in the read sequences. By default no mismatches are allowed, but a maximum number of mismatches per comparison can be defined in the configuration. The default behavior will not modify the reads based on matched primers, but several read-modifying actions are available via the primer_action configuration option:

none: no change (the default). This is reasonable if adapter sequences have already been trimmed, but otherwise one of the below should be used.
keep: trim each reads to the primer but keep the primer region as-is. Note that this would not work will with primer sequences containing IUPAC ambiguity codes; in that situation use replace or remove to ensure that reads with equivalent primer regions are grouped together.
replace: trim each read and replace the matched primer region with the sequence from the locus attributes table.
remove: trim reads to exclude the primer regions.

These can also be specified for forward and reverse primers only with primer_action_fwd and primer_action_rev, respectively.

PCR artifacts can obscure real allele sequences with incorrect sequences. There are extra filters to attempt to remove these if possible or highlight cases that may require further attention.

The sample data tables include "Stutter" and "Artifact" columns to mark entries that look like possible polymerase stutter or other artifacts of another sequence present at higher counts. For cases of potential polymerase stutter, the higher-count sequence is one motif repeat longer. For insertion/deletion/substitution artifacts the higher-count sequence is within 1 bp of the same length. In both cases the supposed artifact sequence will be marked if the read counts are 1/3 or lower than the higher-count sequence. (This represents a trade-off in sensitivity and specificity since genuine allele sequences may differ in length by one or even zero repeats, and read counts for pairs of alleles in a given sample can vary considerably.) Both of these columns store row numbers for the higher-count sequence that an artifact may have originated in, if found. Note that relative sequence lengths and counts determine the outcome here, since sequence content for the artifacts is largely indistinguishable from real allele sequences.

Lastly, the ratio of read counts for each sequence to the total reads in the sample and the reads with the same MatchingLocus value is stored in FractionOfTotal and FractionOfLocus columns respectively.

This is the analyze_seqs function in the R package.

Genotype Calling

In the previous stage every single unique sequence for each data file was described in a table, but no filtering or genotyping occurred. Now just one or two candidate allele sequences are extracted from each table and reported as the genotype.

First, the table rows are restricted to just those matching the expected locus' primer, motif, and length range (using the MatchingLocus, MotifMatch, and LengthMatch columns). If the resulting total read count is below a minimum value (by default r cfg("min_locus_reads"), customizable via the min_locus_reads setting) no genotyping will be attempted. Next only those sequences accounting for at least a minimum fraction of the remaining reads are considered. (The default value is r cfg("min_allele_abundance"). This can be changed via the min_allele_abundance setting.) Sequences that are marked as potential stutter or other artifacts (via the Stutter and Artifact columns of the table) or contain ambiguous sequence content (via the Ambiguous column) are excluded next.

After these filters are applied, the top one or two remaining sequences are labeled as the alleles. (If only one sequence remains, the sample is labeled homozygous; if two or more, heterozygous.) The final details kept for each sample are:

the sequence content, length, and counts for the one or two alleles
the zygosity of the sample
whether the ambiguous-sequence filter removed a potential allele
whether the stutter and/or artifact filter removed a potential allele
The read counts of the entire sample before any filtering
The read counts of just those sequences matching the locus primer, motif, and length range

These tasks (the filtering and categorizing of each sequence in the table and the short genotype summary) are the analyze_sample and summarize_sample functions in the R package.

Summary and Reporting

The genotype and details identified in the previous step for each sample are aggregated into a spreadsheet with a row for each sample. This summary spreadsheet and the more detailed per-file and per-sample tables are all saved in the final output.

For inter-sample comparisons, the alleles identified across samples for each locus are aligned to one another. The genotypes for each sample are clustered by number of matching alleles, showing similarity between samples. If a spreadsheet of known genotypes was given, the sample genotypes are also compared to the known genotypes, with any close matches reported. If a Name column was provided with the sample definition table as well as a known genotypes spreadsheet, the known-correct genotypes will be paired with applicable samples and a column tracking the result of the genotyping (Correct, Incorrect, Blank, or Dropped Allele) will be added. A single report document summarizes the genotyping and these other details. See the Output Data Organization section below for more information on the output.

These steps are handled by the full_analysis function in the R package.

Usage

CHIIMP takes a configuration file as input and saves all output to a folder. The configuration file points to all of the input data described above, and specifies options for the analysis and output. All options have defaults, so the file may be very brief or even empty. The file format a simple two-column CSV spreadsheet, with a "Key" column storing the names of configuration options and a "Value" column storing each associated value. (The YAML file used in previous CHIIMP versions is still supported, but the new CSV format should be easier to to manage for those unfamiliar with YAML.)

For example, a configuration file might have just two entries, showing the spreadsheets to use for the samples and loci to analyze:

Key,Value
dataset,samples.csv
locus_attrs,locus_attrs.csv

The configuration file can be dragged and dropped onto the desktop icon created during installation.

For command-line usage, the configuration file can be given as the first argument to the R script installed with the package. (The location of the script can be shown in R with system.file("bin", "chiimp", package="chiimp").) To run the same analysis within R, pass a list of configuration options to the chiimp::full_analysis() function.

Example Configuration File

The text in the example configuration file included here shows a slightly more complex case:

# Show example configuration file text
fp <- "inst/example_config.csv"
config_example <- load_config(fp)
kableExtra::kable_styling(knitr::kable(config_example, booktabs = TRUE, linesep = ""))

Common Options

Below is a list of a few commonly-customized options. (See also the end of this document for a full list with all default settings.) For more information on the format of the spreadsheets listed here, see the "Input Data Organization" section above.

dataset: file path to table of sample attributes to use (rather than detecting sample attributes based on filenames)
locus_attrs: file path to locus attributes CSV file
allele_names: file path to known alleles CSV file
genotypes_known: file path to known genotypes CSV file
output_path: directory path for saving output data
min_motif_repeats: minimum number of perfect tandem motif repeats to accept a read as a match for an STR locus. For example, if the Motif specified for a locus (in the locus_attrs file) is ACTC, a setting of 3 here will require ACTCACTCACTC somewhere in each read.
max_stutter_ratio: when considering a second-most-abundant sequence as a possible allele, if it is one repeat shorter than the most abundant sequence, it must exceed this fraction of the primer sequence abundance
max_artifact_ratio: like max_stutter_ratio, but for secondary sequences with lengths within 1 bp of the primary sequence.
use_reverse_primers: should reverse primers be used for matching reads to loci, or just forward?
reverse_primer_r1: if using reverse primers, are the sequences given in the locus attributes table oriented in the direction of the forward reads?
primer_action: a key word for how to modify each read based on the matched primer. (See Sample Processing section above.)
min_allele_abundance: minimum fraction of locus-matched reads for a sequence to be considered a candidate allele
min_locus_reads: minimum number of locus-matched reads to attempt to call a genotype

Output Data Organization

A the end of an analysis CHIIMP creates a directory of files with all results.

summary.csv: spreadsheet of the called genotypes and additional attributes for each sample. Each sample is on a separate row, and each column corresponds to a separate attribute in the results. This includes all columns in the input dataset spreadsheet including locus, replicate, and sample identifiers, the sequences, sequence lengths, and counts of the identified allele(s), and several additional attributes.
processed-files: directory of spreadsheets for each input data file. Each spreadsheet contains one unique sequence per row with attributes on columns. At this stage no filtering for sample/locus-specific attributes has been applied. (This is particularly relevant for sequencer-multiplexed samples as one input data file may contain data for multiple samples.)
processed-samples: directory of spreadsheets for each sample. As for processed-files, each spreadsheet contains one unique sequence per row with attributes on columns. These represent the intermediate sample-specific data CHIIMP uses to call a genotype for each sample, and each spreadsheet here corresponds to a single row in the summary.csv file.
histograms: directory of counts-versus-length histograms for each sample. Counts are tallied on a by-sequence basis rather than by-length for alleles, so the bars for called alleles (in red) are generally shorter than the bars for unfiltered sequences (in black) or the matching-locus sequences (in pink).
allele-sequences: directory of FASTA files for each sample, giving just the sequence content also shown in summary.csv. (This is a convenience feature to make the called alleles easily usable in a standard format, but the same information is available in summary.csv.)
alignments: directory of FASTA files for each locus, giving a multiple alignment of all identified alleles per locus.
alignment-images: directory of visualization images of the per-locus alignments. These are also included in the report document.
report.html: report document summarizing the genotyping results, inter-sample comparisons, and (if known genotypes were provided), a comparison of samples with known individuals.

An additional file will be created if fp_rds is defined in the output setting of the configuration. This file contains all analysis results in a single R object using R's native data serialization format for easy post-analysis in R if desired.

These directory and file names are customizable in the output section of the configuration.

Full Configuration Options List

Configuration options and their defaults for CHIIMP version r devtools::as.package(".")$version:

cfg_defaults <- subset(CFG_DEFAULTS, select = -c(Parser, OldName))
colnames(cfg_defaults)[colnames(cfg_defaults) == "Value"] <- "Default Value"
kableExtra::landscape(kableExtra::kable_styling(
  knitr::kable(cfg_defaults,
  booktabs = TRUE, linesep = ""), latex_options = "scale_down"))

ShawHahnLab/microsat documentation built on Aug. 25, 2023, 11:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ShawHahnLab/microsat
Computational, High-throughput Individual Identification through Microsatellite Profiling

In ShawHahnLab/microsat: Computational, High-throughput Individual Identification through Microsatellite Profiling

Introduction

Installation

Windows

Mac OS

Linux

Input Data Organization

Sequence Files

Dataset Sample Attributes

Locus Attributes

Known Individuals (Optional)

Named Alleles (Optional)

Algorithm

Sample Processing

Genotype Calling

Summary and Reporting

Usage

Example Configuration File

Common Options

Output Data Organization

Full Configuration Options List

R Package Documentation

Browse R Packages

We want your feedback!

ShawHahnLab/microsat Computational, High-throughput Individual Identification through Microsatellite Profiling

In ShawHahnLab/microsat: Computational, High-throughput Individual Identification through Microsatellite Profiling

Introduction

Installation

Windows

Mac OS

Linux

Input Data Organization

Sequence Files

Dataset Sample Attributes

Locus Attributes

Known Individuals (Optional)

Named Alleles (Optional)

Algorithm

Sample Processing

Genotype Calling

Summary and Reporting

Usage

Example Configuration File

Common Options

Output Data Organization

Full Configuration Options List

R Package Documentation

Browse R Packages

We want your feedback!

ShawHahnLab/microsat
Computational, High-throughput Individual Identification through Microsatellite Profiling