knitr::opts_chunk$set(echo = TRUE) devtools::load_all(quiet = TRUE)
CHIIMP (Computational, High-throughput Individual Identification through Microsatellite Profiling) is a program to analyze microsatellite (short tandem repeat) DNA sequence data, producing genotypes from raw data and automating some typical analysis tasks.
CHIIMP runs as a standalone tool, but is built as an R language package. All functionality of the standalone program can be accessed from functions within R, and the reporting and visualization functions are designed to integrate well with RStudio and R Markdown.
This document mostly focuses on CHIIMP as a standalone tool. For more information on the use of specific functions within R, also see the built-in package documentation.
First install R and RStudio, which will supply most software dependencies for CHIIMP. Once these are installed, follow the specific instructions below for your operating system. In all three cases CHIIMP performs an analysis when a configuration file is dragged and dropped onto the desktop icon; there is no interactive interface via the icon, though the R package can be used interactively. See the Usage section for more information.
On Windows, double-click the install_windows.cmd
script. This will install
the package and R dependencies, and create a desktop shortcut.
On Mac OS, right-click (control+click) the install_mac.command
shell script,
select "Open," and also click "Open" in the window that appears to confirm that
really do want to open it. (Apple has specific instructions about these
security precautions here.)
This will automatically install the package along with R dependencies and
create a desktop alias.
If a window appears recommending installation of the Mac OS command-line developer tools, go ahead and install them. After that you'll probably need to re-run the CHIIMP installer again to finish the install.
On Linux, run the install_linux.sh
shell script to automatically install the
package along with R dependencies. An icon for the program is created at
$HOME/Desktop/CHIIMP.desktop
. Specific usage of the desktop icon will depend
on the desktop environment in use. (The CHIIMP.desktop
text file references
the installed chiimp executable, and supplies the config file as a command-line
argument when dragged and dropped onto the icon.)
The information CHIIMP uses during analysis is:
The spreadsheets are in comma-separated (CSV) format. Column names are important but not column order. Extra columns are imported as-is but are otherwise ignored.
The sequence files must contain sequences that span complete microsatellites. No assembly is performed to handle fragments of microsatellites, and the lengths of sequences identified as alleles are reported as-is. (An implicit assumption throughout the analysis is that any candidate allele sequence begins and ends with conserved regions corresponding to the PCR primers used, and the forward primer sequence is one of the filtering criteria during analysis.)
The description of the samples to be analyzed can be provided in a spreadsheet, or automatically loaded from the data file names. An example spreadsheet:
| Filename
| Replicate
| Sample
| Locus
|
|:------------------:|:-----------:|:--------:|:-------:|
| 100-1-A.fastq.gz
| 1
| 100
| A
|
| 100-2-A.fastq.gz
| 2
| 100
| A
|
| 100-1-B.fastq.gz
| 1
| 100
| B
|
| 100-2-B.fastq.gz
| 2
| 100
| B
|
| 100-1-1.fastq.gz
| 1
| 100
| 1
|
| 100-2-1.fastq.gz
| 2
| 100
| 1
|
| 100-1-2.fastq.gz
| 1
| 100
| 2
|
| 100-2-2.fastq.gz
| 2
| 100
| 2
|
| 101-1-A.fastq.gz
| 1
| 101
| A
|
| 101-2-A.fastq.gz
| 2
| 101
| A
|
| 101-1-B.fastq.gz
| 1
| 101
| B
|
| 101-2-B.fastq.gz
| 2
| 101
| B
|
| 101-1-1.fastq.gz
| 1
| 101
| 1
|
| 101-2-1.fastq.gz
| 2
| 101
| 1
|
| 101-1-2.fastq.gz
| 1
| 101
| 2
|
| 101-2-2.fastq.gz
| 2
| 101
| 2
|
These columns are required for each entry:
For simple cases that have a one-to-one match between sequence files and sample/locus combinations, and with descriptive filenames following a consistent pattern, the dataset table can be created automatically at run-time. See the Usage section for more information.
The description of the loci should be given in a spreadsheet with loci on rows and attributes on columns. For example:
| Locus
| LengthMin
| LengthMax
| LengthBuffer
| Motif
| Primer
| ReversePrimer
|
|:-------:| -----------:| -----------:| --------------:|:-------:|:-----------------:|:----------------:|
| A
| 131
| 179
| 20
| TAGA
| TATCACTGGTGT...
| CACAGTTGTGTG...
|
| B
| 194
| 235
| 20
| TAGA
| AGTCTCTCTTTC...
| TAGGAGCCTGTG...
|
| 1
| 232
| 270
| 20
| TATC
| ACAGTCAAGAAT...
| CTGTGGCTCAAA...
|
| 2
| 218
| 337
| 20
| TCCA
| TTGTCTCCCCAG...
| TCTGTCATAAAC...
|
These columns are required:
use_reverse_primers
is enabled
in the configuration.If a spreadsheet of genotypes for known individuals is supplied, the analysis can attempt to match samples with the known genotypes automatically. For example:
| Name
| Locus
| Allele1Seq
| Allele2Seq
|
|:---------:|:-------:|:----------------:|:----------------:|
| CH001
| A
| ATTATCACTGG...
| ATTATCACTGG...
|
| CH001
| B
| TCAGTCTCTCT...
| |
| CH001
| 1
| AGACAGTCAAG...
| AGACAGTCAAG...
|
| CH001
| 2
| CTTTGTCTCCC...
| CTTTGTCTCCC...
|
| CH002
| A
| ATTATCACTGG...
| ATTATCACTGG...
|
| CH002
| B
| TCAGTCTCTCT...
| TCAGTCTCTCT...
|
| CH002
| 1
| AGACAGTCAAG...
| |
| CH002
| 2
| CTTTGTCTCCC...
| CTTTGTCTCCC...
|
The order of the alleles given is not important, and homozygous individuals may have Allele2Seq either left blank or set to a copy of Allele1Seq. The sequences should contain any conserved region before and after the repeats including that used for the PCR primers described above.
If a spreadsheet of allele names and sequences is supplied, the analysis will use those names in summary tables in the output report. For example:
| Locus
| Name
| Seq
|
|:-------:|:-----------:|:----------------:|
| A
| 200-a
| ATTATCACTGG...
|
| A
| 180-a
| ATTATCACTGG...
|
| A
| 180-b
| ATTATCACTGG...
|
| B
| 300-a
| ATTATCACTGG...
|
| B
| 305-a
| ATTATCACTGG...
|
| B
| 290-a
| ATTATCACTGG...
|
The software will automatically create short allele names for any identified allele not listed in the allele spreadsheet (or for all alleles if no spreadsheet is given).
The automatic names are the sequence length and a sequence-specific suffix separated by a hyphen, for example, "180-fdd1c6" for a 180 bp sequence with no assigned name and particular sequence content. Any other 180 bp sequence would receive a different suffix when the name is assigned.
CHIIMP breaks the genotyping process into two parts. First a sample file is
de-replicated and a table of unique sequences is created, with no filtering yet
applied. Second the table is filtered to just candidate allele sequences, and
up to two sequences are reported as the genotype. Both the per-sequence table
and the final genotypes are saved in the final output, as spreadsheets in the
processed-files
directory and as the summary.csv
spreadsheet.
The table of unique sequences includes basic information for each case: sequence content, length, and read counts observed. These are the Seq, Count, and Length columns. The sequences are ordered by count with the most abundant first.
Additional columns associate each sequence with a particular locus, using the locus attributes described above. First each locus' forward primer (and optionally reverse as well) is compared with the sequence and the matching locus name is stored in a MatchingLocus column. The sequence is then checked for several tandem repeats of the motif for that locus, and compared to the length range expected for that locus. TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns respectively. The Ambiguous column marks any sequences containing bases outside of A, C, T, and G (such as N).
The primer matching recognizes IUPAC ambiguity codes in the primer sequences
(e.g. an "N" signifies any of A, C, T, or G) but not in the read sequences.
By default no mismatches are allowed, but a maximum number of mismatches per
comparison can be defined in the configuration. The default behavior will not
modify the reads based on matched primers, but several read-modifying actions
are available via the primer_action
configuration option:
none
: no change (the default). This is reasonable if adapter sequences
have already been trimmed, but otherwise one of the below should be used.keep
: trim each reads to the primer but keep the primer region as-is.
Note that this would not work will with primer sequences containing IUPAC
ambiguity codes; in that situation use replace
or remove
to ensure that
reads with equivalent primer regions are grouped together.replace
: trim each read and replace the matched primer region with the
sequence from the locus attributes table.remove
: trim reads to exclude the primer regions.These can also be specified for forward and reverse primers only with
primer_action_fwd
and primer_action_rev
, respectively.
PCR artifacts can obscure real allele sequences with incorrect sequences. There are extra filters to attempt to remove these if possible or highlight cases that may require further attention.
The sample data tables include "Stutter" and "Artifact" columns to mark entries that look like possible polymerase stutter or other artifacts of another sequence present at higher counts. For cases of potential polymerase stutter, the higher-count sequence is one motif repeat longer. For insertion/deletion/substitution artifacts the higher-count sequence is within 1 bp of the same length. In both cases the supposed artifact sequence will be marked if the read counts are 1/3 or lower than the higher-count sequence. (This represents a trade-off in sensitivity and specificity since genuine allele sequences may differ in length by one or even zero repeats, and read counts for pairs of alleles in a given sample can vary considerably.) Both of these columns store row numbers for the higher-count sequence that an artifact may have originated in, if found. Note that relative sequence lengths and counts determine the outcome here, since sequence content for the artifacts is largely indistinguishable from real allele sequences.
Lastly, the ratio of read counts for each sequence to the total reads in the sample and the reads with the same MatchingLocus value is stored in FractionOfTotal and FractionOfLocus columns respectively.
This is the analyze_seqs
function in the R package.
In the previous stage every single unique sequence for each data file was described in a table, but no filtering or genotyping occurred. Now just one or two candidate allele sequences are extracted from each table and reported as the genotype.
First, the table rows are restricted to just those matching the expected locus'
primer, motif, and length range (using the MatchingLocus, MotifMatch, and
LengthMatch columns). If the resulting total read count is below a minimum
value (by default r cfg("min_locus_reads")
, customizable via the
min_locus_reads
setting) no genotyping will be attempted. Next only those
sequences accounting for at least a minimum fraction of the remaining reads are
considered. (The default value is r cfg("min_allele_abundance")
. This can
be changed via the min_allele_abundance
setting.) Sequences that are marked
as potential stutter or other artifacts (via the Stutter and Artifact columns
of the table) or contain ambiguous sequence content (via the Ambiguous column)
are excluded next.
After these filters are applied, the top one or two remaining sequences are labeled as the alleles. (If only one sequence remains, the sample is labeled homozygous; if two or more, heterozygous.) The final details kept for each sample are:
These tasks (the filtering and categorizing of each sequence in the table and
the short genotype summary) are the analyze_sample
and summarize_sample
functions in the R package.
The genotype and details identified in the previous step for each sample are aggregated into a spreadsheet with a row for each sample. This summary spreadsheet and the more detailed per-file and per-sample tables are all saved in the final output.
For inter-sample comparisons, the alleles identified across samples for each locus are aligned to one another. The genotypes for each sample are clustered by number of matching alleles, showing similarity between samples. If a spreadsheet of known genotypes was given, the sample genotypes are also compared to the known genotypes, with any close matches reported. If a Name column was provided with the sample definition table as well as a known genotypes spreadsheet, the known-correct genotypes will be paired with applicable samples and a column tracking the result of the genotyping (Correct, Incorrect, Blank, or Dropped Allele) will be added. A single report document summarizes the genotyping and these other details. See the Output Data Organization section below for more information on the output.
These steps are handled by the full_analysis
function in the R package.
CHIIMP takes a configuration file as input and saves all output to a folder. The configuration file points to all of the input data described above, and specifies options for the analysis and output. All options have defaults, so the file may be very brief or even empty. The file format a simple two-column CSV spreadsheet, with a "Key" column storing the names of configuration options and a "Value" column storing each associated value. (The YAML file used in previous CHIIMP versions is still supported, but the new CSV format should be easier to to manage for those unfamiliar with YAML.)
For example, a configuration file might have just two entries, showing the spreadsheets to use for the samples and loci to analyze:
Key,Value dataset,samples.csv locus_attrs,locus_attrs.csv
The configuration file can be dragged and dropped onto the desktop icon created during installation.
For command-line usage, the configuration file can be given as the first
argument to the R script installed with the package. (The location of the
script can be shown in R with system.file("bin", "chiimp", package="chiimp")
.)
To run the same analysis within R, pass a list of configuration options to the
chiimp::full_analysis()
function.
The text in the example configuration file included here shows a slightly more complex case:
# Show example configuration file text fp <- "inst/example_config.csv" config_example <- load_config(fp) kableExtra::kable_styling(knitr::kable(config_example, booktabs = TRUE, linesep = ""))
Below is a list of a few commonly-customized options. (See also the end of this document for a full list with all default settings.) For more information on the format of the spreadsheets listed here, see the "Input Data Organization" section above.
dataset
: file path to table of sample attributes to use (rather than
detecting sample attributes based on filenames)locus_attrs
: file path to locus attributes CSV fileallele_names
: file path to known alleles CSV filegenotypes_known
: file path to known genotypes CSV fileoutput_path
: directory path for saving output datamin_motif_repeats
: minimum number of perfect tandem motif repeats to
accept a read as a match for an STR locus. For example, if the Motif
specified for a locus (in the locus_attrs
file) is ACTC
, a setting of 3
here will require ACTCACTCACTC
somewhere in each read.max_stutter_ratio
: when considering a second-most-abundant sequence as
a possible allele, if it is one repeat shorter than the most abundant
sequence, it must exceed this fraction of the primer sequence abundancemax_artifact_ratio
: like max_stutter_ratio
, but for secondary sequences
with lengths within 1 bp of the primary sequence.use_reverse_primers
: should reverse primers be used for matching reads to
loci, or just forward?reverse_primer_r1
: if using reverse primers, are the sequences given in
the locus attributes table oriented in the direction of the forward reads?primer_action
: a key word for how to modify each read based on the
matched primer. (See Sample Processing section above.)min_allele_abundance
: minimum fraction of locus-matched reads for a
sequence to be considered a candidate allelemin_locus_reads
: minimum number of locus-matched reads to attempt to call
a genotypeA the end of an analysis CHIIMP creates a directory of files with all results.
summary.csv
: spreadsheet of the called genotypes and additional attributes
for each sample. Each sample is on a separate row, and each column corresponds
to a separate attribute in the results. This includes all columns in the input
dataset spreadsheet including locus, replicate, and sample identifiers, the
sequences, sequence lengths, and counts of the identified allele(s), and
several additional attributes.processed-files
: directory of spreadsheets for each input data file. Each
spreadsheet contains one unique sequence per row with attributes on columns.
At this stage no filtering for sample/locus-specific attributes has been
applied. (This is particularly relevant for sequencer-multiplexed samples as
one input data file may contain data for multiple samples.)processed-samples
: directory of spreadsheets for each sample. As for
processed-files
, each spreadsheet contains one unique sequence per row with
attributes on columns. These represent the intermediate sample-specific data
CHIIMP uses to call a genotype for each sample, and each spreadsheet here
corresponds to a single row in the summary.csv
file.histograms
: directory of counts-versus-length histograms for each sample.
Counts are tallied on a by-sequence basis rather than by-length for alleles, so
the bars for called alleles (in red) are generally shorter than the bars for
unfiltered sequences (in black) or the matching-locus sequences (in pink).allele-sequences
: directory of FASTA files for each sample, giving just the
sequence content also shown in summary.csv
. (This is a convenience feature
to make the called alleles easily usable in a standard format, but the same
information is available in summary.csv
.)alignments
: directory of FASTA files for each locus, giving a multiple
alignment of all identified alleles per locus.alignment-images
: directory of visualization images of the per-locus
alignments. These are also included in the report document.report.html
: report document summarizing the genotyping results,
inter-sample comparisons, and (if known genotypes were provided), a comparison
of samples with known individuals.An additional file will be created if fp_rds
is defined in the output
setting of the configuration. This file contains all analysis results in a
single R object using R's native data serialization format for easy
post-analysis in R if desired.
These directory and file names are customizable in the output
section of the
configuration.
Configuration options and their defaults for CHIIMP version
r devtools::as.package(".")$version
:
cfg_defaults <- subset(CFG_DEFAULTS, select = -c(Parser, OldName)) colnames(cfg_defaults)[colnames(cfg_defaults) == "Value"] <- "Default Value" kableExtra::landscape(kableExtra::kable_styling( knitr::kable(cfg_defaults, booktabs = TRUE, linesep = ""), latex_options = "scale_down"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.