chiimp-package | R Documentation |
Computational, High-throughput Individual Identification through Microsatellite Profiling. For a conceptual overview see the latest user guide and additional documentation at https://shawhahnlab.github.io/chiimp/.
Starting from file inputs and producing file outputs, the overall workflow (handled by full_analysis as a configuration-driven wrapper for the entire process) is:
Load input data. The input spreadsheets are text files using comma-separated values (CSV).
Load data frame of sample information from a spreadsheet via load_dataset or directly from filenames via prepare_dataset.
Load data frame of locus attributes via load_locus_attrs
Optionally, load data frame of names for allele sequences via load_allele_names.
Optionally, load data frame of known genotypes for named individuals via load_genotypes.
Analyze dataset via analyze_dataset
Load each sequence data file into a character vector with load_seqs and process into a dereplicated data frame with analyze_seqs.
For each sample, filter the sequences from the relevant per-file data frame to just those matching the expected locus and identify possible alleles, via analyze_sample. (There may be a many-to-one relationship of samples to files, for example with sequencer multiplexing.)
Process each per-sample data frames into a summary list of attributes giving alleles identified and related information, via summarize_sample.
Organize analyze_dataset results into a list of per-file data frames, a list of per-sample data frames, and a single summary data frame across all samples.
Summarize results and add additional comparisons (cross-sample and to known-individual) via summarize_dataset.
Tabulate sequence counts per sample matching each locus' primer via tally_cts_per_locus.
Align identified alleles for each locus via align_alleles.
Create a sample-to-sample distance matrix of allele mismatches via make_dist_mat.
If genotypes for known individuals were provided, create a sample-to-known-individual distance matrix via make_dist_mat_known.
If identities of samples were provided, score genotyping success via match_known_genotypes and categorize_genotype_results.
Save analysis results to files. Spreadsheets are in CSV format for output
as well as input. Some output files are in FASTA format (alignments and
alleles) or are PNG images (alignment visualization and sequence count
histograms). If specified in the configuration, saveRDS is
called on the entire output as well, saving to results.rds
by default.
Create an HTML report document summarizing all results.
For defaults used in the configuration, see CFG_DEFAULTS.
The workflow above outlines CHIIMP's behavior when called as a standalone program, where main loads a configuration file into global options in R and calls full_analysis. The public functions linked above can also be used independently; see the documentation and code examples for the individual functions for more information.
The Package structure of the source files, grouped by topic:
Main Interface:
chiimp.R
: Main entry point for command-line usage (main) and R usage
(full_analysis).
Data Analysis:
analyze_dataset.R
: High-level interface to analyze all samples
across a given dataset (analyze_dataset); used by full_analysis to
manage the main part of the processing.
summarize_dataset.R
: High-level interface to provide inter-sample
and inter-locus analyses (summarize_dataset); used by full_analysis
to manage the second stage of the processing.
analyze_seqs.R
: Low-level interface to convert raw sequence input
to a data frame of unique sequences (analyze_seqs); used
by analyze_dataset.
analyze_sample.R
: Low-level interface to extract per-locus
details from a data frame of unique sequences
(analyze_sample); used by analyze_dataset.
summarize_sample.R
: Low-level interface to condense each sample
data frame into a a concise list of consistent attributes, suitable for
binding together across samples for a dataset (summarize_sample); used
by analyze_dataset.
categorize.R
: Low-level helper functions used by summarize_dataset
for samples with known identity.
Plotting and reporting:
report.R
: Various plotting and summarizing functions used when
rendering a report in full_analysis.
histogram.R
: Sequence histogram plotting tools histogram) as used
during full_analysis.
markdown.R
: Various helper functions for adding tables and plots
to an R Markdown report as used in full_analysis.
Utility Functions and Configuration:
configuration.R
: Configuration handling helper code and the default
configuration options CFG_DEFAULTS) used by many chiimp functions.
io.R
: various helper input/output functions used loading and
saving sequence data files, spreadsheets, and plots used in multiple
parts of the package.
util.R
: Various helper functions used in multiple parts of the
package.
Maintainer: Jesse Connell ancon@upenn.edu
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.