chiimp-package: CHIIMP

chiimp-packageR Documentation

CHIIMP

Description

Computational, High-throughput Individual Identification through Microsatellite Profiling. For a conceptual overview see the latest user guide and additional documentation at https://shawhahnlab.github.io/chiimp/.

Details

Starting from file inputs and producing file outputs, the overall workflow (handled by full_analysis as a configuration-driven wrapper for the entire process) is:

  • Load input data. The input spreadsheets are text files using comma-separated values (CSV).

    • Load data frame of sample information from a spreadsheet via load_dataset or directly from filenames via prepare_dataset.

    • Load data frame of locus attributes via load_locus_attrs

    • Optionally, load data frame of names for allele sequences via load_allele_names.

    • Optionally, load data frame of known genotypes for named individuals via load_genotypes.

  • Analyze dataset via analyze_dataset

    • Load each sequence data file into a character vector with load_seqs and process into a dereplicated data frame with analyze_seqs.

    • For each sample, filter the sequences from the relevant per-file data frame to just those matching the expected locus and identify possible alleles, via analyze_sample. (There may be a many-to-one relationship of samples to files, for example with sequencer multiplexing.)

    • Process each per-sample data frames into a summary list of attributes giving alleles identified and related information, via summarize_sample.

    • Organize analyze_dataset results into a list of per-file data frames, a list of per-sample data frames, and a single summary data frame across all samples.

  • Summarize results and add additional comparisons (cross-sample and to known-individual) via summarize_dataset.

    • Tabulate sequence counts per sample matching each locus' primer via tally_cts_per_locus.

    • Align identified alleles for each locus via align_alleles.

    • Create a sample-to-sample distance matrix of allele mismatches via make_dist_mat.

    • If genotypes for known individuals were provided, create a sample-to-known-individual distance matrix via make_dist_mat_known.

    • If identities of samples were provided, score genotyping success via match_known_genotypes and categorize_genotype_results.

  • Save analysis results to files. Spreadsheets are in CSV format for output as well as input. Some output files are in FASTA format (alignments and alleles) or are PNG images (alignment visualization and sequence count histograms). If specified in the configuration, saveRDS is called on the entire output as well, saving to results.rds by default.

  • Create an HTML report document summarizing all results.

For defaults used in the configuration, see CFG_DEFAULTS.

The workflow above outlines CHIIMP's behavior when called as a standalone program, where main loads a configuration file into global options in R and calls full_analysis. The public functions linked above can also be used independently; see the documentation and code examples for the individual functions for more information.

The Package structure of the source files, grouped by topic:

  • Main Interface:

    • chiimp.R: Main entry point for command-line usage (main) and R usage (full_analysis).

  • Data Analysis:

    • analyze_dataset.R: High-level interface to analyze all samples across a given dataset (analyze_dataset); used by full_analysis to manage the main part of the processing.

    • summarize_dataset.R: High-level interface to provide inter-sample and inter-locus analyses (summarize_dataset); used by full_analysis to manage the second stage of the processing.

    • analyze_seqs.R: Low-level interface to convert raw sequence input to a data frame of unique sequences (analyze_seqs); used by analyze_dataset.

    • analyze_sample.R: Low-level interface to extract per-locus details from a data frame of unique sequences (analyze_sample); used by analyze_dataset.

    • summarize_sample.R: Low-level interface to condense each sample data frame into a a concise list of consistent attributes, suitable for binding together across samples for a dataset (summarize_sample); used by analyze_dataset.

    • categorize.R: Low-level helper functions used by summarize_dataset for samples with known identity.

  • Plotting and reporting:

    • report.R: Various plotting and summarizing functions used when rendering a report in full_analysis.

    • histogram.R: Sequence histogram plotting tools histogram) as used during full_analysis.

    • markdown.R: Various helper functions for adding tables and plots to an R Markdown report as used in full_analysis.

  • Utility Functions and Configuration:

    • configuration.R: Configuration handling helper code and the default configuration options CFG_DEFAULTS) used by many chiimp functions.

    • io.R: various helper input/output functions used loading and saving sequence data files, spreadsheets, and plots used in multiple parts of the package.

    • util.R: Various helper functions used in multiple parts of the package.

Author(s)

Maintainer: Jesse Connell ancon@upenn.edu


ShawHahnLab/microsat documentation built on Aug. 25, 2023, 11:16 p.m.