inst/CharAnalysis_R_Translation_ProjectBrief.md

CharAnalysis R Translation — Project Brief

Prepared for Claude Cowork | April 2026 — Updated April 15, 2026 to reflect all-dataset validation results

1. Project Overview

CharAnalysis is a MATLAB program for reconstructing local fire histories from lake-sediment charcoal records. It implements a peak-detection workflow that decomposes a charcoal accumulation rate (CHAR) time series into low-frequency background and high-frequency peak components, identifies charcoal peaks that likely represent fire events, and summarizes fire-history metrics.

The program has been in use since the mid-2000s and has been used in dozens of published studies analyzing sediment-charcoal records worldwide. The GitHub repository is https://github.com/phiguera/CharAnalysis.

Version 2.0 (March 2026) is the current MATLAB codebase. It is a fully validated modernization of the original code, with no changes to analytical methods. It runs on MATLAB R2019a or higher with no toolbox dependencies. All MATLAB source files are in the CharAnalysis_2_0_MATLAB/ subfolder of the repository.

The goal of this project is to translate CharAnalysis Version 2.0 from MATLAB into R, producing a well-structured R package that is quantitatively equivalent to the MATLAB implementation on validated reference datasets.

R is the dominant language in the paleoecological community today. An R implementation will significantly broaden access to CharAnalysis for new users and make integration with downstream analyses much easier.

2. MATLAB Codebase Reference

The MATLAB codebase is organized as a set of .m function files called by a single entry point. The analytical call hierarchy is:

CharAnalysis.m            (entry point)
  ├── CharParameters.m    Read .csv / .xls input file
  ├── CharPretreatment.m  Interpolate, compute CHAR, transform
  ├── CharSmooth.m        Estimate Cbackground (5 methods)
  ├── CharThreshGlobal.m  Global threshold + noise PDF
  ├── CharThreshLocal.m   Local (sliding-window) threshold
  ├── CharPeakID.m        Peak flagging + minimum-count screen
  ├── CharPostProcess.m   Post-processing: FRIs, Weibull, SNI, GOF
  └── CharPlotResults.m   All output figures (Figures 3–9)

Supporting:
  charLowess.m            Base-MATLAB lowess implementation (no toolbox)
  CharValidateParams.m    Input validation
  GaussianMixture.m       EM algorithm entry point; sub-functions
                          (EStep, MStep, EMIterate, MDLReduceOrder,
                          ClusterNormalize, initMixture, SplitClasses,
                          GMClassLikelihood) are bundled as local
                          functions within this single file, not
                          separate .m files
  smoothFRI.m             Smoothed FRI and fire-frequency curves
  bkgCharSensitivity.m    Sensitivity to Cbackground window width
  CharPlotFig3–9.m        Individual figure functions (modular interface)

Key data structures

Data flows through three primary structs and several parameter structs:

| Struct | Contents | |--------|----------| | Charcoal | Raw and resampled time series: depths, ages, counts, volumes, concentrations, CHAR, smoothed CHAR, peak CHAR, identified peaks, magnitudes, fire frequencies | | CharThresh | Threshold values (pos/neg, 4 columns), noise PDF, SNI, GOF, minCountP | | Pretreatment | zoneDiv, yrInterp, transform | | Smoothing | method (1–5), yr (window width) | | PeakAnalysis | cPeak, threshType, threshMethod, threshValues[4], minCountP, peakFrequ, sensitivity | | Results | save, saveFigures, allFigures |

3. Reference Datasets

Five dataset/configuration combinations have been validated between MATLAB V1.1 and V2.0 and serve as the quantitative benchmark for the R translation. All datasets are included in the repository under DataTemplates_and_Examples/.

| Dataset | Configuration | Notes | |---------|---------------|-------| | Code Lake | Local threshold (default params) | Primary test case | | Code Lake | Global threshold | Tests the global threshold path | | Chickaree Lake | CH10 (non-default tolerances) | GMM variability near record gaps | | Thunder Lake | TL06 | — | | Raven Lake | RA07 | — |

These five combinations are the required validation targets for the R translation. At each phase, R outputs must be compared numerically against the corresponding MATLAB V2.0 CSV output files.

4. Analytical Workflow Summary

The five steps the program performs are listed here for reference when implementing R equivalents. See the User's Guide (CharAnalysis_UsersGuide.md) for the complete description.

  1. Interpolate — Resample the raw charcoal series (concentration, sediment accumulation rate) to equal time intervals using a proportion-weighted algorithm that preserves sediment structure. Compute CHAR (pieces cm⁻² yr⁻¹). Optionally apply log transformation.

  2. Smooth — Model low-frequency trends in CHAR (C_background) using one of five methods: lowess, robust lowess, moving average, moving median, or moving mode.

  3. Remove background — Compute the high-frequency component C_peak as either residuals (C_int − C_back) or ratios (C_int / C_back).

  4. Threshold — Define a threshold value t and flag samples where C_peak > t as candidate peaks. Threshold can be globally defined (whole record) or locally defined (sliding window). Two noise-distribution approaches: Gaussian assumption or Gaussian mixture model.

  5. Screen peaks — Apply a minimum-count test (two-sample Poisson test) to remove statistically insignificant peaks. Compute fire-history metrics: peak magnitudes, fire return intervals (FRIs), Weibull fits, bootstrapped CIs, smoothed fire frequency, SNI, and KS goodness-of-fit.

5. Phased Implementation Plan

The translation is organized into four sequential phases. Phases 1–3 are the core deliverables. Phase 4 is packaging and polish.

Phase 1 — Parameter Reading and Pretreatment

Goal: Read input files and reproduce the interpolated CHAR series exactly.

R functions to implement:

R packages needed: readxl or readr (input), base R (computation).

Quantitative validation targets (Phase 1):

Compare the following columns from the MATLAB V2.0 CSV output against R output on all five reference datasets. Tolerances are absolute differences:

| Column | Variable | Required tolerance | |--------|----------|--------------------| | cmTop_i | Interpolated depth | Exact match | | ageTop_i | Interpolated age | Exact match | | charCount_i | Interpolated count | < 1e-10 | | charVol_i | Interpolated volume | < 1e-10 | | charCon_i | Interpolated concentration | < 1e-10 | | charAcc_i | Interpolated CHAR | < 1e-10 |

Validation script (Phase 1):

# Example validation pattern — apply to each of the five reference datasets
matlab_out <- read.csv("CodeLake_V2_local_MATLAB.csv")
r_out      <- CharAnalysis("CodeLake_charParams.csv")

stopifnot(max(abs(r_out$ageTop_i - matlab_out$ageTop_i)) < 1e-10)
stopifnot(max(abs(r_out$charAcc_i - matlab_out$charAcc_i)) < 1e-10)

Phase 2 — Smoothing and Threshold Determination

Goal: Reproduce C_background and C_peak, and determine threshold values.

R functions to implement:

R packages needed: zoo, mclust, stats (base).

Lowess wrapper note: MATLAB's charLowess.m (V2.0) uses span as the number of points or as a fraction of data length. R's lowess() uses f as a fraction always. Write a thin wrapper char_lowess(y, span) that converts span to f before calling lowess(). Validate this wrapper carefully against MATLAB output on all five reference datasets before using it in char_smooth().

Quantitative validation targets (Phase 2):

| Column | Variable | Required tolerance | |--------|----------|--------------------| | charBkg | C_background | < 1e-8 | | charPeak | C_peak | < 1e-8 | | thresh1thresh3, threshFinalPos, threshFinalNeg | Threshold values | < 1e-6 | | SNI | Signal-to-noise index | < 1e-6 | | threshGOF | KS goodness-of-fit p-value | < 1e-4 |

For the Chickaree Lake (CH10) dataset with threshMethod = 3 (GMM), the allowable tolerance for threshFinalPos is relaxed to < 0.015 due to irreducible GMM variability near record gaps (documented in the V2.0 validation report). This relaxed tolerance must be explicitly noted in the Phase 2 validation output.

Phase 3 — Peak Identification and Fire-History Metrics

Goal: Reproduce all peak identification and post-processing outputs.

R functions to implement:

Weibull parameterization note: MATLAB uses parameter order (a = scale, b = shape); R's MASS::fitdistr() and pweibull() / dweibull() use (shape, scale). Swap consistently in all downstream calculations and document this explicitly in the code.

R packages needed: MASS, boot, stats (base).

Quantitative validation targets (Phase 3):

| Column | Variable | Required tolerance | |--------|----------|--------------------| | peaks1peaks3, peaksFinal | Peak binary flags | Exact match | | peaksInsig | Insignificant peaks | Exact match | | peakMag | Peak magnitude | < 1e-6 | | smPeakFrequ | Smoothed fire frequency | < 1e-6 | | smFRIs | Smoothed FRI curve | < 1e-6 | | mFRI | Mean FRI by zone | < 1e-3 | | WBLb, WBLc | Weibull parameters | < 0.5 (see note) |

Weibull CI note: Bootstrapped Weibull CIs (WBLb_uCI, WBLb_lCI, WBLc_uCI, WBLc_lCI) will differ between MATLAB and R due to random seed differences in bootstrap resampling. Validate point estimates (WBLb, WBLc) only; treat CI comparisons as qualitative.

Phase 4 — Output, Figures, and Packaging

Goal: Produce equivalent output figures and deliver a usable R package.

R functions to implement:

Package structure (as built):

R/
  CharParameters.R
  CharPretreatment.R
  CharSmooth.R
  CharThreshGlobal.R
  CharThreshLocal.R
  CharPeakID.R
  CharPostProcess.R
  CharWriteResults.R
  CharPlotResults.R       (Figs 1-8 + CharPlotAll wrapper; Figs 9-10 not implemented)
  CharAnalysis.R          (top-level pipeline wrapper)
  utils_lowess.R          (char_lowess helper)
  CharValidateParams.R
  utils_gaussian_mixture.R  (direct port of MATLAB EM; see Section 6.2)
tests/
  test_phase1.R
  test_phase2.R
  test_phase3.R
  test_phase4.R
  CO_charData.csv         (Code Lake reference input)
  CO_charParams.csv
  CO_charResults.csv      (MATLAB V2.0 reference output)
  CO_R_charResults.csv    (R output — generated April 14, 2026)
  CH10_charData.csv       (Chickaree Lake reference input)
  CH10_charParams.csv
  CH10_charResults.csv    (MATLAB V2.0 reference output)
  SI17_charData.csv       (Silver Lake)
  SI17_charParams.csv
  SI17_charResults.csv    (MATLAB V2.0 reference output)
  SI17_charResults.csv    (R output — generated April 15, 2026)
  RA07_charData.csv       (Raven Lake)
  RA07_charParams.csv
  RA07_charResults.csv    (MATLAB V2.0 reference output)
  RA07_charResults.csv    (R output — generated April 15, 2026)
vignettes/
  CharAnalysis_intro.Rmd
DESCRIPTION
NAMESPACE

R packages needed: ggplot2, patchwork, ggtext, MASS, zoo, stats (base).

Validation at Phase 4: Re-run the full Phase 1–3 validation suite through char_run() to confirm end-to-end equivalence has not regressed.

6. Known Translation Issues and Required Workarounds

These are the most important implementation decisions. Address each one explicitly and document the resolution in code comments.

6.1 Lowess span convention

MATLAB's charLowess.m (V2.0) accepts span as either a fraction of data length (if < 1) or a number of points (if ≥ 1). R's lowess() accepts f as a fraction only. Write a wrapper that converts: f = span / length(y) when span >= 1, and f = span when span < 1. Validate the wrapper on all five reference datasets before integrating it into char_smooth().

6.2 Gaussian mixture model — direct port of MATLAB EM

The MATLAB codebase bundles a custom EM implementation with MDL-based order selection (GaussianMixture.m, with sub-functions bundled as local functions within that file). Rather than replacing this with mclust, the R implementation directly ports the MATLAB EM algorithm in utils_gaussian_mixture.R. The port replicates MATLAB's first/last-point initialisation and uses the same loose convergence criterion (ε = 0.03 × log(N)) as the original Bowman CLUSTER EM. No mclust dependency is needed.

Despite this close replication, irreducible floating-point divergence between MATLAB and R runtimes causes the EM to converge along different trajectories on some sliding windows, particularly when the window is data-sparse or contains near-ties. Observed divergences across all validated datasets (as of April 15, 2026):

| Dataset | Config | charBkg max\|diff\| | Peaks: R vs MATLAB | Note | |---------|--------|--------------------|--------------------|------| | CO (Code Lake) | local, GMM, method 1 | ~5×10⁻⁶ | 39 vs 48 | GMM only; method 1 unaffected by smooth() | | CH10 (Chickaree) | local, GMM, method 2, gap | 0.267 | 59 vs 50 | charBkg diff + GMM; gap amplifies smooth() divergence | | SI17 (Silver Lake) | local, GMM, method 2 | 0.130 | 25 vs 31 | charBkg diff + GMM | | RA07 (Raven Lake) | local, GMM, method 2, no gap | <0.001 | 15 vs 17 | GMM only; no gap, method 2 essentially identical |

These differences cascade into all downstream outputs (thresholds, peak magnitude, fire frequency, FRIs, Weibull statistics) wherever the peak count differs. Full per-dataset phase-by-phase results are in z_Validation_report_R_vs_MATLAB_V_2.0.md.

The allowed tolerance for threshFinalPos on all GMM-threshold datasets is relaxed to ±0.10 for the purposes of this validation. Peak-count differences should be documented in release notes as a known implementation difference, not a bug.

6.3 Weibull parameterization order

MATLAB's wblfit() returns (a = scale, b = shape). R's MASS::fitdistr() and distribution functions use (shape, scale). This swap must be applied consistently everywhere Weibull parameters are computed or used downstream.

6.4 Minimum-count test: tcdf vs. pnorm

MATLAB's CharPeakID.m uses 1 - tcdf(d, 1e10) for the minimum-count p-value. With 10^10 degrees of freedom, the t-distribution is numerically identical to the standard normal. Use 1 - pnorm(d) in R. This is mathematically equivalent and requires no tolerance relaxation.

6.5 Bootstrap CIs: bootstrp vs. boot

MATLAB's bootstrp(1000, 'mean', FRI) maps to boot::boot(FRI, function(x, i) mean(x[i]), R = 1000)$t. Bootstrapped CIs will differ between implementations due to random seed differences. Validate point estimates only; treat CI differences as expected.

6.6 Weibull optimization failure for small zones

MASS::fitdistr(x, "weibull") uses optim() internally and can fail to converge ("optimization failed") when the data vector is small (fewer than ~10 points) and all values are unique — a situation that arises in zones with few fire events. The default starting values place the optimizer on a flat region of the log-likelihood surface.

Fix (implemented in CharPostProcess.R): On failure, retry with method-of-moments starting values:

sh    <- max(0.1, (mean(x) / sd(x))^1.086)   # MoM shape approximation
scale <- mean(x) / gamma(1 + 1/sh)
MASS::fitdistr(x, "weibull", start = list(shape = sh, scale = scale))

This was observed on the CO dataset zone 2 (6 FRIs, all unique bin centres). The retry succeeds and produces results consistent with MATLAB's wblfit().

6.8 Anderson-Darling test in smoothFRI

MATLAB calls a custom AnDarksamtest() bundled with the distribution. The R implementation uses ks.test() (base R) for two-sample comparisons between zones, which avoids an additional package dependency. If Anderson-Darling tests are needed in a future revision, kSamples::ad.test(list(x1, x2), method = "asymptotic") is the appropriate replacement.

7. Quantitative Validation Protocol

7.1 MATLAB reference output files

Before beginning R development, generate fresh MATLAB V2.0 CSV output for all five reference datasets and save them as the authoritative reference files. These files contain the ground truth for all numerical comparisons.

Suggested naming convention:

CodeLake_local_MATLAB_V2.csv
CodeLake_global_MATLAB_V2.csv
Chickaree_CH10_MATLAB_V2.csv
Thunder_TL06_MATLAB_V2.csv
Raven_RA07_MATLAB_V2.csv

7.2 Validation script structure

As of April 15, 2026, all four primary datasets have been validated through Phase 4. The validation script described below has not yet been written as a formal automated suite (validate_all.R). Full numerical results are in z_Validation_report_R_vs_MATLAB_V_2.0.md. Summary below:

All-dataset validation summary — April 15, 2026:

| Dataset | Phase 1 | charBkg max\|diff\| | Peaks R vs MATLAB | peaks Insig. | |---------|---------|-------------------|-------------------|-------------| | CO (local, method 1) | PASS (≤5×10⁻⁶) | ~5×10⁻⁶ | 39 vs 48 | exact match | | CH10 (local, method 2, gap) | PASS (≤5×10⁻⁶) | 0.267 | 59 vs 50 | exact match | | SI17 (local, method 2) | PASS (≤5×10⁻⁵) | 0.130 | 25 vs 31 | exact match | | RA07 (local, method 2) | PASS (≤5×10⁻⁵) | <0.001 | 15 vs 17 | max\|diff\|=1 |

R output CSVs saved in tests/: CO_R_charResults.csv, CH10_charResults.csv, SI17_charResults.csv, RA07_charResults.csv.

Create a single validation script (tests/validate_all.R) that runs all five reference datasets through the R implementation and compares against the MATLAB reference CSVs. The script should:

  1. Load the MATLAB reference CSV and the R output for each dataset.
  2. For each validated column, compute max(abs(r_out - matlab_ref)).
  3. Assert the value is within the required tolerance for that column and phase.
  4. Print a pass/fail summary table.

Example summary table format:

Dataset              Column         Max abs diff   Tolerance   Status
CodeLake (local)     charAcc_i      3.2e-13        1e-10       PASS
CodeLake (local)     charBkg        4.1e-9         1e-8        PASS
CodeLake (local)     threshFinalPos 2.3e-7         1e-6        PASS
CodeLake (local)     peaksFinal     0              exact       PASS
Chickaree (CH10)     threshFinalPos 0.0087         0.015       PASS
...

7.3 Regression test maintenance

Once Phase 3 validation passes, freeze the reference CSV files and the validation script as permanent regression tests. Any future change to the R code must pass the full validation suite before being merged.

8. Repository and Documentation

The R translation should be developed in the same repository, in a new subfolder CharAnalysis_R/, and initially on a dedicated branch (e.g. r-translation).

Documentation expectations

9. R Package Dependencies

| Package | Status | Purpose | |---------|--------|---------| | MASS | Required | fitdistr() for Weibull parameter estimation | | zoo | Required | rollmean, rollmedian, rollapply in CharSmooth.R | | stats | Required (base) | ks.test(), lowess(), qnorm, pnorm, dnorm | | graphics | Required (base) | Legacy plotting fallback | | utils | Required (base) | read.csv(), write.table() | | ggplot2 (≥ 3.4.0) | Suggested | All output figures via CharPlotResults.R | | patchwork | Suggested | Multi-panel figure layout | | ggtext | Suggested | HTML/markdown subscripts in figure panel titles | | readxl | Suggested | Read Excel parameter files (if .xls/.xlsx input) |

Packages listed in earlier drafts (mclust, boot, kSamples, openxlsx) are not used in the as-built implementation.

10. Summary of Deliverables by Phase

| Phase | Deliverable | Validation status | |-------|-------------|-------------------| | 1 | CharParameters(), CharPretreatment() | COMPLETE — all 4 datasets PASS (≤5×10⁻⁵, num2str artifact) | | 2 | CharSmooth(), CharThreshGlobal(), CharThreshLocal(), charLowess.R | COMPLETE — charBkg known smooth() divergence (method 2); all datasets validated (Section 6.2) | | 3 | CharPeakID(), CharPostProcess() | COMPLETE — all 4 datasets validated; Weibull fix applied (Section 6.6); zone stats consistent with peak-count differences | | 4 | CharWriteResults(), CharPlotResults.R (Figs 1–8; Figs 9–10 not implemented), CharAnalysis(), vignette, DESCRIPTION, NAMESPACE | COMPLETE — all 4 R output CSVs saved; vignette updated; validation report created. validate_all.R automated suite not yet written. |

This project was planned with the assistance of Claude, an AI assistant by Anthropic. All code will be reviewed and validated by Philip Higuera (philip.higuera@umontana.edu) against Version 2.0 MATLAB reference outputs before being merged to the main branch.



Try the CharAnalysis package in your browser

Any scripts or data that you put into this service are public.

CharAnalysis documentation built on May 3, 2026, 5:06 p.m.