generate_report: Generate a standardized report from the current session
In autocodebook: Automatic Codebook and Tracking for 'Spark' and 'dplyr' Pipelines

generate_report

R Documentation

Generate a standardized report from the current session

Description

Produces an HTML report combining the eligibility flowchart, the codebook, and a per-variable inspection panel. Supports two inspection modes:

Usage

generate_report(
  data,
  type = c("cross_sectional", "longitudinal"),
  id_var = NULL,
  time_var = NULL,
  variables = NULL,
  labels = NULL,
  treat_as_categorical = NULL,
  output_html,
  output_dir = NULL,
  export_codebook_editable = TRUE,
  cache_data = TRUE,
  title = NULL,
  n_bins = 30,
  top_n_cat = 20
)

Arguments

`data`	A Spark DataFrame (tbl_spark) or local data frame.
`type`	One of `"cross_sectional"` or `"longitudinal"`.
`id_var`	Character. Name of the ID column. For `longitudinal`, mandatory. For `cross_sectional`, used to skip the ID column in inspection.
`time_var`	Character or NULL. Name of the time/wave column. Used in `longitudinal` to compute missingness-over-time. Default: NULL.
`variables`	Optional character vector. If provided, inspects only these variables. Default: NULL (all except id_var/time_var).
`labels`	Optional named list (variable -> label). If NULL, uses labels from the codebook when available.
`treat_as_categorical`	Character vector of variable names to treat as categorical even when their R class is numeric or integer. Useful for coded variables (e.g. `cod_sexo` stored as 1L/2L, `cod_raca` stored as integer). For these variables, the report uses bar charts and proportion-by-time stacked plots instead of histograms / median+IQR. Default: NULL.
`output_html`	File path for the HTML output. There is no default: the destination must be supplied explicitly (e.g. a file under `tempdir()` or a directory chosen by the user).
`output_dir`	Optional directory for ancillary files (codebook.xlsx, codebook.docx, etc.). If NULL, derived from output_html.
`export_codebook_editable`	Logical. Also export codebook as .docx and .xlsx in `output_dir`. Default: TRUE.
`cache_data`	Logical. If TRUE and `data` is a tbl_spark, persists the dataset once before the report aggregations, then releases it on exit. No-op for local data frames. Default: TRUE.
`title`	Optional title for the report.
`n_bins`	Number of bins for numeric histograms. Default: 30.
`top_n_cat`	Max categories shown in categorical plots. Default: 20.

Details

cross_sectional: one plot per variable (histogram / bar / time).
longitudinal: three plots per variable (global distribution, intra-ID variation, missingness by time) plus a meta plot of observations per ID.

All aggregations happen in Spark/dplyr; only small summaries are collected.

Value

Invisible list with paths to all generated files.

Examples


# Rendering the HTML report needs rmarkdown + pandoc and a few plotting
# packages (all in Suggests); it also takes more than 5 seconds, so the
# example is wrapped in \donttest and writes only to tempdir().
if (requireNamespace("rmarkdown", quietly = TRUE) &&
    requireNamespace("knitr", quietly = TRUE) &&
    requireNamespace("ggplot2", quietly = TRUE) &&
    requireNamespace("patchwork", quietly = TRUE) &&
    requireNamespace("scales", quietly = TRUE) &&
    rmarkdown::pandoc_available()) {

  cb_init(id_col = "id_indiv")
  df_baseline <- data.frame(
    id_indiv = sprintf("ID%03d", 1:50),
    cod_sexo = sample(c(1L, 2L), 50, replace = TRUE),
    idade    = sample(18:80, 50, replace = TRUE)
  )

  # Write to a dedicated subdir of tempdir() and clean everything up after:
  out_dir <- file.path(tempdir(), "autocodebook_report_demo")
  generate_report(df_baseline, type = "cross_sectional",
                  id_var = "id_indiv",
                  treat_as_categorical = "cod_sexo",
                  output_html = file.path(out_dir, "report_baseline.html"))
  unlink(out_dir, recursive = TRUE)
}

autocodebook documentation built on June 9, 2026, 1:09 a.m.