fastexplore: Explore and Summarize a Dataset Quickly
In fastml: Fast Machine Learning Model Training and Evaluation

fastexplore

R Documentation

Explore and Summarize a Dataset Quickly

Description

fastexplore provides a fast and comprehensive exploratory data analysis (EDA) workflow. It automatically detects variable types, checks for missing and duplicated data, suggests potential ID columns, and provides a variety of plots (histograms, boxplots, scatterplots, correlation heatmaps, etc.). It also includes optional outlier detection, normality testing, and feature engineering.

Usage

fastexplore(
  data,
  label = NULL,
  visualize = c("histogram", "boxplot", "barplot", "heatmap", "scatterplot"),
  save_results = TRUE,
  output_dir = NULL,
  sample_size = NULL,
  interactive = FALSE,
  corr_threshold = 0.9,
  auto_convert_numeric = TRUE,
  visualize_missing = TRUE,
  imputation_suggestions = FALSE,
  report_duplicate_details = TRUE,
  detect_near_duplicates = TRUE,
  auto_convert_dates = FALSE,
  feature_engineering = FALSE,
  outlier_method = c("iqr", "zscore", "dbscan", "lof"),
  run_distribution_checks = TRUE,
  normality_tests = c("shapiro"),
  pairwise_matrix = TRUE,
  max_scatter_cols = 5,
  grouped_plots = TRUE,
  use_upset_missing = TRUE
)

Arguments

`data`	A `data.frame`. The dataset to analyze.
`label`	A character string specifying the name of the target or label column (optional). If provided, certain grouped plots and class imbalance checks will be performed.
`visualize`	A character vector specifying which visualizations to produce. Possible values: `c("histogram", "boxplot", "barplot", "heatmap", "scatterplot")`.
`save_results`	Logical. If `TRUE`, saves plots and a rendered report (HTML) into a timestamped `EDA_Results_` folder inside `output_dir`.
`output_dir`	A character string specifying the output directory for saving results (if `save_results = TRUE`). Defaults to current working directory.
`sample_size`	An integer specifying a random sample size for the data to be used in visualizations. If `NULL`, uses the entire dataset.
`interactive`	Logical. If `TRUE`, attempts to produce interactive Plotly heatmaps and other interactive elements. If required packages are not installed, falls back to static plots.
`corr_threshold`	Numeric. Threshold above which correlations (in absolute value) are flagged as high. Defaults to `0.9`.
`auto_convert_numeric`	Logical. If `TRUE`, automatically converts factor/character columns that look numeric (only digits, minus sign, or decimal point) to numeric.
`visualize_missing`	Logical. If `TRUE`, attempts to visualize missingness patterns (e.g., via an `UpSet` plot, if UpSetR is available, or VIM, naniar).
`imputation_suggestions`	Logical. If `TRUE`, prints simple text suggestions for imputation strategies.
`report_duplicate_details`	Logical. If `TRUE`, shows top duplicated rows and their frequency.
`detect_near_duplicates`	Logical. Placeholder for near-duplicate (fuzzy) detection. Currently not implemented.
`auto_convert_dates`	Logical. If `TRUE`, attempts to detect and convert date-like strings (`YYYY-MM-DD`) to `Date` format.
`feature_engineering`	Logical. If `TRUE`, automatically engineers derived features (day, month, year) from any date/time columns, and identifies potential ID columns.
`outlier_method`	A character string indicating which outlier detection method(s) to apply. One of `c("iqr", "zscore", "dbscan", "lof")`. Only the first match will be used in the code (though the function is designed to handle multiple).
`run_distribution_checks`	Logical. If `TRUE`, runs normality tests (e.g., Shapiro-Wilk) on numeric columns.
`normality_tests`	A character vector specifying which normality tests to run. Possible values include `"shapiro"` or `"ks"` (Kolmogorov-Smirnov). Only used if `run_distribution_checks = TRUE`.
`pairwise_matrix`	Logical. If `TRUE`, produces a scatterplot matrix (using GGally) for numeric columns.
`max_scatter_cols`	Integer. Maximum number of numeric columns to include in the pairwise matrix.
`grouped_plots`	Logical. If `TRUE`, produce grouped histograms, violin plots, and density plots by label (if the label is a factor).
`use_upset_missing`	Logical. If `TRUE`, attempts to produce an UpSet plot for missing data if UpSetR is available.

Details

This function automates many steps of EDA:

Automatically detects numeric vs. categorical variables.
Auto-converts columns that look numeric (and optionally date-like).
Summarizes data structure, missingness, duplication, and potential ID columns.
Computes correlation matrix and flags highly correlated pairs.
(Optional) Outlier detection using IQR, Z-score, DBSCAN, or LOF methods.
(Optional) Normality tests on numeric columns.
Saves all results and an R Markdown report if save_results = TRUE.

Value

A (silent) list containing:

data_overview - A basic overview (head, unique values, skim summary).
summary_stats - Summary statistics for numeric columns.
freq_tables - Frequency tables for factor columns.
missing_data - Missing data overview (count, percentage).
duplicated_rows - Count of duplicated rows.
class_imbalance - Class distribution if label is provided and is categorical.
correlation_matrix - The correlation matrix for numeric variables.
zero_variance_cols - Columns with near-zero variance.
potential_id_cols - Columns with unique values in every row.
date_time_cols - Columns recognized as date/time.
high_corr_pairs - Pairs of variables with correlation above corr_threshold.
outlier_method - The chosen method for outlier detection.
outlier_summary - Outlier proportions or metrics (if computed).