fastexplore | R Documentation |
fastexplore
provides a fast and comprehensive exploratory data analysis (EDA) workflow.
It automatically detects variable types, checks for missing and duplicated data,
suggests potential ID columns, and provides a variety of plots (histograms, boxplots,
scatterplots, correlation heatmaps, etc.). It also includes optional outlier detection,
normality testing, and feature engineering.
fastexplore(
data,
label = NULL,
visualize = c("histogram", "boxplot", "barplot", "heatmap", "scatterplot"),
save_results = TRUE,
output_dir = NULL,
sample_size = NULL,
interactive = FALSE,
corr_threshold = 0.9,
auto_convert_numeric = TRUE,
visualize_missing = TRUE,
imputation_suggestions = FALSE,
report_duplicate_details = TRUE,
detect_near_duplicates = TRUE,
auto_convert_dates = FALSE,
feature_engineering = FALSE,
outlier_method = c("iqr", "zscore", "dbscan", "lof"),
run_distribution_checks = TRUE,
normality_tests = c("shapiro"),
pairwise_matrix = TRUE,
max_scatter_cols = 5,
grouped_plots = TRUE,
use_upset_missing = TRUE
)
data |
A |
label |
A character string specifying the name of the target or label column (optional). If provided, certain grouped plots and class imbalance checks will be performed. |
visualize |
A character vector specifying which visualizations to produce.
Possible values: |
save_results |
Logical. If |
output_dir |
A character string specifying the output directory for saving results
(if |
sample_size |
An integer specifying a random sample size for the data to be used in
visualizations. If |
interactive |
Logical. If |
corr_threshold |
Numeric. Threshold above which correlations (in absolute value)
are flagged as high. Defaults to |
auto_convert_numeric |
Logical. If |
visualize_missing |
Logical. If |
imputation_suggestions |
Logical. If |
report_duplicate_details |
Logical. If |
detect_near_duplicates |
Logical. Placeholder for near-duplicate (fuzzy) detection. Currently not implemented. |
auto_convert_dates |
Logical. If |
feature_engineering |
Logical. If |
outlier_method |
A character string indicating which outlier detection method(s) to apply.
One of |
run_distribution_checks |
Logical. If |
normality_tests |
A character vector specifying which normality tests to run.
Possible values include |
pairwise_matrix |
Logical. If |
max_scatter_cols |
Integer. Maximum number of numeric columns to include in the pairwise matrix. |
grouped_plots |
Logical. If |
use_upset_missing |
Logical. If |
This function automates many steps of EDA:
Automatically detects numeric vs. categorical variables.
Auto-converts columns that look numeric (and optionally date-like).
Summarizes data structure, missingness, duplication, and potential ID columns.
Computes correlation matrix and flags highly correlated pairs.
(Optional) Outlier detection using IQR, Z-score, DBSCAN, or LOF methods.
(Optional) Normality tests on numeric columns.
Saves all results and an R Markdown report if save_results = TRUE
.
A (silent) list containing:
data_overview
- A basic overview (head, unique values, skim summary).
summary_stats
- Summary statistics for numeric columns.
freq_tables
- Frequency tables for factor columns.
missing_data
- Missing data overview (count, percentage).
duplicated_rows
- Count of duplicated rows.
class_imbalance
- Class distribution if label
is provided and is categorical.
correlation_matrix
- The correlation matrix for numeric variables.
zero_variance_cols
- Columns with near-zero variance.
potential_id_cols
- Columns with unique values in every row.
date_time_cols
- Columns recognized as date/time.
high_corr_pairs
- Pairs of variables with correlation above corr_threshold
.
outlier_method
- The chosen method for outlier detection.
outlier_summary
- Outlier proportions or metrics (if computed).
If save_results = TRUE
, additional side effects include saving figures, a correlation heatmap,
and an R Markdown report in the specified directory.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.