EDAMatrix: An R6 class representing a matrix dataset.
In khughitt/eda: Exploratory Data Analysis in R

Description Usage Arguments Fields Methods Examples

EDAMatrix is a helper class for wrapping data matrices, with optional support for row and column datadata. Methods are provided for common exploratory data analysis summary statistics, transformations, and visualizations.

edm <- EDAMatrix$new(mat, row_mdata=row_mdata_df, row_color='some_var')
edm$summary()

edm$plot_pca()

edm$t$subsample(100)$plot_heatmap()

dat: An m x n dataset.
row_mdata: A matrix or data frame with rows corresponding to the row names of dat
col_mdata: A matrix or data frame with rows corresponding to the column names of dat
row_names: Column name or number containing row identifiers. If set to rownames (default), row names will be used as identifiers.
col_names: Column name or number containing column identifiers. If set to colnames (default), column names will be used as identifiers.
row_mdata_rownames: Column name or number containing row metadata row identifiers. If set to rownames (default), row names will be used as identifiers.
col_mdata_rownames: Column name or number containing col metadata row identifiers. If set to rownames (default), row names will be used as identifiers.
row_color: Row metadata field to use for coloring rowwise plot elements.
row_shape: Row metadata field to use for determine rowwise plot element shape.
row_label: Row metadata field to use when labeling plot points or other elements.
col_color: Column metadata field to use for coloring columnwise plot elements.
col_shape: Column metadata field to use for determine columnwise plot element shape.
col_label: Column metadata field to use when labeling plot points or other elements.
color_pal: Color palette to use for relevant plotting methods (default: Set1).
title: Text to use as a title or subtitle for plots.
ggplot_theme: Default theme to use for ggplot2 plots (default: theme_bw).

dat: Underlying data matrix
row_mdata: Dataframe containing row metadata
col_mdata: Dataframe containing column metadata

clear_cache(): Clears EDAMatrix cache.
clone(): Creates a copy of the EDAMatrix instance.
cluster_tsne(k=10, ...): Clusters rows in dataset using a combination of t-SNE and k-means clustering.
detect_col_outliers(num_sd=2, ctend='median', meas='pearson'): Measures average pairwise similarities between all columns in the dataset. Outliers are considered to be those columns who mean similarity to all other columns is greater than num_sd standard deviations from the average of averages.
detect_row_outliers(num_sd=2, ctend='median', meas='pearson'): Measures average pairwise similarities between all rows in the dataset. Outliers are considered to be those rows who mean similarity to all other rows is greater than num_sd standard deviations from the average of averages.
feature_cor(): Detects dependencies between column metadata entries (features) and dataset rows.
filter_col_outliers(num_sd=2, ctend='median', meas='pearson'): Removes column outliers from the dataset. See detect_col_outliers() for details of outlier detection approach.
filter_row_outliers(num_sd=2, ctend='median', meas='pearson'): Removes row outliers from the dataset. See detect_row_outliers() for details of outlier detection approach.
filter_cols(mask): Accepts a logical vector of length ncol(obj$dat) and returns a new EDAMatrix instance with only the columns associated with TRUE values in the mask.
filter_rows(mask): Accepts a logical vector of length nrow(obj$dat) and returns a new EDAMatrix instance with only the rowsumns associated with TRUE values in the mask.
impute(method='knn'): Imputes missing values in the dataset and stores the result in-place. Currently only k-Nearest Neighbors (kNN) imputation is supported.
log(base=exp(1), offset=0): Log-transforms data.
log1p(): Log(x + 1)-transforms data.
pca(...): Performs principle component analysis (PCA) on the dataset and returns a new EDAMatrix instance of the projected data points. Any additional arguements specified are passed to the prcomp() function.
pca_feature_cor(meas='pearson', ...): Measures correlation between dataset features (column metadata fields) and dataset principle components.
plot_cor_heatmap(meas='pearson', interactive=TRUE, ...): Plots a correlation heatmap of the dataset.
plot_densities(color=NULL, title="", ...): Plots densities for each column in the dataset.
plot_feature_cor(meas='pearson', color_scale=c('green', 'red'): Creates a tile plot of projected data / feature correlations. See feature_cor() function.
plot_heatmap(interactive=TRUE, ...): Generates a heatmap plot of the dataset
plot_pairwise_column_cors(color=NULL, title="", meas='pearson', mar=c(12,6,4,6)): Plot median pairwise column correlations for each variable (column) in the dataset.
plot_pca(pcx=1, pcy=2, scale=FALSE, color=NULL, shape=NULL, title=NULL, text_labels=FALSE, ...): Generates a two-dimensional PCA plot from the dataset.
plot_tsne(color=NULL, shape=NULL, title=NULL, text_labels=FALSE, ...): Generates a two-dimensional t-SNE plot from the dataset.
print(): Prints an overview of the object instance.
subsample(row_n=NULL, col_n=NULL, row_ratio=NULL, col_ratio=NULL): Subsamples dataset rows and/or columns.
summary(markdown=FALSE, num_digits=2): Summarizes overall characteristics of a dataset.
t(): Transposes dataset rows and columns.
tsne(...): Performs T-distributed stochastic neighbor embedding (t-SNE) on the dataset and returns a new EDAMatrix instance of the projected data points. Any additional arguements specified are passed to the Rtsne() function.
tsne_feature_cor(meas='pearson', ...): Measures correlation between dataset features (column metadata fields) and dataset t-SNE projected axes.

library('eda')

dat <- as.matrix(iris[,1:4])
row_mdata <- iris[,5,drop=FALSE]

edm <- EDAMatrix$new(dat, row_mdata=row_mdata, row_color='Species')

edm
edm$summary()

edm$plot_pca()
edm$log1p()$plot_cor_heatmap()
edm$subsample(100)$plot_tsne()