In UBC-MDS/datascience.eda.R: Common Functions for EDA

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

datascience.eda.R

This package includes functions assisting data scientists with various common tasks during the exploratory data analysis stage of a data science project. Its functions will help the data scientist to do preliminary analysis on common column types like numeric columns, categorical columns and text columns; it will also conduct several experimental clusterings on the dataset.

Our functions are tailored based on our own experience, there are also similar packages published, a few good ones worth mentioning:

Main functions

explore_numeric_columns: conducts common exploratory analysis on columns with numeric type: it generates a heatmap showing correlation coefficients (using pearson, kendall or spearman correlation on choice), histograms and SPLOM plots for all numeric columns or a list of columns specified by the user. This returns a list of plot objects so that the user can save and use them later on.
explore_categorical_columns: performs exploratory analysis on categorical features. It returns a list having a tibble with column names, corresponding unique categories, counts of null values, percentages of null values and most frequent categories along with bar plots of user provided categorical columns.
explore_text_columns: performs exploratory data analysis of text features. If text feature columns are not specified, the function will try to identify the text features. The function prints the summary statistics of character length and word count, and plots a histogram of its distributions. It also plots the word cloud of words (after removing stopwords) and bi-grams. Bar charts of top 10 words (after removing stopwords) and top 10 bi-grams will be plotted as well. This function returns a list of all the results.
explore_clustering: fits K-Means and DBSCAN clustering algorithms on the dataset and visualizes the clustering using scatterplots on PCA transformed data. It returns a dictionary (list) with each key being name of the clustering algorithm and the value being a list of plots generated by the models.
explore_KMeans_clustering: fits K-Means clustering algorithms on the dataset and visualizes the clustering using scatterplots on PCA transformed data. It returns a list of plots.
explore_DBSCAN_clustering: fits K-DBSCAN clustering algorithms on the dataset and visualizes the clustering using scatterplots on PCA transformed data. It returns a list of plots.

Installation

You can install the development version of datascience.eda with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/datascience.eda.R")

Example

`explore_KMeans_clustering` and `explore_DBSCAN_clustering`

library(datascience.eda)
library(palmerpenguins)

# you can call each clustering algorithm separately 
explore_KMeans_clustering(penguins, centers = seq(3, 5))
explore_DBSCAN_clustering(penguins, eps = c(1), minPts = c(5))

# OR you can just call explore_clustering(penguins) to apply both KMeans and DBSCAN at once

`explore_text_columns`

library(sacred)
results <- explore_text_columns(apocrypha)

`explore_numeric_columns`

results <- explore_numeric_columns(penguins)

`explore_categorical_columns`

library(dplyr)
library(MASS)
df <- data.frame(lapply(survey[, c('Sex','Clap')], as.character),
                 stringsAsFactors=FALSE) %>% tibble()

results <- explore_categorical_columns(df, c('Sex','Clap'))
results[[1]] %>% knitr::kable()
results[[2]][[1]]
results[[2]][[2]]