knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
This package includes functions assisting data scientists with various common tasks during the exploratory data analysis stage of a data science project. Its functions will help the data scientist to do preliminary analysis on common column types like numeric columns, categorical columns and text columns; it will also conduct several experimental clusterings on the dataset.
Our functions are tailored based on our own experience, there are also similar packages published, a few good ones worth mentioning:
explore_numeric_columns
: conducts common exploratory analysis on columns with numeric type: it generates a heatmap showing correlation coefficients (using pearson
, kendall
or spearman
correlation on choice), histograms and SPLOM plots for all numeric columns or a list of columns specified by the user. This returns a list of plot objects so that the user can save and use them later on.
explore_categorical_columns
: performs exploratory analysis on categorical features. It returns a list having a tibble with column names, corresponding unique categories, counts of null values, percentages of null values and most frequent categories along with bar plots of user provided categorical columns.
explore_text_columns
: performs exploratory data analysis of text features. If text feature columns are not specified, the function will try to identify the text features. The function prints the summary statistics of character length and word count, and plots a histogram of its distributions. It also plots the word cloud of words (after removing stopwords) and bi-grams. Bar charts of top 10 words (after removing stopwords) and top 10 bi-grams will be plotted as well. This function returns a list of all the results.
explore_clustering
: fits K-Means and DBSCAN clustering algorithms on the dataset and visualizes the clustering using scatterplots on PCA transformed data. It returns a dictionary (list) with each key being name of the clustering algorithm and the value being a list of plots generated by the models.
explore_KMeans_clustering
: fits K-Means clustering algorithms on the dataset and visualizes the clustering using scatterplots on PCA transformed data. It returns a list of plots.
explore_DBSCAN_clustering
: fits K-DBSCAN clustering algorithms on the dataset and visualizes the clustering using scatterplots on PCA transformed data. It returns a list of plots.
You can install the development version of datascience.eda with:
# install.packages("devtools") devtools::install_github("UBC-MDS/datascience.eda.R")
explore_KMeans_clustering
and explore_DBSCAN_clustering
library(datascience.eda) library(palmerpenguins) # you can call each clustering algorithm separately explore_KMeans_clustering(penguins, centers = seq(3, 5)) explore_DBSCAN_clustering(penguins, eps = c(1), minPts = c(5)) # OR you can just call explore_clustering(penguins) to apply both KMeans and DBSCAN at once
explore_text_columns
library(sacred) results <- explore_text_columns(apocrypha)
explore_numeric_columns
results <- explore_numeric_columns(penguins)
explore_categorical_columns
library(dplyr) library(MASS) df <- data.frame(lapply(survey[, c('Sex','Clap')], as.character), stringsAsFactors=FALSE) %>% tibble() results <- explore_categorical_columns(df, c('Sex','Clap')) results[[1]] %>% knitr::kable() results[[2]][[1]] results[[2]][[2]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.