This vignette explains how to use smallsets to build Smallset Timelines. A Smallset Timeline is a simple visualisation of data preprocessing decisions. More information on Smallset Timelines can be found in the Smallset Timeline paper and on YouTube.
A synthetic dataset, called s_data, is used throughout this vignette. It is also included in the smallsets package. It contains 100 observations and 8 variables (C1-C8). See ?s_data
for more information.
library(smallsets) head(s_data)
Run this block of code to build a Smallset Timeline. In RStudio, the figure will appear in the plots pane.
library(smallsets) set.seed(145) Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"))
Normally, you pass a character string to code
(e.g., "my_code.R" or "/.../.../my_code.R"). However, the script s_data_preprocess.R is included in smallsets as an example and needs to be called with system.file
.
Each Smallset Timeline is constructed from your dataset and R/R Markdown/Python/Jupyter Notebook data preprocessing script. Scripts must contain a series of smallsets comments with snapshot instructions. Your un-preprocessed dataset (data
) and commented preprocessing script (code
) are the only required inputs to Smallset_Timeline
.
If s_data_preprocess.R was located in your working directory, the code would look like this.
Smallset_Timeline(data = s_data, code = "s_data_preprocess.R")
The smallsets package currently supports data preprocessing workflows fitting the following description.
reticulate
can be found here.To make a Smallset Timeline with smallsets, you need to add structured comments with snapshot instructions to your preprocessing script. All smallsets comments follow the same formula.
# smallsets snap + snap-place + name-of-data-object + caption[caption-text]caption
Ex: # smallsets snap +4 mydata caption[I removed rows that had implausible values.]caption
The following section includes an example R preprocessing script with smallsets structured comments.
There are three options for this argument.
This refers to the data object. The name of the data object can change throughout the script. The snapshot is taken of the object specified in the comment.
This is the snapshot caption describing the preprocessing step.
This is the example R preprocessing script. It demonstrates how to add smallsets structured comments to a preprocessing script. Based on these comment snap-place arguments (empty, +2
, and +1
), snapshots will be taken after line 1, line 7, and line 12.
s_data_preprocess.R (Don't run this code block. It's an example preprocessing script.)
Alternatively, you could place smallsets comments as a block above the preprocessing code^1^, and specify in the snap-place argument the line of code after which you would like each snapshot to be taken. This comment set-up produces the same Smallset Timeline as the comment set-up in the R script above.
s_data_preprocess_block.R (Don't run this code block. It's an example preprocessing script.)
^1^ Note, though, that for preprocessing code in Python functions all comments must be after def
and before return
.
Smallset Timelines can be built for preprocessing code in R Markdown files. If you choose to include the Smallset Timeline as a figure within the R Markdown report itself, it works best to build the Smallset Timeline before the preprocessing code is executed, so that you don't have to reload your (un-preprocessed) data later to build the Smallset Timeline. You assign the Smallset Timeline figure to an object and hide that code with echo=FALSE
. You can then plot that object anywhere in the report.
```{R, eval=FALSE} Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.Rmd", package = "smallsets"))
To see the compiled R Markdown report, which includes a Smallset Timeline figure, run the following code. It will write a PDF titled s_data_preprocess.pdf to your working directory. ```{R, eval=FALSE} rmarkdown::render(system.file("s_data_preprocess.Rmd", package = "smallsets"), output_dir = getwd())
The example R Markdown file can be viewed here.
Python scripts can be passed to the R command Smallset_Timeline
. You will need to use a Python environment to do so (e.g., use_condaenv("r-reticulate")
).
Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.py", package = "smallsets"))
Below is the script s_data_preprocess.py, which does the same thing as s_data_preprocess.R. The smallsets commenting system is the same in Python.
s_data_preprocess.py (Don't run this code block. It's an example preprocessing script.) ```{python, code = readLines(system.file("s_data_preprocess.py", package = "smallsets")), eval=FALSE}
## Jupyter Notebooks {#Jupyter_Notebooks} Jupyter Notebooks can also be passed to the R command `Smallset_Timeline`. And if you want to execute smallsets in a Jupyter Notebook, you can do so using [Rmagic](https://rpy2.github.io/doc/latest/html/interactive.html#rmagic). First, start Rmagic. ```{python, eval=FALSE} %load_ext rpy2.ipython
Then, run smallsets in a %%R
magic cell. If you had a dataset called my_data
and the preprocessing code (with smallsets structured comments) was in a Jupyter Notebook called my_notebook.ipynb
, it would look like the cell below. Note that this Notebook cell could be included in my_notebook.ipynb
itself. You may need to import your dataset into the %%R
magic cell with the -i flag, e.g., -i my_data
.
```{python, eval=FALSE}
%%R -w 1000 -h 500 -r 100
library("smallsets")
Smallset_Timeline(data = my_data, code = "my_notebook.ipynb")
## Smallset selection {#Smallset_selection} A Smallset is a small set of rows (5-15) from the original dataset containing instances of data preprocessing changes. For Smallset selection, there are two decisions to make: 1) how many rows (`rowCount`) and 2) which automated selection method to use (`rowSelect`). If `rowSelect = NULL` (the default setting), rows are selected through a simple random sample. The following code would randomly sample seven rows for the Smallset. ```r Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), rowCount = 7, rowSelect = NULL)
To use the other two selection methods, which are optimisation problems proposed here (in Section 5), you will need a Gurobi license as they rely on the Gurobi solver v9.1.2 (free academic licenses are available). The "Gurobi installation guide" in the prioritizr
package provides step-by-step instructions on installing Gurobi in R.
If rowSelect = 1
, the coverage problem is used to select rows. For each snapshot, it finds at least one example of a data change, if there is one. You can return the solution to the console with rowReturn = TRUE
.
Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), rowCount = 5, rowSelect = 1, rowReturn = TRUE)
Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), rowIDs = c("27", "42", "95", "96", "99"), rowReturn = T)
After the optimisation problem is solved once, the solution can be passed to rowIDs
to avoid having to re-solve it with each run of Smallset_Timeline
.
Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), rowCount = 5, rowIDs = c("27", "42", "95", "96", "99"))
Here, the coverage solution misses a data edit example in the second snapshot, motivating use of the other selection method (rowSelect = 2
): the coverage + variety optimisation problem, which looks for rows affected by the preprocessing steps differently. The drawback of rowSelect = 2
is runtime for large datasets. One potential workaround to a long runtime is building a Smallset Timeline from a sample of the dataset. However, this should be done with caution.
Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), rowSelect = 2, rowReturn = T)
Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), rowIDs = c("3", "32", "80", "97", "99"), rowReturn = T)
There are built-in options to customise a Smallset Timeline. The examples in this section highlight some of them. See ?Smallset_Timeline
or here for a full list of options.
Example 1
Differences: custom colour palette, data in the snapshots, highlighting missing data, no ghost data, font, colour of column names.
set.seed(145) Smallset_Timeline( data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), colours = list(added = "#FFC500", deleted = "#FF4040", edited = "#5BA2A6", unchanged = "#E6E3DF"), printedData = TRUE, truncateData = 4, missingDataTints = TRUE, ghostData = FALSE, font = "Georgia", sizing = sets_sizing(data = 2, captions = 3.5, columns = 3.5), labelling = sets_labelling(labelCol = "darker", labelColDif = 1), spacing = sets_spacing(captions = 3) )
Example 2
Differences: vertical alignment, use of the second built-in colour palette, colour of column names.
set.seed(145) Smallset_Timeline( data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), align = "vertical", colours = 2, spacing = sets_spacing(captions = 8, header = 2.5), labelling = sets_labelling(labelColDif = 1), sizing = sets_sizing(tiles = .4, captions = 3, columns = 3, legend = 12) )
Example 3
Differences: four snapshots, larger Smallset, use of the third built-in colour palette, two rows, rotated column names, font.
set.seed(145) Smallset_Timeline( data = s_data, code = system.file("s_data_preprocess_4.R", package = "smallsets"), rowCount = 8, colours = 3, ghostData = TRUE, missingDataTints = TRUE, font = "serif", spacing = sets_spacing( captions = 3, rows = 2, degree = 60, header = 1.5 ), sizing = sets_sizing( legend = 12, captions = 4, columns = 4 ) )
You can retrieve alternative text (alt text) for your Smallset Timeline. When altText = TRUE
, a draft of alt text is printed to the console. It can be copied from the console, revised for readability, and included with the figure.
set.seed(145) Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess.R", package = "smallsets"), altText = TRUE)
A resume marker is a vertical line between snapshots signalling that preprocessing stopped to move to the estimation or modelling task but then resumed to make additional dataset changes. It is added to a Smallset Timeline with a resume instruction in a structured comment.
In this example, data preprocessing is resumed to transform C9 into a categorical variable.
s_data_preprocess_resume.R (Don't run this code block. It's an example preprocessing script.)
set.seed(145) Smallset_Timeline(data = s_data, code = system.file("s_data_preprocess_resume.R", package = "smallsets"), sizing = sets_sizing( columns = 1.8, captions = 1.8, legend = 7, icons = .8 ), spacing = sets_spacing( captions = 3, degree = 45, header = 3.5, right = 2 ) )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.