tableplot: Create a tableplot
In mtennekes/tabplot: Tableplot, a Visualization of Large Datasets

Description Usage Arguments Details Value Note References See Also Examples

A tableplot is a visualisation of (large) multivariate datasets. Each column represents a variable and each row bin is an aggregate of a certain number of records. For numeric variables, a bar chart of the mean values is depicted. For categorical variables, a stacked bar chart is depicted of the proportions of categories. Missing values are taken into account. Also supports large ffdf datasets from the ff package. For a quick intro, see vignette("tabplot-vignette").

tableplot(
  dat,
  select,
  subset = NULL,
  sortCol = 1,
  decreasing = TRUE,
  nBins = 100,
  from = 0,
  to = 100,
  nCols = ncol(dat),
  sample = FALSE,
  sampleBinSize = 1000,
  scales = "auto",
  numMode = "mb-sdb-ml",
  max_levels = 50,
  pals = list("Set1", "Set2", "Set3", "Set4"),
  change_palette_type_at = 20,
  rev_legend = FALSE,
  colorNA = "#FF1414",
  colorNA_num = "gray75",
  numPals = "OrBu",
  limitsX = NULL,
  bias_brokenX = 0.8,
  IQR_bias = 5,
  select_string = NULL,
  subset_string = NULL,
  colNames = NULL,
  filter = NULL,
  plot = TRUE,
  ...
)

`dat`	a `data.frame`, an `ffdf` object, or an object created by `tablePrepare` (see details below). Required.
`select`	expression indicating the columns of `dat` that are visualized in the tablelplot Also column indices are supported. By default, all columns are visualized. Use `select_string` for character strings instead of expressions.
`subset`	logical expression indicing which rows to select in `dat` (as in `subset`). It is also possible to provide the name of a categorical variable: then, a tableplot for each category is generated. Use `subset_string` for character strings instead of an expressions.
`sortCol`	column name on which the dataset is sorted. It can be an index, expression name, or a character string. PS: in case of ambiguity, the character string is used like in this example: `Sepal.Width <- "Petal.Width"; tableplot(iris, sortCol=Sepal.Width)`.
`decreasing`	boolean that determines whether the dataset is sorted decreasingly (`TRUE`) of increasingly (`FALSE`).
`nBins`	number of row bins
`from`	percentage from which the sorted data is shown
`to`	percentage to which the sorted data is shown
`nCols`	the maximum number of columns per tableplot. If this number is smaller than the number of columns selected in `datNames`, multiple tableplots are generated, where each of them contains the sorted column(s).
`sample`	boolean that determines whether to sample or use the whole data. Only useful when `tablePrepare` is used.
`sampleBinSize`	the number of sampled objects per bin, if `sample` is `TRUE`.
`scales`	determines the horizontal axes of the numeric variables in `select`. Options: "lin", "log", and "auto" for automatic detection. Either `scale` is a named vector, where the names correspond to numerical variable names, or `scale` is unnamed, where the values are applied to all numeric variables (recycled if necessary).
`numMode`	character value that determines how numeric values are plotted. The value consists of the following building blocks, which are concatenated with the "-" symbol. The default value is "mb-sdb-sdl". Prior to version 1.2, "MB-ML" was the default value. `sdb` sd bars between mean-sd to mean+sd are shown `sdl` sd lines at mean-sd and mean+sd are shown `mb` mean bars are shown `MB` mean bars are shown, where the color of the bar indicate completeness where positive mean values are blue and negative orange `ml` mean lines are shown `ML` mean lines are shown, where positive mean values are blue and negative orange `mean2` mean values are shown
`max_levels`	maximum number of levels for categorical variables. Categorical variables with more levels will be rebinned into `max_levels` levels. Either a positive number or -1, which means that categorical variables are never rebinned.
`pals`	list of color palettes. Each list item is on of the following: a palette name of `tablePalettes`, optionally with the starting color between brackets. a color vector If the list items are unnamed, they are applied to all selected categorical variables (recycled if necessary). The list items can be assigned to specific categorical variables, by naming them accordingly.
`change_palette_type_at`	number at which the type of categorical palettes is changed. For categorical variables with less than `change_palette_type_at` levels, the palette is recycled if necessary. For categorical variables with `change_palette_type_at` levels or more, a new palette of interpolated colors is derived (like a rainbow palette).
`rev_legend`	logical value or vector that determines which legends are reversed. If a vector is provided, the names of the items should the names of (a selection of) the categorical variables.
`colorNA`	color for missing values for categorical variables.
`colorNA_num`	color for missing values for numeric variables. It is used when all values in a bin are missing. If a part of the values are missing, a brighter color is used (see argument `numPals`).
`numPals`	vector of palette names that are used for numeric variables. These names are chosen from the diverging palette names in `tablePalettes`. Either `numPals` is a named vector, where the names correspond to the numerical variable names, or an unnamed vector (recycled if necessary). A "-" prefix in the name reverses the palette. When sd bars are shown (see the argument `numMode` of `plot`), only the righthand-side of the palette is used, where brightness is used to differentiate between mean bar and sd bar. When sd bars are not shown (the default in versions before 1.2), the righthand-side of the palette is used for positive mean values, and the lefthand-side for negative mean values. The brightness of the color is determined by the fraction of missing values.
`limitsX`	a list of vectors of length two, where each vector contains a lower and an upper limit value. Either the names of `limitsX` correspond to numerical variable names, or `limitsX` is an unnamed list (recycled if necessary).
`bias_brokenX`	parameter between 0 en 1 that determines when the x-axis of a numeric variable is broken. If minimum value is at least `bias_brokenX` times the maximum value, then X axis is broken. To turn off broken x-axes, set `bias_brokenX=1`.
`IQR_bias`	parameter that determines when a logarithmic scale is used when `scales` is set to "auto". The argument `IQR_bias` is multiplied by the interquartile range as a test.
`select_string`	character equivalent of the `select` argument (particularly useful for programming purposes)
`subset_string`	character equivalent of the `subset` argument (particularly useful for programming purposes)
`colNames`	deprecated; used in older versions of tabplot (prior to 0.12): use `select_string` instead
`filter`	deprecated; used in older versions of tabplot (prior to 0.12): use `subset_string` instead
`plot`	boolean, to plot or not to plot a tableplot
`...`	layout arguments, such as `fontsize` and `title`, are passed on to `plot`

For large dataset, we recommend to use tablePrepare which does all the necessary preprocessing that are needed to make any tableplot of the particular dataset. The resulting object of this function is passed on to tableplot (argument dat). Now tableplotting is very fast, and even faster with sampling enabled (sample=TRUE).

tabplot-object (silent output). If multiple tableplots are generated (which can be done by either setting subset to a categorical column name, or by restricting the number of columns with nCols), then a list of tabplot-objects is silently returned.

In early development versions of tabplot (prior to version 1.0) it was possible to sort datasets on multiple columns. To increase to tableplot creation speed, this feature is dropped. For multiple sorting purposes, we recommend to use the subset parameter instead.

Tennekes, M., Jonge, E. de, Daas, P.J.H. (2013) Visualizing and Inspecting Large Datasets with Tableplots, Journal of Data Science 11 (1), 43-58

itableplot

# load diamonds dataset from ggplot2
require(ggplot2)
data(diamonds)

# default tableplot
tableplot(diamonds)

# prior to verison 1.2, the mean values of numeric variables are displayed 
# without standard deviation (see ?plot.tabplot):
tableplot(diamonds, numMode = "MB-ML")

# most expensive diamonds
tableplot(diamonds, 
		  select=c(carat, cut, color, clarity, price), 
		  sortCol=price, 
		  from=0, 
		  to=5)

# for large datasets, we recommend to preprocess the data with tablePrepare:
p <- tablePrepare(diamonds)

# specific subsetting
tableplot(p, subset=price < 5000 & cut=='Ideal')

# change palettes
tableplot(p, 
		  pals=list(cut="Set4", color="Paired", clarity=grey(seq(0, 1,length.out=7))),
		  numPals=c(carat="PRGn", price="BrBG"))

# create a tableplot cut category, and fix scale limits of carat, table, and price
tabs <- tableplot(p, subset=cut,
	limitsX=list(carat=c(0,4), table=c(55, 65), price=c(0, 20000)), plot=FALSE)
plot(tabs[[3]], title="Very good cut diamonds")