# Plotting frequencies using superb" In superb: Summary Plots with Adjusted Error Bars

cat("this will be hidden; use for general initializations.\n")
library(superb)
library(ggplot2)
options(superb.feedback = 'none')


## How to plot frequencies along with confidence intervals using superb

Many studies collect data that are categorized according to one or some factors. For example, it is possible to categorize a sample of college students based on their gender and on their projects after college is finished (go to university, get a job, etc.). Here, there are two "factors": gender and plans post studies. The measure for each participant is in which of the "cells" the participant is categorized. Typically, the data are summarized as frequencies, that is, the count of participants in each of the various combinations of the factor levels. For two classification factors, the data are said to be two-way classfication data, or to form a contingency table. Nothing prevent from having more than 2 factors, e.g., a three-way classification data.

Although frequencies are often given in a table, these tables provides very little insight with regards to trends. It is far more adviseable to illustrate the frequencies using a plot showing the count in each levels of the factors. However, to be truly informative, such a plot should be accompanied with error bars such as the confidence interval. Herein, we show how this can be done.

We adopted an approach based on the pivot method developed by @cp34. This method is given in an analytic form in @lt96. Such confidence intervals are commonly non-symmetrical around the estimate; they are also exact or conservative, in which case the length of the interval tends to be too long when the frequencies are small [@c90].

Given the total sample size $N$ in all the cells, each observed frequency $n$ in a given cell is used to get lower and upper confidence bounds around the proportion $\hat{p}=n/N$ with the formula: [ \hat{\pi}{\,\text{low}}=\left( 1+\frac{N-n+1}{n F{1-\alpha/2}(2n,2(N-n+1)} \right)^{-1} < \hat{\pi} < \left( 1+\frac{N-n}{(n+1) F_{\alpha/2}(2(n+1),2(N-x)} \right)^{-1} =\hat{\pi}{\,\text{high}} ] in which $F{q}$ denotes the $100\;q$% quantile of an $F$ distribution and $1-\alpha$ the desired coverage of the interval, often 95%. The interval [ {n_{\,\text{low}}, n_{\,\text{high}} } = N \, \times\, { \hat{\pi}{\,\text{low}}, \hat{\pi}{\,\text{high}} } ] is then used as a $100 (1-\alpha)$% confidence interval of the observed frequency $n$ which can be used to compare one frequency to an expected or theoretical frequency. Such an unadjusted confidence interval is termed a stand-alone confidence interval [@cgh21].

As more commonly, we wish to compare an observed frequency to another observed frequency, a difference-adjusted confidence interval is sought. To obtain a difference-adjusted confidence interval, it is required to multiply the interval width by 2, [ n_{\,\text{low}}^ = 2(n-n_{\,\text{low}})+n ] [ n_{\,\text{high}}^ = 2(n_{\,\text{high}}-n)+n ] where the asterisk denotes difference-adjusted confidence interval limits. Thus, the interval
[ { n^_{\,\text{low}}, n^{\,\text{high}} } ] is the _difference-adjusted $100 (1-\alpha)$% confidence interval [@b12]. The difference-adjusted confidence intervals allow comparing the frequencies pairwise rather than to a theoretical frequency.

## Why multiply the stand-alone CI length by 2?

The reason for the multiplication by 2 is two-fold. First, to obtain a difference-adjusted confidence interval (CI), it is necessary to multiply the CI width by $\sqrt{2}$ (under the assumption of homogeneous variance). Second, as the total must necessarily be equal to $N$, the observed frequencies are correlated and this correlation equals $-1 / (C-1)$ where $C$ is the number of class. As this CI is meant for pair-wise comparisons, $C$ is replaced by 2 in this formula, resulting in a second, correlation-based, correction of $\sqrt{1-r} = \sqrt{1 - (-1/(2-1))} = \sqrt{2}$. As usual, both corrections to the CI width are multiplicative.

## One illustration

To illustrate the method on a real data set, we enter the data set found in @lm71. The data counts the number of teenagers based (first factor) on their gender an on (second factor) their educational vocation (the type of studies they want to complete in the future). The sample is composed of 617 teens. To generate the dataset, we use the following:

dta <- data.frame(
vocation = factor(unlist(lapply(c("Secondary","Vocational","Teacher","Gymnasium","University"), function(p) rep(p,2))),
levels = c("Secondary","Vocational","Teacher","Gymnasium","University")),
gender  = factor(rep(c("Boys","Girls"),5), levels=c("Boys","Girls")),
obsfreq   = c(62,61,121,149,26,41,33,20,84,20)
)


The function factor uses the argument levels to specify the order in which the items are to be plotted; otherwise, the default order is alphabetic order.

If you have the data in a file, it is actually a lot more easier to import the file!

Here are the data in extenso:

dta


To have a quick-and-dirty plot, just display the raw counts with no error bars

library(superb)
library(ggplot2)

plt1 <- superbPlot(
dta,
BSFactors = c("vocation","gender"),
variables = "obsfreq",                      # name of the column with the counts
statistic = "identity",                     # the raw data as is
errorbar  = "none",                         # no error bars
# the following is for the look of the plot
plotStyle      = "line",                    # style of the plot
lineParams     = list( size = 1.0)          # thicker lines as well
)
plt1


## Define the confidence interval function

First, we need the summary function that computes the frequency. This is actually the datum stored in the data frame, so there is nothing to compute.

count <- function(x) x[1]


Second, we need an initalizer that will fetch the total sample size $N$ and dump it in the global environment for later use:

init.count <- function(df) {

## In summary

Frequencies, a.k.a. counts, can be displayed with appropriate confidence intervals without any problem. They are just another regular dependent variable in the researcher's toolkit.

# References

## Try the superb package in your browser

Any scripts or data that you put into this service are public.

superb documentation built on May 29, 2024, 8:51 a.m.