knitr::opts_chunk$set(message = FALSE, warning = FALSE, out.width = '70%', fig.asp = 0.5, fig.align = "center") options( htmltools.dir.version = FALSE, formatR.indent = 2, width = 55, digits = 4, tibble.print_max = 5, tibble.print_min = 5) otheme <- ggplot2::theme_set(ggplot2::theme_minimal())
R offers many tools to analyse the univariate or bivariate
distribution of series. This includes table
and prop.table
on base
R, group_by/summarise
and count
in dplyr. However,
these functions are somehow frustrating as some very common tasks,
like:
>=N
category,are tedious. Moreover, to our knowledge, R offers weak support for
numerical series for which the numerical value is not known at the
individual level, but only the fact that this value belongs to a
certain range. descstat
is intended to provide user-friendly tools
to perform these kind of operations. More specifically, descstat
provides\:
bin
class which makes computations on series that contain bins
easy,freq_table
function to construct a frequency tables, and some
methods to print, plot and compute some descriptive statistics,cont_table
to compute contingency tables from two series.These function are writen in the tidyverse style, which means that the pipe operator can be used and that the series can be selected without quotes.
library("dplyr") library("ggplot2") library("purrr") library("tidyr") library("forcats") library("tibble") #library("descstat") ra <- lapply(system("ls ~/YvesPro2/R_github/descstat/R/*.R", intern = TRUE), source) ra <- system("ls ~/YvesPro2/R_github/descstat/data/*.rda", intern = TRUE) for (i in ra) load(i)
library("descstat") library("ggplot2") library("dplyr")
bin
is a new class intended to deal easily with series that
indicates in which range of numerical values an observation belongs
to. It is basically a factor or a character, with values having a very
strict format, like [10,20)
, (20,30]
, [50,Inf)
, ie\ :
(
or [
for
respectively a bin open or closed on the left,)
or ]
for
respectively a bin open or closed on the right.This format is returned by the base::cut
function which default
method is relevant for numeric series and has a breaks
argument (a
numeric containing the breaks) and a right
argument (if TRUE
the
bin is closed on the right).
z <- c(1, 5, 10, 12, 4, 9, 8) bin1 <- cut(z, breaks = c(1, 8, 12), right = FALSE) bin2 <- cut(z, breaks = c(1, 8, 12), right = TRUE) bin3 <- cut(z, breaks = c(1, 8, 12, Inf), right = FALSE) tibble(z, bin1, bin2, bin3)
Note that\ :
[8,12)
bin when right = FALSE
and in the (1,8]
bin when right = TRUE
,NA
when right = TRUE
as the lowest
value of breaks
is 1,NA
when right = FALSE
as 12 is the
highest value of breaks
.cut
returns a factor, but sometimes series containing bins can be
provided as a character. In this case, there is a problem with the
ordering of the distinct values, as illustrated below\:
bin3chr <- as.character(bin3) bin3chr factor(bin3chr) sort(unique(bin3chr))
The problem with bins stored in characters is that the distinct values
don't appear in the correct order while sorted (ie [12,Inf)
before [8,12)
).
bin
classTo create bin
objects, an extract
method is provided for
characters and factors which create a tibble containing the different
elements of the unique bins\:
bin3chr %>% extract
The relevance of the values to represent bins is then checked and the
as_bin
function returns an object of class bin
which is a factor
with the levels in the relevant order.
bin3chr %>% as_bin
In case of incorrect values in the series, NA
's are returned.
bin4 <- c("[1,8)", "[1, 8)", "[8,12", "[12,inf)", "[1,8)", "[8,12)", "[8,12)") bin4 %>% as_bin
The real strength of the bin
class is to allow calculus on the
underlying numerical values. To perform this task, an as_numeric
function is provided which returns a numeric series. The basic use of
this function returns simply the lower bound\:
bin3 %>% as_numeric
but a pos
argument can be used to return any value in the range of
the bin, by defining the relative position, ie\:
pos = 0
(the default) for the lower bound,pos = 1
for the upper bound,pos = 0.5
for the center of the bin.bin3 %>% as_numeric(pos = 1) bin3 %>% as_numeric(pos = 0.5)
While computing some statistics on bins, it is custumory to consider
that, for all the individuals belonging to a specific bin, the value
is just the center (ie all the individual values are equal to 10 for
individuals belonging to the [8,12)
bin), or the values are
uniformaly distributed in the bin. This value is returned by
as_numeric
with pos = 0.5
, but there is a specific problem for the last bin if, as
it is the case here, the last bin is open to infinity on the
right. In this case, the default behaviour of as_numeric
is to set
the width of the last bin to the width of the just before last one. As
the width ot the [8,12)
bin is 4, the width of the last one is also
set to 4, which mean an upper bound of 16 and a center value
of 14. This behaviour can be changed either by setting the wlast
argument to a different value, which is interpreted as a multiple of
the width of the just before last bin. For example, wlast = 4
means that
the width of the last bin is set to 4 times the one of the before
last bin, which means 16, and the resulting bin is [12,28)
with a center
value of 20.
bin3 %>% as_numeric(pos = 0.5, wlast = 4)
The same result can be obtained by directly setting the center of the
last bin using the xlast
argument\:
bin3 %>% as_numeric(pos = 0.5, xlast = 20)
Finally, a specific center for the first bin can also be set using
the xfirst
argument\:
bin3 %>% as_numeric(pos = 0.5, xlast = 20, xfirst = 6)
Frequency tables summarise the univariate distribution of a
categorical or an integer series. In tidyverse
, this task can be
performed using the dplyr::count
function. For example, the rgp
data set, which is an extract of the French cencus, contains the
number of children in households:
rgp %>% count(children)
The descstat::freq_table
function performs the same task and returns
by default exactly the same tibble as the one returned by dplyr::count
:
rgp %>% freq_table(children)
but it has several further arguments which can improve the result:
f
is a character containing one or several letters and indicates
what kind of frequencies should be computed\:
n
for the count or absolute frequencies (the default),f
for the (relative) frequencies,p
for the percentage (ie $f\times 100$),N
, F
and P
for the cumulative values of n
, f
and
p
.total
a boolean, if TRUE
, a total is returned,max
, suitable for an integer series only, is an integer which is the
maximum value presented in the frequency table, eg max = 3
creates
a last line which is >=3
.The following command use all the possible letters.
rgp %>% freq_table(children, "nfpNFP")
As there are few occurrences of families with more than 3 children, we
set max = 3
and add a total by setting total
to TRUE
.
rgp %>% freq_table(children, max = 3, total = TRUE)
Note that in the printed table, the children
series is now a
character as the last two values are >=3
and Total
. Actually
freq_table
returns an object of class freq_table
which inherits
from the tbl_df
class. A look at the structure of the object:
rgp %>% freq_table(children, max = 3, total = TRUE) %>% str
indicates that we still have a tibble, with a numeric children
series for which the last two values equal to 3.34 (which is the
average of the values greater or equal to 3) and NA
(for the total).
A pre_print
function is provided with a method for freq_table
objects. It turns the children
series in a character
, with 3.34
and NA
replaced by >=3
and Total
. This pre_print
method is
included in the format
method for freq_table
objects, which is then
passed to the tbl_df
method:
descstat:::format.freq_table
The pre_print
function should be used explicitly while using
knitr::kable
, as this function doesn't use any format
method:
rgp %>% freq_table(children, max = 3, total = TRUE) %>% pre_print %>% knitr::kable()
The most natural way to plot a frequency table with ggplot
is to use
geom_col
(or equivalently geom_bar
with stat = 'identity'
).
cld <- rgp %>% freq_table(children, f = "nf", max = 3) cld %>% pre_print %>% ggplot(aes(children, f)) + geom_col(fill = "white", color = "black")
Note the use of the pre_print
method which turns the 3.34
numerical
value in >=3
.
To get more enhanced graphics, a pre_plot
method is provided, with a
plot
argument equal to stacked
or cumulative
. With plot =
"stacked"
\:
cld %>% pre_print %>% pre_plot("f", plot = "stacked")
pre_plot
returns an ypos
series which indicates the coordinates
where to write labels (here the values of the series).
bnp <- cld %>% pre_print %>% pre_plot("f", plot = "stacked") %>% ggplot(aes(x = 2, y = f, fill = children)) + geom_col() + geom_text(aes(y = ypos, label = children)) + scale_x_continuous(label = NULL) + scale_fill_brewer(palette = "Set3") + guides(fill = FALSE) bnp
using polar coordinates, we get a pie chart:
bnp + coord_polar(theta = "y") + theme_void()
changing the range of the x
values, we get a hole in the pie chart
which result in the so called donut chart:
bnp + scale_x_continuous(limits = c(1, 2.5)) + coord_polar(theta = "y") + theme_void()
The other possible value for the plot
argument of pre_plot
is
cumulative:
cld <- rgp %>% freq_table(children, "F", max = 5, total = TRUE) cld %>% pre_plot(plot = "cumulative") %>% print(n = 5)
this returns four series which have the names of the aesthetics that
are mandatory for geom_segment
; x
, xend
, y
and yend
. It
also returns a pos
series with which one can draw differently the
horizontal (hor
) and the vertical (vert
) segments, using for
example the linetype
aesthetic.
cld %>% pre_plot(plot = "cumulative") %>% ggplot() + geom_segment(aes(x = x, xend = xend, y = y, yend = yend, linetype = pos)) + guides(linetype = FALSE) + labs(x = "number of children", y = "cumulative frequency")
As an example, consider the wages
data set that contains two series
called wage
and size
which respectively indicate bins for wage and
firm size.
wages %>% print(n = 3)
Applying descstat::freq_table
for the size
series, we get\:
wages %>% freq_table(size) %>% print(n = Inf)
A breaks
argument is provided, which is a
numerical vector that can be used to reduce the number of bins:
wages %>% freq_table(size, breaks = c(20, 250))
The breaks
argument should only contain values that are bounds of the
initial set of bins. If the breaks
argument is of length 1, all the
bins containing values greater than the one specified are merged\:
wages %>% freq_table(size, breaks = 50)
Internally, the breaks
argument is passed to the cut
methods for
bin
objects.
A frequency table with bins can also be created from a continuous
numerical series, as price
in the padova
data set\:
padova %>% pull(price) %>% range padova %>% freq_table(price, breaks = c(250, 500, 750)) padova %>% freq_table(price, breaks = c(30, 250, 500, 750, 1000))
Note that in this case, the first (last) values of breaks
can be
either:
0
and the upper bound of the last bin is Inf
. The f
argument may contain further letters than for the discrete
series case\:
d
for density, which is the relative frequency divided by the
bin's width,m
for the mass frequencies (the mass is the product of the
absolute frequency and the value of the variable),M
for cumulated mass frequencies.wages %>% freq_table(size, "dmM", breaks = c(20, 100, 250))
Other values of the variable can be included in the table using the
vals
argument, which is a character including some of the following
letters:
l
for the lower bound of the bin,u
for the upper bound of the bin,a
for the width of the bins.wages %>% freq_table(size, "p", vals = "xlua", breaks = c(20, 100, 250), wlast = 2)
Relevant plots for numerical continuous series are very different from
those suitable for discrete or categorial series. ggplot
provides
three geoms to plot the distribution of a numerical series:
geom_histogram
, geom_density
and geom_freqpoly
. These three
geoms use the bin
stat, which means that they consider a raw vector
of numerical values, create bins, count the number of observations
in each bin and then plot the result:
padova %>% ggplot(aes(price)) + geom_histogram(aes(y = ..density..), color = "black", fill = "white") + geom_freqpoly(aes(y = ..density..), color = "red") + geom_density(color = "blue")
These geoms can be used when individual numerical data are available,
but not when only bins are observed. A pre_plot
method is
provided for freq_table
objects, with the plot
argument either
equal to histogram
(the default) or freqpoly
. The resulting table
contains columns called x
and y
and can be ploted using
geom_polygon
for an histogram and geom_line
for a frequency
polygon.
ftwage <- wages %>% freq_table(wage, "d", breaks = c(10, 20, 30, 40, 50)) ftwage %>% pre_plot(plot = "histogram") %>% ggplot(aes(x, y)) + geom_polygon(fill = "white", color = "black") ftwage %>% pre_plot(plot = "freqpoly") %>% ggplot(aes(x, y)) + geom_line()
Another popular plot for continuous numerical series is the Lorenz curve, which
indicates the relation between the cumulative distributions of the
frequencies and the masses of the series. The data necessary to draw
this curve are obtained using plot = "lorenz"
. Not that in this
case, the table should contain F
and M
.
lzc <- wages %>% freq_table(wage, "MF", breaks = c(10, 20, 30, 40, 50)) %>% pre_plot(plot = "lorenz") lzc
Each line in the resulting tibble indicate the coordinates (F
for
x
and M
for y
) of the points that are necessary to plot the
polygons under the Lorenz curve. pts
is a logical which indicates
on which lines of the tibble there are points that are part of the
Lorenz curve\:
lzc %>% ggplot(aes(F, M)) + geom_polygon(fill = "lightyellow", color = "black") + geom_point(data = filter(lzc, pts)) + geom_line(data = tibble(F = c(0, 1), M = c(0, 1)), color = "blue") + geom_line(data = tibble(F = c(0, 1, 1), M = c(0, 0, 1)), color = "red")
The basic plot is obtained with a call to geom_polygon
and
geom_point
and we added two calls to geom_line
to draw the two
extreme cases of a Lorenz curve\:
freq_table
is not only designed for data sets containing individual
observations but also for frequency tables, like the income
data
set\:
income
which presents the distribution of income in France with 25 bins;
inc_class
contains range of yearly income (in thousands of euros)
number
is the number of households (in millions) and tot_inc
is
the mass of income in the bin (in thousands of million). To use
freq_table
to read this frequency table, a further argument freq
is required and should indicate which column of the tibble contains
the frequencies\:
income %>% freq_table(inc_class, freq = number)
If one column of the tibble contains the mass of the variables (the
tot_inc
series in the income
tibble), it can be indicated using
the mass
argument\:
income %>% freq_table(inc_class, freq = number, mass = tot_inc)
the center of the bins are then calculated by dividing the mass by the frequency, which result, by definition, to the exact mean value for every bin.
Descriptive statistics can easily be computed applying functions to
freq_table
objects. The problem is that only a few statistical
functions of the base
and the stats
packages are generic. For
these, methods where written for freq_table
objects, for the other
ones, we had to create new functions. The following table indicates
the R
functions and the corresponding descstat
functions.
tribble(~ "R", ~ descstat, "mean", "mean", "median", "median", "quantile", "quantile", "var", "variance", "sd", "stdev", "mad", "madev", "", "modval", "", "medial", "", "gini", "", "skewness", "", "kurtosis") %>% knitr::kable()
To compute the central values statistics of the distribution of
wages
we use:
z <- wages %>% freq_table(wage) z %>% mean z %>% median z %>% modval
median
returns a value computed using a linear
interpolation. modval
returns the mode, which is a one line tibble
containing the bin, the center of the bin and the modal value.
For the dispersion statistics:
z %>% stdev z %>% variance z %>% madev
For the quantiles, the argument y
can be used to compute the
quantiles using the values of the variable (y = "value"
, the
default) or the masses (y = "mass"
):
z %>% quantile(probs = c(0.25, 0.5, 0.75)) z %>% quantile(y = "mass", probs = c(0.25, 0.5, 0.75))
The quantile of level 0.5 is the median in the first case, the medial in the second case\:
z %>% median z %>% medial
gini
computes the Gini coefficient of the series:
z %>% gini
skewness
and kurtosis
compute Fisher's shape statistics\:
z %>% skewness z %>% kurtosis
All these functions also work for counts. Of course, for categorical series, all these functions are irrelevant except the one which computes the mode\:
wages %>% freq_table(sector) %>% modval
With dplyr
, a contingency table can be computed using count
with
two categorical variables. Let's first reduce the number of classes
of size
and wage
in the wages
table.
wages2 <- wages %>% mutate(size = cut(size, c(20, 50, 100)), wage = cut(wage, c(10, 30, 50)))
wages2 %>% count(size, wage)
To get a "wide" table, with the values of one of the two variables
being the columns, we can use tidyr::pivot_wider
:
wages2 %>% count(size, wage) %>% tidyr::pivot_wider(values_from = n, names_from = size)
cont_table
The same contingency table can be obtained using
descstat::cont_table
:
wages2 %>% cont_table(wage, size)
The result is a cont_table
object, which is a tibble in "long"
format, as the result of the dplyr::count
function. The printing of
the table in "wide" format is performed by the pre_print
method,
which is included in the format
method, but should be used
explicitely while using knitr::kable
.
wages2 %>% cont_table(wage, size) %>% pre_print %>% knitr::kable()
The total
argument can be set to TRUE
to get a row and a column of
totals for the two series and, to save place, the row_name
can be
set to FALSE
so that the first column, which contains the modalities
of the first series is unnamed.
wages2 %>% cont_table(wage, size, total = TRUE) %>% print(row_name = FALSE)
A weights
argument is used to mimic the population. For example, the
employment
table contains a series of weights called weights
:
employment %>% cont_table(age, sex, weights = weights, total = TRUE)
as for freq_table
, central values for the first and the last class
can be set using arguments xfirst#
, xlast#
and wlast#
, where #
is equal to 1 or 2 for the first and the second variable indicated in
the cont_table
function.
A contingency table can be ploted using geom_point
, with the size of
the points being proportional to the count of the cells. The pre_plot
method replaces classes by values.
wages2 %>% cont_table(size, wage) %>% pre_plot %>% ggplot() + geom_point(aes(size, wage, size = n))
$n_{ij}$ is the count of the cell corresponding to the $i$^th^ modality of the first variable and the $j$^th^ modality of the second one.
The joint
, marginal
and conditional
functions return these three
distributions. The last two require an argument y
which is one of the
two series of the cont_table
object.
wht <- wages2 %>% cont_table(size, wage) wht %>% joint wht %>% marginal(size) wht %>% conditional(size)
Note that marginal
, as it returns an univariate distribution, is
coerced to a freq_table
object.
Descriptive statistics can be computed using any of the three distributions. Using the joint distribution, we get a tibble containing two columns for the two series.
wht %>% joint %>% mean wht %>% joint %>% stdev wht %>% joint %>% variance wht %>% joint %>% modval
The same (univariate) statistics can be obtained using the marginal distribution:
wht %>% marginal(size) %>% mean
or even more simply considering the univariate distribution computed
by freq_table
:
wages2 %>% freq_table(size) %>% mean
The mean
, stdev
and variance
methods are actually only usefull
when applied to a conditional distribution; in this case, considering
for example the conditional distribution of the first variable, there
are as many values returned that the number of modalities of the
second (conditioning) variable.
wht %>% conditional(wage) %>% mean wht %>% conditional(wage) %>% variance
The total variance of $X$ can be writen as the sum of:
$$ s_{x}^2 = \sum_j f_{.j} s^2_{x_j} + \sum_j f_{.j} (\bar{x}_j - \bar{\bar{x}}) ^ 2 $$
The decomposition of the variance can be computed by joining tables containing the conditional moments and the marginal distribution of the conditioning variable and then applying the formula:
cm <- wht %>% conditional(wage) %>% mean# %>% rename(mean = wage) cv <- wht %>% conditional(wage) %>% variance# %>% rename(variance = wage) md <- wht %>% marginal(size) md %>% left_join(cm) %>% left_join(cv) %>% summarise(om = sum(f * mean), ev = sum(f * (mean - om) ^ 2), rv = sum(f * variance), tv = ev + rv) -> ra
Or more simply using the anova
method for cont_table
objects:
wht_wage <- wht %>% anova("wage") wht_wage
which has a summary
method which computes the different elements of
the decomposition:
wht_wage %>% summary
and especially the correlation ratio, which is obtained by dividing the explained variance by the total variance.
The regression curve of wage
on size
can be plotted using
wht_wage
, together with error bars:
wht_wage %>% ggplot(aes(x, mean)) + geom_point() + geom_line(lty = "dotted") + geom_errorbar(aes(ymin = mean - sqrt(variance), ymax = mean + sqrt(variance))) + labs(x = "size", y = "wage")
For the joint distribution, two other functions are provided,
covariance
and correlation
(which are the equivalent of the
non-generic stats::cov
and stats::cor
functions) to compute the
covariance and the linear coefficient of correlation.
wht %>% joint %>% covariance wht %>% joint %>% correlation
The regression line can be computed using the regline
function:
rl <- regline(wage ~ size, wht) rl
which returns the intercept and the slope of the regression of wage
on size
. We can then draw the points and the regression line:
wht %>% pre_plot %>% ggplot() + geom_point(aes(size, wage, size = n)) + geom_abline(intercept = rl[1], slope = rl[2])
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.