knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE)
#devtools::load_all()

The main purpose of this vignette is to provide R code to produce graphs that feature in Sections 2.4 and 2.5 of the STAT0002 notes. All these graphs involve one variable only. See also the Chapter 2: Graphs (more than one variable).

The R code used in this vignette are available: graphs-vignette.R.

The Oxford Birth Times data

These data are available in the data frame ox_births.

library(stat1004)
birth_times <- ox_births[, "time"]
sort(birth_times)

Empirical cumulative distribution function

  1. Can you work out from the output below what the function ecdf does? Use ?ecdf to check our answer.
ox_ecdf <- ecdf(birth_times)
ox_ecdf(1)
ox_ecdf(2)
2/95
ox_ecdf(4.7)
23/95
ox_ecdf(19)
ox_ecdf(100)

The plot method for an object returned by ecdf produces a plot much like (the hollow circles in) Figure 2.3 in the STAT0002 notes.

We will label a lot of axes with "time (hours)". To make our lives a little easier we create a character variable called xlab.

xlab <- "time (hours)"
plot(ox_ecdf, main = "", xlab = xlab, ylab = "cumulative relative frequency", pch = 1)
  1. It should be obvious what the arguments xlab and ylab do. What do you think that main = "" and pch = 1 do? Use ?title and ?par (respectively) to find out more. ?par lists lots of arguments that can be used to adjust a graph.

Histogram

For an explanation of how to construct a histogram, and why it is constructed in this way, see the short How to draw a histogram video. The following code produces the plots on the right side of Figure 2.4 in the notes. Note that in the first plot the probability = TRUE (or prob = TRUE for short) argument is important: it ensures that the total area of the rectangles of the histogram is equal to 1. In the second plot it isn't necessary to include prob = TRUE because if the bin widths are unequal then (sensibly) hist uses prob = TRUE.

hist(birth_times, prob = TRUE, col = 8, main = "", xlab = xlab)
br <- c(seq(from = 0, to = 12, by = 2), 20)
hist(birth_times, breaks = br, col = 8, main = "", xlab = xlab)
  1. Can you see what the argument breaks does? ... and the function and seq?

The hist function returns an object (a list of several objects in fact) containing the information used to construct the plot. If we assign this to an R object, ox_tab say, then we can look at this information and perhaps use it. The ls function tells us the names of the objects in this list. Use ?hist to find out what these objects are.

ox_tab <- hist(birth_times, plot = FALSE)
ls(ox_tab)

We use the information in ox_tab to reproduce the plot in Figure 2.2 of the notes.

cum_rel_freq <- cumsum(c(0, ox_tab$counts)) / length(birth_times)
plot(ox_tab$breaks, cum_rel_freq, type = "b", pch = 16, ylab = "cumulative relative frequency", xlab = xlab)
abline(h = 1, lty = 2)
  1. What do you think the function cumsum does?

The following code produces the histogram-like alternative plots in the lecture slides STAT0002 lecture slides. As you can see we can do many things to change the appearance of plots. Use ?par to see a list of the graphical parameters that we can use.

# adjust plot margins, produce a 2 by 2 array of plots (fill row 1 then row 2)
par(mar = c(4, 4, 1, 1), mfrow = c(2, 2))
# no vertical axis
hist(birth_times, col = 8, prob = TRUE, axes = FALSE, xlab = xlab, ylab = "", main = "")
axis(1, line = 0.5)
# no vertical axis plus rug of points
hist(birth_times, col = 8, prob = TRUE, axes = FALSE, xlab = xlab, ylab = "",main = "")
axis(1, line = 0.5)
rug(birth_times, line = 0.5, ticksize = 0.05)
# non-shaded frequency polygon
n <- length(ox_tab$mids)
ox_tab$mids <- c(ox_tab$mids[1], ox_tab$mids, ox_tab$mids[n])
ox_tab$density <- c(0, ox_tab$density, 0)
plot(ox_tab$mids, ox_tab$density, xlab = xlab, ylab = "density", type = "l", las = 1) 
axis(1, line = 0)
# shaded frequency polygon with no vertical axis
plot(ox_tab$mids, ox_tab$density, xlab = xlab, ylab = "", type = "l", bty= "l", axes = FALSE)
axis(1, line = -0.4)
polygon(ox_tab$mids, ox_tab$density, col = 8)

Stem-and-leaf plot

The short How to draw a histogram video also describes how to draw a stem-and-leaf plot.

# The default plot
stem(birth_times)

# The plot that appears in the notes
stem(birth_times, scale = 2)
  1. Use ?stem to see what scale does.

Dotplot

It is surprisingly difficult to produce a nice-looking dotplot using standard R functions. One possibility is to use stripchart. The code below also produces the plots Figure 2.5 in the notes.

stripchart(birth_times, method = "stack", pch = 16, at = 0, offset = 2/3)
title(xlab = xlab)
x <- round(birth_times)
stripchart(x, method = "stack", pch = 16, at = 0, axes = FALSE, offset = 2/3)
title(xlab = xlab)
x_labs <- c(min(x), pretty(x), max(x))
axis(1, at = x_labs)
  1. Find out what the functions pretty and axis do.

Boxplot

The function that produces a boxplot (or box-and-whisker plot) is called boxplot. Inside boxplot the estimation of the quartiles, on which the box part of the plot is based, uses the default value of the argument type (i.e. type = 7) in a call to quantile. In order that we can change the value of type if we wish stat1004 contains a function box_plot, which is a copy of boxplot with the extra argument type added. In box_plot the default is type = 6, which corresponds to the particular way of estimating quartiles that is described in the STAT0002 lecture notes and slides. See the Descriptive Statistics vignette for more information about quantile and the argument type. In many cases we will not need to worry about the choice of the value of type. Unless the dataset is small the value of type will not have enough of an effect on the appearance of the boxplot to matter. This is certainly the case for the birth_times data.

# type = 6
b1 <- box_plot(birth_times, horizontal = TRUE, main = "type = 6", xlab = xlab)
# type = 7
b2 <- boxplot(birth_times, horizontal = TRUE, main = "type = 7")

We know from the Descriptive Statistics vignette that only the estimates of the lower quartile differ between the type = 6 and type = 7 cases. We can check this using the objects b1 and b2 returned above.

as.vector(b1$stats)
as.vector(b2$stats)
  1. Can you work out from the output and graphs above what boxplot returns in the list component stats? Check your answer using ?boxplot. How could we obtain the values of any data points that lie outside the whiskers?

If we read the documentation of boxplot carefully (see ?boxplot and ?bxp) we can have a lot of control over the graph that it produces. The following code produces the plots in Figure 2.7 of the STAT0002 notes.

par(mar = c(4, 1, 0.5, 1))
x_labs <- c(min(birth_times), pretty(birth_times), max(birth_times))
# top left
box_plot(birth_times, horizontal = TRUE, col = 8, xlab = xlab, pch = 16)
# top right
boxplot(birth_times, horizontal = TRUE, col = 8, axes = FALSE, xlab = xlab, pch = 16)
axis(1, at = x_labs, labels = x_labs)
# bottom left
boxplot(birth_times, horizontal = TRUE, axes = FALSE, xlab = xlab, pch = 16, lty = 1, range = 0, staplewex = 0)
axis(1, at = x_labs, labels = x_labs)
# bottom right
boxplot(birth_times, horizontal = TRUE, axes = FALSE, xlab = xlab, pch = 16, lty = 1, range = 0, boxcol = "white", staplewex = 0, medlty = "blank", medpch = 16)
axis(1, at = x_labs, labels = x_labs)

In the Chapter 2: Graphs (more than one variable) vignette we consider both variables in the ox_births data frame, i.e. the numeric continuous variable time and the categorical variable day, producing separate boxplots of time for each day of the week.

The Challenger O-ring data

We return to the data that feature in the Challenger Space Shuttle disaster vignette. The column \code{damaged} in the data frame \code{shuttle} contains the observed numbers of O-rings that suffer thermal distress. This is a numerical discrete variable. In fact we can be more precise than this, it is an integer variable. One way to summarize these data graphically is using a bar plot. In order to produce the bar plot we first need to tabulate the data, using the functiontable to calculate the frequencies of the numbers of distressed O-rings.

shuttle$damaged
O_tab <- table(shuttle$damaged)
O_tab

This table is OK but it does not include the zero frequencies for the categories 4, 5 and 6.

  1. Why can we not expect the output from table not include the categories 4, 5 and 6?

A digression: table and classes of R objects

An answer to question 8. is "R can't possibly know which values the variable shuttle$damaged could have". It is obvious to us, because we know what these data mean and that a space shuttle has 6 O-rings, but R's table function cannot infer this from the data alone. In this example it probably doesn't matter that some of the zero frequencies are missing. However, consider the following example. We simulate 10 values from a Poisson distribution with a mean of 5. For an introduction to simulation see the Stochastic simulation vignette. These data are not real, but imagine that they are the respective numbers of earthquakes that occur in a particular area in the years 2008 to 2017.

set.seed(47)
x <- rpois(10, 5)
table(x)

We know that these data non-negative integers. R does not know this and has included columns in the table only for non-zero frequencies, i.e. for the values 4, 5, 6, 7, 8 and 10. At the very least we might want the zero frequencies for the values 0, 1, 2, 3 and 9 to be included, and perhaps also something to indicate that there are no values greater than 10. One possibility is the following. It is not particularly elegant but it is quite effective.

table(c(x, 0:11)) - 1
  1. Can you work out what this code does? It might help to look at ?c.

If we are really fussy then we could change the name of the final heading in the table.

tab <- table(c(x, 0:11)) - 1
names(tab)[12] <- ">10"
tab

To explore another option we return to the variable shuttle$damaged. R has a system by which R objects are categorized into different classes. The effect of a given R function may depend on the class of the object provided to that function. The function class can be used to find out what class an object has.

class(shuttle$damaged)

R thinks (correctly) that shuttle$damaged is an integer variable. Unfortunately, as far as I am aware, there is no mechanism for us to tell R that this integer variable can take only the values 0, 1, 2, 3, 4, 5, or 6. However, there is a way to do this if we get R to treat shuttle$damaged as a categorical variable. Then we can provide to R information about the possible levels of the categorical variable shuttle$damaged. In R categorical variables are called factors. We use the function factor to create a new variable fac_dam that is (a) a factor, and (b) has levels 0, 1, 2, 3, 4, 5 and 6.

fac_dam <- factor(shuttle$damaged, levels = 0:6)
class(fac_dam)

Now when we use table the output reflects automatically the possible values of the variable.

O_tab_fac <- table(fac_dam)
O_tab_fac

Bar plot

The following code reproduces the graphs in Figure 2.8 of the notes.

par(mfrow=c(2,2))
par(oma=c(0,0,0,0),mar=c(4,4,1,2)+0.1)
xlab <- "number of damaged O-rings"
ylab <- "frequency"
barplot(O_tab, xlab = xlab, ylab = ylab, las = 1)
barplot(O_tab, space = 1, xlab = xlab, ylab = ylab, las = 1)
barplot(O_tab, space = 1, xlab = xlab, ylab = ylab, las = 1)
abline(h=0)
yy <- as.numeric(O_tab)
xx <- as.numeric(unlist(dimnames(O_tab),use.names=F))
plot(xx, yy, pch = c("0","1","2","3"), axes = FALSE, xlab ="", ylab = ylab, ylim = c(0, 16))
title(xlab="number of damaged o-rings",line=0.25)
axis(2, lty = 1, at = yy, labels = yy, pos = -0.3, las = 1)

If instead we use the O-tab_fac then the zero frequencies for values 4, 5 and 6 are included in the plot.

barplot(O_tab_fac, xlab = xlab, ylab = ylab, las = 1)

Blood groups

The data in Table 2.8 of the notes are available in the data frame blood_types.

Bar plot

This code produces the bar plot on the left side of Figure 2.9.

blood_types
lab <- paste(blood_types$ABO, substr(blood_types$Rh, 3, 3), sep = "")
barplot(blood_types$percentage, space = 1, xlab = "blood group", ylab = "percentage", las = 1, names.arg = lab)
abline(h=0)

Pie chart

If you have a desperate need to produce a pie chart then you can do this using the following code.

par(mar = c(1, 1, 0, 1))
slices <- rep(c("white","grey66","grey33","black"), 2)
pie(blood_types$percentage, labels = lab, col = slices)

Then read the text in the Note section at ?pie and forget that the function pie exists.

FTSE100 data

The FTSE 100 share index data described in Section 2.5.6 of the notes are available in the data frame ftse.

head(ftse)
tail(ftse)

Handling dates in R can be tricky. The main issue is to ensure that a time series, and its dates, are stored correctly. One way to achieve this is to use the function ts to create a time series object, specifying the frequency of the data (frequency = 52 means weekly because there are, approximately, 52 weeks in a year) and the start data (the 14th week of 1984). This time series object is a the vector of weekly closing prices and two attributes in which the important extra information (metadata) of tsp and class is stored. The attribute tsp contains the start date, end date and frequency of the observations and the attribute class indicates that this is a ts (time series) object.

ftse_ts <- ts(ftse$price, frequency = 52, start = c(1984, 14))
head(ftse_ts)
attributes(ftse_ts)

Time series plot

The following code produces the time series plot on the left side of Figure 2.10. Here we have used the function as.Date to ensure that R knows that the data to be plotted on the horizontal axis of the plot are dates. In a moment we will make use of the time series object ftse_ts.

plot(as.Date(ftse$date), ftse$price, type = "l", ylab = "weekly FTSE 100 share index", xlab = "year")

The following code produces the plot on the right side of Figure 2.10. The class of ftse_ts is "ts" so R knows that when we use plot(ftse_ts) we want a time series plot. The rest of the code just adjusts the appearance of the plot.

plot(ftse_ts, ylab = "FTSE 100 (in 1000s)", xlab = "year", las = 1, axes = FALSE)
q2 <- c(min(ftse_ts), (2:6) * 1000, max(ftse_ts))
axis(2, at = q2, labels = round(q2 / 1000, 1), las = 1)
axis(4, at= q2, labels = round(q2 / 1000, 1), las=1)
axis(1)
abline(h = par("usr")[3])
  1. Can you work out what each bit of the code does?

Influenza data

The mystery data plotted in Figure 2.11 of the notes are the numbers of people in the UK visiting their doctor with symptoms of influenza (’flu) during four-weekly time periods over the time period 28th January 1967 to 13th November 2004. [I hope that this doesn't spoil the surprise for you.]

head(flu)
tail(flu)

Time series plot

We create a time series object using ts and produce the plot in Figure 2.11

par(mar = c(5, 5, 4, 3) + 0.1, cex.axis = 0.8, cex.lab = 0.75)
flu_ts <- ts(flu[,2], frequency = 13, start = c(1967, 4))
plot(flu_ts, ylab="number of 'flu consultations (in 1000s)", xlab = "year", las = 1, axes=FALSE)
q2 <- c(min(flu_ts),(1:6) * 100, max(flu_ts))
axis(2, at = q2, labels = round(q2, 1), las = 1)
axis(4, at = q2, labels = round(q2, 1), las = 1)
abline(h=0)
axis(1,pos=-1)


paulnorthrop/stat1004 documentation built on Nov. 17, 2019, 3:49 a.m.