In paulnorthrop/stat1004: stat1004: Introduction to Probability and Statistics at UCL

knitr::opts_chunk$set(comment = "", prompt = TRUE, collapse = TRUE)
#devtools::load_all()

The main purpose of this vignette is to provide R code to produce graphs that involve more than one variable. We consider two general situations: (i) plotting the values of a continuous variable for different values of a categorical variable (see The Oxford Birth Times data) and (ii) scatter plots of one variable against another (see The 2000 US Presidential Election data). The latter feature in Section 2.6 of the STAT0002 notes. See also the Chapter 2: Graphs (one variable).

The R code used in this vignette are available: graphs-2-vignette.R.

The Oxford Birth Times data

These data are available in the data frame ox_births.

library(stat1004)
birth_times <- ox_births[, "time"]
day <- ox_births[, "day"]

To display these data we manipulate them into a matrix that is of the same format as Table 2.1 in the notes. The number of birth times varies between days so we pad the matrix with R's missing values code NA in order that each column of the matrix has the same number of rows.

ox_mat <- matrix(NA, ncol = 7, nrow = 16)
for (i in 1:7) {
  day_i_times <- ox_births$time[which(ox_births$day == i)]
  ox_mat[1:length(day_i_times), i] <- sort(day_i_times)
  colnames(ox_mat) <- paste("day", 1:7, sep = "")
}  
ox_mat

We have a numeric continuous variable, birth_times, and a categorical variable, day. The following code produces the graphs in the STAT0002 lecture slides that contain a plot of birth_times for each day of the week.

par(mar = c(4, 4, 0.5, 1))
xlab <- "time (hours)"
x_labs <- c(min(birth_times), pretty(birth_times), max(birth_times))
# top left
box_plot(birth_times ~ day, col = 8, ylab = xlab, pch = 16, xlab = "day")
# top right
box_plot(birth_times ~ day, col = 8, horizontal = TRUE,  axes = FALSE, xlab = xlab, ylab = "day", pch = 16)
axis(1, at = x_labs, labels = x_labs)
axis(2, at = 1:7, labels = 1:7, lwd = 0, lty = 0)
# bottom left
box_plot(birth_times ~ day, axes = FALSE, ylab = xlab, pch = 16, lty = 1, range = 0, boxcol = "white", staplewex = 0, medlty = "blank", medpch = 16, xlab = "day")
axis(1, at = 1:7, labels = 1:7, lwd = 0, lty = 0)
axis(2, at = x_labs, labels = x_labs)
# bottom right
box_plot(birth_times ~ day, horizontal = TRUE, axes = FALSE, xlab = xlab, pch = 16, lty = 1, range = 0, boxcol = "white", staplewex = 0, medlty = "blank", medpch = 16)
axis(1, at = x_labs, labels = x_labs)
axis(2, at = 1:7, labels = 1:7, lwd = 0, lty = 0, las = 1)

The 2000 US Presidential Election in Florida

These data are available in the data frame USelection. See ?USelection for details.

# County identifiers and location
head(USelection[, 1:4])
# County demographic variables
head(USelection[, 5:12])
# Numbers of votes for candidates
head(USelection[, 13:22])

For the moment we simply produce some scatter plots. A separate vignette will be devoted to these data.

A plot to show the locations of the counties.

plot(-USelection[, "lon"], USelection[, "lat"], xlab = "longitude (degrees north)", ylab = "latitude (degrees east)", pch = 16)

Can you see the outline of the state of Florida?

A plot of the percentage of the vote for Buchanan against population size.

pbuch <- 100 * USelection$buch/USelection$tvot
is_PB <- USelection[, "co_names"] == "PalmBeach"
pch <- 1 + 3 * is_PB
pch
plot(USelection$npop, pbuch, xlab = "population", ylab = "Buchanan % vote", pch = pch)
which_PB <- which(is_PB)
text(USelection[which_PB, "npop"], pbuch[which_PB] + 0.1, "Palm Beach", cex = 0.8)

Can you see how the code used to identify Palm Beach on the plot works?

Pairwise scatter plots of the demographic variables.

pairs(USelection[, 5:12])

A plot of the square root of the percentage of the vote for Buchanan against population size, in thousands of people. The horizontal axis has been plotted on a log scale.

x <- USelection$npop / 1000
y <- sqrt(pbuch)
ystring <- expression(sqrt("% Buchanan vote"))
rm_PB <- which(!is_PB)
scatter(x[rm_PB], y[rm_PB], pch = 16, xlab ="Total Population (1000s)", ylab = ystring, log = "x")
points(x[which_PB], y[which_PB], pch = "X")
text(x[which_PB], y[which_PB] + 0.04, "Palm Beach", cex = 0.8)

Can you see how the different method to identify Palm Beach works?
Can you guess what the numbers on the axes are? See scatter to find out.

Similarly, we use scatter_hist to create a scatter plot in which the distribution of each variable is summarized by a histogram.

scatter_hist(x, y, log = "x", pch = 16, xlab ="Total Population (1000s)", ylab = ystring)

The plot in the lecture slides is produced by specifying particular bins for the histograms.

logx <- log(x)
xbreaks <- seq(from = min(logx), to = max(logx), len = 25)
ybreaks <- seq(from = min(y), to = max(y), len = 25)
scatter_hist(x, y, log = "x", pch = 16, xlab ="Total Population (1000s)", ylab = ystring, xbreaks = xbreaks, ybreaks = ybreaks)

paulnorthrop/stat1004 documentation built on Nov. 17, 2019, 3:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

paulnorthrop/stat1004
stat1004: Introduction to Probability and Statistics at UCL

In paulnorthrop/stat1004: stat1004: Introduction to Probability and Statistics at UCL

The Oxford Birth Times data

The 2000 US Presidential Election in Florida

R Package Documentation

Browse R Packages

We want your feedback!

paulnorthrop/stat1004 stat1004: Introduction to Probability and Statistics at UCL

In paulnorthrop/stat1004: stat1004: Introduction to Probability and Statistics at UCL

The Oxford Birth Times data

The 2000 US Presidential Election in Florida

R Package Documentation

Browse R Packages

We want your feedback!

paulnorthrop/stat1004
stat1004: Introduction to Probability and Statistics at UCL