# IPSUR: Introduction to Probability and Statistics Using R # Copyright (C) 2018 G. Jay Kerns # # Chapter: Data Description # # This file is part of IPSUR. # # IPSUR is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # IPSUR is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
options(knitr.duplicate.label = 'allow')
# Preliminary code to load before start # This chapter's package dependencies library(aplpack) library(e1071) library(lattice) library(qcc)
In this chapter we introduce the different types of data that a statistician is likely to encounter, and in each subsection we give some examples of how to display the data of that particular type. Once we see how to display data distributions, we next introduce the basic properties of data distributions. We qualitatively explore several data sets. Once that we have intuitive properties of data sets, we next discuss how we may numerically measure and describe those properties with descriptive statistics.
What do I want them to know?
Loosely speaking, a datum is any piece of collected information, and a data set is a collection of data related to each other in some way. We will categorize data into five types and describe each in turn:
In each subsection we look at some examples of the type in question and introduce methods to display them.
Quantitative data are any data that measure or are associated with a measurement of the quantity of something. They invariably assume numerical values. Quantitative data can be further subdivided into two categories.
Note that the distinction between discrete and continuous data is not always clear-cut. Sometimes it is convenient to treat data as if they were continuous, even though strictly speaking they are not continuous. See the examples.
\bigskip
``{example, name="Annual Precipitation in US Cities"}
The vector
precip` \index{Data sets!precip@\texttt{precip}}
contains average amount of rainfall (in inches) for each of 70 cities
in the United States and Puerto Rico. Let us take a look at the data:
```r str(precip)
precip[1:4]
The output shows that precip
is a numeric vector which has been
named, that is, each value has a name associated with it (which can
be set with the names
\index{names@\texttt{names}} function). These
are quantitative continuous data.
\bigskip
``{example, name="Lengths of Major North American Rivers"}
The U.S. Geological Survey recorded the lengths (in miles) of several
rivers in North America. They are stored in the vector
rivers\index{Data sets!rivers@\texttt{rivers}} in the
datasetspackage [@datasets] (which ships with base R). See
?rivers. Let us take a look at the data with the
str`
\index{str@\texttt{str}} function.
```r str(rivers)
The output says that rivers
is a numeric vector of length 141, and
the first few values are 735, 320, 325, etc. These data are
definitely quantitative and it appears that the measurements have been
rounded to the nearest mile. Thus, strictly speaking, these are
discrete data. But we will find it convenient later to take data like
these to be continuous for some of our statistical procedures.
\bigskip
``{example, name="Yearly Numbers of Important Discoveries"}
The vector
discoveries` \index{Data
sets!discoveries@\texttt{discoveries}} contains numbers of "great"
inventions/discoveries in each year from 1860 to 1959, as reported by
the 1975 World Almanac. Let us take a look at the data:
```r str(discoveries)
The output is telling us that discoveries
is a time series (see
Section \@ref(sub-other-data-types) for more) of length 100. The entries are
integers, and since they represent counts this is a good example of
discrete quantitative data. We will take a closer look in the
following sections.
One of the first things to do when confronted by quantitative data (or any data, for that matter) is to make some sort of visual display to gain some insight into the data's structure. There are almost as many display types from which to choose as there are data sets to plot. We describe some of the more popular alternatives.
\index{strip chart} \index{dot plot| see{strip chart}}
These can be used for discrete or continuous data, and usually look
best when the data set is not too large. Along the horizontal axis is
a numerical scale above which the data values are plotted. We can do
it in R with a call to the stripchart
\index{stripchart@\texttt{stripchart}} function. There are three
available methods.
overplot
: plots ties covering each other. This method is good to
display only the distinct values assumed by the data
set.jitter
: adds some noise to the data in the (y) direction in
which case the data values are not covered up by ties.stack
: plots repeated values stacked on top of one another. This
method is best used for discrete data with a lot of ties;
if there are no repeats then this method is identical to
overplot.See Figure \@ref(fig:stripcharts), which was produced by the following code.
stripchart(precip, xlab="rainfall") stripchart(rivers, method="jitter", xlab="length") stripchart(discoveries, method="stack", xlab="number")
The leftmost graph is a strip chart of the precip
data. The graph
shows tightly clustered values in the middle with some others falling
balanced on either side, with perhaps slightly more falling to the
left. Later we will call this a symmetric distribution, see Section
\@ref(sub-shape). The middle graph is of the rivers
data, a vector of
length 141. There are several repeated values in the rivers data, and
if we were to use the overplot method we would lose some of them in
the display. This plot shows a what we will later call a right-skewed
shape with perhaps some extreme values on the far right of the
display. The third graph strip charts discoveries
data which are
literally a textbook example of a right skewed distribution.
par(mfrow = c(3,1)) # 3 plots: 3 rows, 1 column stripchart(precip, xlab="rainfall", cex.lab = cexlab) stripchart(rivers, method="jitter", xlab="length", cex.lab = cexlab) stripchart(discoveries, method="stack", xlab="number", ylim = c(0,3), cex.lab = cexlab) par(mfrow = c(1,1)) # back to normal
(ref:cap-stripcharts) \small Three stripcharts of three data sets. The first graph uses the overplot
method, the second the jitter
method, and the third the stack
method.
The DOTplot
\index{DOTplot@\texttt{DOTplot}} function in the
UsingR
\index{R packages!UsingR@\texttt{UsingR}} package
[@UsingR] is another alternative.
These are typically used for continuous data. A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. Then vertical bars are drawn over the bins with height proportional to the number of observations that fell into the bin.
These are one of the most common summary displays, and they are often misidentified as "Bar Graphs" (see below.) The scale on the (y) axis can be frequency, percentage, or density (relative frequency). The term histogram was coined by Karl Pearson in 1891, see [@Miller].
\bigskip
``{example, label="annual", name="Annual Precipitation in US Cities"}
We are going to take another look at the
precip`
\index{Data sets!precip@\texttt{precip}} data that we
investigated earlier. The strip chart in Figure \@ref(fig:stripcharts)
suggested a loosely balanced distribution; let us now look to see what
a histogram says.
There are many ways to plot histograms in R, and one of the easiest is with the `hist` \index{hist@\texttt{hist}} function. The following code produces the plots in Figure \@ref(fig:histograms). ```r hist(precip, main = "") hist(precip, freq = FALSE, main = "")
Notice the argument (\mathtt{main = ""}) which suppresses the main
title from being displayed -- it would have said "Histogram of
precip
" otherwise. The plot on the left is a frequency histogram
(the default), and the plot on the right is a relative frequency
histogram (freq = FALSE
).
par(mfrow = c(1,2)) hist(precip, main = "", cex.lab = cexlab) hist(precip, freq = FALSE, main = "", cex.lab = cexlab) par(mfrow = c(1,1))
(ref:cap-histograms) \small (Relative) frequency histograms of the
precip
data.
Please mind the biggest weakness of histograms: the graph obtained strongly depends on the bins chosen. Choose another set of bins, and you will get a different histogram. Moreover, there are not any definitive criteria by which bins should be defined; the best choice for a given data set is the one which illuminates the data set's underlying structure (if any). Luckily for us there are algorithms to automatically choose bins that are likely to display well, and more often than not the default bins do a good job. This is not always the case, however, and a responsible statistician will investigate many bin choices to test the stability of the display.
Recall that the strip chart in Figure \@ref(fig:stripcharts)
suggested a relatively balanced shape to the precip
data
distribution. Watch what happens when we change the bins slightly
(with the breaks
argument to hist
). See Figure \@ref(fig:histograms-bins)
which was produced by the following code.
hist(precip, breaks = 10) hist(precip, breaks = 25) hist(precip, breaks = 50)
par(mfrow = c(1,3)) hist(precip, breaks = 10, main = "", cex.lab = cexlab) hist(precip, breaks = 25, main = "", cex.lab = cexlab) hist(precip, breaks = 50, main = "", cex.lab = cexlab) par(mfrow = c(1,1))
(ref:cap-histograms-bins) \small More histograms of the precip
data.
The leftmost graph (with breaks = 10
) shows that the distribution is
not balanced at all. There are two humps: a big one in the middle and
a smaller one to the left. Graphs like this often indicate some
underlying group structure to the data; we could now investigate
whether the cities for which rainfall was measured were similar in
some way, with respect to geographic region, for example.
The rightmost graph in Figure \@ref(fig:histograms-bins) shows what happens when the number of bins is too large: the histogram is too grainy and hides the rounded appearance of the earlier histograms. If we were to continue increasing the number of bins we would eventually get all observed bins to have exactly one element, which is nothing more than a glorified strip chart.
Stem-and-leaf displays (also known as stemplots) have two basic parts: stems and leaves. The final digit of the data values is taken to be a leaf, and the leading digit(s) is (are) taken to be stems. We draw a vertical line, and to the left of the line we list the stems. To the right of the line, we list the leaves beside their corresponding stem. There will typically be several leaves for each stem, in which case the leaves accumulate to the right. It is sometimes necessary to round the data values, especially for larger data sets.
\bigskip
``{example, label="ukdriverdeaths-first", name="Driver Deaths in the United Kingdom"}
UKDriverDeaths\index{Data
sets!UKDriverDeaths@\texttt{UKDriverDeaths}} is a time series that
contains the total car drivers killed or seriously injured in Great
Britain monthly from Jan 1969 to Dec 1984. See
?UKDriverDeaths. Compulsory seat belt use was introduced on January
31, 1983. We construct a stem and leaf diagram in R with the
stem.leaf\index{stem.leaf@\texttt{stem.leaf}} function from the
aplpack` \index{R packages@\textsf{R}
packages!aplpack@\texttt{aplpack}} package [@aplpack].
```r stem.leaf(UKDriverDeaths, depth = FALSE)
The display shows a more or less balanced mound-shaped distribution, with one or maybe two humps, a big one and a smaller one just to its right. Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line.
Notice that the depth
s \index{depths} have been suppressed. To learn
more about this option and many others, see Section
\@ref(sec-exploratory-data-analysis). Unlike a histogram, the original
data values may be recovered from the stem-and-leaf display -- modulo
the rounding -- that is, starting from the top and working down we can
read off the data values 1050, 1070, 1110, 1130, and so forth.
Done with the plot
\index{plot@\texttt{plot}} function. These are
good for plotting data which are ordered, for example, when the data
are measured over time. That is, the first observation was measured at
time 1, the second at time 2, etc. It is a two dimensional plot, in
which the index (or time) is the (x) variable and the measured value
is the (y) variable. There are several plotting methods for index
plots, and we mention two of them:
spikes
: draws a vertical line from the (x)-axis to the observation height.points
: plots a simple point at the observation height.\bigskip
``{example, name="Level of Lake Huron 1875-1972"}
Brockwell and Davis [@Brockwell1991] give the annual measurements
of the level (in feet) of Lake Huron from 1875--1972. The data are
stored in the time series
LakeHuron. \index{Data
sets!LakeHuron@\texttt{LakeHuron}} See
?LakeHuron`. Figure
\@ref(fig:indpl-lakehuron) was produced with the following code:
```r plot(LakeHuron) plot(LakeHuron, type = "p") plot(LakeHuron, type = "h")
The plots show an overall decreasing trend to the observations, and there appears to be some seasonal variation that increases over time.
par(mfrow = c(3,1)) plot(LakeHuron, cex.lab = cexlab) plot(LakeHuron, type = "p", cex.lab = cexlab) plot(LakeHuron, type = "h", cex.lab = cexlab) par(mfrow = c(1,1))
(ref:cap-lakehuron) \small Index plots of the LakeHuron
data.
The default method uses a Gaussian kernel density estimate.
# The Old Faithful geyser data d <- density(faithful$eruptions, bw = "sj") d plot(d) hist(precip, freq = FALSE) lines(density(precip))
Qualitative data are simply any type of data that are not numerical, or do not represent numerical quantities. Examples of qualitative variables include a subject's name, gender, race/ethnicity, political party, socioeconomic status, class rank, driver's license number, and social security number (SSN).
Please bear in mind that some data look to be quantitative but are not, because they do not represent numerical quantities and do not obey mathematical rules. For example, a person's shoe size is typically written with numbers: 8, or 9, or 12, or (12\,\frac{1}{2}). Shoe size is not quantitative, however, because if we take a size 8 and combine with a size 9 we do not get a size 17.
Some qualitative data serve merely to identify the observation (such a subject's name, driver's license number, or SSN). This type of data does not usually play much of a role in statistics. But other qualitative variables serve to subdivide the data set into categories; we call these factors. In the above examples, gender, race, political party, and socioeconomic status would be considered factors (shoe size would be another one). The possible values of a factor are called its levels. For instance, the factor of gender would have two levels, namely, male and female. Socioeconomic status typically has three levels: high, middle, and low.
Factors may be of two types: nominal \index{nominal data} and ordinal. \index{ordinal data} Nominal factors have levels that correspond to names of the categories, with no implied ordering. Examples of nominal factors would be hair color, gender, race, or political party. There is no natural ordering to "Democrat" and "Republican"; the categories are just names associated with different groups of people.
In contrast, ordinal factors have some sort of ordered structure to the underlying factor levels. For instance, socioeconomic status would be an ordinal categorical variable because the levels correspond to ranks associated with income, education, and occupation. Another example of ordinal categorical data would be class rank.
Factors have special status in R. They are represented internally by numbers, but even when they are written numerically their values do not convey any numeric meaning or obey any mathematical rules (that is, Stage III cancer is not Stage I cancer + Stage II cancer).
\bigskip
The `state.abb` \index{Data sets!state.abb@\texttt{state.abb}} vector gives the two letter postal abbreviations for all 50 states.
str(state.abb)
These would be ID data. The state.name
\index{Data
sets!state.name@\texttt{state.name}} vector lists all of the complete
names and those data would also be ID.
\bigskip
``{example, name="U.S. State Facts and Features"}
The U.S. Department of Commerce of the U.S. Census Bureau releases all
sorts of information in the *Statistical Abstract of the United
States*, and the
state.region\index{Data
sets!state.region@\texttt{state.region}} data lists each of the 50
states and the region to which it belongs, be it Northeast, South,
North Central, or West. See
?state.region`.
```r str(state.region) state.region[1:5]
The str
\index{str@\texttt{str}} output shows that state.region
is
already stored internally as a factor and it lists a couple of the
factor levels. To see all of the levels we printed the first five
entries of the vector in the second line.
One of the best ways to summarize qualitative data is with a table of
the data values. We may count frequencies with the table
function or
list proportions with the prop.table
\index{prop.table@\texttt{prop.table}} function (whose input is a
frequency table). In the R Commander you can do it with
Statistics
(\triangleright) Frequency Distribution...
Alternatively, to look at tables for all factors in the Active data
set
\index{Active data set@\texttt{Active data set}} you can do
Statistics
(\triangleright) Summaries
(\triangleright) Active
Dataset
.
Tbl <- table(state.division) Tbl
Tbl/sum(Tbl) # relative frequencies
prop.table(Tbl) # same thing
A bar graph is the analogue of a histogram for categorical data. A bar is displayed for each level of a factor, with the heights of the bars proportional to the frequencies of observations falling in the respective categories. A disadvantage of bar graphs is that the levels are ordered alphabetically (by default), which may sometimes obscure patterns in the display.
\bigskip
``{example, name="U.S. State Facts and Features"}
The
state.regiondata lists each of the 50 states and the region to
which it belongs, be it Northeast, South, North Central, or West. See
?state.region. It is already stored internally as a factor. We make
a bar graph with the
barplot`
\index{barplot@\texttt{barplot}} function:
```r barplot(table(state.region), cex.names = 1.20) barplot(prop.table(table(state.region)), cex.names = 1.20)
See Figure \@ref(fig:bar-gr-stateregion). The display on the left is a frequency bar graph because the (y) axis shows counts, while the display on the left is a relative frequency bar graph. The only difference between the two is the scale. Looking at the graph we see that the majority of the fifty states are in the South, followed by West, North Central, and finally Northeast. Over 30% of the states are in the South.
Notice the cex.names
\index{cex.names@\texttt{cex.names}}
argument that we used, above. It expands the names on the (x) axis
by 20% which makes them easier to read. See ?par
\index{par@\texttt{par}} for a detailed list of additional
plot parameters.
par(mfrow = c(2,1)) # 2 plots: 2 rows, 1 column barplot(table(state.region), cex.names = 1.2) barplot(prop.table(table(state.region)), cex.names = 1.2) par(mfrow = c(1,1)) # back to normal
(ref:cap-stateregion) \small The top graph is a frequency barplot made with table
and the bottom is a relative frequency barplot made with prop.table
.
A pareto diagram is a lot like a bar graph except the bars are rearranged such that they decrease in height going from left to right. The rearrangement is handy because it can visually reveal structure (if any) in how fast the bars decrease -- this is much more difficult when the bars are jumbled.
\bigskip
``{example, name="U.S. State Facts and Features"}
The
state.division\index{Data
sets!state.division@\texttt{state.division}} data record the
division (New England, Middle Atlantic, South Atlantic, East South
Central, West South Central, East North Central, West North Central,
Mountain, and Pacific) of the fifty states. We can make a pareto
diagram with either the
RcmdrPlugin.IPSUR\index{R
packages@\textsf{R}
packages!RcmdrPlugin.IPSUR@\texttt{RcmdrPlugin.IPSUR}} package
[@RcmdrPlugin.IPSUR] or with the
pareto.chart\index{pareto.chart@\texttt{pareto.chart}} function from the
qcc` \index{R packages@\textsf{R}
packages!qcc@\texttt{qcc}} package [@qcc]. See Figure
\@ref(fig:pareto-chart). The code follows.
```r pareto.chart(table(state.division), ylab="Frequency")
pareto.chart(table(state.division), ylab="Frequency", cex.lab = cexlab)
(ref:cap-pareto) \small Pareto chart of the state.division
data.
These are a lot like a bar graph that has been turned on its side with the bars replaced by dots on horizontal lines. They do not convey any more (or less) information than the associated bar graph, but the strength lies in the economy of the display. Dot charts are so compact that it is easy to graph very complicated multi-variable interactions together in one graph. See Section \@ref(sec-comparing-data-sets). We will give an example here using the same data as above for comparison. The graph was produced by the following code.
x <- table(state.region) dotchart(as.vector(x), labels = names(x), cex.lab = cexlab)
(ref:cap-dotchart) \small Dot chart of the \texttt{state.region} data.
See Figure \@ref(fig:dotchart). Compare it to Figure \@ref(fig:bar-gr-stateregion).
These can be done with R and the R Commander, but they fallen out of
favor in recent years because researchers have determined that while
the human eye is good at judging linear measures, it is notoriously
bad at judging relative areas (such as those displayed by a pie
graph). Pie charts are consequently a very bad way of displaying
information unless the number of categories is two or three. A bar
chart or dot chart is a preferable way of displaying qualitative
data. See ?pie
\index{pie@\texttt{pie}} for more information.
We are not going to do any examples of a pie graph and discourage their use elsewhere.
There is another type of information recognized by R
which does not fall into the above categories. The value is either
TRUE
or FALSE
(note that equivalently you can use 1 = TRUE
, 0 =
FALSE
). Here is an example of a logical vector:
x <- 5:9 y <- (x < 7.3) y
Many functions in R have options that the user may or may
not want to activate in the function call. For example, the
stem.leaf
function has the depths
argument which is TRUE
by
default. We saw in Section \@ref(sub-quantitative-data) how to turn the option
off, simply enter stem.leaf(x, depths = FALSE)
and they will not be
shown on the display.
We can swap TRUE
with FALSE
with the exclamation point !
.
!y
Missing data are a persistent and prevalent problem in many
statistical analyses, especially those associated with the social
sciences. R reserves the special symbol NA
to
representing missing data.
Ordinary arithmetic with NA
values give NA
's (addition,
subtraction, etc.) and applying a function to a vector that has an
NA
in it will usually give an NA
.
x <- c(3, 7, NA, 4, 7) y <- c(5, NA, 1, 2, 2) x + y
Some functions have a na.rm
argument which when TRUE
will ignore
missing data as if they were not there (such as mean
, var
, sd
,
IQR
, mad
, ...).
sum(x) sum(x, na.rm = TRUE)
Other functions do not have a na.rm
argument and will return NA
or
an error if the argument has NA
s. In those cases we can find
the locations of any NA
s with the is.na
function and remove
those cases with the []
operator.
is.na(x) z <- x[!is.na(x)] sum(z)
The analogue of is.na
for rectangular data sets (or data frames) is
the complete.cases
function. See Appendix \@ref(sec-editing-data-sets).
Given that the data have been appropriately displayed, the next step is to try to identify salient features represented in the graph. The acronym to remember is C-enter, U-nusual features, S-pread, and S-hape. (CUSS).
One of the most basic features of a data set is its center. Loosely
speaking, the center of a data set is associated with a number that
represents a middle or general tendency of the data. Of course, there
are usually several values that would serve as a center, and our later
tasks will be focused on choosing an appropriate one for the data at
hand. Judging from the histogram that we saw in Figure
\@ref(fig:histograms-bins), a measure of center would be about
r round(mean(precip))
.
The spread of a data set is associated with its variability; data sets with a large spread tend to cover a large interval of values, while data sets with small spread tend to cluster tightly around a central value.
When we speak of the shape of a data set, we are usually referring to the shape exhibited by an associated graphical display, such as a histogram. The shape can tell us a lot about any underlying structure to the data, and can help us decide which statistical procedure we should use to analyze them.
A distribution is said to be right-skewed (or positively skewed) if the right tail seems to be stretched from the center. A left-skewed (or negatively skewed) distribution is stretched to the left side. A symmetric distribution has a graph that is balanced about its center, in the sense that half of the graph may be reflected about a central line of symmetry to match the other half.
We have already encountered skewed distributions: both the discoveries
data in Figure \@ref(fig:stripcharts) and the precip
data in Figure
\@ref(fig:histograms-bins) appear right-skewed. The UKDriverDeaths
data in Example \@ref(ex:ukdriverdeaths-first) is relatively
symmetric (but note the one extreme value 2654 identified at the
bottom of the stem-and-leaf display).
Another component to the shape of a distribution is how "peaked" it is. Some distributions tend to have a flat shape with thin tails. These are called platykurtic, and an example of a platykurtic distribution is the uniform distribution; see Section \@ref(sec-the-continuous-uniform). On the other end of the spectrum are distributions with a steep peak, or spike, accompanied by heavy tails; these are called leptokurtic. Examples of leptokurtic distributions are the Laplace distribution and the logistic distribution. See Section \@ref(sec-other-continuous-distributions). In between are distributions (called mesokurtic) with a rounded peak and moderately sized tails. The standard example of a mesokurtic distribution is the famous bell-shaped curve, also known as the Gaussian, or normal, distribution, and the binomial distribution can be mesokurtic for specific choices of (p). See Sections \@ref(sec-binom-dist) and \@ref(sec-the-normal-distribution).
Clusters or gaps are sometimes observed in quantitative data
distributions. They indicate clumping of the data about distinct
values, and gaps may exist between clusters. Clusters often suggest an
underlying grouping to the data. For example, take a look at the
faithful
data which contains the duration of eruptions
and the
waiting
time between eruptions of the Old Faithful geyser in
Yellowstone National Park. Do not be frightened by the complicated
information at the left of the display for now; we will learn how to
interpret it in Section \@ref(sec-exploratory-data-analysis).
with(faithful, stem.leaf(eruptions))
There are definitely two clusters of data here; an upper cluster and a lower cluster.
Extreme observations fall far from the rest of the data. Such observations are troublesome to many statistical procedures; they cause exaggerated estimates and instability. It is important to identify extreme observations and examine the source of the data more closely. There are many possible reasons underlying an extreme observation:
One of my favorite professors would repeatedly harp, "You cannot do statistics without data."
What do I want them to know?
These are used for categorical data. The idea is that there are a number of different categories, and we would like to get some idea about how the categories are represented in the population.
The sample mean is denoted (\overline{x}) (read "(x)-bar") and is simply the arithmetic average of the observations: \begin{equation} \overline{x}=\frac{x_{1}+x_{2}+\cdots+x_{n}}{n}=\frac{1}{n}\sum_{i=1}^{n}x_{i}. \end{equation}
It is appropriate for use with data sets that are not highly skewed without extreme observations.
The sample median is another popular measure of center and is denoted (\tilde{x}). To calculate its value, first sort the data into an increasing sequence of numbers. If the data set has an odd number of observations then (\tilde{x}) is the value of the middle observation, which lies in position ((n+1)/2); otherwise, there are two middle observations and (\tilde{x}) is the average of those middle values.
One desirable property of the sample median is that it is resistant to extreme observations, in the sense that the value of (\tilde{x}) depends only on those data values in the middle, and is quite unaffected by the actual values of the outer observations in the ordered list. The same cannot be said for the sample mean. Any significant changes in the magnitude of an observation (x_{k}) results in a corresponding change in the value of the mean. Consequently, the sample mean is said to be sensitive to extreme observations.
The trimmed mean is a measure designed to address the sensitivity of the sample mean to extreme observations. The idea is to "trim" a fraction (less than 1/2) of the observations off each end of the ordered list, and then calculate the sample mean of what remains. We will denote it by (\overline{x}_{t=0.05}).
table
function, and relative frequencies with
prop.table(table())
.x
with the
command mean(x)
.x
with the command median(x)
.trim
argument;
mean(x, trim = 0.05)
.A common first step in an analysis of a data set is to sort the values. Given a data set (x_{1}), (x_{2}), ..., (x_{n}), we may sort the values to obtain an increasing sequence \begin{equation} x_{(1)}\leq x_{(2)}\leq x_{(3)}\leq\cdots\leq x_{(n)} \end{equation} and the resulting values are called the order statistics. The (k^{\mathrm{th}}) entry in the list, (x_{(k)}), is the (k^{\mathrm{th}}) order statistic, and approximately (100(k/n))% of the observations fall below (x_{(k)}). The order statistics give an indication of the shape of the data distribution, in the sense that a person can look at the order statistics and have an idea about where the data are concentrated, and where they are sparse.
The sample quantiles are related to the order
statistics. Unfortunately, there is not a universally accepted
definition of them. Indeed, R is equipped to calculate
quantiles using nine distinct definitions! We will describe the
default method (type = 7
), but the interested reader can see the
details for the other methods with ?quantile
.
Suppose the data set has (n) observations. Find the sample quantile of order (p) ((0<p<1)), denoted (\tilde{q}_{p}) , as follows:
The interpretation of (\tilde{q}{p}) is that approximately (100p) % of the data fall below the value (\tilde{q}{p}).
Keep in mind that there is not a unique definition of percentiles, quartiles, etc. Open a different book, and you'll find a different definition. The difference is small and seldom plays a role except in small data sets with repeated values. In fact, most people do not even notice in common use.
Clearly, the most popular sample quantile is (\tilde{q}{0.50}), also known as the sample median, (\tilde{x}). The closest runners-up are the first quartile (\tilde{q}{0.25}) and the third quartile (\tilde{q}_{0.75}) (the second quartile is the median).
At the command prompt We can find the order statistics of a data set
stored in a vector x
with the command sort(x)
.
We can calculate the sample quantiles of any order (p) where
(0<p<1) for a data set stored in a data vector x
with the
quantile
function, for instance, the command quantile(x, probs =
c(0, 0.25, 0.37))
will return the smallest observation, the first
quartile, (\tilde{q}{0.25}), and the 37th sample quantile,
(\tilde{q}{0.37}). For (\tilde{q}_{p}) simply change the values
in the probs
argument to the value (p).
With the R Commander we can find the order statistics of a variable
in the Active data set
by doing Data
(\triangleright) Manage
variables in Active data set...
(\triangleright) Compute new
variable...
In the Expression to compute
dialog simply type
sort(varname)
, where varname
is the variable that it is desired to
sort.
In Rcmdr
, we can calculate the sample quantiles for a particular
variable with the sequence Statistics
(\triangleright) Summaries
(\triangleright) Numerical Summaries...
We can automatically
calculate the quartiles for all variables in the Active data set
with the sequence Statistics
(\triangleright) Summaries
(\triangleright) Active Dataset
.
The sample variance is denoted (s^{2}) and is calculated with the formula \begin{equation} s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}. \end{equation} The sample standard deviation is (s=\sqrt{s^{2}}). Intuitively, the sample variance is approximately the average squared distance of the observations from the sample mean. The sample standard deviation is used to scale the estimate back to the measurement units of the original data.
We will spend a lot of time with the variance and standard deviation in the coming chapters. In the meantime, the following two rules give some meaning to the standard deviation, in that there are bounds on how much of the data can fall past a certain distance from the mean.
Fact: (Chebychev's Rule). The proportion of observations within (k) standard deviations of the mean is at least (1-1/k^{2}), i.e., at least 75%, 89%, and 94% of the data are within 2, 3, and 4 standard deviations of the mean, respectively.
Note that Chebychev's Rule does not say anything about when (k=1), because (1-1/1^{2}=0), which states that at least 0% of the observations are within one standard deviation of the mean (which is not saying much).
Chebychev's Rule applies to any data distribution, any list of numbers, no matter where it came from or what the histogram looks like. The price for such generality is that the bounds are not very tight; if we know more about how the data are shaped then we can say more about how much of the data can fall a given distance from the mean.
Fact: (Empirical Rule). If data follow a bell-shaped curve, then approximately 68%, 95%, and 99.7% of the data are within 1, 2, and 3 standard deviations of the mean, respectively.
Just as the sample mean is sensitive to extreme values, so the associated measure of spread is similarly sensitive to extremes. Further, the problem is exacerbated by the fact that the extreme distances are squared. We know that the sample quartiles are resistant to extremes, and a measure of spread associated with them is the interquartile range ((IQR)) defined by (IQR=q_{0.75}-q_{0.25}).
A measure even more robust than the (IQR) is the median absolute deviation ((MAD)). To calculate it we first get the median (\widetilde{x}), next the absolute deviations (|x_{1}-\tilde{x}|), (|x_{2}-\tilde{x}|), ..., (|x_{n}-\tilde{x}|), and the (MAD) is proportional to the median of those deviations: \begin{equation} MAD\propto\mbox{median}(|x_{1}-\tilde{x}|,\ |x_{2}-\tilde{x}|,\ldots,|x_{n}-\tilde{x}|). \end{equation} That is, the (MAD=c\cdot\mbox{median}(|x_{1}-\tilde{x}|,\ |x_{2}-\tilde{x}|,\ldots,|x_{n}-\tilde{x}|)), where (c) is a constant chosen so that the (MAD) has nice properties. The value of (c) in R is by default (c=1.4286). This value is chosen to ensure that the estimator of (\sigma) is correct, on the average, under suitable sampling assumptions (see Section \@ref(sec-point-estimation)).
We have seen three different measures of spread which, for a given data set, will give three different answers. Which one should we use? It depends on the data set. If the data are well behaved, with an approximate bell-shaped distribution, then the sample mean and sample standard deviation are natural choices with nice mathematical properties. However, if the data have an unusual or skewed shape with several extreme values, perhaps the more resistant choices among the (IQR) or (MAD) would be more appropriate.
However, once we are looking at the three numbers it is important to understand that the estimators are not all measuring the same quantity, on the average. In particular, it can be shown that when the data follow an approximately bell-shaped distribution, then on the average, the sample standard deviation (s) and the (MAD) will be the approximately the same value, namely, (\sigma), but the (IQR) will be on the average 1.349 times larger than (s) and the (MAD). See \@ref(cha-sampling-distributions) for more details.
At the command prompt we may compute the sample range with
range(x)
and the sample variance with var(x)
, where x
is a
numeric vector. The sample standard deviation is sqrt(var(x))
or
just sd(x)
. The (IQR) is IQR(x)
and the median absolute
deviation is mad(x)
.
With the R Commander we can calculate the sample standard deviation
with the Statistics
(\triangleright) Summaries
(\triangleright) Numerical Summaries...
combination. R Commander does not calculate the (IQR)
or (MAD) in any of the menu selections, by default.
The sample skewness, denoted by (g_{1}), is defined by the formula
\begin{equation}
g_{1}=\frac{1}{n}\frac{\sum_{i=1}^{n}(x_{i}-\overline{x})^{3}}{s^{3}}.
\end{equation}
The sample skewness can be any value (-\infty
We still need to know how big is "big", that is, how do we judge whether an observed value of (g_{1}) is far enough away from zero for the data set to be considered skewed to the right or left? A good rule of thumb is that data sets with skewness larger than (2\sqrt{6/n}) in magnitude are substantially skewed, in the direction of the sign of (g_{1}). See Tabachnick & Fidell [@Tabachnick2006] for details.
The sample excess kurtosis, denoted by (g_{2}), is given by the formula \begin{equation} g_{2}=\frac{1}{n}\frac{\sum_{i=1}^{n}(x_{i}-\overline{x})^{4}}{s^{4}}-3. \end{equation} The sample excess kurtosis takes values (-2\leq g_{2}<\infty). The subtraction of 3 may seem mysterious but it is done so that mound shaped samples have values of (g_{2}) near zero. Samples with (g_{2}>0) are called leptokurtic, and samples with (g_{2}<0) are called platykurtic. Samples with (g_{2}\approx0) are called mesokurtic.
As a rule of thumb, if (|g_{2}|>4\sqrt{6/n}) then the sample excess kurtosis is substantially different from zero in the direction of the sign of (g_{2}). See Tabachnick & Fidell [@Tabachnick2006] for details.
Notice that both the sample skewness and the sample kurtosis are invariant with respect to location and scale, that is, the values of (g_{1}) and (g_{2}) do not depend on the measurement units of the data.
The e1071
package [@e1071] has the skewness
function for the
sample skewness and the kurtosis
function for the sample excess
kurtosis. Both functions have a na.rm
argument which is FALSE
by
default.
\bigskip
We said earlier that the `discoveries` data looked positively skewed; let's see what the statistics say:
e1071::skewness(discoveries) 2*sqrt(6/length(discoveries))
The data are definitely skewed to the right. Let us check the sample
excess kurtosis of the UKDriverDeaths
data:
kurtosis(UKDriverDeaths) 4*sqrt(6/length(UKDriverDeaths))
so that the UKDriverDeaths
data appear to be mesokurtic, or at least
not substantially leptokurtic.
This field was founded (mostly) by John Tukey (1915-2000). Its tools are useful when not much is known regarding the underlying causes associated with the data set, and are often used for checking assumptions. For example, suppose we perform an experiment and collect some data... now what? We look at the data using exploratory visual tools.
There are many bells and whistles associated with stemplots, and the
stem.leaf
function can do many of them.
trim.outliers
argument (which
is TRUE
by default) will separate the extreme
observations from the others and graph the
stemplot without them; they are listed at the
bottom (respectively, top) of the stemplot with
the label HI
(respectively LO
).3
are plotted on the same line, regardless of the
value of the second digit. But this gives some
stemplots a "skyscraper" appearance, with too many
observations stacked onto the same stem. We can
often fix the display by increasing the number of
lines available for a given stem. For example, we
could make two lines per stem, say, 3*
and
3.
. Observations with second digit 0 through 4
would go on the upper line, while observations with
second digit 5 through 9 would go on the lower
line. (We could do a similar thing with five lines
per stem, or even ten lines per stem.) The end
result is a more spread out stemplot which often
looks better. A good example of this was shown on
page \pageref{exa-stemleaf-multiple-lines-stem}.The basic command is stem(x)
or a more sophisticated version written
by Peter Wolf called stem.leaf(x)
in the R
Commander. We will describe stem.leaf
since that is the one used by
R Commander.
WARNING: Sometimes when making a stem-and-leaf display the result will not be what you expected. There are several reasons for this:
stem.leaf
according to
an algorithm that the computer believes will represent the data
well. Depending on the choice of the digit, stem.leaf
may drop
digits from the data or round the values in unexpected ways.Let us take a look at the rivers
data set.
stem.leaf(rivers)
The stem-and-leaf display shows a right-skewed shape to the rivers
data distribution. Notice that the last digit of each of the data
values were dropped from the display. Notice also that there were
eight extreme observations identified by the computer, and their exact
values are listed at the bottom of the stemplot. Look at the scale on
the left of the stemplot and try to imagine how ridiculous the graph
would have looked had we tried to include enough stems to include
these other eight observations; the stemplot would have stretched over
several pages. Notice finally that we can use the depths to
approximate the sample median for these data. The median lies in the
row identified by (18)
, which means that the median is the average
of the ninth and tenth observation on that row. Those two values
correspond to 43
and 43
, so a good guess for the median would
be 430. (For the record, the sample median is
(\widetilde{x}=425). Recall that stemplots round the data to the
nearest stem-leaf pair.)
Next let us see what the precip
data look like.
stem.leaf(precip)
Here is an example of split stems, with two lines per stem. The final digit of each datum has been dropped for the display. The data appear to be left skewed with four extreme values to the left and one extreme value to the right. The sample median is approximately 37 (it turns out to be 36.6).
Given a data set (x_{1}), (x_{2}), ..., (x_{n}), the hinges are found by the following method:
Find the order statistics (x_{(1)}), (x_{(2)}), ..., (x_{(n)}).
The lower hinge (h_{L}) is in position (L=\left\lfloor (n+3)/2\right\rfloor / 2), where the symbol $\left\lfloor x\right\rfloor$ denotes the largest integer less than or equal to (x). If the position (L) is not an integer, then the hinge (h_{L}) is the average of the adjacent order statistics.
The upper hinge (h_{U}) is in position (n+1-L).
Given the hinges, the five number summary ((5NS)) is \begin{equation} 5NS=(x_{(1)},\ h_{L},\ \tilde{x},\ h_{U},\ x_{(n)}). \end{equation} An advantage of the (5NS) is that it reduces a potentially large data set to a shorter list of only five numbers, and further, these numbers give insight regarding the shape of the data distribution similar to the sample quantiles in Section \@ref(sub-order-statistics).
If the data are stored in a vector x
, then you can compute the
(5NS) with the fivenum
function.
A boxplot is essentially a graphical representation of the (5NS). It can be a handy alternative to a stripchart when the sample size is large.
A boxplot is constructed by drawing a box alongside the data axis with sides located at the upper and lower hinges. A line is drawn parallel to the sides to denote the sample median. Lastly, whiskers are extended from the sides of the box to the maximum and minimum data values (more precisely, to the most extreme values that are not potential outliers, defined below).
Boxplots are good for quick visual summaries of data sets, and the relative positions of the values in the (5NS) are good at indicating the underlying shape of the data distribution, although perhaps not as effectively as a histogram. Perhaps the greatest advantage of a boxplot is that it can help to objectively identify extreme observations in the data set as described in the next section.
Boxplots are also good because one can visually assess multiple features of the data set simultaneously:
A potential outlier is any observation that falls beyond 1.5 times the width of the box on either side, that is, any observation less than (h_{L}-1.5(h_{U}-h_{L})) or greater than (h_{U}+1.5(h_{U}-h_{L})). A suspected outlier is any observation that falls beyond 3 times the width of the box on either side. In R, both potential and suspected outliers (if present) are denoted by open circles; there is no distinction between the two.
When potential outliers are present, the whiskers of the boxplot are
then shortened to extend to the most extreme observation that is not a
potential outlier. If an outlier is displayed in a boxplot, the index
of the observation may be identified in a subsequent plot in Rcmdr
by clicking the Identify outliers with mouse
option in the Boxplot
dialog.
What do we do about outliers? They merit further investigation. The primary goal is to determine why the observation is outlying, if possible. If the observation is a typographical error, then it should be corrected before continuing. If the observation is from a subject that does not belong to the population of interest, then perhaps the datum should be removed. Otherwise, perhaps the value is hinting at some hidden structure to the data.
The quickest way to visually identify outliers is with a boxplot,
described above. Another way is with the boxplot.stats
function.
\bigskip
``{example, name="Lengths of Major North American Rivers"}
We will look for potential outliers in the
rivers` data.
```r boxplot.stats(rivers)$out
We may change the coef
argument to 3 (it is 1.5 by default) to
identify suspected outliers.
boxplot.stats(rivers, coef = 3)$out
It is sometimes useful to compare data sets with each other on a scale that is independent of the measurement units. Given a set of observed data (x_{1}), (x_{2}), ..., (x_{n}) we get (z) scores, denoted (z_{1}), (z_{2}), ..., (z_{n}), by means of the following formula [ z_{i}=\frac{x_{i}-\overline{x}}{s},\quad i=1,\,2,\,\ldots,\, n. ]
The scale
function will rescale a numeric vector (or data frame) by
subtracting the sample mean from each value (column) and/or by
dividing each observation by the sample standard deviation.
We have had experience with vectors of data, which are long lists of
numbers. Typically, each entry in the vector is a single measurement
on a subject or experimental unit in the study. We saw in Section
\@ref(sub-vectors) how to form vectors with the c
function or the
scan
function.
However, statistical studies often involve experiments where there are two (or more) measurements associated with each subject. We display the measured information in a rectangular array in which each row corresponds to a subject, and the columns contain the measurements for each respective variable. For instance, if one were to measure the height and weight and hair color of each of 11 persons in a research study, the information could be represented with a rectangular array. There would be 11 rows. Each row would have the person's height in the first column and hair color in the second column.
The corresponding objects in R are called data frames,
and they can be constructed with the data.frame
function. Each row
is an observation, and each column is a variable.
\bigskip
Suppose we have two vectors `x` and `y` and we want to make a data frame out of them.
x <- 5:8 y <- letters[3:6] A <- data.frame(v1 = x, v2 = y)
Notice that x
and y
are the same length. This is necessary. Also
notice that x
is a numeric vector and y
is a character vector. We
may choose numeric and character vectors (or even factors) for the
columns of the data frame, but each column must be of exactly one
type. That is, we can have a column for height
and a column for
gender
, but we will get an error if we try to mix function height
(numeric) and gender
(character or factor) information in the same
column.
Indexing of data frames is similar to indexing of vectors. To get the
entry in row (i) and column (j) do A[i,j]
. We can get entire
rows and columns by omitting the other index.
A[3, ] A[ , 1] A[ , 2]
There are several things happening above. Notice that A[3, ]
gave a
data frame (with the same entries as the third row of A
) yet A[ ,
1]
is a numeric vector. A[ ,2]
is a factor vector because the
default setting for data.frame
is stringsAsFactors = TRUE
.
Data frames have a names
attribute and the names may be extracted
with the names
function. Once we have the names we may extract given
columns by way of the dollar sign.
names(A) A['v1']
The above is identical to A[ ,1]
.
The sample Pearson product-moment correlation coefficient: [ r=\frac{\sum_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})}\sqrt{\sum_{i=1}^{n}(y_{i}-\overline{y})}} ]
measures strength and direction of linear association
Two-Way Tables. Done with table
, or in the R
Commander by following Statistics
(\triangleright) Contingency
Tables
(\triangleright)} Two-way Tables
. You can also enter and
analyze a two-way table.
Multivariate Data Display
table
, or in R
Commander by following Statistics
(\triangleright) Contingency
Tables
(\triangleright) Multi-way Tables
.plot(state.region, state.division)
barplot(table(state.division,state.region), legend.text=TRUE)
require(graphics) mosaicplot(HairEyeColor) x <- apply(HairEyeColor, c(1, 2), sum) x mosaicplot(x, main = "Relation between hair and eye color") y <- apply(HairEyeColor, c(1, 3), sum) y mosaicplot(y, main = "Relation between hair color and sex") z <- apply(HairEyeColor, c(2, 3), sum) z mosaicplot(z, main = "Relation between eye color and sex")
Sometimes we have data from two or more groups (or populations) and we would like to compare them and draw conclusions. Some issues that we would like to address:
I am thinking here about the Statistics
(\triangleright)
Numerical Summaries
(\triangleright) Summarize by groups
option
or the Statistics
(\triangleright) Summaries
(\triangleright)
Table of Statistics
option.
xyplot(Petal.Width ~ Petal.Length, data = iris, group = Species)
print(xyplot(Petal.Width ~ Petal.Length, data = iris, group = Species))
(ref:cap-xyplot) \small Scatterplot of Petal width versus length in the iris
data.
splom(~ cbind(Fertility,Agriculture,Examination), data = swiss) (positive and negative)
Dot charts
dotplot(Admit ~ Freq | Dept, groups = Gender, data = C)
Mosaic plot
mosaic( ~ Admit + Dept + Gender, data = UCBAdmissions)
Spine plots
Given two samples (x_{1}), (x_{2}), ..., (x_{n}), and (y_{1}), (y_{2}), ..., (y_{n}), we may find the order statistics (x_{(1)}\leq x_{(2)}\leq\cdots\leq x_{(n)}) and (y_{(1)}\leq y_{(2)}\leq\cdots\leq y_{(n)}). Next, plot the (n) points ((x_{(1)},y_{(1)})), ((x_{(2)},y_{(2)})), ..., ((x_{(n)},y_{(n)})).
It is clear that if (x_{(k)}=y_{(k)}) for all (k=1,2,\ldots,n), then we will have a straight line. It is also clear that in the real world, a straight line is NEVER observed, and instead we have a scatterplot that hopefully had a general linear trend. What do the rules tell us?
The following types of plots are useful when there is one variable of interest and there is a factor in the data set by which the variable is categorized.
It is sometimes nice to set lattice.options(default.theme = "col.whitebg")
bwplot(~weight | feed, data = chickwts)
print(bwplot(~weight | feed, data = chickwts))
(ref:cap-bwplot) \small Boxplots of weight
by feed
type in the chickwts
data.
histogram(~age | education, data = infert)
print(histogram(~age | education, data = infert))
(ref:cap-histg) \small Histograms of age
by education
level from the infert
data.
xyplot(Petal.Length ~ Petal.Width | Species, data = iris)
print(xyplot(Petal.Length ~ Petal.Width | Species, data = iris))
(ref:cap-xyplot-by) \small An xyplot
of Petal.Length
versus Petal.Width
by Species
in the iris
data.
coplot(conc ~ uptake | Type * Treatment, data = CO2)
print(coplot(conc ~ uptake | Type * Treatment, data = CO2))
(ref:cap-coplot) \small A coplot
of conc
versus uptake
by Type
and Treatment
.
Open R and issue the following commands at the command
line to get started. Note that you need to have the
RcmdrPlugin.IPSUR
package [@RcmdrPlugin.IPSUR] installed, and
for some exercises you need the e1071
package [@e1071].
library("RcmdrPlugin.IPSUR") data(RcmdrTestDrive) attach(RcmdrTestDrive) names(RcmdrTestDrive)
To load the data in the R Commander (Rcmdr
), click the
Data Set
button, and select RcmdrTestDrive
as the active data
set. To learn more about the data set and where it comes from, type
?RcmdrTestDrive
at the command line.
``{block, type="xca", label="xca-summary-RcmdrTestDrive"}
Perform a summary of all variables in
RcmdrTestDrive. You can do this with the command
summary(RcmdrTestDrive)`.
Alternatively, you can do this in the Rcmdr
with the sequence
Statistics
(\triangleright) Summaries
(\triangleright) Active
Data Set
. Report the values of the summary statistics for each
variable.
\bigskip ```{block, type="xca"} Make a table of the `race` variable. Do this with `Statistics` \(\triangleright\) `Summaries` \(\triangleright\) `Frequency Distributions - IPSUR...` 1. Which ethnicity has the highest frequency? 1. Which ethnicity has the lowest frequency? 1. Include a bar graph of `race`. Do this with `Graphs` \(\triangleright\) `IPSUR - Bar Graph...`
\bigskip
``{block, type="xca"}
Calculate the average
salaryby the factor
gender. Do this with
Statistics\(\triangleright\)
Summaries\(\triangleright\)
Table
of Statistics...`
gender
has the highest mean salary
? salary
.salary
by gender
. Which gender
has the biggest
standard deviation?Make boxplots of salary
by gender
with the following method:
i) On the Rcmdr
, click Graphs
(\triangleright) IPSUR - Boxplot...
i) In the Variable
box, select salary
.
i) Click the Plot by groups...
box and select gender
. Click OK
.
i) Click OK
to graph the boxplot.
How does the boxplot compare to your answers to (1) and (3)?
\bigskip ```{block, type="xca"} For this problem we will study the variable `reduction`. 1. Find the order statistics and store them in a vector `x`. *Hint:* `x <- sort(reduction)` 1. Find \(x_{(137)}\), the 137\(^{\mathrm{th}}\) order statistic. 1. Find the IQR. 1. Find the Five Number Summary (5NS). 1. Use the 5NS to calculate what the width of a boxplot of `reduction` would be. 1. Compare your answers (3) and (5). Are they the same? If not, are they close? 1. Make a boxplot of `reduction`, and include the boxplot in your report. You can do this with the `boxplot` function, or in `Rcmdr` with `Graphs` \(\triangleright\) `IPSUR - Boxplot...` 1. Are there any potential/suspected outliers? If so, list their values. *Hint:* use your answer to (a). 1. Using the rules discussed in the text, classify answers to (8), if any, as *potential* or *suspected* outliers.
\bigskip
``{block, type="xca"}
In this problem we will compare the variables
beforeand
after. Don't forget
library("e1071")`.
before
? (You may
want to look at a boxplot.) Which measure of center is more
appropriate for after
?before
. Based
on these values, how would you describe the shape of before
?after
. Based
on these values, how would you describe the shape of after
?before
and after
and compare them to your
answers to (4) and (5).\bigskip ```{block, type="xca"} Describe the following data sets just as if you were communicating with an alien, but one who has had a statistics class. Mention the salient features (data type, important properties, anything special). Support your answers with the appropriate visual displays and descriptive statistics. 1. Conversion rates of Euro currencies stored in `euro`. 2. State abbreviations stored in `state.abb`. 3. Areas of the world's landmasses stored in `islands`. 4. Areas of the 50 United States stored in `state.area`. 5. Region of the 50 United States stored in `state.region`.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.