knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
R was developed as a language for statistical analysis and as such has a lot of built-in tools for all kinds of procedures. More are added as they are developed. To describe them all would take a book. This is but a short introduction - not into statistics but rather into the way it is implemented in R.
Functions exist to estimate sample parameters: mean
for the mean, median
for the median, var
for variance, sd
for standard deviation, min
, max
and range
for the extremities, quantile
for quantiles, IQR
for inter-quartile range and many more. Of course, writing functions that compute all these parameters is not especially difficult and the difference between R and other languages in this department is qualitative rather than quantitative. Don't worry, it gets better.
There are functions that simulate distributions, e.g. rnorm
generates a random sample from a normal distribution, and runif
- from a uniform one, etc. See ?distributions
for a full list.
A wide range of statistical tests are also implemented in built-in functions. Perhaps the most commonly used statistical test in biology is Student's t test. Whether or not that is a proper course of action is another matter. The test performed by the t.test
function.
t.test
can compare two sapmles passed as vectors.
t.test(x = rnorm(100, 0), y = rnorm(100, 1))
It also has a formula interface, in which a column in a data frame is divided according to a factor column with two levels.
t.test(len ~ supp, data = ToothGrowth)
t.test
returns is an object of class htest. This is a list that contains all the information about the test and all of its results. The printed information is a complete description of the test, more than sufficient to cite in a publication.
Switching to a non-parametric test like Mann-Whitney-Wilcoxon test is super easy as the function that runs that test has the same syntax.
wilcox.test(Sepal.Length ~ Species, data = iris, subset = Species != 'setosa')
The result is also a htest object. Many other functions will return like objects.
R also contains a number of modelling functions. They indentify dependencies between variables in a data set. The simplest model to describe a two-dimensional data set is a straight line, which is most often found by the smallest square method. Briefly, different straight lines are lead throught the scatter and their distances from the actual data on the y axis are squared and summed. The line for which the sum of squarred deviations is the lowest is considered to be the best fit to the data.
Linear regression is done with the lm
function. It is called with a formula that has the independent variables, or predictors, on the rhs and the dependent variable, or response, on the lhs. The syntax for lm
is exactly the same as for plot
:
plot(eruptions ~ waiting, data = faithful) lm(eruptions ~ waiting, data = faithful)
lm
returns an object of class lm, which has a list structure. It contains the original data and all the computed information. It also stores the call used to build the model so it can even be modified afterwards.
What is printed in the console is not the entirety of the model, it is only the most basic information: the formula and the coefficients that can describe the straight line. This is what the print
method for class lm
displays. More information can be accessed with summary
.
mod <- lm(eruptions ~ waiting, data = faithful) summary(mod)
Again, there is the formula and the coefficients. However, now there are also some parameters that help judge their estimation: the t value
reflects how far from zero the coefficient estimate lies, and Pr(>|t|)
estimates the probability of obtaining a higher t value
than obtained in the model.
There is a summary of the residuals, i.e. the deviations of actual values from the model. Normality of residuals is one of the ke indicators of model quality so it helps to have a look if the quartiles are symmetric at least. Finally, the statistical significance of the model is reported at the bottom, as is r-squared, which corresponds to the fraction of variation in the data explained by the model. This is another indicator of model quality - the closer to 1, the better.
Particular elements of the model can be accessed directly, as list elements. There are also convenience functions for extracting certain elements, like coef
which extracts model coefficients, or fitted
, which extracts fitted values.
mod$coefficients coef(mod) coefficients(mod)
Coefficients can be conveniently passed directly to the coef
argument in abline
to add the model to a plot.
plot(eruptions ~ waiting, data = faithful) abline(coef = coef(mod))
A model can be assessed further by plotting it out. The lm method for plot
creates four diagnostic graphs with outlying observations marked.
layout(mat = matrix(1:4, 2, 2)) plot(mod)
As a rule, all modelling functions behave this way: they are called with a formula and they return an object of their own class, which can be summarized and used for predictions (see below). The built-in glm
function (for generalized linear models) can transform the data before the fitting so that relationships than linear can be modeled. More functions exist in external packages and they all keep to the same procedure.
Once a model is ready, or trained, it can be used to predict the response in other data sets.
newdata <- data.frame(waiting = runif(25, 40, 100)) predict(mod, newdata)
The model describes a linear function. predict
can access the formula within the model and computes the lhs terms based on the rhs terms. The computed values can, of course be added to the data set.
newdata$predicted <- predict(mod, newdata) plot(eruptions ~ waiting, faithful) points(predicted ~ waiting, newdata, col = 'red', pch = 16)
To be fair, linear regression is not the best example of a model - driven prediction, although in principle it can be used to interpolate and extrapolate data, if the model is good enough. Prediction is something we usually do with classifiers: models whose response is to assign observations to two or more categories. Functions that build classifiers operate on the same principles.
All statistical methods work on certain assumptions, which should be observed if the method is to produce valid results. Perhaps the most frequent (and frequently ignored) assumption is that of sample normality. There are several methods to find out if a sample comes from a normal distribution.
The first way is to look at the distribution. Granted, the human eye is not very good at noticing details in distributions but gross departures from normality are readily apparent.
A more systematic look can be obtained with a quantile plot. This is a statistical device for comparing distributions. Given two samples, their quantiles are computed and plotted against one another. If the relationship is linear, the two samples can be - to some extent of confidence - be said to come from the same king of distribution.
To test a sample for normality one compares it to a random sample from a normal distribution. The method does not produce a numeric result, at which one can put a cut-off value. The user must make a decision whether the plot is good enough for them. Naturally, real-world data will nearly always show some irregularity, slight departures from perfect normality.
There are also statistical test for data normality. The most common is the Shapiro-Wilk test. The null hypothesis in this test is that the sample does come from a normal distribution. If the null hypothesis can be rejected, it can be concluded that the sample is not normal.
shapiro.test(rnorm(1e3)) shapiro.test(runif(1e3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.