knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(mosaic) library(ggformula) library(mosaicCalc) library(palmerpenguins) library(knitr)
Installing
First, install R and RStudio according to the instructions given by your instructor. Opening R or RStudio will present you with an "R console."{mosaicCalc}
Then, you need to install the {mosaicCalc}
software and a lot of related software. Do this from your R console:
install.packages("mosaic", "ggformula", "mosaicCalc")
This will take some minutes, but the process is entirely automatic.
You only need to install the packages once. If you switch to another computer, however, you may need to install the packages on that computer.
Starting an R session
When you sit down to work with {mosaicCalc}
, you need to have an "R session" started. This is typically accomplished by clicking on the RStudio icon:
Alternatively, you may still have an R session open from your previous activities.
Before you can use {mosaicCalc}
, you need to tell R to access the software. This is done with
library(mosaicCalc)
If you fail to do this, when you start using {mosaicCalc}
functions you will encounter an error message in this format
Error in makeFun() : could not find function "makeFun"
To fix this, go back and run the library(mosaicCalc)
command again.
Assignment: Giving names to values
x <- 2
x
<-
is the assignment operator<-
. But never put a space between the two characters in the arrow.2 <- x
might make sense to a human, but in R it will lead to an error message, "invalid(do_set) left-hand side to assignment." x
(or any other name)
.
and _
Function application: Evaluating a function on an input
sqrt(3)
sqrt
is the name of a functionsin
. Another word for input is argument. sin(3)
causes the sine calculation to be carried out. sqrt(x)
. For this to work, x
must have been previously assigned to a value. If there is no such assignment, you'll get an error code such as "object \"x\" not found."
Writing mathematical formulas
(3 + 7 * sin(2*pi*x)) / (x^2 - 4)
(3 + 7 * sin(2*pi*x))
are grouping parentheses. Grouping parentheses do what you would expect: govern the order of the operations.sin
in sin(2*pi*x)
are function evaluation parentheses. These signal to R that you want to evaluate the function on the input contained inside the parentheses.*
character.^
character, as in x^2
x
) used in the formula. Typically you will write formulas in tilde expressions. (See below.)
Tilde expressions
A tilde expression is a special feature of R that let's you write down a mathematical expression without having the expression evaluated. We use tilde expressions to construct our own mathematical functions and for a handful of other, related purposes such as plotting.
y^2 + 3*y ~ y
y^2 + 3*y
is the left-hand side and plain, ordinary y
is on the right-hand side.+
or &
(either will do). The +
does not mean "add them," it's merely a way of saying "and." For instance, the right-hand side in `3xexp(y) ~ x + y"
Constructing your own mathematical functions
g <- makeFun(2 + 3*x - 7*x^2 ~ .)
g <- makeFun(2 + 3*x - 7*x^2 ~ .)
g
corresponding to $g(x)\equiv 2 + 3x - 7\,x^2$makeFun()
is a function constructormakeFun()
is a tilde expressionmakeFun()
the right side should be a simple period (.
) makeFun()
is case sensitive. So using makefun
(note: lower-case f
in fun
) won't work.g
, you will get the output
r
g
The first line tells you that you defined g as a function of x. The second line tells the formula for it. The third line is relevant only to advanced R programmers; you can ignore it.
Graphing functions with a single input
slice_plot(exp(x) ~ x, bounds(x = -3 : 3))
{mosaicCalc}
commands.bounds()
function to a named argument.x
and the R expression is $-3 : 3$. That R expression means "from -3 to 3." The colon (:
) is essential punctuation; it's what separates grammatically the lower end of the bounds from the upper end. (For experienced R users: If you prefer to use the form c(-3,3)
instead of the colon, that works. For specialized R programmers: The ordinary integer orientation of :
is suspended within bounds()
so the limits you give will be treated exactly.) slice_plot()
because we think of a function of one input as a slice of a multivariable surface, but we won’t get into that immediately.bounds()
will always be named. That name should match the one on the right-hand side of the tilde expression. :
to collect together the lower and upper bounds of the graphics frame. To graph the function from 0 to 10, you would replace the -3:3
with 0:10
.color="blue"
. To see the full list of named colors in R, give the command colors()
in a sandbox. (You need the empty parentheses, otherwise you'll be shown the software underlying the list of colors.) Notice that the color should always be in quotes, e.g. "green"
or "darkorchid"
.size
argument, for instance size=2
.alpha = 0.2
. The alpha value can range from 0.0 (completely invisible) to 1.0 (completely opaque).slice_plot(exp(x)~x, bounds(-3:3)))
(leaving out the name x
inside bounds()
), you get an error of, "Bounds involves 0 variables, but function has 1"slice_plot(exp(x)~x, bounds(t=-3:3))
(mismatching names in tilde expression and bounds), you get an error of, "Bounds has variable(s) t but function has argument(s) named x"slice_plot(exp(x), bounds(x=-3:3))
(forgetting the ~x
in the tilde expression), you get an error of, "object 'x' not found"slice_plot(exp(x)~x)
, you get an error of, "bounds must be specified when there is no preceeding layer." Aside from the misspelling of preceding, this is telling the user that there wasn’t a previous graph to pull the bounds from, so you need to specify the bounds as an argument to slice_plot()
.
Drawing contour plots
In MOSAIC Calculus, we generally use a contour-plot format to display graphically a function with two inputs. This is done in much the same way as slice_plot()
.
contour_plot(3 + 2*x - 4*y + 0.5*x*y ~ x & y, bounds(x=-5:5, y=0:10))
slice_plot()
: tilde expression, bounds, ...bounds()
.+
or &
. Both have the same effect.bounds()
, one for each of the two inputs to the function being plotted.bounds()
. The vertical axis is set by the second name.bounds()
statement, you can add additional named arguments, separated as always by commas, to customize your graph.skip = 0
Label every one of the contours, rather than the default of skipping one between labels.filled = FALSE
Don't color in the background. This is particularly useful when showing two or more graphical layers.contour_color = "blue"
(or any other color) Instead of using the function output level to determine the color of each contour, use just one color for all. This is useful when, for example, you want to show just the zero contour of each of two functions.
makeFun()
to define a function of two inputs, the created function will take two arguments, one for each of the inputs. Those arguments are named and it is a good idea when evaluating the function you created using the names explicitly.h <- makeFun(x + 15*y - x*y ~ .)
h(y=3, x=2)
h()
, the first argument will be x
and the second y
. So you could perform the same evaluation as in the above command with h(2, 3)
. But it's easy to forget or mistake the order of the arguments. Using the named-argument style makes it clearer which value is intended for which argument.slice_plot(h(x=x, y=3) ~ x, bounds(x=-1:1))
. At first glance, this may look silly. What it means is that the $x$ argument to $h()$ will take on the set of values inside the bounds, which is itself written in terms of $x$.Layering graphics
Sometimes you may want to compare two or more functions in the same graphics frame. You do this by drawing the individual functions in the usual way, but connecting the commands with a pipe, signified by the token %>%
slice_plot(x^2 ~ x, bounds(x=-2:2)) %>% slice_plot(sin(x) ~ x, color="blue")
%>%
symbol is pronounced "pipe"bounds()
at an early stage in the pipeline, you don't need to specify it again in the later plotting commands. But, if you want, feel free to set the bounds explicitly. This is helpful if you want the graphics bounds to be different for the different layers.r
slice_plot(x^2 ~ x, bounds(x=-2:2)) %>%
slice_plot(sin(x) ~ x, color="blue")
Data frames
Essentially all of the data we use in this course is arranged as data frames. (This is also true in statistics and data science generally.) A data frame is a rectangular array. Each column is called a variable, each row is a case. For instance, in a data frame about people, each row might be an individual person. The different variables record different aspects of that person: height, age, sex, state of residence, etc. All of the entries within a column must be the same kind of thing: a number for height, a postal abbreviation for state, and so on. There are three kinds of things you will do with data frames:
For this course, almost all data will be provided by giving you the name of dataframes. Sometimes this name will be simple and in the usual form, e.g. EbolaAll
. Other time, the name will be preceded with information about where the computer should look for the data, e.g. palmerpenguins::penguins
.
Some simple, helpful commands for orienting yourself to a data frame:
- names(EbolaAll)
tells you the names of the columns in the data frame.
- head(EbolaAll)
shows the first several rows.
- DT::datatable(EbolaAll)
will show the entire data frame interactively, allowing you to page through the data.
- help(EbolaAll)
displays documentation about the data frame.
Once you know the basic facts of a data frame---what the variable names are and what kind of thing each variable records---you are ready to use the data in your work. Two common tasks are (1) to plot one variable versus another and (2) to use the variables in some calculation, such as fitting a model. Both tasks use much the same syntax based on tilde expressions.
For instance:
gf_point(Gcases ~ Date, data = EbolaAll)
. (See the "plotting data" section of this document.) The variable names (Gcases
and Date
) are used in the tilde expression, while the name of the data frame is specified in other argument, the named argument data=
.Plotting data
There is one type of data graphic that we will be using in this course: the point plot (also known as a "scatter plot"). There are, of course, many other kinds of statistical graphics such as density plots, jittered plots, etc. which you will learn about in a statistics course, but we will not use them here.
A point plot always involves two variables from a data frame. To illustrate, consider the palmerpenguins::penguins
data frame, where each case is a individual penguin.
kable(penguins)
Suppose we want to look at flipper length versus body mass. The appropriate command is:
gf_point(flipper_length_mm ~ body_mass_g, data = palmerpenguins::penguins)
Each row in the data frame generates one dot in the plot: the $y$ and $x$ coordinates of the dot are set by the value of the variables named in the tilde expression. $y$ is always the first name, on the left-hand side of the tilde.
There is a wide variety of ways to customize the plot: size, shape, transparency of the dots, etc. A statistics course can show you why you would want to use these modalities. For our purposes in this course, there are just three sorts of customizations:
# linear axes: the default gf_point(pressure ~ volume, data = Boyle) # Log-log axes gf_point(pressure ~ volume, data = Boyle) %>% gf_refine(scale_x_log10(), scale_y_log10())
To make a semi-log plot, use the scale_y_log10()
and leave out the scale_x_log10()
argument to gf_refine(0)
.
Note: Log-log or semi-log axes are a valuable way to present data to a human reader, because the value of the variables for any point can be read directly from the axes. But sometimes your purpose is not to present data but to estimate a parameter in a power-law or exponential relationship. For the purpose of estimating parameters, better to make an ordinary plot with linear axes, but plot the log of the variable(s) rather than the variables themselves. For instance:
gf_point(log(pressure) ~ log(volume), data = Boyle)
gf_point(flipper_length_mm ~ body_mass_g, data = palmerpenguins::penguins, color = ~ sex)
Notice the tilde before the variable name in the color=
argument. You can also set the color to be a fixed values, e.g. color="magenta"
.
gf_lims()
function. For instance, the penguin graph above shows that females have somewhat smaller flipper length than males, and somewhat smaller weights as well. But because zero was not used as the lower limit of the axes, the plot overstates the sex differences. Including zero as an axis limit is often appropriate.gf_point(flipper_length_mm ~ body_mass_g, data = palmerpenguins::penguins, color = ~ sex) %>% gf_lims(y=c(0,235), x=c(0,6500))
Fitting functions to data
Finding parameters to match a function to data can be a matter of trial and error. The fitModel()
R/mosaic function can be used to polish a preliminary fit.
For instance, as every chemistry student knows, Boyle's Law states that at constant temperature, pressure and volume are inversely related: $P = a V^{-1}$. Let's see how well Boyle's data matches his law by fitting a general power-law form $P = a V^{n}$ to his data.
At the core of fitModel()
is a tilde expression that specifies the name of the function output (on the left side of the tilde) and a functional form written in terms of the names of the inputs to the function. Parameters are written using names.
mod <- fitModel(pressure ~ a*volume^n, data = Boyle)
The object created by fitModel()
is a function of the variables on the right-hand side of the tilde expression, just volume
here. You can use that mathematical function like any other that you create with makeFun()
or the like. For instance:
gf_point(pressure ~ volume, data = Boyle) %>% slice_plot(mod(volume) ~ volume, color="magenta")
But unlike other functions, you can interrogate the functions produced by fitModel(0)
to see the numerical values of the parameters. This is done with the coef()
function (as in "coefficients").
coef(mod)
If Boyle had been able to measure and control the temperature in his experiments, he would have found that the parameter a
is proportional to temperature.
Default values for parameters in functions
Suppose you were creating a function to represent the distance travelled by an object in free fall as a function of time from some initial point in time $t_0$. The mathematical relationship is $$\text{dist}(t) \equiv v_0 \left[\strut t - t_0\right] + \frac{1}{2} g \left[\strut t - t_0\right]^2$$
To evaluate this function, you need to know the three parameters $v_0, t_0,$ and $g$. On Earth, $g=-9.8$ meters/sec$^2$, but you may not know $v_0$ or $t_0$ until it comes time to use the function in some context. It's tempting to put in numerical values for the parameters to make the function easy to use. For instance, if the object starts from rest at time $t=0$, it would be tempting to make the R definition something like:
dist <- makeFun(0*(t-0) - 9.8*(t-0)^2 / 2 ~ .)
or even makeFun(-9.8*t^2 / 2 ~ .)
A better practice is to create the function with the parameter names shown explicitly:
dist <- makeFun(v0 * (t - t0) + g*(t-t0)^2/2 ~ .)
Unfortunately, the resulting function will have four arguments which have to be specified every time you use it. A nice compromise is to assign default values for the parameters. For instance $t_0 = 0$ and $v_0 = 0$ and $g = -9.8$ meters/sec$^2$ would be sensible.
You specify the default parameters by adding them as additional arguments after the tilde expression.
dist <- makeFun(v0*(t-t0) + g*(t-t0)^2 / 2 ~ ., g=-9.8, v0=0, t0=0)
This way, you can use the function simply when the default parameters are appropriate, or modify them as needed. For instance, the free-fall distance over two seconds:
on_Earth <- dist(2) on_Mars <- dist(2, g=-3.7)
Calculating derivatives
"Calculating a derivative" means to find the function that is the derivative of a specified function. In R/mosaic this can be done in exactly the same way that makeFun()
works. For instance:
df <- D(exp(t) * cos(t) ~ t) df slice_plot(df(t) ~ t, bounds(t=0:10))
The name on the right-hand side of the tilde expression will be the "with respect to" variable. If you want a second derivative, use and expression like t + t
on the right-hand side. You can also calculate mixed partial derivatives with right-hand side expressions like t & x
.
R/mosaic knows how to use both numerical and symbolic methods. To force the use of numerical methods, use numD()
instead of D()
.
Calculating anti-derivatives
Anti-derivatives can be computed with antiD()
. For example:
antiD(sin(omega*t) ~ t)
R/mosaic knows only a few symbolic anti-derivatives. When it doesn't know the symbolic form, the function produced by antiD()
will use numerical methods.
Solving (zero-finding)
"Solving" means to find an input $x^\star$ that will generate an output of $v$ from $f(x)$. That is, the answer $x^\star$ will give $f(x^\star) = v$. There may be no solutions, one solutions, several or many solutions.
In R/mosaic, solving is implemented as the Zeros()
function. Zeros()
looks for solutions where $f(x^\star) = 0$, but any solution problem, regardless of $v$, can be placed in this form.
Zeros()
is used in the same way as many other R/mosaic functions: the first argument is a tilde expression, the second is a bounds. The output will be a data-frame with the solutions found. For instance, here we find the inputs to $\sin(x)$ that produce an output of 0.5. Notice that we have created a new function ($\sin(x) - 0.5$) whose output will be zero at the solutions we seek.
solutions <- Zeros(sin(x) - 0.5 ~ x, bounds(x=0:10)) slice_plot(sin(x) ~ x, bounds(x=0:10)) %>% gf_point(0.5 ~ x, data = solutions) solutions
Optimization (with derivatives)
In optimization, you have an objective function $f(x)$ and you seek the argmin(s) or argmax(es). The textbook introduces optimization using a technique involving differentiating and solving, that is, finding $x^\star$ such that $\partial_x f(x^\star) = 0$. We'll illustrate with a made-up function:
f <- rfun(~ x, seed=943) # a random function slice_plot(f(x) ~ x, bounds(x=-5:5))
It's so easy to spot the argmins and argmaxes from a graph that it's reasonable to wonder why we need derivatives and solving to do it. The answer is that we are helping you establish a conceptual foundation for more interesting optimization problems where you can't just look at a graph. So let's step through the problem formally using derivatives.
df <- D(f(x) ~ x)
If there were symbolic parameters in your definition of $f()$, you would need to resolve them at this point, assigning the parameters definite numerical values. 2. Find the zeros of the derivative function:
dzeros <- Zeros(df(x) ~ x, bounds(x=-5:5)) dzeros
The output of Zeros()
has two columns. The first lists the solutions, the second gives the value of the function at those solutions. Note that the function output is very close to zero, there's not really any information there. And there's nothing that tells you whether a given $x$ is an argmax or an argmin. Of course, you could figure this out by referring to the graph of $f(x)$ itself, but let's be a little more formal by extracting this information from dzeros
, f()
, and the second derivative of f()
. To do this, we introduce a new function, mutate()
, that operates on data frames. And we'll need the second derivative $\partial_{xx}f()$ to establish whether the function is concave up or down at the solutions found.
ddf <- D(f(x) ~ x & x)
Here's where mutate()
comes in, allowing us to apply the functions f()
and ddf()
to the x
values in the solution. The first command might look strange. It means, "Take the old data frame dzeros
, add some new columns to it using mutate()
, and then calling the new data frame by the old name dzeros
.
dzeros <- dzeros %>% mutate(val = f(x), convexity = sign(ddf(x))) dzeros
From the sign of the convexity at the solutions $x$, you can tell immediately which rows correspond to an argmin (positive convexity) and which to an argmax (negative convexity).
For fun, we can graph our findings on top of the function $f()$ to confirm that we have things right:
slice_plot(f(x) ~ x, bounds(x=-5:5)) %>% gf_point(val ~ x, data = dzeros, color= ~ convexity)
Optimization (generally)
Optimization is such an important procedure that clever algorithms to handle all sorts of difficulties has been invented. This is an extensive topic in its own right, but for simplicity, R/mosaic provides a function argM()
that works with functions with multiple inputs, pretty much automatically. It's called argM()
because it returns both argmins and argmaxes, along with some supplemental information.
argM(f(x) ~ x, bounds(x=-5:5))
argM()
will not necessarily find all the local argmins and argmaxes. But it also works for functions with multiple inputs.
f2 <- makeFun(x*sin(sqrt(3+x))*(1+cos(y))-y ~ .) solns <- argM(f2(x, y) ~ x & y, bounds(x=c(-3,3), y=c(-3,3))) solns
contour_plot(f2(x,y) ~ x & y, bounds(x=c(-3,3), y=c(-3,3))) %>% gf_point(y ~ x, data = solns, color="red")
Vectors and matrices
To be released.
Finding solutions to the target problem
To be released.
Solving differential equations
To be released.
Basic modeling functions
identity_fun <- makeFun(x ~ x) constant_fun <- makeFun(1 ~ x) straight_line <- makeFun(m*x + b ~ x, b=0, m=1) exponential <- makeFun(exp(k * x) ~ x, k=1) power_law <- makeFun(x^p ~ x, p = 1/2) sinusoid <- makeFun(sin(2*pi*(t-t0)/P) ~ t, P=2, t0=0) logarithm <- makeFun(log(x, base=exp(1)) ~ x) gaussian <- makeFun(dnorm(x, mean, sd) ~ x, mean=0, sd=1) sigmoid <- makeFun(pnorm(x, mean, sd) ~ x, mean=0, sd=1) identity_fun(3) constant_fun(3) power_law(3)
Assembling functions
# Linear combination (example) f <- makeFun(a0 + a1*exp(k * x) ~ ., a0=30, a1=150, k=-0.5) # Product (example) g <- makeFun(dnorm(x, mean=0, sd=3) * sin(2*pi*t/P) ~ ., P=3) # Composition (example) h <- makeFun(exp(sin(2*pi*t/P) ~ x) ~ ., P = 3)
Graphing functions
slice_plot(dnorm(x, mean=1, sd=2) ~ x, bounds(x=-5:5)) contour_plot(dnorm(x, mean=1, sd=2) * pnorm(y, mean=-3, sd=1) ~ x + y, bounds(x=-5:5, y=-5:5))
Calculus operations
f <- makeFun(exp(-0.5*x) * sin(2*pi*x/3) ~ .) df <- D(f(x) ~ x) slice_plot(df(x) ~ x, bounds(x=-5:5)) %>% slice_plot(f(x) ~ x, bounds(x=-5:5), color="orange3")
f <- makeFun(dnorm(x, mean=1, sd=2) ~ .) F <- antiD(f(x) ~ x) slice_plot(F(x) ~ x, bounds(x=-5:5)) %>% slice_plot(f(x) ~ x, color="orange3") # Set "constant of integration" slice_plot(F(x) ~ x, bounds(x=-5:5)) %>% slice_plot(f(x) ~ x, color="orange3") %>% slice_plot(F(x, C=0.25) ~ x, color="green") # Definite integral F(5) - F(-5)
f <- makeFun(exp(sin(2*pi*x/3)) - 0.5 ~ .) Zeros <- findZeros(f(x) ~ x, near=0, within=5) Zeros slice_plot(f(x) ~ x, bounds(x=-5:5)) %>% gf_hline(yintercept=0, color="orange3") %>% gf_vline(xintercept= ~ x, color="dodgerblue", data=Zeros)
Fitting functions to data
Use fitModel()
to fit a function of one variable to data.
gf_point(temp ~ time, data = CoolingWater) # Eyeball half-life at 25 k0 <- -log(2)/25 mod <- fitModel(temp ~ A + B*exp(-k*time), data=CoolingWater, start=list(k=k0)) Plot <- gf_point(temp ~ time, data = CoolingWater) %>% slice_plot(mod(time) ~ time, color="dodgerblue", alpha=0.25, size=2)
Linear and quadratic approximations
Approximate a function $f(x)$ around a selected point $x=x_0$.
f <- makeFun(exp(-0.5*x)*sin(2*pi*x/3) ~ .) df <- D(f(x) ~ x) center_fun_on <- 0.9 ddf <- D(df(x) ~ x) # alternatively, D(f(x) ~ x + x) lin_approx <- makeFun(f(x0) + df(x0)*(x-x0) ~ x, x0 = center_fun_on) quad_approx <- makeFun(lin_approx(x) + 0.5*ddf(x0)*(x-x0)^2 ~ x, x0 = center_fun_on) slice_plot(f(x) ~ x, bounds(x=0:1.5), size=2) %>% slice_plot(lin_approx(x) ~ x, color="blue") %>% slice_plot(quad_approx(x) ~ x, bounds(x=0:1.5), color="orange") %>% gf_vline(xintercept = center_fun_on, alpha=0.2, color="yellow", size=3)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.