Structured datasets often consist of tables with observations recorded
as rows and variables encoded in columns. Let's assume that two of the
recorded variables, say x
and y
, are of particular interest. A typical
data-science question might then be: are the two variables related?
A straightforward approach to this problem is to compute the correlation
between x
and y
, for example using cor.test
. This is informative,
but it opens follow-up questions such as: do other factors play a
role in the relationship? Such follow-up questions are important because
they can influence discussions or possible causation mechanisms driving
the correlation. They can also influence subsequent data-collection and
analysis efforts.
Studying relations between multiple variables can be approached using multiple linear regression. A guide to the use and interpretation of multiple linear regression models is summarized, for example, in a Points of Significance column.
In R, such models are accessible through the lm
and glm
functions.
These tools build one model at a time, but in practice it
is often useful to explore data using several models. The famr
package
thus provides a set of wrappers and visualization tools to analyze
families of related multiple regression models.
Let's look at some examples that expose the functionality of the package.
This section presents a series of examples based on the generative models below.
{width=100%}
In each diagram, circles indicate random variables and arrows reveal
causal relations. In model 1, variables w1
through
w6
as well as x
are independent; values of variable y
are determined by x
(plus noise). In model 2,y
is driven by a combination
of w2
and x
, and so on. Together, the models cover some of the
common relationships encountered in multivariate datasets.
The analyses below each use a toy dataset generated using one of
these models. They build linear models of
the form y ~ (intercept) + (slope-x)x
and estimate values for the
intercept and slope from the data. At the same time, the analyses
also build families of models
y ~ (intercept) x + (slope-x)x + (slope-w)w
, with w
systematically
replaced by variables w1
through w6
.
To begin, let's load the package and a toy dataset mrdata1
(it is bundled with the package).
library(famr) data(mrdata1) head(mrdata1, 3)
This object is a data frame with 8 columns. As before, let's assume that x
and y
are of particular interest and use the other
variables, i.e. w1
through w6
, as auxiliary data.
A famr
analysis of this a dataset consists of a single command
family1 = famr(mrdata1, "y", "x")
This specifies our dataset and the two variables of interest. A preview of the output reveals that multiple regression has been performed using a family of seven related models.
family1
Before digging into the structure of the output object, let's visualize
its contents using plot
.
plot(family1, main="Dataset 1")
This shows the raw values for the modeled variables. We see that there is a linear relation. The red line is a single-variable linear regression.
But recall that the famr
analysis also contains auxiliary models.
In those models, y
is hypothesized to depends on x
as well as one of the auxiliary fields, i.e. w1
-w6
. An overview of
the results is available through summary
.
summary(family1)
The first line describes the single-variate linear regression. In this case,
the p-value associated with the x
-dependency is around 1e-14
. The other
rows in the summary describe models with additional terms. The primary
p-value refers to the contribution of x
to y
. The secondary p-value
refers to the auxiliary fields.
The package provides a tool to visualize these dependencies.
plot(family1, type="pvalues", main="Dataset 1")
The axes here display log-transformed p-values. The
multi-variate models that contain a dependency on x
and another variable
contribute one point each. The base single-variate model, which does not
contain a secondary variable, is shown as a vertical line. In this case,
all the points are very close to the line and none is associated with a
significant p-value.
The conclusion of this analysis might be that y
and x
are correlated
and that none of the auxiliary variables participate in this relationship.
Indeed, this is consistent with generative model 1.
For the next examples, it will be convenient to draw both diagrams at the same time. So let's define a wrapper function.
## wrapper for two famr plots (scatter, pvalues) plotSP = function(family, ...) { par(mfrow=c(1,2)) plot(family, type="scatter", ...) plot(family, type="pvalues", ...) }
For the next example, let's load and analyze a related dataset, mrdata2
.
data(mrdata2) family2 = famr(mrdata2, "y", "x") plotSP(family2, main="Dataset 2")
The scatter diagram (left) again suggests a linear relation between y
and x
. In addition to the thick red line, the diagram also shows some
gray lines. They represent fits of the multivariate models. These lines
were also present in the earlier example, but were not visible behind the
red line. Here, some of the multivariate models seem to impact the gradient
between y
and x
.
The p-value diagram (right) is markedly different from before. One of the
points, labeled w2
, lies far from the vertical line. The fact that this
point has a significant vertical coordinate suggests that variable w2
is
correlated with y
. The fact that including w2
improves the significance of
x
in the model suggests that x
and w2
are independent and both
relevant to modeling y
. Indeed, this is the structure of generative model 2.
The next example uses dataset mrdata3
.
data(mrdata3) family3 = famr(mrdata3, "y", "x") plotSP(family3, main="Dataset 3")
The scatter diagram (left) again shows a correlations between x
and y
. However, two of the multivariate models produce fits that are almost flat. These fits suggests that given knowledge of two of these auxiliary variables, the dependency of y
and x
disappears.
The p-value diagram (right) leads us toward a similar conclusion. Two of the
points fall to the far left of the baseline model. This suggests that
including either w3
or w4
in a model de-emphasizes the importance of x
.
Given that we identified these new interesting variables, we can
investigate further using famr
, switching focus onto w3
.
family3b = famr(mrdata3, "y", "w3") plotSP(family3b, main="Dataset 3, focus on w3")
The results here are consistent with the generative model 3.
While the previous examples demonstrated some clear-cut examples, some
patterns may be more subtle to interpret. The next example, mrdata4
,
goes in this direction.
data(mrdata4) family4 = famr(mrdata4, "y", "x") plotSP(family4, main="Dataset 4")
This output appears similar to example 1. The single-variate regression on the left-hand side seems to fit the data well and the alternate models do not greatly affect the gradient. The p-value diagram on the right does not point to a single variable that might improve or refute the model. However, the dots near the vertical line show some scatter.
We can take this as an opportunity to investigate the data further and to use another feature of the package. Instead of using a default set of alternative models, we can customize that family. To do that, we first build defining the default family.
models = famrModels(mrdata4, avoid=c("x", "y")) models[1:2]
The first command inspects the data frame and define alternative models based
on the columns in this data frame. The second argument suggests not to
include the variables x
and y
; those are part of the base model.
The output object models
is just a list. Its formating is verbose, but
the benefit is that can augment the models
object in several ways.
A new alternative model might be to include more than one secondary variable.
models[["w4w5"]] = list(w4="w4", w5="w5")
The above creates a model called w4w5
that will contribute two terms to
the multivariate regression.
Another alternative model might use a predefined combination of existing fields. We can implement that using a model based on a custom function.
f456 = function(x) { -x[,"w4"]+x[,"w5"]+0.5*x[,"w6"] } models[["w456"]] = list(w456=f456)
The combination of variables in the function seems ad-hoc at this stage, but let's carry on and see what happens. In the second line, the function is set as part of a custom model.
We can now launch the famr
analysis with our custom model family.
family4b = famr(mrdata4, "y", "x", models) plotSP(family4b, main="Dataset 4, with custom model family")
We see additional components on the plots corresponding to the new
alternative models. Variable w456
(defined by applying function f456
on the dataset) greatly reduces the importance of x
. As
in the previous example, this is a suggestion to investigate further. In
particular, we can perform a famr
analysis focusing on y
and f456
.
(Note the modified syntax using a list as the third parameter).
family4c = famr(mrdata4, "y", list(w456=f456)) plotSP(family4c, main="Dataset 4, focus on w456")
The base model now fits better than before. The p-value diagram suggests
we migth try to add w3
into the model next.
The final example uses a dataset mrdata5
with categorical fields.
data(mrdata5) head(mrdata5, 3)
We see that w5
and w6
consist of factors. However, we can proceed
with famr
analysis as before.
family5 = famr(mrdata5, "y", "x") plotSP(family5, main="Dataset 5")
The interpretation of the these plots is by now familiar. On the right hand
side, two of the points are far from the baseline model and indicate that
including a new variable to a great extent explains away the correlation
between x
and y
. In this case, the two points correspond to factors from
the categorical column w5
.
Indeed, this dataset was produced in a similar manner to example 3, whereby
both x
and y
were generated independently from values of w5
.
The famr
package provides wrappers and extensions to R's linear modeling
and plot functions. These wrappers provide a means to test relations
between variables of interest while automatically considering alternative
models.
A possible practical application is during explorative stages of data analysis. Automatic testing of families of related models can suggest alternative relations between variables. Many dynamic systems with complex interactions, for example biological gene expression, are known to create correlations between non-causally related components. The package provide tools to perform tests for such effects in an automated fashion.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.