knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette illustrates the functionality of TwinAnalysis
package for
twin analysis (the statistical analysis of the data of twins in a twin study).
This package can:
Build and run the univariate and multivariate (correlated factors) ACE and ADE models.
Report the parameters and statistics from a twin model in the tables that are ready for a paper or report.
Build and run the cross-lag ACE and ADE models.
Compute descriptive statistics and twin correlations that often supplement the results from the twin analysis.
This package cannot:
Do the analysis of categorical outcomes.
Help to choose the model or fix the model that does not fit the data.
Clean or reshape the data. (However, this vignette provides an example of reshaping the data.)
This package builds on the functionality of OpenMx
,
a package for structural equation modeling. OpenMx
is widely used for twin
analysis due to its matrix algebra tools that allow flexible model building.
This vignette only explains the concepts from OpenMx
that are essential for the
use of TwinAnalysis
. However, it is advisable to learn the syntax of OpenMx
for the fuller control of the models, particularly for the diagnostic of the
(un)fitted models.
TwinAnalysis
uses the mlth.data.frame
package for the output tables.
This package can also gather the tables across the script and write them into a
single Excel file. The script for the installation of the package is provided
below. For the details on mlth.data.frame
refer to the
package vignette).
This vignette uses
the pipe operator (%>%
)
from the dplyr
package. x %>% f
is equivalent to f(x)
. For example,
d %>% sapply(mean, na.rm = TRUE)
is equivalent to sapply(d, mean, na.rm)
.
# install.packages('devtools') library(devtools) install_github('ivanvoronin/mlth.data.frame') install_github('ivanvoronin/TwinAnalysis')
## Load the packages library(dplyr) library(mlth.data.frame) library(OpenMx) library(TwinAnalysis)
For the purpose of illustration, I will use the twin dataset from OpenMx
. This is an Australian twin data on height (ht1/2
), weight (wt1/2
) and BMI (bmi1/2
). There are two cohorts of twins: younger (zyg %in% 1:5
) and older (zyg %in% 6:10
). Details in ?twinData
.
## Load the data data(twinData) head(twinData)
Pre-processing of the data may include many steps, such as:
which are not covered in this vignette.
Twin data can be stored in a data tables in two formats:
Most functions from TwinAnalysis
operate with data in the wide format. It is assumed that the variables that differ between the twins (including phenotype variables) are named as X1
and X2
. X1
is trait X
in the first twin and X2
is trait X
in the second twin. Mind that there is no separator between variable name and the suffix 1/2
.
Below is an example of reshaping wide table twinData
to long table longData
and back.
# From wide to long longData <- twinData %>% reshape(direction = 'long', varying = c('wt1', 'wt1', 'ht1', 'ht2', # The variables that vary 'htwt1', 'htwt2', 'bmi1', 'bmi2', # within a twin pair 'age1', 'age2'), sep = '') # <- A separator between variable name and suffix head(longData) # Back from long to wide longData %>% reshape(direction = 'wide', v.names = c('wt', 'ht', 'htwt', # The variables that vary within a twin pair 'bmi', 'age'), idvar = 'fam', # Unique identifier of twin pair timevar = 'time', # Twin number (identifier of twin within pair) sep = '') %>% # Optional: a separator, default '.' head
Zygosity is a key variable for twin analysis as it labels MZ and DZ twins. A twin pair with no zygosity is excluded from the twin analysis.
In the illustration data, the zygosity
variable labels both zygosity and sex. I create the new zygosity and sex variables to use in further analysis.
twinData$zyg <- twinData$zygosity %>% as.character %>% sapply(switch, MZFF = 'MZ', MZMM = 'MZ', DZFF = 'DZss', # Same-sex DZ pairs DZMM = 'DZss', DZOS = 'DZos') %>% # Opposite-sex DZ pairs factor(levels = c('MZ', 'DZss', 'DZos')) # This is to keep the order of levels twinData$sex1 <- twinData$zygosity %>% as.character %>% sapply(switch, MZFF = 'Female', MZMM = 'Male', DZFF = 'Female', DZMM = 'Male', DZOS = 'Female') # First twin in opposite-sex pair is female twinData$sex2 <- twinData$zygosity %>% as.character %>% sapply(switch, MZFF = 'Female', MZMM = 'Male', DZFF = 'Female', DZMM = 'Male', DZOS = 'Male') # Second twin in opposite-sex pair is male # There are many alternative ways to do the same thing # I also like to check whether the manipulations with the data # did the right thing, like so: with(twinData, table(zygosity, zyg)) with(twinData, table(zygosity, sex1)) with(twinData, table(zygosity, sex2))
Finally, for this analysis I will select only the younger cohort of participants.
## It is usually a good idea to make a copy of an original ## data frame and make changes there twinData2 <- twinData[twinData$cohort == 'younger', ]
Descriptive statistics are important for twin analysis because they help us to make sure that the data is OK and that the assumptions of twin model hold. If any assumption doesn't hold, the model can have poor fit meaning that it may not descrive well the reality that we study.
One mathematical assumption requires similar distributions of the phenotypic variables within twin pairs and between zygosity groups. Namely, the means and variances shouldn't differ between twin 1 and twin 2 and between MZ and DZ twins. Another mathematical assumption constrains the similarities witnin MZ and DZ pairs: $r_{MZ} > r_{DZ}$ and $2r_{DZ} < r_{MZ}$ (ACE model) or $r_{MZ}>2r_{DZ}$ and $4r_{DZ} > r_{MZ}$ (ADE model). We also may want to compare twin similarity in DZss and DZos pairs if we want to combine them into the single group.
The TwinAnalysis
package provides functionality to compute descriptive statistics (means, SDs, twin correlations).
get_univ_descriptives(twinData2, zyg = 'zyg', vars = c('ht', 'wt', 'bmi')) ## No suffix ## We can also compare DZss and DZos correlations to check if they differ get_fisher_z(twinData2, zyg = 'zyg', vars = c('ht', 'wt', 'bmi'), compare = c('DZss', 'DZos'))
In this dataset there is noticeable difference between DZ twin 1 and DZ twin 2 in height and weight. This is a natural result of twin 1 being female and twin 2 being male in this group. The twin similarity is lower in DZos pairs comparing to DZss pairs, perhaps as the result of gender differences. Finally, $r_{MZ} > 2r_{DZ}$ for BMI which means that non-additive genetic effects may contribute to this phenotype and ADE model may be appropriate.
For now, I ignore gender differences and exclude DZos pairs from the analysis by making a new zygosity variable.
The models only accept zygosity as a factor variable with labels 'MZ'
and 'DZ'
.
twinData2$zyg2 <- twinData2$zyg %>% as.character %>% sapply(switch, MZ = 'MZ', DZss = 'DZ', DZos = NA) %>% factor(levels = c('MZ', 'DZ')) twinData2 <- twinData2[!is.na(twinData2$zyg2), ] get_univ_descriptives(twinData2, zyg = 'zyg2', vars = c('ht', 'wt', 'bmi'))
A univariate ACE/ADE is the simplest form of twin model.
Let's fit a univariate ACE model to BMI.
## Define the model bmi_univ <- univariate_ace(twinData2, zyg = 'zyg2', ph = 'bmi') ## Fit MxModel bmi_univ <- mxRun(bmi_univ) ## OpenMx summary summary(bmi_univ) ## Output tables get_output_tables(bmi_univ)
univariate_ace()
returns an MxModel object which means that you can do everything you usually do with OpenMx
models: run parameter fitting (mxRun()
, mxTryHard()
), get model summary (summary()
), add/delete matrices and algebra from a model (mxModel()
), add constraints (mxConstraint()
) etc.
bmi_univ bmi_univ$expCovMZ mxEval(expCovDZ, bmi_univ) ## Note that expected variances in MZ and DZ twins are exactly same, ## this is why the assumptions on means and variances must be met in the data
In addition to the standard model structure, such as the parameter matrices, the algebra for expected covariance matrix, the observed data, unvariate_ace()
(as well as the other functions from TwinAnalysis
that define models) adds the algebra that computes indeces that we actually will use as main results. In particular, univariate_ace()
returns the single otput table with percents of variability accounted for by A, C and E. The output tables are usually stored inside the model as MxAlgebra
.
The function get_output_tables()
returns the output tables from the model. These tables are recorded as an algebraic expression with model parametwes (MxAlgebra
object) inside the model. For example, the Variance_components
table reports the proportion of genetic and environmental contributions computed as genetic or environmental variability divided by the total variability of a trait.
To ensure that the model fits the data well, we compare it against two reference models: saturated model and independence model. In a saturated model, there are as many parameters as there are data points (in terms of variances, covariances and means), therefore this models has the best fit of all. An independence model assumes that all variables are independent. Theoretically, a good model should describe the data as well as its saturated model.
The function ref_models()
runs and appends the reference models to a target model. By default, it uses mxRefModels()
to define reference models and the user can provide custom reference models as well (see ?ref_models
).
The function twin_ref_model()
defines, runs and appends the reference models that follow the assumptions of the twin method (equal means, variances and co-variances across twins within pair and across zygosity groups).
The function fit_stats()
returns a table with fit statistics.
## bmi_univ and bmi_univ2 will be the same model with different reference models bmi_univ2 <- bmi_univ bmi_univ <- twin_ref_models(bmi_univ, run = TRUE) bmi_univ2 <- ref_models(bmi_univ2) fit_stats(bmi_univ) fit_stats(bmi_univ2)
We can check that the assumptions hold by comparing two saturated models. In this case, the constrained saturated model explains the data significantly worse that the unconstrained saturated model meaning that some assumptions might be violated.
mxCompare(get_ref_models(bmi_univ2)$Saturated, get_ref_models(bmi_univ)$Saturated)
When there are several phenotypic variables to fit a univariate twin model upon, it is convenient to iterate over the variables.
univ_models <- list() for (v in c('ht', 'wt', 'bmi')) { model <- univariate_ace(twinData2, zyg = 'zyg2', ph = v) model <- mxRun(model, intervals = TRUE) ## This time I evaluate CIs model <- ref_models(model) univ_models[[v]] <- model } ## Estimates of genetic and environmental contributions univ_models %>% lapply(get_output_tables) %>% lapply(`[[`, 'Variance_components') %>% ## same as lapply(function(x) x[['Variance_components']]) do.call(rbind, .) univ_models %>% lapply(fit_stats) %>% do.call(rbind, .)
A bivariate twin model is applied when the research goal is to find out whether the measured correlation between two traits is accounted for by genetic or environmental variability shared by these two traits. This kind of analysis relies on within-twin and cross-twin within-trait and cross-trait covariances. For example, ht1 <-> ht2
is a within-trait cross-twin correlation, ht1 <-> wt1
is a cross-trait within-twin correlation, ht1 <-> wt2
is a cross-trait cross-twin correlation. All these data will be used to find genetic and environmental contributions to height and veight and to estimate the overlap between the corresponding genetic and environmental variability.
twin_cors <- get_cross_trait_cors(twinData2, zyg = 'zyg2', vars = c('ht', 'wt')) twin_cors ## Within-twin variances-covariances ## Each table is diagonally symmetrical twin_cors %>% lapply(`[`, c('ht1', 'wt1'), c('ht1', 'wt1')) twin_cors %>% lapply(`[`, c('ht2', 'wt2'), c('ht2', 'wt2')) ## Cross-twin within-trait covariances ## Second set of tables is transposed first set of tables twin_cors %>% lapply(`[`, c('ht1', 'wt1'), c('ht2', 'wt2')) twin_cors %>% lapply(`[`, c('ht2', 'wt2'), c('ht1', 'wt1'))
When we run a bivariate ACE on the illustration data, the optimizer fails to find optimal parameter values (status code RED). We can check model identification as OpenMx suggests.
biv_model <- multivariate_ace(twinData2, zyg = 'zyg2', ph = c('ht', 'wt')) biv_model <- mxRun(biv_model) mxCheckIdentification(biv_model) get_output_tables(biv_model)
It says that there are problems with the non-shared environmental parameters (ACE.e[1,1]
, ACE.e[2,1]
, ACE.e[2,2]
). The matrix e
is actually a Cholesky decomposition of environmental variability: E = e %*% t(e)
. I will eliminate non-shared environmental correlation between height and weight by constraining e[2, 1] == 0
.
biv_model <- multivariate_ace(twinData2, zyg = 'zyg2', ph = c('ht', 'wt')) biv_model$e$values[2, 1] <- 0 ## This is not the only way to fix a parameter biv_model$e$free[2, 1] <- FALSE biv_model <- mxRun(biv_model) ## It works fine now get_output_tables(biv_model) biv_model <- ref_models(biv_model) fit_stats(biv_model)
Now it runs fine. The non-shared environmental correlation was sussessfully eliminated. The chi-square test shows significant difference between the model and the corresponding saturated model but CFI, TLI and RMSEA suggest good fit.
The output tables of a bivariate/multivatiate ACE model:
Variance_compinents
: percent of total variance of each variable accounted for by genetic, shared environmental and non-shared environmental factors.All_correlations
: genetic and environmental correlations (rA
, rC
, rE
), total (phenotypic) correlation (Total
), percent of correlation accounted for by genetic, shared environmental and non-shared environmental factors (A(%)
, C(%)
, E(%)
).A_correlations
, C_correlations
, E_correlations
, Total_correlations
: genetic, environmental and total correlation matrices. get_output_tables()
will throw an error.umx
that provides a lot of functionality for SEM and twin analysis.TwinAnalysis
and mlth.data.frame
.file.copy(from = 'vignettes/twin-analysis-workflow.html', to = 'docs/index.html', overwrite = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.