knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Intro

This vignette illustrates the functionality of TwinAnalysis package for twin analysis (the statistical analysis of the data of twins in a twin study).

This package can:

  1. Build and run the univariate and multivariate (correlated factors) ACE and ADE models.

  2. Report the parameters and statistics from a twin model in the tables that are ready for a paper or report.

  3. Build and run the cross-lag ACE and ADE models.

  4. Compute descriptive statistics and twin correlations that often supplement the results from the twin analysis.

This package cannot:

  1. Do the analysis of categorical outcomes.

  2. Help to choose the model or fix the model that does not fit the data.

  3. Clean or reshape the data. (However, this vignette provides an example of reshaping the data.)

This package builds on the functionality of OpenMx, a package for structural equation modeling. OpenMx is widely used for twin analysis due to its matrix algebra tools that allow flexible model building. This vignette only explains the concepts from OpenMx that are essential for the use of TwinAnalysis. However, it is advisable to learn the syntax of OpenMx for the fuller control of the models, particularly for the diagnostic of the (un)fitted models.

TwinAnalysis uses the mlth.data.frame package for the output tables. This package can also gather the tables across the script and write them into a single Excel file. The script for the installation of the package is provided below. For the details on mlth.data.frame refer to the package vignette).

This vignette uses the pipe operator (%>%) from the dplyr package. x %>% f is equivalent to f(x). For example, d %>% sapply(mean, na.rm = TRUE) is equivalent to sapply(d, mean, na.rm).

Installing the package

# install.packages('devtools')
library(devtools)
install_github('ivanvoronin/mlth.data.frame')
install_github('ivanvoronin/TwinAnalysis')

Loading packages and data

## Load the packages

library(dplyr)
library(mlth.data.frame)
library(OpenMx)
library(TwinAnalysis)

For the purpose of illustration, I will use the twin dataset from OpenMx. This is an Australian twin data on height (ht1/2), weight (wt1/2) and BMI (bmi1/2). There are two cohorts of twins: younger (zyg %in% 1:5) and older (zyg %in% 6:10). Details in ?twinData.

## Load the data

data(twinData)

head(twinData)

Pre-processing twin data

Pre-processing of the data may include many steps, such as:

which are not covered in this vignette.

Twin data can be stored in a data tables in two formats:

Most functions from TwinAnalysis operate with data in the wide format. It is assumed that the variables that differ between the twins (including phenotype variables) are named as X1 and X2. X1 is trait X in the first twin and X2 is trait X in the second twin. Mind that there is no separator between variable name and the suffix 1/2.

Below is an example of reshaping wide table twinData to long table longData and back.

# From wide to long
longData <- twinData %>%
  reshape(direction = 'long',
          varying = c('wt1', 'wt1', 'ht1', 'ht2',       # The variables that vary 
                      'htwt1', 'htwt2', 'bmi1', 'bmi2', # within a twin pair
                      'age1', 'age2'), 
          sep = '') #                 <- A separator between variable name and suffix

head(longData)

# Back from long to wide
longData %>%
  reshape(direction = 'wide',
          v.names = c('wt', 'ht', 'htwt', # The variables that vary within a twin pair 
                      'bmi', 'age'),      
          idvar = 'fam',             # Unique identifier of twin pair
          timevar = 'time',          # Twin number (identifier of twin within pair)
          sep = '') %>%              # Optional: a separator, default '.'
  head

Zygosity is a key variable for twin analysis as it labels MZ and DZ twins. A twin pair with no zygosity is excluded from the twin analysis.

In the illustration data, the zygosity variable labels both zygosity and sex. I create the new zygosity and sex variables to use in further analysis.

twinData$zyg <- twinData$zygosity %>%
  as.character %>%
  sapply(switch,
         MZFF = 'MZ',
         MZMM = 'MZ',
         DZFF = 'DZss',     # Same-sex DZ pairs
         DZMM = 'DZss',
         DZOS = 'DZos') %>% # Opposite-sex DZ pairs
  factor(levels = c('MZ', 'DZss', 'DZos')) # This is to keep the order of levels

twinData$sex1 <- twinData$zygosity %>%
  as.character %>%
  sapply(switch,
         MZFF = 'Female',
         MZMM = 'Male',
         DZFF = 'Female',     
         DZMM = 'Male',
         DZOS = 'Female')    # First twin in opposite-sex pair is female

twinData$sex2 <- twinData$zygosity %>%
  as.character %>%
  sapply(switch,
         MZFF = 'Female',
         MZMM = 'Male',
         DZFF = 'Female',     
         DZMM = 'Male',
         DZOS = 'Male')    # Second twin in opposite-sex pair is male

# There are many alternative ways to do the same thing
# I also like to check whether the manipulations with the data
# did the right thing, like so:
with(twinData, table(zygosity, zyg))
with(twinData, table(zygosity, sex1))
with(twinData, table(zygosity, sex2))

Finally, for this analysis I will select only the younger cohort of participants.

## It is usually a good idea to make a copy of an original
## data frame and make changes there
twinData2 <- twinData[twinData$cohort == 'younger', ]

Twin descriptive statistics

Descriptive statistics are important for twin analysis because they help us to make sure that the data is OK and that the assumptions of twin model hold. If any assumption doesn't hold, the model can have poor fit meaning that it may not descrive well the reality that we study.

One mathematical assumption requires similar distributions of the phenotypic variables within twin pairs and between zygosity groups. Namely, the means and variances shouldn't differ between twin 1 and twin 2 and between MZ and DZ twins. Another mathematical assumption constrains the similarities witnin MZ and DZ pairs: $r_{MZ} > r_{DZ}$ and $2r_{DZ} < r_{MZ}$ (ACE model) or $r_{MZ}>2r_{DZ}$ and $4r_{DZ} > r_{MZ}$ (ADE model). We also may want to compare twin similarity in DZss and DZos pairs if we want to combine them into the single group.

The TwinAnalysis package provides functionality to compute descriptive statistics (means, SDs, twin correlations).

get_univ_descriptives(twinData2, zyg = 'zyg',      
                      vars = c('ht', 'wt', 'bmi')) ## No suffix

## We can also compare DZss and DZos correlations to check if they differ
get_fisher_z(twinData2, zyg = 'zyg', 
             vars = c('ht', 'wt', 'bmi'), 
             compare = c('DZss', 'DZos'))

In this dataset there is noticeable difference between DZ twin 1 and DZ twin 2 in height and weight. This is a natural result of twin 1 being female and twin 2 being male in this group. The twin similarity is lower in DZos pairs comparing to DZss pairs, perhaps as the result of gender differences. Finally, $r_{MZ} > 2r_{DZ}$ for BMI which means that non-additive genetic effects may contribute to this phenotype and ADE model may be appropriate.

For now, I ignore gender differences and exclude DZos pairs from the analysis by making a new zygosity variable.

The models only accept zygosity as a factor variable with labels 'MZ' and 'DZ'.

twinData2$zyg2 <- twinData2$zyg %>%
  as.character %>%
  sapply(switch, 
         MZ = 'MZ', 
         DZss = 'DZ', 
         DZos = NA) %>%
  factor(levels = c('MZ', 'DZ'))

twinData2 <- twinData2[!is.na(twinData2$zyg2), ]

get_univ_descriptives(twinData2, zyg = 'zyg2',
                      vars = c('ht', 'wt', 'bmi'))

Univariate twin model

A univariate ACE/ADE is the simplest form of twin model.

Let's fit a univariate ACE model to BMI.

## Define the model
bmi_univ <- univariate_ace(twinData2, zyg = 'zyg2', ph = 'bmi')

## Fit MxModel
bmi_univ <- mxRun(bmi_univ)

## OpenMx summary
summary(bmi_univ)

## Output tables
get_output_tables(bmi_univ)

univariate_ace() returns an MxModel object which means that you can do everything you usually do with OpenMx models: run parameter fitting (mxRun(), mxTryHard()), get model summary (summary()), add/delete matrices and algebra from a model (mxModel()), add constraints (mxConstraint()) etc.

bmi_univ

bmi_univ$expCovMZ

mxEval(expCovDZ, bmi_univ)

## Note that expected variances in MZ and DZ twins are exactly same, 
## this is why the assumptions on means and variances must be met in the data

In addition to the standard model structure, such as the parameter matrices, the algebra for expected covariance matrix, the observed data, unvariate_ace() (as well as the other functions from TwinAnalysis that define models) adds the algebra that computes indeces that we actually will use as main results. In particular, univariate_ace() returns the single otput table with percents of variability accounted for by A, C and E. The output tables are usually stored inside the model as MxAlgebra.

The function get_output_tables() returns the output tables from the model. These tables are recorded as an algebraic expression with model parametwes (MxAlgebra object) inside the model. For example, the Variance_components table reports the proportion of genetic and environmental contributions computed as genetic or environmental variability divided by the total variability of a trait.

To ensure that the model fits the data well, we compare it against two reference models: saturated model and independence model. In a saturated model, there are as many parameters as there are data points (in terms of variances, covariances and means), therefore this models has the best fit of all. An independence model assumes that all variables are independent. Theoretically, a good model should describe the data as well as its saturated model.

The function ref_models() runs and appends the reference models to a target model. By default, it uses mxRefModels() to define reference models and the user can provide custom reference models as well (see ?ref_models).

The function twin_ref_model() defines, runs and appends the reference models that follow the assumptions of the twin method (equal means, variances and co-variances across twins within pair and across zygosity groups).

The function fit_stats() returns a table with fit statistics.

## bmi_univ and bmi_univ2 will be the same model with different reference models
bmi_univ2 <- bmi_univ 

bmi_univ <- twin_ref_models(bmi_univ, run = TRUE)
bmi_univ2 <- ref_models(bmi_univ2)

fit_stats(bmi_univ)
fit_stats(bmi_univ2)

We can check that the assumptions hold by comparing two saturated models. In this case, the constrained saturated model explains the data significantly worse that the unconstrained saturated model meaning that some assumptions might be violated.

mxCompare(get_ref_models(bmi_univ2)$Saturated,
          get_ref_models(bmi_univ)$Saturated)

When there are several phenotypic variables to fit a univariate twin model upon, it is convenient to iterate over the variables.

univ_models <- list()

for (v in c('ht', 'wt', 'bmi')) {
  model <- univariate_ace(twinData2, zyg = 'zyg2', ph = v)
  model <- mxRun(model, intervals = TRUE)   ## This time I evaluate CIs
  model <- ref_models(model)

  univ_models[[v]] <- model
}


## Estimates of genetic and environmental contributions
univ_models %>%
  lapply(get_output_tables) %>%
  lapply(`[[`, 'Variance_components') %>% ## same as lapply(function(x) x[['Variance_components']])
  do.call(rbind, .)

univ_models %>%
  lapply(fit_stats) %>%
  do.call(rbind, .)

Bivariate twin model

A bivariate twin model is applied when the research goal is to find out whether the measured correlation between two traits is accounted for by genetic or environmental variability shared by these two traits. This kind of analysis relies on within-twin and cross-twin within-trait and cross-trait covariances. For example, ht1 <-> ht2 is a within-trait cross-twin correlation, ht1 <-> wt1 is a cross-trait within-twin correlation, ht1 <-> wt2 is a cross-trait cross-twin correlation. All these data will be used to find genetic and environmental contributions to height and veight and to estimate the overlap between the corresponding genetic and environmental variability.

twin_cors <- get_cross_trait_cors(twinData2, zyg = 'zyg2',
                                  vars = c('ht', 'wt'))
twin_cors

## Within-twin variances-covariances
## Each table is diagonally symmetrical
twin_cors %>% 
  lapply(`[`, c('ht1', 'wt1'), c('ht1', 'wt1'))
twin_cors %>% 
  lapply(`[`, c('ht2', 'wt2'), c('ht2', 'wt2'))

## Cross-twin within-trait covariances
## Second set of tables is transposed first set of tables
twin_cors %>% 
  lapply(`[`, c('ht1', 'wt1'), c('ht2', 'wt2'))
twin_cors %>% 
  lapply(`[`, c('ht2', 'wt2'), c('ht1', 'wt1'))

When we run a bivariate ACE on the illustration data, the optimizer fails to find optimal parameter values (status code RED). We can check model identification as OpenMx suggests.

biv_model <- multivariate_ace(twinData2, zyg = 'zyg2',
                              ph = c('ht', 'wt'))
biv_model <- mxRun(biv_model)

mxCheckIdentification(biv_model)

get_output_tables(biv_model)

It says that there are problems with the non-shared environmental parameters (ACE.e[1,1], ACE.e[2,1], ACE.e[2,2]). The matrix e is actually a Cholesky decomposition of environmental variability: E = e %*% t(e). I will eliminate non-shared environmental correlation between height and weight by constraining e[2, 1] == 0.

biv_model <- multivariate_ace(twinData2, zyg = 'zyg2',
                              ph = c('ht', 'wt'))
biv_model$e$values[2, 1] <- 0     ## This is not the only way to fix a parameter
biv_model$e$free[2, 1] <- FALSE

biv_model <- mxRun(biv_model)

## It works fine now

get_output_tables(biv_model)

biv_model <- ref_models(biv_model)
fit_stats(biv_model)

Now it runs fine. The non-shared environmental correlation was sussessfully eliminated. The chi-square test shows significant difference between the model and the corresponding saturated model but CFI, TLI and RMSEA suggest good fit.

The output tables of a bivariate/multivatiate ACE model:

Notes:

  1. The models with categorical outcomes: different fit function and fit statistics may be needed.
  2. In multivariate models we can achieve genetic and environmental correlations with opposite sign. This means that one factor will contribute to phenotypic relationship positively and another negatively. In this case we have to treat percents with caution. I use absolute value of correlations to compute percents.
  3. If the model has not been fit, get_output_tables() will throw an error.
  4. Also, check the package umx that provides a lot of functionality for SEM and twin analysis.
  5. There is a presentation that explains the basics of twin method and OpenMx and demonstrates the functionality of TwinAnalysis and mlth.data.frame.
file.copy(from = 'vignettes/twin-analysis-workflow.html',
          to = 'docs/index.html',
          overwrite = TRUE)


IvanVoronin/TwinAnalysis documentation built on July 24, 2024, 9:36 p.m.