knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )

## copied from examples from the "simmer" R package ## after: https://www.enchufa2.es/archives/suggests-and-vignettes.html ## by Iñaki Úcar required <- c("bnpsd", "popkin", "kinship2", "RColorBrewer") if (!all(sapply(required, requireNamespace, quietly = TRUE))) knitr::opts_chunk$set(eval = FALSE)

The `simfam`

package is for constructing large random families---with population studies in mind---with realistic constraints including avoidance of close relative pairings but otherwise biased for closest pairs in a 1-dimensional geography.
There is also code to draw genotypes across the pedigree starting from genotypes for the founders.
Our model allows for the founders to be related or structured---which arises in practice when there are different ancestries among founders---and given known parameters for these founders (including known kinship matrices and ancestry proportions) we provide code to calculate the true kinship and expected admixture proportions of all descendants.

This vignette focuses on two examples.
One is small, necessary to actually visualize the pedigree.
The second is larger, where the pedigree cannot be fully visualized but whose population parameters are more realistic.
This vignette draws from other general packages to visualize the data (`popkin`

and `kinship2`

), as well as to construct/simulate structured founders (`bnpsd`

, which is based on the admixture model).

Let's start with a 3-generation pedigree with 15 individuals per generation. (The code can also handle variable numbers of individuals per generation, which is not demonstrated in this vignette for simplicity).

library(simfam) # data dimensions n <- 15 # number of individuals per generation G <- 3 # number of generations # the result changes in every run unless we set the seed set.seed(1) # draw the random pedigree with those properties data <- sim_pedigree( n, G ) # save these objects separately fam <- data$fam ids <- data$ids kinship_local_G <- data$kinship_local

The function returns a list with two objects.
The first element in the list is the most important: `fam`

is a table that describes the pedigree in a standard notation in genetics research: the plink FAM format.
Every row corresponds to one individual, and columns name individuals (`id`

), their parents (`pat`

and `mat`

, founders have parents set to `NA`

), and their `sex`

(`1`

=male, `2`

=female).
The FAM table format also requires a family ID (`fam`

) and a phenotype (`pheno`

) column, both of which take on dummy values in the output of `sim_pedigree`

:

# the top of the table head( fam ) # and the bottom tail( fam )

The IDs (columns `id`

, `pat`

, `mat`

) are formatted as `g-i`

, where the first number is the generation and the second number is an index within the generation (1 to `n`

).
Every individual belongs to a generation in the simulation (founders are the first generation), and only individuals in the same generation may be paired.
Sex is drawn at random for each individual prior to pairing.
Every individual has a unique index within the generation, that places them in a 1D geography---an important detail since pairings are strongly biased to be between closest individuals.
Founders are ordered as they are constructed, and children are ordered according to the mean index of their parents.
Pairings occur randomly and iteratively: while there are available male-female pairs, a male is drawn at random from the available males and he is paired with the closest unpaired female that is not a close relative (by default they must be less related than second cousins, i.e. have a pedigree kinship coefficient of less than `1/4^3`

).

The second element in the return list of `sim_pedigree`

is `ids`

, the list of IDs separated by generation.
This is very handy as often we want to copy the names of the founders (`ids[[ 1 ]]`

) to other objects, or want to filter data to contain the last generation only (`ids[[ G ]]`

).

The third element in the return list of `sim_pedigree`

is the local (i.e., pedigree-based) kinship matrix of the individuals.
This data can be recalculated from the `fam`

table alone, so it is redundant, but it is calculated because the pairing algorithm requires it to avoid close relatives, so it is returned mostly because it's there already.
By default, only individuals in the last generation are included in the matrix that is returned, so this is a `n * n`

(here 15 x 15) matrix:

kinship_local_G

This kinship matrix is later visualized more conveniently as a heatmap using the `popkin`

package.
If you're interested, a list of local kinship matrices (one per generation) is returned in place of this if `sim_pedigree`

is called with the `full = TRUE`

option.

We use the excellent `kinship2`

package to visualize our randomly-constructed pedigree:

# load the package library(kinship2) # Convert the `fam` table into a `pedigree` object required by the `kinship2` package. obj <- pedigree( id = fam$id, dadid = fam$pat, momid = fam$mat, sex = fam$sex, affected = fam$pheno ) # plot the pedigree plot( obj, cex = 0.7 )

In these kinds of pedigree plots, node shapes represent sex (box=male, circle=female), and the dashed curved lines (if any) represent individuals copied in two places (once next to its parents, the second time next to its mate) to simplify the plot (the two nodes connected by a dashed line have the same individual ID). It is immediate clear that only individuals with opposite sex were paired, and no close relatives were paired (i.e. no siblings or first or second cousins).

You will notice several things from this random pedigree. One is that some individuals do not get paired in each generation, which happens because there are too many constraints: the number of males and females is random so they may be unequal, and close-relative avoidance further limits pairings. However, this problem is exacerbated in this toy example because the population is small; in a large population most people are paired (see further below). Here, founders that weren't paired are not plotted (see message below figure). Pairings are always between people of the same generation (first number in ID), and usually but not always between people with close indexes (second number in ID).

When there are pairings, `sim_pedigree`

enforces a minimum number of children per family (default 1).
Each generation has exactly the desired number of total individuals (15 per generation in this example), which is achieved by randomly drawing family sizes from a modified Poisson distribution (the number of children in excess of the minimum is drawn from Poisson with target mean per family minus minimum size) followed by smaller random adjustments in the direction of the discrepancy until the target population size is met.

If we are only interested in the last generation, then individuals from previous generations that were unpaired are irrelevant, since for example they don't determine the genotypes of the last generation.
So for simplicity or efficiency, we may want to remove those individuals without descendants.
We do this with the function `prune_fam`

!

# replace original with pruned FAM table # containing only ancestors of last generation fam <- prune_fam( fam, ids[[ G ]] ) # inspect fam # and plot pedigree obj <- pedigree( id = fam$id, dadid = fam$pat, momid = fam$mat, sex = fam$sex, affected = fam$pheno ) plot( obj, cex = 0.7 )

Inspection of the top of the table shows that founders that were originally unpaired are now removed (the plot also doesn't report people not plotted). The plot further shows that many individuals in the second generation (childless uncles and aunts of the last generation) were also removed, but nobody in the third generation was removed (as desired).

We will draw genotypes in two ways to reflect the traditional (unrelated founders) and the intended use of `simfam`

: structured founders.
Let's do unrelated founders first.
Note that, to keep this vignette runtime low, the number of loci is much lower than any real dataset, so data will look artificially noisier here.

# number of loci to draw m <- 5000 # draw uniform ancestral allele frequencies p_anc <- runif( m ) # draw independent (unstructured) genotypes for founders (includes founders without descendants) X_1 <- matrix( rbinom( n * m, 2, p_anc ), nrow = m, ncol = n ) # `simfam` requires names for the founders on `X_1`, to validate identities and align colnames( X_1 ) <- ids[[ 1 ]]

Before moving on, it's important to understand that these founders have a kinship matrix, albeit a trivial one, with zero kinship (unrelated) between individuals and a self-kinship of 1/2 (expected for outbred individuals).
We can verify this directly using a kinship estimator applied to the genotypes we just drew.
We use `popkin`

, which is the only unbiased kinship estimator.
`popkin`

also provides heatmap visualization of kinship matrices via `plot_popkin`

.

library(popkin) # estimate kinship kinship_est_1 <- popkin( X_1 ) # true/expected kinship kinship_1 <- diag( n ) / 2 # assign same IDs as `X_1` and `fam`; `simfam` generally requires these to validate/align colnames( kinship_1 ) <- colnames( X_1 ) rownames( kinship_1 ) <- colnames( X_1 ) # plot them together for comparison plot_popkin( list( kinship_1, kinship_est_1 ), titles = c('Expected', 'Estimated'), names = TRUE, leg_width = 0.2, mar = c(3, 2) )

Now we draw genotypes across the pedigree using `geno_fam`

!
Since we have genotypes of the founders, we can simulate the random inheritance of alleles to every descendant, which produces the correct pedigree-based kinship/covariance structure at every locus.
However, this code draws loci independently from each other (there's no linkage disequilibrium).
We also construct the total expected kinship across this pedigree with a similar function, `kinship_fam`

.

X <- geno_fam( X_1, fam ) # expected kinship of entire pruned pedigree, starting from true kinship of founders. # Called "local" kinship because it ignores the ancestral population structure (will be reused) # NOTE: agrees with `kinship2::kinship( fam )`, but only because founders are unstructured kinship_local <- kinship_fam( kinship_1, fam )

We can view and validate the resulting data in several ways.
The more complete validation is to estimate kinship for all `X`

(all 3 generations) and see that it agrees with `kinship_fam`

:

# estimate kinship kinship_local_est <- popkin( X ) # plot them together for comparison plot_popkin( list( kinship_local, kinship_local_est ), titles = c('Expected', 'Estimated'), names = TRUE, names_cex = 0.8, leg_width = 0.2, mar = c(3, 2) )

Above you can see that only founders in the pruned FAM table were kept in the output.
However, in a real dataset previous generations might not be available (especially if `G`

were much larger than 3), so we might only have the genotype data of a subset of more recent relatives.
Here we subset to keep the last generation, and compare not only to `kinship_fam`

, but also to the original `kinship_local_G`

that was calculated by `sim_pedigree`

to avoid pairing close relatives.

# indexes (really logicals) marking individuals in last generation, # to subset data (which is aligned with `fam`) indexes_G <- fam$id %in% ids[[ G ]] # estimate kinship kinship_local_est <- popkin( X ) # plot them together for comparison plot_popkin( list( kinship_local[ indexes_G, indexes_G ], kinship_local_est[ indexes_G, indexes_G ], kinship_local_G ), titles = c('Expected', 'Estimated', 'Local'), names = TRUE, mar = c(3, 2) )

Now we simulate data from the much more interesting case, where founders have differing mixed ancestries!
This is much more like real human data.
For this we use another external package, `bnpsd`

, to construct the admixture parameters of the founders, which also includes calculating their true population kinship matrix.

library(bnpsd) # additional admixture parameters: # number of ancestries K <- 3 # define population structure # FST values for k=3 subpopulations inbr_subpops <- c(0.2, 0.4, 0.6) # admixture proportions from 1D geography (founders) admix_proportions_1 <- admix_prop_1d_linear( n, K, sigma = 1 ) # add founder names to `admix_proportions_1` (will propagate to other vars of interest) rownames( admix_proportions_1 ) <- ids[[ 1 ]] # also name subpopulations simply as S1, S2, S3 colnames( admix_proportions_1 ) <- paste0( 'S', 1:K ) # calculate true kinship matrix of the admixed founders # NOTE: overwrites `kinship` for unstructured founders kinship_1 <- coanc_to_kinship( coanc_admix( admix_proportions_1, inbr_subpops ) ) # draw genotypes from admixture model # NOTE: overwrites `X_1` for unstructured founders X_1 <- draw_all_admix( admix_proportions_1, inbr_subpops, m )$X

We can again verify the population kinship of the founders, which is no longer full of zeroes but rather takes on continuous values that depend on the admixture proportions of each pair of founders:

# estimate kinship (NOTE: overwrites `kinship_est_1` for unstructured founders) kinship_est_1 <- popkin( X_1 ) # plot them together for comparison plot_popkin( list( kinship_1, kinship_est_1 ), titles = c('Expected', 'Estimated'), names = TRUE, leg_width = 0.2, mar = c(3, 2) )

Note that, unlike the corresponding demonstration in the `bnpsd`

package, here the diagonal plots self-kinship values instead of inbreeding coefficients, as kinship matrices are the central focus of this `simfam`

package.

We draw genotypes across the pedigree the same way as before:

X <- geno_fam( X_1, fam ) # expected kinship of entire pruned pedigree, starting from true kinship of founders # NOTE: DOESN'T agree with `kinship2::kinship( fam )` anymore because founders are structured! kinship_total <- kinship_fam( kinship_1, fam )

We again compare the expected kinship matrix to the estimate from genotypes:

# estimate kinship kinship_total_est <- popkin( X ) # plot them together for comparison plot_popkin( list( kinship_total, kinship_total_est ), titles = c('Expected', 'Estimated'), names = TRUE, names_cex = 0.8, leg_width = 0.2, mar = c(3, 2) )

Lastly, we again subset individuals in the last generation only and compare their *total* to their *local* kinship, which are related but different things when a population is structured.
Although undoubtedly the *total* kinship (panels A-B below) capture much of the same correlation patterns as the *local* kinship (panel C), the total kinship captures additional covariance due to the structure of the founders that the local kinship ignores (the local kinship uses only the pedigree information and ignores the ancestry differences of the founders).

plot_popkin( list( kinship_total[ indexes_G, indexes_G ], kinship_total_est[ indexes_G, indexes_G ], kinship_local_G ), titles = c('Total Expected', 'Total Estimated', 'Local'), names = TRUE, mar = c(3, 2) )

In this admixture model the structure comes from fractional shared ancestry due to admixture as well as the relatedness between ancestries. We can look at these relatedness components in isolation from the pedigree and see that their contribution to the total kinship matrix is nearly additive (will explain more precisely shortly)! First propagate the admixture proportions across the pedigree, under the simple model that the ancestry of a child is the average of the ancestry of its parents:

# admixture proportions of pruned family admix_proportions <- admix_fam( admix_proportions_1, fam ) # visualize as bar plot # for nice colors library(RColorBrewer) # colors for independent subpopulations col_subpops <- brewer.pal( K, "Paired" ) # shrink default margins par_orig <- par(mar = c(4, 4, 0.5, 0) + 0.2) barplot( t( admix_proportions ), col = col_subpops, legend.text = TRUE, args.legend = list( cex = 0.7, bty = 'n' ), border = NA, space = 0, ylab = 'Admixture prop.', las = 2 ) mtext('Individuals', side = 1, line = 3) par( par_orig ) # reset `par`

The above figure shows that admixture proportions in the first generation vary a lot: for example, 1-1 has a large proportion of S1 ancestry, while 1-14 has mostly S3 instead. In contrast, in the third generation ancestry is much more averaged out, with fewer extremes. This effect will be weaker in larger populations, an example we go into shortly.

This code calculates the kinship expected from the admixture structure only, which will be visualized shortly.

# this is the admixture-only relatedness kinship_admix <- coanc_to_kinship( coanc_admix( admix_proportions, inbr_subpops ) )

We also demonstrate a good approximation of the total kinship that treats the structural (here admixture) and pedigree relatedness (in this case treating founders as unstructured) as independent effects, which roughly says that the kinship components satisfy (pseudocode):

# These quantities are independent (1 - total) = (1 - struct) * (1 - family). # solved for total, reveals near additivity: total = struct + family - struct * family.

So when both the structural and family kinship values are much smaller than 1, their sum is a good match of the total kinship.
Only when these values approach 1 does it become necessary to subtract their product, which is necessary since all of these values are probabilities and the total kinship cannot exceed 1.
The same applies to inbreeding coefficient, but if we start from kinship matrices we need to transform the diagonal values back and forth (since, in pseudocode, `kinship[i,i] = ( 1 + inbreeding[i] ) / 2`

).
We plot all of the matrices and verify that the approximation is very good, so the conceptual partitioning into structural and family effects is justified:

# this is the approximation of the total kinship: # NOTE: a bit complicated because diagonal has to be transformed. # Approximation operates on inbreeding coefficients, not self-kinship, # so `popkin::inbr_diag` converts self-kinship to inbreeding, # `bnpsd::coanc_to_kinship` converts inbreeding back to self-kinship. # Off-diagonal values are unmodified by both functions. kinship_total_approx <- coanc_to_kinship( inbr_diag( kinship_local ) + inbr_diag( kinship_admix ) - inbr_diag( kinship_admix ) * inbr_diag( kinship_local ) ) # plot all model kinship matrices (no estimates from genotypes this time) plot_popkin( list( kinship_admix, kinship_local, kinship_total, kinship_total_approx ), titles = c('Admixture only', 'Family only', 'Total', 'Total approx.'), names = TRUE, names_cex = 0.8, leg_width = 0.2, layout_rows = 2, mar = c(3, 2) )

Here we will simulate a much larger population size and a pedigree with a larger number of generations, to produce data that may better resemble real human data.
We will make no attempt of visualize pedigrees or show individual labels.
The whole code, redundant with our earlier analysis, is presented with minor explanations.
Note, however, use of the newly-introduced functions `geno_last_gen`

, `kinship_last_gen`

, and `admix_last_gen`

to directly construct data for the last generation only, which also saves memory compared to the more general `geno_fam`

, `kinship_fam`

, and `admix_fam`

shown earlier.

library(simfam) # data dimensions n <- 500 # number of individuals per generation G <- 10 # number of generations K <- 3 # number of ancestries m <- 5000 # number of loci # draw the random pedigree with those properties. data <- sim_pedigree( n, G ) fam <- data$fam ids <- data$ids # use this local kinship calculation in plot kinship_local_G <- data$kinship_local # prune pedigree to speed up simulations/etc fam <- prune_fam( fam, ids[[ G ]] ) ### admixture model # define population structure # FST values for k=3 subpopulations inbr_subpops <- c(0.2, 0.4, 0.6) # admixture proportions from 1D geography (founders) admix_proportions_1 <- admix_prop_1d_linear( n, K, sigma = 1 ) # add founder names to `admix_proportions_1` (will propagate to other vars of interest) rownames( admix_proportions_1 ) <- ids[[ 1 ]] # also name subpopulations simply as S1, S2, S3 colnames( admix_proportions_1 ) <- paste0( 'S', 1:K ) # draw genotypes from admixture model # admixed founders X_1 <- draw_all_admix( admix_proportions_1, inbr_subpops, m )$X # draw genotypes, returning last generation only X_G <- geno_last_gen( X_1, fam, ids ) # total kinship estimated from genotypes kinship_total_G_est <- popkin( X_G ) # calculate true total kinship matrix kinship_total_1 <- coanc_to_kinship( coanc_admix( admix_proportions_1, inbr_subpops ) ) kinship_total_G <- kinship_last_gen( kinship_total_1, fam, ids ) # kinship due to admixture only admix_proportions_G <- admix_last_gen( admix_proportions_1, fam, ids ) kinship_admix_G <- coanc_to_kinship( coanc_admix( admix_proportions_G, inbr_subpops ) ) # total kinship approximation from components kinship_total_approx_G <- coanc_to_kinship( inbr_diag( kinship_local_G ) + inbr_diag( kinship_admix_G ) - inbr_diag( kinship_admix_G ) * inbr_diag( kinship_local_G ) )

The first important validation is that the total structure (estimated from genotypes) agrees with the theoretical calculations based on the pedigree and the known founder total kinship. For simplicity we visualize the last generation only:

plot_popkin( list( kinship_total_G, kinship_total_G_est ), titles = c('Expected', 'Estimated'), leg_width = 0.2, mar = c(3, 2) )

The next demonstration is that the component partitioning approximation works again (limited to last generation):

# plot all model kinship matrices (no estimates from genotypes this time) plot_popkin( list( kinship_admix_G, kinship_local_G, kinship_total_G, kinship_total_approx_G ), titles = c('Admixture only', 'Family only', 'Total', 'Total approx.'), leg_width = 0.2, layout_rows = 2, mar = c(3, 2) )

Lastly, let's determine how many individuals in each generation were ancestors of individuals in the last generation. The bar plot below shows that, in this large simulation, most individuals in the first generation (founders), and in every subsequent generation, had descendants in the last generation.

# the IDs in `fam` (the pruned FAM table) contain the individuals of interest frac_per_gen <- vector('numeric', G) for ( g in 1 : G ) { # this is the desired fraction frac_per_gen[g] <- sum( fam$id %in% ids[[ g ]] ) / n } # visualize as a barplot barplot( frac_per_gen, names.arg = 1 : G, xlab = 'Generation', ylab = 'Frac. individuals with descendants' ) # add a line marking the maximum fraction of individuals per generation abline( h = 1, lty = 2, col = 'red' )

All previous examples had independent loci. Here we use new functions to simulate recombination breaks and therefore a realistic linkage disequilibrium (LD) structure consistent with the pedigree.

For comparison, let's start from the same pedigree used in the last "large" example. Here we shall simulate the first two human autosomes (coordinates in genome version hg38) with a real recombination map (slightly simplified) provided with this package. The steps are lengthy primarily due to the non-trivial construction of ancestral haplotypes.

# real human recombination map is also used to define chromosome numbers and lengths # however, to keep example brief we use only first two chromosomes map <- recomb_map_hg38[ 1L:2L ] # tricky part is need to define ancestral haplotypes, following the desired genetic structure # will simulate each chromosome separately haplos_anc <- vector( 'list', length( map ) ) for ( chr in 1L : length( map ) ) { # Positions matter for genetic map, so for this example choose random positions with a # desired density, here 100 per Kb on average, keep it sparse for small example pos_max <- max( map[[ chr ]]$pos ) m_loci_chr <- round( pos_max / 100000L ) pos_chr <- sample.int( pos_max, m_loci_chr ) # Draw haplotypes. Here we use admixture code to get genotypes # with same structure as before for first generation X_chr <- draw_all_admix( admix_proportions_1, inbr_subpops, m_loci_chr )$X # and randomly split genotypes into haplotypes # (there was no LD/phase so this works well enough for our purposes) H1_chr <- matrix( rbinom( X_chr, 1L, X_chr/2L ), nrow = m_loci_chr ) # second haplotype is complement of first given genotypes H2_chr <- X_chr - H1_chr # column names are required for code later, identifying ancestors (need _pat/mat suffixes) colnames( H1_chr ) <- paste0( colnames( X_chr ), '_pat' ) colnames( H2_chr ) <- paste0( colnames( X_chr ), '_mat' ) # final data all together X_chr <- cbind( H1_chr, H2_chr ) # add to structure, in a list haplos_anc[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # initialize trivial founders chromosomes (unrecombined chromosomes with unique labels) # must use IDs of first generation from pedigree as input founders <- recomb_init_founders( ids[[ 1 ]], map ) # draw recombination breaks along pedigree, with coordinates in genetic distance (centiMorgans) inds <- recomb_last_gen( founders, fam, ids ) # map recombination break coordinates to base pairs inds <- recomb_map_inds( inds, map ) # determine haplotypes of descendants given ancestral haplotypes haplos <- recomb_haplo_inds( inds, haplos_anc ) # and reduce haplotype data to genotypes, same standard matrix format as previous examples X_G_recomb <- recomb_geno_inds( haplos ) # total kinship estimated from genotypes (simulated with recombination) kinship_total_G_recomb_est <- popkin( X_G_recomb )

First we validate that the kinship structure of these genotypes with recombination matches the earlier estimate from the simulation with independent loci as well as the expected kinship from the pedigree. Below it can be seen that the coarse structure is the same, although the data with recombination appears noisier, as expected because LD increases estimation variance. Simulating more chromosomes would reduce the variance considerably.

plot_popkin( list( kinship_total_G, kinship_total_G_est, kinship_total_G_recomb_est ), titles = c('Expected', 'Estimated (indep loci)', 'Estimated (recomb)'), mar = c(3, 2) )

Now let's confirm the presence of LD. The below plot shows that genotype correlations are noisy overall (because the number of individuals is small in our toy example) but nevertheless, correlations within the two chromosomes in the simulation with recombination (two diagonal blocks in panel B below) are noticeably larger than correlations between chromosomes, as well as the correlations in the earlier simulation with independent loci (panel A).

# Notice that the number of loci is similar for both simulations # (rows of the genotype matrices) dim( X_G ) dim( X_G_recomb ) # Calculate correlation matrices between loci (hence transposition). # Exclude fixed loci to prevent warnings about zero standard deviations. ld_G <- abs( cor( t( X_G[ !fixed_loci( X_G ), ] ) ) ) ld_G_recomb <- abs( cor( t( X_G_recomb[ !fixed_loci( X_G_recomb ), ] ) ) ) # cap extreme values, to be able to see the relatively weak LD signal ld_max <- 0.3 ld_G[ ld_G > ld_max ] <- ld_max ld_G_recomb[ ld_G_recomb > ld_max ] <- ld_max # plot! plot_popkin( list( ld_G, ld_G_recomb ), titles = c('Indep loci', 'Recomb'), leg_width = 0.2, mar = c(3, 2), leg_title = 'Absolute Correlation', ylab = 'Loci' )

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.