family_sim_compare: Compare simulated and observed inferred genetic relationships

View source: R/family_sim_compare.R

family_sim_compareR Documentation

Compare simulated and observed inferred genetic relationships

Description

This function can help compare observed estimates of relatedness to expected values of relatedness for different familial relationships given a set of loci and population alelle frequencies. It returns a data table combining the simulated and observed estiamtes and a plot of density curves overlaying the distribution of observed relatedness estimates on top of simulated values.

Usage

family_sim_compare(
  simGRM,
  obsGRM,
  plotColous = NULL,
  look = "ggplot",
  legendPos = "right",
  curveAlpha = 0.7,
  curveFill = NULL,
  curveOutline = NULL,
  facetFamily = FALSE
)

Arguments

simGRM

Matrix: The simulated GRM using the individuals generated from family_sim_genos. See Details.

obsGRM

Matrix: The observed GRM using the observed individuals. See Details.

look

Character: The look of the plot. Default = 'ggplot', the typical gray background with gridlines produced by ggplot2. Alternatively, when set to 'classic', produces a base R style plot.

legendPos

Character: Where should the legend be positioned? Default is 'top', but could also be one of, 'right', 'bottom', 'left', or 'none'.

curveAlpha

Numeric: A value between 0 and 1 to set the transparency of the density curves. Default = 0.7.

curveFill

Character: A vector of colours to fill density curves, but is an optional argument. Default = NULL. If specified, must be a length of 5, with colours corresponding to the levels 'Unrelated', 'Cousins', 'Half-siblings', 'Siblings', and 'Observed', in that order.

curveOutline

Character: A vector of colours to for density curve outlines, but is an optional argument. Default = NULL. If specified, must be a length of 5, with colours corresponding to the levels 'Unrelated', 'Cousins', 'Half-siblings', 'Siblings', and 'Observed', in that order.

facetFamily

Logical: Should the plot be faceted by family relationship? Default is FALSE.

numSims

Integer: The number of simulated individuals for each family relationship. Default is 100.

Details

The function takes the output of family_sim_genos and additional calculations of the genetic relationiship matrix (GRM) that are performed by the user on the simulated and observed individuals.

The GRMs for arguments simGRM and obsGRM need to be created by the user with whatever program they want to use to calculate pairwise relatedness among individuals. The same function call should be used ob both datasets. The GRM should be a square matrix with the relatedness of individuals to themselves on the diagonal, and their relatedness to other individuals on the off-diagonal.

Value

Returns two objects. The first is data.table with the following columns:

  1. $SIM, the simulation number for pairs of simulated individuals, or 'NA' for the pairs of observed individuals.

  2. $SAMPLE1, the sample ID for the first individual.

  3. $SAMPLE2, the sample ID for the second individual.

  4. $FAMILY, the familial relationship for simulated individuals.

  5. $RELATE, the estimated relatedness.

The second is a ggplot object which plots density curves for the estimated relatedness values calculated for simulated pairs of unrelated individuals, cousins, half-siblings, and siblings, with the observed relatedness values from the user's dataset overlayed. Dashed lines are used to demarcate the theoretical expected relatedness values for unrelated individuals (0), cousins (0.125), half-siblings (0.25), and siblings (0.5).

Examples

library(genomalicious)
data(data_Genos)

# Subset Pop1 genotypes
genosPop1 <- data_Genos[POP=='Pop1', c('SAMPLE', 'LOCUS', 'GT')]

# Get the allele frequencies for Pop1
freqsPop1 <- genosPop1[, .(FREQ=sum(GT)/(length(GT)*2)), by=LOCUS]

# Simulate 100 families
simFamily <- family_sim_genos(
   freqData=freqsPop1,
   locusCol='LOCUS',
   freqCol='FREQ',
   numSims=100
)

# Create some siblings in Pop1 from two sets of parents
parentList <- list(c('Ind1_1','Ind1_2'), c('Ind1_3','Ind1_4'))
genosSibs <- lapply(1:2, function(i){
  parents <- parentList[[i]]

  child <- paste(sub('Ind1_', '', parents), collapse='.')

  gamete1 <- genosPop1[SAMPLE == parents[1]] %>%
    .[, .(GAMETE=rbinom(n=2,size=1,prob=GT/2)), by=c('SAMPLE','LOCUS')] %>%
    .[, SIB:=1:2, by=LOCUS]

  gamete2 <- genosPop1[SAMPLE == parents[2]] %>%
    .[, .(GAMETE=rbinom(n=2,size=1,prob=GT/2)), by=c('SAMPLE','LOCUS')] %>%
    .[, SIB:=1:2, by=LOCUS]

  rbind(gamete1, gamete2) %>%
    .[, .(GT=sum(GAMETE)), by=c('LOCUS','SIB')] %>%
    .[, SAMPLE:=paste0('Child_',child,'_',SIB)]
}) %>%
  do.call('rbind', .)

### THE OBSERVED GENOMIC RELATIONSHIPS MATRIX
library(AGHmatrix)

# Combine the population samples and the created siblings
# into a single genotype matrix
obsGenosMat <- rbind(genosPop1, genosSibs[, c('SAMPLE','LOCUS','GT')]) %>%
  DT2Mat_genos()

# Calculate the GRM
obsGRM <- Gmatrix(obsGenosMat, method='Yang', ploidy=2)

### THE SIMULATED GENOMIC RELATIONSHIPS MATRIX
# Convert simulated families into a genotype matrix
simGenosMat <- DT2Mat_genos(simFamily$focal.pairs)

# Calculate the GRM
simGRM <- Gmatrix(simGenosMat, method='Yang', ploidy=2)

### COMPARE THE OBSERVED AND SIMULATED
relComp <- family_sim_compare(
   simGRM=simGRM,
   obsGRM=obsGRM,
   look='classic'
)

# The data
relComp$data

# Simulated dataset
relComp$data[!is.na(SIM)]

# The observed dataset
relComp$data[is.na(SIM)]

# Plot of relatedness values. Dashed lines denote relatedness
# values of 0, 0.0625, 0.125, 0.25, and 0.5, which are the theoretical
# expectations for unrelated individuals, half-cousins, cousins, half-siblings,
# and siblings, respectively.
# You will note a large variance are the expected values, which
# is not surprising for this very small SNP dataset (200 loci).
relComp$plot

# Take a look at the "known" relationships in the observed dataset using the
# offspring we created.
# Note, siblings and parent-offspring pairs have a theoretical
# relatedness of 0.5. But you will probably find the "observed"
# relatedness might be lower.
relComp$data[SAMPLE1=='Child_1.2_1' & SAMPLE2%in%c('Child_1.2_2','Ind1_1','Ind1_2')]
relComp$data[SAMPLE1=='Child_3.4_1' & SAMPLE2%in%c('Child_3.4_2','Ind1_3','Ind1_4')]

# Now take a look at the simulated distribution: the possible values for
# siblings given this dataset are quite wide.
relComp$data[FAMILY=='Half-siblings']$RELATE %>% summary()
relComp$data[FAMILY=='Siblings']$RELATE %>% summary()


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.