Introduction to segtest Version 2.0
In segtest: Tests for Segregation Distortion in Polyploids

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(1)

library(segtest)

The segtest package offers a suite of tools for testing segregation distortion in F1 polyploid populations across diverse meiotic models. These methods support autopolyploids (full polysomic inheritance), allopolyploids (full disomic inheritance), and segmental allopolyploids (partial preferential pairing). Double reduction is optionally modeled fully in tetraploids and partially (at simplex loci only) in higher ploidies. A user-specified maximum proportion of outliers allows the method to accommodate moderate double reduction at non-simplex loci. Offspring genotypes may be known or modeled using genotype likelihoods to account for genotype uncertainty. Parent data may or may not be provided, at your option. Parents can have different (even) ploidies, at your option. Details of the methods may be found in Gerard et al. (2025a) and Gerard et al. (2025b).

Additional functions include those that generate gamete and genotype frequencies under different models of meiosis, functions that simulate genotype (log) likelihoods, and "competing" tests for segregation distortion.

The main functions are:

seg_multi(): Run the likelihood ratio test for segregation distortion in parallel at many loci.
multidog_to_g: Format the genotyping output from updog::multidog() to be compatible with the input of seg_multi().
seg_lrt(): Test for segregation distortion for any even ploidy.
gamfreq(): Gamete frequencies.
gf_freq(): Genotype frequencies of an F1 population of polyploids.
drbounds(): Upper bounds on the double reduction rate(s) based on two different extreme models of meiosis.
simgl(): Simulate genotype log-likelihoods given a vector of genotype counts.

Gamete genotype frequencies

gamfreq() will generate gamete frequencies under different models of meiosis. gf_freq() will generate genotype frequencies under the same models by convolving the output from gamfreq(). We focus on gamfreq(), as gf_freq() uses the same models but applied separately to each parent.

For autopolyploids, specify type = "polysomic" and, optionally, the amount of double reduction via alpha. alpha is a vector of length floor(ploidy / 4) where element i is the probability a gamete has i pairs of identical by double reduction alleles. The upper bounds for alpha can be found via drbounds(). E.g., for a parental octoploid with genotype 4 with no and moderate levels of double reduction:

drbounds(ploidy = 8) ## DR bounds
gamfreq(g = 4, ploidy = 8, type = "polysomic") ## no DR
gamfreq(g = 4, ploidy = 8, alpha = c(0.1, 0.01), type = "polysomic") ## Some DR

For allopolyploids, the possible gamete frequencies can be found in seg. E.g., for a parental octoploid with genotype 4, the possible gamete frequencies are

seg[seg$ploidy == 8 & seg$g == 4 & seg$mode %in% c("disomic", "both"), "p"]

Note that you also need to filter for the mode to be either "disomic" or "both" (both disomic and polysomic). The total number of possible allopolyploid distributions is n_pp_mix().

n_pp_mix(g = 4, ploidy = 8)

You can specify one of these distributions via a 1-of-3 vector. E.g.

gamfreq(g = 4, ploidy = 8, gamma = c(1, 0, 0), type = "mix")
gamfreq(g = 4, ploidy = 8, gamma = c(0, 1, 0), type = "mix")
gamfreq(g = 4, ploidy = 8, gamma = c(0, 0, 1), type = "mix")

Segmental allopolyploids are mixtures of the possible allopolyploid segregation distributions. E.g., an equal mixture of the three for an octoploid with genotype 4 is

gamfreq(g = 4, ploidy = 8, gamma = c(1, 1, 1)/3, type = "mix")

At simplex, loci, there is only one possible allopolyploid segregation distribution:

n_pp_mix(g = 1, ploidy = 8)
gamfreq(g = 1, ploidy = 8, gamma = 1, type = "mix")
n_pp_mix(g = 7, ploidy = 8)
gamfreq(g = 7, ploidy = 8, gamma = 1, type = "mix")

You can account for double reduction at these loci by including beta. The upper bound of which can be found via beta_bounds().

beta_bounds(ploidy = 8)
gamfreq(g = 1, ploidy = 8, gamma = 1, beta = 0.03, type = "mix")
gamfreq(g = 7, ploidy = 8, gamma = 1, beta = 0.03, type = "mix")

Simulating data

Let's suppose we have some genotype frequencies we want to simulate individual data from:

gf <- gf_freq(
  p1_g = 2, 
  p1_ploidy = 6,
  p1_gamma = c(0.7, 0.3), 
  p1_type = "mix",
  p2_g = 4,
  p2_ploidy = 6, 
  p2_gamma = c(0.5, 0.5), 
  p2_type = "mix")
plot(gf, type = "h", xlab = "Genotype", ylab = "Frequency")

To simulate genotype counts, just use multinom() from the stats package. Let's simulate data from 10 individuals.

x <- c(rmultinom(n = 1, size = 10, prob = gf))
x

To simulate genotype (log) likelihoods, insert these genotype counts into simgl().

gl <- simgl(nvec = x)
gl

Testing for segregation distortion

You can test for segregation distortion using seg_lrt(). E.g., let's test for it using the data (both known genotypes and genotype likelihoods) we simulated from the previous section:

## With known genotypes
sout1 <- seg_lrt(x = x, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)
sout1$p_value
## With genotype likelihoods
sout2 <- seg_lrt(x = gl, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)
sout2$p_value

My recommendation is to always use the genotype log-likelihoods. But seg_lrt() allows for known genotypes, if that situation works best for you.

The default (model = "seg") is to assume your organism is a segmental allopolyploid, and to account for possible double reduction at simplex loci. But you should absolutely use other models if you have more information on your organism:

"seg": General segmental allopolyploid with possible double reduction at simplex loci.
"allo_pp": Segmental allopolyploid with complete bivalent pairing (no double reduction).
"allo": Pure allopolyploid (disomic inheritance).
"auto": Pure autopolyploid with complete bivalent pairing (no double reduction).
"auto_dr": Pure autopolyploid with possible multivalent pairing (some double reduction).
"auto_allo": Same null hypothesis as in polymapR.

We allow for some non-valid genotypes via the ob argument. This is the upper bound on the proportion of outliers. By default, this is set to 0.03. You can set this to 0 (or set outlier = FALSE) if you want any outliers to indicate segregation distortion.

Make sure that the log-likelihoods are base $e$. If they are base 10, you'll get the wrong $p$-value:

gl10 <- gl / log(10)
seg_lrt(x = gl10, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)$p_value

Don't mess with the technical arguments (ntry, opt, optg, df_tol). These have to do with the optimization and how to approximate the degrees of freedom of the test. Except possibly ntry. You could increase that if you are seeing weird results. But then let me know, because I haven't seen any bad behavior with ntry = 3 (the default).

References

Gerard D, Thakkar M, & Ferrão LFV (2025a). "Tests for segregation distortion in tetraploid F1 populations." Theoretical and Applied Genetics, 138(30), p. 1--13. doi:10.1007/s00122-025-04816-z.

Gerard, D, Ambrosano, GB, Pereira, GdS, & Garcia, AAF (2025b). "Tests for segregation distortion in higher ploidy F1 populations." bioRxiv, p. 1--20. bioRxiv:2025.06.23.661114