knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
There are two main linear selection indices employed in marker-assisted selection (MAS) to predict the net genetic merit and to select individual candidates as parents for the next generation: the linear marker selection index (LMSI) and the genome-wide LMSI (GW-LMSI).
Both indices maximize the expected genetic gain per trait by combining molecular and phenotypic data. The LMSI requires a prior step of selecting markers significantly linked to the quantitative trait loci (QTL) to calculate a "marker score". The GW-LMSI is a single-stage procedure that treats information at each individual marker as a separate trait, entering all markers and phenotypic information into a single index.
In this vignette, we demonstrate how to apply these indices using the selection.index package on phenotypic (maize_pheno) and genotypic (maize_geno) data.
We use the synthetic maize datasets built into the package to calculate the phenotypic, genotypic, marker, and covariance matrices necessary for molecular index modeling.
library(selection.index) # Load the built-in maize phenotype and genotype datasets data("maize_pheno") data("maize_geno") # 1. Prepare Phenotypic Data # We select three traits for our demonstration traits <- c("Yield", "PlantHeight", "DaysToMaturity") phen_mat <- as.matrix(maize_pheno[, traits]) # Calculate Genotypic (gmat) and Phenotypic (pmat) covariance matrices gmat <- gen_varcov(maize_pheno[, traits], maize_pheno$Genotype, maize_pheno$Block) pmat <- phen_varcov(maize_pheno[, traits], maize_pheno$Genotype, maize_pheno$Block) # Aggregate phenotypic data by Genotype to match marker data dimensions agg_pheno <- aggregate(maize_pheno[, traits], by = list(Genotype = maize_pheno$Genotype), FUN = mean) phen_mat <- as.matrix(agg_pheno[, traits]) # 2. Prepare Genotypic Data # Extract only the marker columns (dropping the Genotype ID column) marker_mat <- as.matrix(maize_geno[, -1]) # 3. Define Economic Weights # Suppose we want to improve Yield strongly, decrease Plant Height, and lightly increase Days to Maturity wmat <- weight_mat(data.frame(Trait = traits, Weight = c(2, -1, 0.5)))
The Linear Marker Selection Index (LMSI) requires an explicit prior step: calculating the marker score ($\mathbf{s}$).
Let $\mathbf{y}_i = \mathbf{g}_i + \mathbf{e}_i$ be the vector of phenotypic traits. The unobservable genetic value $g_i$ can be predicted using a linear combination of the $M$ selected markers structurally linked to the QTL that affect the $i$th trait:
$$ \mathbf{s}i=\sum \limits{j=1}^M \boldsymbol{\theta}_j \mathbf{x}_j $$
Where $\mathbf{s}_i$ is the marker score, $\boldsymbol{\theta}_j$ is the regression coefficient, and $\mathbf{x}_j$ is the coded value of the $j$th markers.
The LMSI combines the phenotypic scores ($\mathbf{y}$) and marker scores ($\mathbf{s}$) to predict the net genetic merit $H$:
$$ I_M={\boldsymbol{\beta}}_y^{\prime}\mathbf{y}+{\boldsymbol{\beta}}_s^{\prime}\mathbf{s}=\left[{\boldsymbol{\beta}}_y^{\prime}\kern0.5em {\boldsymbol{\beta}}_s^{\prime}\right]\left[\begin{array}{c}\mathbf{y}\ {}\mathbf{s}\end{array}\right] $$
To apply the LMSI, we must first estimate the marker scores for each trait. A simple way is to perform multiple linear regression of the phenotypic values on the marker data, selecting only significant markers, and using those coefficients to calculate $\mathbf{s}$.
For simplicity in this tutorial, we will use all markers in a linear model (ridge regression) to predict the traits.
# Calculate marker scores via a simple Ridge Regression # (We use MASS::lm.ridge or glmnet in practice, but for the vignette we approximate # the prediction using the marker matrix directly) library(MASS) marker_scores <- matrix(0, nrow = nrow(phen_mat), ncol = ncol(phen_mat)) colnames(marker_scores) <- colnames(phen_mat) for (i in seq_len(ncol(phen_mat))) { # Fit ridge regression to handle multicollinearity among markers # Note: A lambda must be chosen carefully in real scenarios fit <- lm.ridge(phen_mat[, i] ~ marker_mat, lambda = 10) # Calculate predicted marker scores marker_scores[, i] <- scale(marker_mat, center = fit$xm, scale = fit$scales) %*% fit$coef + fit$ym }
Now that we have $\mathbf{pmat}$, $\mathbf{gmat}$, the phenotypic matrix $\mathbf{y}$, and the estimated marker score matrix $\mathbf{s}$, we can calculate the Linear Marker Selection Index:
# Calculate the LMSI lmsi_res <- lmsi( phen_mat = phen_mat, marker_scores = marker_scores, pmat = pmat, gmat = gmat, wmat = wmat ) # View the LMSI coefficient summary print(lmsi_res$summary) # Display the expected genetic gains per trait print(lmsi_res$Delta_G)
The output yields the relative contribution weights of the phenotypes (b_y.*) versus the markers (b_s.*), and calculates the overall Expected Genetic Advance.
The genome-wide linear selection index (GW-LMSI) bypasses the two-stage regression scoring step of the LMSI. Instead, it treats information at each individual marker as a separate trait. All $m$ marker information enters together with phenotypic information into a single index.
The GW-LMSI combines the trait phenotypic values ($\mathbf{y}$) and the molecular information ($\mathbf{m}$) to predict $H$:
$$ I_W={\boldsymbol{\beta}}_y^{\prime}\mathbf{y}+{\boldsymbol{\beta}}_m^{\prime}\mathbf{m}=\left[{\boldsymbol{\beta}}_y^{\prime}\kern0.5em {\boldsymbol{\beta}}_m^{\prime}\right]\left[\begin{array}{c}\mathbf{y}\ {}\mathbf{m}\end{array}\right] $$
Because the number of markers $m$ often wildly exceeds the number of individuals $n$ (the $p \gg n$ problem), the covariance matrix of the markers ($\mathbf{M}$) and the covariance between phenotypes and markers ($\mathbf{W}$) are generally singular.
Thus, calculating the index coefficients ${\boldsymbol{\beta}}_y$ and ${\boldsymbol{\beta}}_m$ typically requires generalized inverses or regularization (e.g. Ridge Regression / Tikhonov regularization applied to the covariance matrices).
The selection.index package manages the complex construction of the $\mathbf{M}$ and $\mathbf{W}$ covariance matrices asymptotically and handles singularity issues using Tikhonov regularization (lambda).
# We use all 500 markers in the maize_geno matrix # Since p (500) approaches n (600), we supply a ridge regularization lambda # to ensure the covariance matrices are invertible. gw_lmsi_res <- gw_lmsi( marker_mat = marker_mat, trait_mat = phen_mat, gmat = gmat, wmat = wmat, lambda = 0.05 ) # Display the phenotypic weighting coefficients print(gw_lmsi_res$b_y) # We can also check the condition number to observe the matrix stability print(gw_lmsi_res$condition_number) # And summarize the overall index statistics print(gw_lmsi_res$summary)
The GW-LMSI automatically models all the molecular variance in one step. While it requires computationally heavier matrix inversion (often mitigated by regularization), it tends to predict the net genetic merit more efficiently than standard LMSI when trait heritability is very low or when capturing many small QTL regressions requires immense sample sizes.
According to Lande and Thompson (1990), applying the LMSI correctly requires several primary statistical and biological conditions to be met in the breeding population:
Will a linear marker selection index always dramatically outperform standard phenotypic selection? Not necessarily.
Lande and Thompson (1990) showed that LMSI efficiency relative to basic phenotypic selection ($\lambda_M$) for a single trait is:
$$ \lambda_M=\frac{R_M}{R}=\sqrt{\frac{q}{h^2}+\frac{{\left(1-q\right)}^2}{1-{qh}^2}} $$
Where $q = \frac{\sigma_s^2}{\sigma_g^2}$ is the proportion of additive genetic variance definitively explained by the markers, and $h^2$ is the trait heritability.
This equation demonstrates the "LMSI Paradox" mathematically defined by Knapp (1998): as $h^2$ trends toward zero, $\lambda_M$ approaches infinity. Thus, the massive relative advantage of marker-assisted selection over phenotypic selection occurs explicitly when phenotypic trait heritability is extremely low, and the markers successfully capture a massive fraction of genetic variance. Otherwise, the LMSI offers diminishing returns and can fix QTLs at an overly fast rate.
Assuming $H$ and the index $I_M$ share a bivariate joint normal distribution, the statistical properties mirror the standard phenotypic LPSI:
Because molecular indices incorporate both the phenotypic and genotypic marker data, the LMSI accuracy will technically always be higher than the standard LPSI accuracy.
While the gw_lmsi() and lmsi() functions in the selection.index package abstract away the underlying algebra, molecular index computation routinely suffers from the $p \gg n$ covariance problem (vastly more markers than recorded individuals).
Estimating the marker score variances ($\mathbf{S}$) and the massive $\mathbf{M}$ and $\mathbf{W}$ covariance matrices using least-squares often leads to: 1. Marker covariance values erroneously overestimating the true genetic covariance ($\mathbf{C}$). 2. Singular, non-positive definite matrices where classical inverses fail.
The standard solution involves regularization techniques (like the Ridge penalty lambda built into gw_lmsi()), ensuring matrix invertibility at the cost of slight coefficient bias.
The selection.index package extends base index selection directly into the molecular realm. The LMSI (lmsi()) allows breeders to leverage an established pre-calculated marker scorings regression model. The GW-LMSI (gw_lmsi()) embraces the Genomic Selection paradigm by swallowing all marker and phenotypic data jointly in a single predictive framework, utilizing ridge regularization to bypass classical $p \gg n$ multicollinearity obstacles.
When implemented on traits exhibiting low heritability, these marker-assisted pipelines can drastically outcompete standard phenotypic indices in maximizing Expected Genetic Gain per selection cycle.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.