knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(genomicMateSelectR)
The following will be a demonstration of the core functions for predicting mate selection criteria available in genomicMateSelectR
.
I generated simulated data, included in the package (accessible via data()
), which should clarify the input format(s) required.
The example input data will be explained and used in the vignette below.
Simulation was conducted with the package AlphaSimR
.
In a subsequent section, we will dig into the details and more fully explore all the available options in genomicMateSelectR
. First, a quick demo of the predictCrosses()
function using (mostly) ready-to-go pre-prepared input data.
To predict cross variance, 4 inputs are required:
Phased haplotypes of candidate parents
Also a matrix of allele dosages
Recombination frequency matrix (derived from a genetic map)
First input is a matrix,haploMat
with two rows per individual and one column per SNP locus.
Alleles coded {0,1}
. rownames(haploMat)
contains sample IDs ("GID") with haplotypes distinguished by the mandatory suffixes "_HapA" and "_HapB". colnames(haploMat)
correspond to SNP IDs (i.e. $chr _ pos$), currently requires that chromosomes be distinguished using integers; no prefix to SNP IDs is allowed (i.e. 1_2 not chr1_2 and also not anything like A_2). In this example, the GIDs are all numbers, they are from simulated data; in practice they can be any alphanumeric string.
data(haploMat); str(haploMat)
haploMat[1:6,1:8]
For now, to use predictCrosses()
, both a haploMat
and another, related matrix (dosages
) are required as input.
The matrix dosages
has one row per individual sample and one column per SNP marker. Column-names are the same as for haploMat
. rownames are the same prefix as in haploMat
, but no suffix ("_HapA" or "_HapB"). Genotypes are coded {0, 1, 2}
counting the number of (usually) the alternative allele. Important that the counted allele is the same allele as indicated by the value 1
in the haploMat
.
data("doseMat") str(doseMat)
doseMat[1:3,1:8]
NOTE: The dual dosages
and haploMat
input requirement is because most users will be predicting cross means + variances and currently, the function predCrossMeans()
, which is used internally, requires dosages
and predCrossVars()
needs a haploMat
. Its computationally more efficient (I think) to avoid converting haplotype matrices into dosage matrices internally, thus I demand both be specified. On the improvement to-do list.
The second input, recombFreqMat
is a square-symmetric matrix, with dimension p-markers by p-markers. The elements of recombFreqMat
, should actually be 1-2*recombFreqMat
, meaning 1 minus 2 times the expected frequency of recombination between pairs of loci; a decision made for computational efficiency.
The usual starting point to create this matrix will be a centimorgan-scale genetic map. genomicMateSelectR
provides a helper function genmap2recombfreq()
to facilitate the conversion.
Users can input their genetic map as a named, numeric vector as shown below (genmap
).
data(genmap); str(genmap)
genmap[1:10]
The input format for predictCrosses()
, for reasons of computational efficiency, is actually 1-2*recombFreqMat
where recombFreqMat
is a
recombFreqMat<-genmap2recombfreq(genmap, nChr = 2) str(recombFreqMat)
recombFreqMat[1:5,1:5]
Now calc. 1-2*recombFreqMat
.
recombFreqMat<-1-(2*recombFreqMat) recombFreqMat[1:5,1:5]
This corresponds to the input to predictCrosses(recombFreqMat=)
and the example dataset included with the package: data("recombFreqMat")
The third input, for predictCrosses(snpeffs=)
are marker-effects from genome-wide marker regression e.g. RR-BLUP.
Marker-effects are supplied as a tibble
with one row per trait, one chr-type column named Trait and one or more list-type columns containing SNP-effects. Each element of each list-type SNP-effects column corresponds to effects for a single trait, and contains a matrix with a single column, rownames corresponding to SNP IDs and should match the IDs in recombFreqMat
and haploMat
.
The number and required column names for the SNP-effect list-type columns depends on the value of modelType==
. There are 3 model types implemented "A"
, "AD"
and "DirDom"
. The predictCrosses()
function is designed to be run with output produced by running the runGenomicPredictions()
function using the same modelType
setting. Both are designed to ensure / encourage the correct formulation of additive+dominance models.
For this quick example, the additive-only model: modelType="A"
. I'll introduce each model later and explain the correct inputs / expected outputs for those.
The example dataset data(snpeffsA)
is gpredsA$genomicPredOut[[1]]
where gpreds
is the output of running runGenomicPredictions()
with modelType = "A"
.
data(snpeffsA)
snpeffsA
Because snpeffsA
is the output of runGenomicPredictions()
, there are columns that are not required and will be ignored by predictCrosses()
. The unecessary columns are gblups, varcomps and fixeffs.
Required:
snpeffsA %>% dplyr::select(Trait,allelesubsnpeff)
Two rows, one for each trait. One list-column, required to be labelled allelsubsnpeff, containing a single-column matrix of SNP-effects, rownames identifying SNP IDs.
snpeffsA$allelesubsnpeff[[1]] %>% head
summary(snpeffsA$allelesubsnpeff[[1]])
Lastly, we just need to tell predictCrosses()
, which crosses to predict. The input to argument CrossesToPredict=
should be a tibble
or data.frame
with two-columns. Columns should be character-class (no factors!), column names should be sireID and damID. Names should be present in the rownames of haploMat
and the row/colnames of recombFreqMat
.
There is a helper function crosses2predict()
that takes a vector of parents as input and creates a data.frame
of all pairwise matings, including self-crosses, but excluding reciprocal crosses, i.e. use as male == use as female. Basically crosses2predict()
specifies crosses on the diagonal and in the upper-triangle of a potentially square-symmetrical mating matrix.
Here, I'll just randomly choose 5 parents and make a list of the 15 CrossesToPredict
.
set.seed(42); parents<-sample(x = snpeffsA$gblups[[1]]$GID, size = 5, replace = F) CrossesToPredict<-crosses2predict(parents) CrossesToPredict %>% head
str(CrossesToPredict)
And behold, we have the stuff we need to predict crosses.
Selection Indices:\
genomicMateSelectR
is built to facilitate mating choices based on multiple traits, particularly using a user-specified set of weights (SIwts) to produce cross merit predictions on a linear selection index (selInd
). When supplied multiple traits, predictCrosses()
will predict both the variances of those traits and their covariance in each family. When the user specifies selInd = T
and supplies a vector of weights with names matching the trait names, predictCrosses()
will add predictions for a new composite trait, SELIND to the output.
SIwts are used as-is by predictCrosses()
so the user is responsible for choosing "good" values; a topic beyond the scope of this manual / vignette.
SIwts<-c(0.75,0.25) %>% `names<-`(.,c("Trait1","Trait2")) SIwts
crossPreds<-predictCrosses(modelType="A", selInd = T, SIwts = SIwts, CrossesToPredict=CrossesToPredict, snpeffs=snpeffsA, dosages=doseMat, haploMat=haploMat, recombFreqMat=recombFreqMat)
Now we can quickly examine the output.
Produces a single-row tibble (basically a 2 element list) with two columns tidyPreds and rawPreds.
crossPreds
Most users and breeders will work with the tidyPreds. This simplified output was designed especially for those interested in selections based on SELIND.
Each row contains relevant predictions for a single trait in a single cross. Trait covariances are excluded from this output in order to show both mean (predMean), variance / standard deviation (predVar, predSD) and usefulness (predUsefulness) side-by-side.
crossPreds$tidyPreds[[1]]
crossPreds$tidyPreds[[1]] %>% str
Additional columns in tidyPreds:
Nsegsnps
: the number of SNPs expected to segregate in this cross. For compute efficiency, when predicting cross variances, genomicMateSelectR
detects which SNPs will actually cause genetic variance within the cross, and restricts the analysis to those loci.
predOf
: "BV" for breeding value (all models), "TGV" for total genetic values (models "AD" and "DirDom" only). Here only "BV" because modelType="A"
.
predUsefulness
is predMean + stdSelInt*predSD
, where stdSelInt
is the standardized selection intensity input to predictCrosses()
, stdSelInt = 2.67
by default.
Complete details of the predictions, especially including trait-trait covariance predictions are available in the two-element, named-list contained in rawPreds:
crossPreds$rawPreds[[1]] %>% str
At crossPreds$rawPreds[[1]]$predMeans
are the predictions for cross means, and at crossPreds$rawPreds[[1]]$predVars
are the predictions for cross variances.
For sake of example, we should make a plot (predMean vs. predSD), for "fun" and then make a selection of the top 5 crosses, based on their predicted usefulness on the selection index:
crossPreds$tidyPreds[[1]] %>% dplyr::filter(Trait=="SELIND") %>% ggplot2::ggplot(.,ggplot2::aes(x=predMean,y=predSD, color=predUsefulness, size=predUsefulness)) + ggplot2::geom_point() + ggplot2::theme_bw()
crossPreds$tidyPreds[[1]] %>% dplyr::filter(Trait=="SELIND") %>% dplyr::slice_max(order_by = predUsefulness, n = 5) # arrange(desc(predUsefulness)) %>% slice(1:5) # same as `slice_max`
That concludes the quick demo, focused on input formatting and output contents, using the basic additive-model in the function predictCrosses()
to get selection-index cross usefulness predictions.
In subsequent vignettes I will provide:
Learn the features for incorporating non-additive effects into predictions: modelType="AD"
, modelType="DirDom
and their correct useage. Includes a theoretical introduction, followed by a demonstration of the use of runGenomicPredictions()
and predictCrosses()
to accomplish predictions of individual and cross performances.
Use of the functions for cross-validation (runCrossVal()
, runParentWiseCrossVal()
,
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.