bilinear: bilinear function
In nsantantonio/Bilinear: Fits bilinear models for multi environment trial data

Description Usage Arguments Details Value Examples

This function will fit bilinear models, such as AMMI, GGE / SREG, EGE / GREG, designed for multienvironment trials from a plant breeding/agronomy perspective, but can be used for fitting any two factor interaction models for dimension reduction of the interaction table. A bootstrap test and an F_R test are implemented for testing the number of significant dimensions.

bilinear(x = NULL, G = NULL, E = NULL, y = NULL, block = NULL,
  model = "AMMI", errorMeanSqDfReps = NULL, f = 0.5,
  test = "bootstrap", imputePC = "sig", alpha = 0.05, B = 10000,
  nCore = 1, Bonferroni = FALSE, returnDataFrame = TRUE,
  override3col = FALSE, verbose = TRUE, ...)

`x`	matrix or data.frame containing numeric values of cell means with genotypes on rows and environments on columns (e.g. ith genotype within jth environment). Alternatively a data.frame can be provided in long format, with factor and value variable names passed to 'G', 'E' and 'y' respectively
`G`	character. Name of genotype variable in data.frame 'x' if 'x' is a data.frame in long format. Optionally, character or factor vector of genotype names (optional). If NULL (default) a matrix of cell means must be provided to 'x' argument
`E`	character. Name of environment variable in data.frame 'x' if 'x' is a data.frame in long format. Optionally, character or factor vector of environment names (optional). If NULL (default) a matrix of cell means must be provided to 'x' argument
`y`	character. Name of phenotype variable in data.frame 'x' if 'x' is a data.frame in long format. Optionally, numeric vector of phenotype values (optional). If NULL (default) a matrix of cell means must be provided to 'x' argument
`block`	character. Optional, for RCBD only. Name of block variable in data.frame 'x' if 'x' is a data.frame in long format and study is an RCBD. Optionally, character or factor vector of block names.
`model`	bilinear model to be fit. Arguments can be "AMMI", "GGE", "SREG", "EGE", "GREG". "GGE" and "SREG" are equivalent, as are "EGE" and "GREG".
`errorMeanSqDfReps`	numeric vector containing the error mean square, degrees of freedom, and the number of replications, in that order.
`f`	numeric value in (0, 1) for scaling exponent on eigenvalues for weighting the genotype scores. Environment scores are weighted 1 - f. Default is 0.5
`test`	character. Test for significant PCs. Argument can be either "Ftest" or "bootstrap". Default is "bootstrap".
`imputePC`	integer. number of PCs for imuptation of missing cells. must be an integer between 1 and min(genotypes, environments) - 2. Default is 1. Alternatively, the argument "sig" can be provided to use the number of significant dimensions from the initial fit (assuming GE_ij = 0 for missing cells). THIS HAS NOT YET BEEN IMPLEMENTED!
`alpha`	pvalue cutoff threshold for significance. Default is 0.05.
`B`	integer. Number of bootstrap samples for bootstrap test. 1e+06 or greater is recommended, but default is 1e+05 to limit first use computation time.
`nCore`	integer. Number of cpu cores to be used for parallelization of computation using foreach() and the doMC package. Not tested outside a unix environment. Default is 1.
`Bonferroni`	logical. Use a bonferroni correction for multiple testing. Default is TRUE
`returnDataFrame`	logical. Should a data frame of the resulting model fit be returned? Default is TRUE
`override3col`	logical. Overrides the 3 column numeric warning when a matrix of 3 coulmns is provided. Most users will not need to turn this off, unless the number of environments is 3 and a table of cell means are provided. Default is FALSE.
`verbose`	logical. When TRUE, details from nalysis are printed.
`...`	Additional arguments.

Bilinear models attempt to separate interaction signal from noise in a two factor interaction model by dimension reduction of the table of residuals from the additive model. Dimension reduction is accomplished by singular value decomposition of the residual table, keeping the first k significant dimensions and partitioning the remaining dimensions to the error.

The interaction model can be formulated for the genotype by environmental interaction as the following:

y = mu + G_i + E_j + G*E_ij + e_ijk where mu is the grand mean, G_i is the effect of the ith genotype, E_j is the effect of the ith genotype, G*E_ij is the genotype by environment interaction term and e is the experimental error. The G*E_ij term can be thought of as a matrix of cell means (minus the main effects), and can be decomposed as sum(u_ik * d_k * v_jk) + r_ij for $k = 1,...,K$, using singular value decomposition where u_ik is the kth left singular value of the ith genotype, v_jk is the kth right singular value of the jth genotype and d_k is the kth singular value. r is sum(u_ik * d_k * v_jk) for k = K+1, ..., M where M is min(levels(G), levels(E)) - c for c equals 1 or 2 for GGE/SREG/EGE/GREG and AMMI respectively. The genotype and environmental scores are calculated as u_ik * (d_k)^f and v_jk * (d_k)^(1-f) respectively.

"AMMI" specifies Additive Main Effects Multiplicative Interaction, where $y = mu + G_i + E_j + sum_k = 1^K u_ik * d_k * v_jk) + e$ for $k = 1,...,K$ significant multiplicative terms. "GGE" or "SREG" specifies a Sites REGression or Genotype + Genotype by Environment interaction, where $y = mu + E_j + sum_k = 1^K u_ik * d_k * v_jk) + e$ for $k = 1,...,K$ significant multiplicative terms. "EGE" or "GREG" specifies Genotype REGression or Environment + Genotype by Environment interaction,,$y = mu + G_i + sum_k = 1^K u_ik * d_k * v_jk) + e$ for $k = 1,...,K$ significant multiplicative terms.

For specifics see Zobel, R. W., Wright, M. J., & Gauch, H. G. (1988). Statistical analysis of a yield trial. Agronomy Journal, 80(3), 388-393.

Note: either the row and column variable names can specified to 'G' and 'E' respectively and the value variable names to 'y', or by column variable names provided as arguments to 'G', 'E', and 'y' If data.frame already contains variables 'G', 'E' and 'y' ('block' optional) model will be fit accordingly. Any character class in the data.frame is converted to factor. If data is provided in long format and study is replicated within environment, cell means will be used, and a warning will diplay if the data is unbalanced.

The bilinear fit object and tests for significant dimensions.

data(soy)
print(soyMeanMat)

AMMIfit <- bilinear(x = soyMeanMat)
AMMIfit[c("mu", "Eeffect", "Geffect")] 
AMMIfit["DF"] 
AMMIfit[c("pvalue", "sigPC")] # Note that the first two terms are considered significant
AMMIfit["varcomp"]
AMMIfit["svdE"]

# Same soy data in long format
# as long as dataframe or matrix column names match c("G", "E", "y") 
# exactly (order doesnt matter), the function will assume that the data is in long format
print(head(soyMeanDf))
AMMIfit <- bilinear(x = soyMeanDf)

# if column names DO NOT match c("G", "E", "y"), they must be supplied to the 'G', 'E', and 'y'
# exactly (order doesnt matter), the function will assume that the data is in long format
print(head(soyMeanDf))
AMMIfit <- bilinear(x = soyMeanDf)
# multiple CPUs for bootstrap test, not run
# AMMIfit <- bilinear(x = soyMeanMat, model = "AMMI", nCore = 2)

# Sequential F test for significance using cell means. This test is known to be too liberal and will likely result in type I errors
AMMIfit <- bilinear(soyMeanMat, test = "Ftest")

# F_R test for significance using plot level data in long format to estimate error variance for F test
print(head(soy))
AMMIfit <- bilinear(soy, test = "Ftest")

# F_R test with estimated error variance. Useful for Augmented designs where the error estimate is estimated from replicated checks
# Estimate error variance first 
soyfit <- lm(as.formula(y ~ E + G + G:E + block:E), data = soy)
errorMeanSq <- summary(soyfit)$sigma^2
errorDf <- anova(soyfit)["Residuals", "Df"]
reps <- unique(table(soy$G, soy$E))

print(c(errorMeanSq, errorDf, reps))
# F_R test with externally estimated error variance
soyF2 <- bilinear(soyMeanMat, errorMeanSqDfReps = c(errorMeanSq, errorDf, reps), test = "Ftest")

#reformat data

# as long as dataframe or matrix column names match c("G", "E", "y") 
# exactly (order doesnt matter), the function will assume that the data is in long format
dfY <- soyMeanDf
names(dfY) <- c("G","E","y") # name dfY c("G", "E", "y") 
print(head(dfY, 20))
bilinear(x = dfY, model = "AMMI", B = 10000, nCore = 2)
bilinear(x = soy[c("G","E","y")], model = "AMMI", B = 10000, nCore = 2)


matY <- as.matrix(dfY)
print(matY[1:20,])
bilinear(x = matY, model = "AMMI", B = 10000, nCore = 2) 

# if the columns do not match as above, the names of the variables must be
# passed to the 'G', 'E', and 'y' arguments as character strings
names(dfY) <- c("geno", "env", "trait") #change names dfY
print(head(dfY, 20))
bilinear(x = dfY, G = "geno", E = "env", y = "trait", model = "AMMI", B = 10000, nCore = 2)
# data can be also supplied directly as vectors to the 'G', 'E' and 'y' arguments
Gvec <- dfY$geno
Evec <- dfY$env
yvec <- dfY$trait
bilinear(G = Gvec, E = Evec, y = yvec, model = "AMMI", alpha = 0.05, B = 10000, nCore = 2)

# Impute missing data!
missMat <- soyMeanMat 
missMat[sample(1:prod(dim(missMat)), 10)] <- NA 
print(missMat)
bilinear(x = matY, model = "AMMI", B = 10000, nCore = 2) 

missMat <- soyMeanMat 
missMat[sample(1:prod(dim(missMat)), 10)] <- NA 
print(missMat) # 10 missing records
missFit <- bilinear(x = missMat, model = "AMMI", B = 10000) 

library(Bilinear)
missMeanDf <- soyMeanDf[!{soyMeanDf$E %in% unique(soyMeanDf$E)[4:6] & soyMeanDf$G %in% unique(soyMeanDf$G)[3:4]}, ]
nrow(missMeanDf) # 10 missing records
missFit2 <- bilinear(x = missMeanDf, model = "AMMI", B = 10000, maxiter = 200, Ytrue = soyMeanMat, plotMSE = TRUE, tol = 1e-4)