fanova: Fitting Functional ANOVA Models

Description Usage Arguments Details Value Note References Examples

View source: R/fanova.R

Description

This function is to fit Fanova models by B-Spline basis expansion with a penalty term on the second derivative of of the estimated functions for the smoothness of the curves. The penalty parameter, λ, is chosen by generalized cross validation (GCV).

Usage

1
2
fanova(Y.na.mat, X, tt, formula, K.int = 6,
            order = 4, d1 = 2, lower = -10, upper = 15)

Arguments

Y.na.mat

an n by t matrix of response variable (the extracted features), where each row contains the time series data of an observation, and each column contains all the observations at a given time. Here n is the number of biological units of the study (for example, in the study of plant growth curve, n is the number of plants and t is the number of observation time points. Any missing observation is filled by "NA" in the matrix. For example, if a measurement of plant is missing for plant i on date j, the (i, j) component in Y.na.mat should be filled by "NA".

X

an n by r dataframe or matrix of explanatory variables, where n is the number of observations and r is the number of explanatory variables.

tt

a t by 1 vector, with the same length as the rows in Y.na.mat. Each element implies the observation date.

formula

an object of class "formula", which specifies the model to use.

K.int

a positive integer, which refers to the number of interior knots of the B-spline.

order

a positive integer, which refers to the order of the B-spline. In default, order = 4, which implies that we use a cubic polynomial to connect each two adjacent knots.

d1

a non-negative integer, giving the order of the derivatives of the basis expansion function in the penalty matrix, Ω. See the "detail"" part. In default, d1 = 2. Note that d1 should be smaller than the order of the basis function.

lower

lower bound for the possible values of log (smoothing parameter), i.e., log (λ), used in GCV.

upper

upper bound for the possible values of log (smoothing parameter), i.e., log (λ), used in GCV.

Details

Suppose the trait is potentially affected by q factors. Denote by l_j the number of levels of the jth factor. We define X_{ij} = (x_{ij2}, …, x_{ijl_j})' as the categorical indicators of the jth factor of the ith plant. Specifically, x_{ijk} is set to one if the jth factor of the ith plant has level “k"; otherwise, x_{ijk} = 0. In convenience, let q = 2 in the help documentation. Denote by \otimes the Kronecker product of matrices. A functional ANOVA model with interactions can be written in the following form:

y_i(t) = μ(t)+ X'_{i1}a_1(t) + X'_{i2}a_2(t) + (X'_{i1}\otimes X'_{i2})a_{1,2}(t) + ε_i(t),

where a_j(t) = (a_{j2}(t), …, a_{jl_j}(t))' are the treatment effect functions of the jth factor with dimension l_j-1, a_{1, 2}(t) are the pairwise interaction effect functions between the two factors with dimension (l_1-1)(l_2-1), and ε_i(t)s are temporal dependent random processes with zero means. We have implemented this multi-factor model (q≥q 2) in our package such that researchers can specify the main and interactions effects as needed. Under the model, we can estimate the temporally varying coefficient functions. First, we approximate all of the coefficient functions with rank K splines expansions. For example, a_{1k}(t) = ∑_{v=1}^{K}β_{1k,v}B_{d,v}(t) , where \{B_{d,v}(t)\}_{v = 1}^{K} are order d B-spline basis functions, and \{β_{1k,v}\}_{v = 1}^{K} are the corresponding coefficients. Let β_{μ}, β_{1k}, β_{2k} and β_{int,k1,k2} denote the k-dimensional vector of the B-spline coefficients for the intercept, the two main factor effects (kth level), and their interaction terms (k_1th and k_2th level), respectively. Let β = (β_{μ}', β_{12}', …, β_{1l_1}', β_{22}', …, β_{2l_2}', …)' be the vector containing all these coefficients.

To estimate all those coefficients β in front of basis functions, we apply the least squares estimation with penalizations on the d1th derivatives of the rank K spline expansion functions to attain smooth estimators. Denoting y_i(t_{i,z}) as the measurement for the ith (i = 1,...,n) plant observed at the zth (z = 1,...,m) time point, we minimize the penalized error sum of squares:

∑_{i=1}^{n}∑_{z = 1}^{m}\{y_i(t_{i,z})-B(t_{i,z})'β_{μ}-∑_{k=2}^{l_1}X_{i1k}B(t_{i,z})'β_{1k}-∑_{k=2}^{l_2}X_{i2k}B(t_{i,z})'β_{2k}

-∑_{k_1 = 2}^{l_1}∑_{k_2=2}^{l_2}X_{i1k_1}X_{i2k_2}B(t_{i,z})'β_{int,k_1,k_2}\}^2

+λβ_{μ}'Ω β_{μ} +∑_{k=2}^{l_1}λ β_{1k}'Ω β_{1k} +λ ∑_{k=2}^{l_2}β_{2k}'Ωβ_{2k}+ λ∑_{k_1 = 2}^{l_1} ∑_{k_2 = 2}^{l_2} β_{int,k_1,k_2}' Ω β_{int,k_1,k_2}

where λ is the smoothing parameter, Ω = \int B^{(d1)}(t)\{B^{(d1)}(t)\}'dt is a penalty matrix, B(t) = (B_{d,1}(t), …, B_{d,K}(t))', and d1 is the order of the derivitives in the penalty term. We obtain the estimated B-spline coefficient \hat{β} by minimizing the above quantity. The smoothing parameter λ is estimated by generalized cross validation (GCV). By plugging \hat{β} back to the rank K splines expansions, we get the estimated varying coefficient functions in the model. For example, \hat{β}_{μ}(t) = B(t)'\hat{β}_{μ}, \hat{a}_{1k}(t) = B(t)'\hat{β}_{1k}.

Value

est_fun

a t by u matrix of the estimated function of coefficients, where t is the number of observation time points, and u is the number of columns of the design matrix.

For example, if the model includes two categorical variables without interaction terms, each categorical variable contains two levels, the model can be expressed as

y(t) = β_0(t) + β_1(t) X_1 + β_2(t) X_2 + ε(t).

In this case, u = 3. This means your output "est_fun" contains 3 columns, and the columns repesent \hat{β}_0(t), \hat{β}_1(t) and \hat{β}_2(t), respectively.

bhat

a (u \times K) \times 1 numeric vector, containing all the estimated ceofficients, \hat{β} for the B-spline expansion, where u is the number of columns of the design matrix, and K is the rank of the spline expansion.

design_mat

The design matrix under the model specified in the function. It can be used as reference to help users identify the contrast vector L in the function CI( ), and the values of j1 and j2 in the function CI_contrast( ).

lambda

The penalty parameter, obatined by GCV.

K

The rank of the B-spline basis function. K = number of interior knots + order of the B-spline basis function.

...

Other ouputs are for writing other functions. Users can ignore them.

Note

The rank of the B-spline basis functions, K, is equal to the order (degree+1) of the spline plus the number of interior knots. To avoid over-fitting, we choose the rank less than half of the number of observation time points, m.

References

Ramsay, James O., and Silverman, Bernard W. (2005), Functional Data Analysis, 2nd ed., Springer, New York.

Ramsay, James O., Hooker, Giles, and Graves, Spencer (2009) Functional Data Analysis in R and Matlab, Springer, New York.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#load the data
data_new = read.csv(system.file("extdata", "data.csv",
                                 package = "implant", mustWork = TRUE))
#The first three columns of data_new refer to the original position of
#the observations from the original dataset, genotype and block, respectively
Y = data_new[, -c(1:3)]
#This step is to factorize each factor
Genotype = as.factor(data_new$Genotype)
Block = as.factor(data_new$Block)
X = data.frame(Genotype, Block)
formula = "~ Genotype + Block"
tt = seq(from = 0, to = 44, by = 2)
fit = fanova(Y.na.mat = Y, X = X, tt = tt, formula, K.int = 6, order = 4)
fit$lambda
fit$est_fun

rwang14/implant documentation built on Sept. 6, 2020, 3:21 a.m.