#' @title Fit a statistical factor model using principal component analysis
#'
#' @description Fits a statistical factor model using Principal Component
#' Analysis (PCA) for one or more asset returns or excess returns. When the
#' number of assets exceeds the number of time periods, Asymptotic Principal
#' Component Analysis (APCA) is performed. An object of class \code{"sfm"} is
#' returned. This function is based on the S+FinMetric function \code{mfactor}.
#'
#' @details
#' If \code{data} is not of class \code{"xts"}, rownames must provide an
#' \code{"xts"} compatible time index. Before model fitting, incomplete cases in
#' \code{data} are removed using \code{\link[stats]{na.omit}}. Specifying
#' \code{check=TRUE}, issues a warning if any asset is found to have identical
#' observations.
#'
#' Let \code{N} be the number of columns or assets and \code{T} be the number
#' of rows or observations. When \code{N < T}, Principal Component Analysis
#' (PCA) is performed. Any number of factors less than \code{min(N,T)} can be
#' chosen via argument \code{k}. Default is 1. Refer to Zivot and Wang (2007)
#' for more details and references.
#'
#' When \code{N >= T}, Asymptotic Principal Component Analysis (APCA) is
#' performed. The user can directly specify \code{k} similar to PCA above, or a
#' method to automatically determine the number of factors can be specified:
#' \code{k="bn"} corresponds to Bai and Ng (2002) and \code{k="ck"} corresponds
#' to Connor and Korajczyk (1993). Users can choose the maximum number of
#' factors, \code{max.k}, to consider with these methods. The default for
#' \code{max.k} is set to be 10 or $T-1$, whichever is smaller.
#'
#' \code{refine} specifies whether a refinement of the APCA procedure from
#' Connor and Korajczyk (1988), that may improve efficiency, is to be used.
#'
#' When \code{corr=TRUE}, the correlation matrix of returns are used for
#' finding the principal components instead of the covariance matrix. This is
#' typically decided by practioners on a case-by-case basis. The variable with
#' the highest variance dominates the PCA when the covariance matrix is used.
#' However, this may be justified if a volatile asset is more interesting for
#' some reason and volatility information shouldn't be discarded. On the other
#' hand, using the correlation matrix standardizes the variables and makes them
#' comparable, avoiding penalizing variables with less dispersion.
#'
#' Finally, if the median of the 1st principal component is negative, all it's
#' factor realizations are automatically inverted to enable more meaningful
#' interpretation.
#'
#' @param data vector, matrix, data.frame, xts, timeSeries or zoo object with
#' asset returns. See details.
#' @param k number of factors (or) a method for determining the optimal number
#' of factors, one of "bn" or "ck". See details. Default is 1.
#' @param max.k scalar; the maximum number of factors to be considered for
#' methods "bn" or "ck". Default is \code{NULL}. See details.
#' @param refine logical; whether to use the Connor-Korajczyk refinement for
#' APCA. Default is \code{TRUE}.
#' @param sig scalar; desired level of significance when "ck" method is
#' specified. Default is 0.05.
#' @param check logical; to check if any asset has identical observations.
#' Default is \code{FALSE}.
#' @param corr logical; whether to use the correlation instead of the covariance
#' matrix when finding the principal components. Default is \code{FALSE}.
#' @param ... optional arguments passed to \code{\link[stats]{lm}}.
#'
#' @return fitTsfm returns an object of class \code{"sfm"} for which
#' \code{print}, \code{plot}, \code{predict} and \code{summary} methods exist.
#'
#' The generic accessor functions \code{coef}, \code{fitted} and
#' \code{residuals} extract various useful features of the fit object.
#' Additionally, \code{fmCov} computes the covariance matrix for asset returns
#' based on the fitted factor model
#'
#' An object of class \code{"sfm"} is a list containing the following
#' components:
#' \item{asset.fit}{fitted object of class \code{"mlm"} or \code{"lm"} from the
#' time-series LS regression of asset returns on estimated factors.}
#' \item{k}{number of factors; as input or determined by "ck" or "bn" methods.}
#' \item{factors}{T x K xts object of estimated factor realizations.}
#' \item{loadings}{N x K matrix of factor loadings estimated by
#' regressing the asset returns on estimated factors.}
#' \item{alpha}{length-N vector of estimated alphas.}
#' \item{r2}{length-N vector of R-squared values.}
#' \item{resid.sd}{length-N vector of residual standard deviations.}
#' \item{residuals}{T x N xts object of residuals from the LS regression.}
#' \item{Omega}{N x N return covariance matrix estimated by the factor model.}
#' \item{eigen}{length-N (or length-T for APCA) vector of eigenvalues of the
#' sample covariance matrix.}
#' \item{mimic}{N x K matrix of factor mimicking portfolio weights.}
#' \item{call}{the matched function call.}
#' \item{data}{T x N xts data object containing the asset returns.}
#' \item{asset.names}{length-N vector of column names from data.}
#' Where N is the number of assets, K is the number of factors, and T is the
#' number of observations.
#'
#' @author Eric Zivot, Sangeetha Srinivasan and Yi-An Chen
#'
#' @references
#' Bai, J., & Ng, S. (2002). Determining the number of factors in approximate
#' factor models. Econometrica, 70(1), 191-221.
#'
#' Connor, G., & Korajczyk, R. A. (1988). Risk and return in an equilibrium
#' APT: Application of a new test methodology. Journal of Financial Economics,
#' 21(2), 255-289.
#'
#' Connor, G., & Korajczyk, R. A. (1993). A test for the number of factors in
#' an approximate factor model. The Journal of Finance, 48(4), 1263-1291.
#'
#' Zivot, E., & Wang, J. (2007). Modeling Financial Time Series with S-PLUS
#' (Vol. 191). Springer.
#'
#'
#' @seealso The \code{sfm} methods for generic functions:
#' \code{\link{plot.sfm}}, \code{\link{predict.sfm}},
#' \code{\link{print.sfm}} and \code{\link{summary.sfm}}.
#'
#' And, the following extractor functions: \code{\link[stats]{coef}},
#' \code{\link[stats]{fitted}}, \code{\link[stats]{residuals}},
#' \code{\link{fmCov}}, \code{\link{fmSdDecomp}}, \code{\link{fmVaRDecomp}}
#' and \code{\link{fmEsDecomp}}.
#'
#' \code{\link{paFm}} for Performance Attribution.
#'
#' @examples
#'
#' # load return data
#' data(StockReturns)
#'
#' # PCA is performed on r.M and APCA on r.W
#' class(r.M)
#' dim(r.M)
#' range(rownames(r.M))
#' class(r.W)
#' dim(r.W)
#'
#' # PCA
#' args(fitSfm)
#' fit.pca <- fitSfm(r.M, k=2)
#' class(fit.pca)
#' names(fit.pca)
#' head(fit.pca$factors)
#' head(fit.pca$loadings)
#' fit.pca$r2
#' fit.pca$resid.sd
#' fit.pca$mimic
#'
#' # APCA with number of factors, k=15
#' fit.apca <- fitSfm(r.W, k=15, refine=TRUE)
#'
#' # APCA with the Bai & Ng method
#' fit.apca.bn <- fitSfm(r.W, k="bn")
#'
#' # APCA with the Connor-Korajczyk method
#' fit.apca.ck <- fitSfm(r.W, k="ck")
#'
#' @export
fitSfm <- function(data, k=1, max.k=NULL, refine=TRUE, sig=0.05, check=FALSE,
corr=FALSE, ...) {
# record the call as an element to be returned
call <- match.call()
# check input data type and coerce to xts; remove NAs
R.xts <- na.omit(checkData(data, method="xts"))
# dim and dimnames of R.mat
n <- ncol(R.xts)
obs <- nrow(R.xts)
# assign generic variable names, if they are missing
if (is.null(colnames(data))) {
colnames(R.xts) <- paste("V", 1:n, sep = ".")
}
# check input vailidity for argument k
if (is.numeric(k)) {
if (k <= 0 || round(k) != k) {
stop("Invalid argument: k, the number of factors, must be a positive
integer.")
} else if (k >= min(n,obs)) {
stop("Invalid argument: k, the number of factors, must be less than the
number of variables.")
}
} else if (is.character(k) && (n > obs)) {
if (!(k %in% c("bn","ck"))) {
stop("Invalid argument: Method for determining the number of factors for
APCA must be one of 'ck' or 'bn'.")
}
} else {
stop("Invalid argument: k, the number of factors, must either be a positive
integer or methods 'ck' or 'bn'. The latter methods are relevant only
for APCA (when, number of assets >= number of observations).")
}
# check input vailidity or assign default for argument max.k
if (is.null(max.k)) {
max.k <- min(10, obs - 1)
} else if (!is.numeric(max.k) || max.k <= 0 || round(max.k) != max.k) {
stop("Invalid argument: max.k, the maximum number of factors, must be a
positive integer.")
} else if (max.k >= obs) {
stop("Invalid argument: max.k must be less than the number of observations")
}
# check if any asset has identical observations
if (check) {
temp <- apply(data, 2, range)
if(any(abs(temp[2, ] - temp[1, ]) < .Machine$single.eps)) {
warning("Some variables have identical observations.")
}
}
# select method to estimate factors based on k and n
# in each case a partial list of return values are obtained
if (n < obs) {
result <- UsePCA(R.xts=R.xts, k=k, corr=corr, ...)
} else if (k=="ck") {
result <- UseAPCA_ck(R.xts=R.xts, max.k=max.k, refine=refine, sig=sig,
corr=corr, ...)
} else if (k=="bn") {
result <- UseAPCA_bn(R.xts=R.xts, max.k=max.k, refine=refine, corr=corr, ...)
} else {
result <- UseAPCA(R.xts=R.xts, k=k, refine=refine, corr=corr, ...)
}
# create list of return values.
input <- list(call=call, data=R.xts, asset.names=colnames(R.xts))
result <- c(result, input)
class(result) <- "sfm"
return(result)
}
### Principal Component Analysis when N < T
#
UsePCA <- function(R.xts=R.xts, k=k, corr=corr, ...) {
R.mat <- coredata(R.xts) # TxN
n <- ncol(R.mat)
obs <- nrow(R.mat)
# demean TxN matrix of returns
R.mat.d <- t(t(R.mat) - colMeans(R.mat))
# NxN return covariance matrix
Omega.N <- crossprod(R.mat.d)/obs
if (corr) {
Omega.N <- cov2cor(Omega.N)
}
# get eigen decomposition
eig.decomp <- eigen(Omega.N, symmetric=TRUE)
eig.val <- eig.decomp$values
X <- eig.decomp$vectors[, 1:k, drop=FALSE] # NxK
dimnames(X) <- list(colnames(R.xts), paste("F", 1:k, sep = "."))
# get TxK factor realizations
f <- R.mat %*% X
colnames(f) <- paste("F", 1:k, sep = ".")
# invert 1st principal component if most values are negative
if (median(f[,1]) < 0) {f[,1] <- -f[,1]}
# LS time series regression to get B: NxK matrix of factor loadings
f <- xts(f, index(R.xts))
asset.fit <- lm(R.xts ~ f, ...)
B <- t(coef(asset.fit)[-1, , drop=FALSE])
alpha <- coef(asset.fit)[1,]
# extract r2, residual SD and residuals
resid.xts <- do.call(merge, sapply(X=summary(asset.fit), FUN="[", "residuals"))
r2 <- as.numeric(sapply(X=summary(asset.fit), FUN="[", "r.squared"))
resid.sd <- as.numeric(sapply(X=summary(asset.fit), FUN="[", "sigma"))
# compute factor model return covariance: NxN
Omega.fm <- B %*% var(f) %*% t(B) + diag(resid.sd^2)
# compute factor mimicking portfolio weights: NxK
mimic <- t(t(X)/colSums(X))
# assign row and column names
names(eig.val) <- paste("F", 1:n, sep = ".")
names(r2) = names(resid.sd) = colnames(R.xts)
colnames(B) = colnames(f)
# return list
list(asset.fit=asset.fit, k=k, factors=f, loadings=B, alpha=alpha, r2=r2,
resid.sd=resid.sd, residuals=resid.xts, Omega=Omega.fm, eigen=eig.val,
mimic=mimic)
}
### Asymptotic Principal Component Analysis when N >= T
#
UseAPCA <- function(R.xts=R.xts, k=k, refine=refine, corr=corr, ...) {
R.mat <- coredata(R.xts) # TxN
n <- ncol(R.mat)
obs <- nrow(R.mat)
# demean TxN matrix of returns
R.mat.d <- t(t(R.mat) - colMeans(R.mat))
# TxT return covariance matrix
Omega.T <- tcrossprod(R.mat.d)/n
if (corr) {
Omega.T <- cov2cor(Omega.T)
}
# get eigen decomposition
eig.decomp <- eigen(Omega.T, symmetric=TRUE)
eig.val <- eig.decomp$values
# get TxK factor realizations
X <- eig.decomp$vectors[, 1:k, drop=FALSE] # TxK
dimnames(X) <- list(1:obs, paste("F", 1:k, sep = "."))
f <- xts(X, index(R.xts))
# invert 1st principal component if most values are negative
if (median(f[,1]) < 0) {f[,1] <- -f[,1]}
# LS time series regression to get B: NxK matrix of factor loadings
asset.fit <- lm(R.xts ~ f, ...)
B <- t(coef(asset.fit)[-1, , drop=FALSE])
alpha <- coef(asset.fit)[1,]
# estimate residual standard deviations
resid.sd <- as.numeric(sapply(X=summary(asset.fit), FUN="[", "sigma"))
if (refine) {
R.mat.rescaled <- t(R.mat.d)/resid.sd
Omega.T <- crossprod(R.mat.rescaled)/n
if (corr) {
Omega.T <- cov2cor(Omega.T)
}
eig.decomp <- eigen(Omega.T, symmetric=TRUE)
eig.val <- eig.decomp$values
X <- eig.decomp$vectors[, 1:k, drop=FALSE]
dimnames(X) <- list(1:obs, paste("F", 1:k, sep = "."))
f <- xts(X, index(R.xts))
if (median(f[,1]) < 0) {f[,1] <- -f[,1]}
asset.fit <- lm(R.xts ~ f, ...)
B <- t(coef(asset.fit)[-1, , drop=FALSE])
alpha <- coef(asset.fit)[1,]
resid.sd <- as.numeric(sapply(X=summary(asset.fit), FUN="[", "sigma"))
}
# compute factor model return covariance: NxN
Omega.fm <- B %*% var(f) %*% t(B) + diag(resid.sd^2)
# compute factor mimicking portfolio weights
mimic <- ginv(R.mat) %*% f
mimic <- t(t(mimic)/colSums(mimic))
# extract r2, residuals
resid.xts <- do.call(merge, sapply(X=summary(asset.fit), FUN="[", "residuals"))
r2 <- as.numeric(sapply(X=summary(asset.fit), FUN="[", "r.squared"))
# assign row and column names
names(eig.val) = paste("F", 1:obs, sep = ".")
names(r2) = names(resid.sd) = rownames(mimic) = colnames(R.xts)
colnames(B) = colnames(f)
# return list
list(asset.fit=asset.fit, k=k, factors=f, loadings=B, alpha=alpha, r2=r2,
resid.sd=resid.sd, residuals=resid.xts, Omega=Omega.fm, eigen=eig.val,
mimic=mimic)
}
### Asymptotic Principal Component Analysis using 'ck' method to determine k
#
UseAPCA_ck <- function(R.xts=R.xts, max.k=max.k, refine=refine, sig=sig,
corr=corr, ...) {
n <- ncol(R.xts)
obs <- nrow(R.xts)
idx <- 2*(1:(obs/2))
# dof-adjusted squared residuals for k=1
fit <- UseAPCA(R.xts=R.xts, k=1, refine=refine, corr=corr, ...)
eps2 <- fit$residuals^2 / (1-2/obs-1/n)
for (k in 2:max.k) {
f <- fit
mu <- rowMeans(eps2[idx-1,,drop=FALSE])
# dof-adjusted squared residuals for k
fit <- UseAPCA(R.xts=R.xts, k=k, refine=refine, corr=corr, ...)
eps2.star <- fit$residuals^2 / (1-(k+1)/obs-k/n)
mu.star <- rowMeans(eps2.star[idx,,drop=FALSE])
# cross sectional differences in sqd. errors btw odd & even time periods
delta <- mu - mu.star
# test for a positive mean value for Delta
if(t.test(delta, alternative="greater")$p.value > sig) {return(f)}
eps2 <- eps2.star
}
return(fit)
}
### Asymptotic Principal Component Analysis using 'bn' method to determine k
#
UseAPCA_bn <- function(R.xts=R.xts, max.k=max.k, refine=refine, corr=corr, ...) {
n <- ncol(R.xts)
obs <- nrow(R.xts)
# intialize sigma
sigma <- rep(NA, max.k)
for (k in 1:max.k) {
# fit APCA for k factors
fit <- UseAPCA(R.xts=R.xts, k=k, refine=refine, corr=corr, ...)
# get cross-sectional average of residual variances
sigma[k] <- mean(fit$resid.sd^2)
}
idx <- 1:max.k
# Preferred criteria PC_p1 and PC_p2
PC_p1 <- sigma[idx] + idx*sigma[max.k]*(n+obs)/(n*obs)*log((n*obs)/(n+obs))
PC_p2 <- sigma[idx] + idx*sigma[max.k]*(n+obs)/(n*obs)*log(min(n,obs))
if(order(PC_p1)[1] != order(PC_p2)[1]) {
warning("PC_p1 and PC_p2 did not yield the same result. The smaller one was
used.")
}
k <- min(order(PC_p1)[1], order(PC_p2)[1])
UseAPCA(R.xts=R.xts, k=k, refine=refine, corr=corr, ...)
}
#' @param object a fit object of class \code{sfm} which is returned by
#' \code{fitSfm}
#' @rdname fitSfm
#' @method coef sfm
#' @export
coef.sfm <- function(object, ...) {
# cbind alpha and beta
coef.mat <- cbind(object$alpha, object$loadings)
# name for alpha/intercept column
colnames(coef.mat)[1] <- "(Intercept)"
return(coef.mat)
}
#' @rdname fitSfm
#' @method fitted sfm
#' @export
fitted.sfm <- function(object, ...) {
# use residuals already computed via fitSfm function
fitted.xts <- object$data - object$residuals
return(fitted.xts)
}
#' @rdname fitSfm
#' @method residuals sfm
#' @export
residuals.sfm <- function(object, ...) {
# already computed via fitSfm function
return(object$residuals)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.