R/bigKRLS.R

Defines functions load.bigKRLS shiny.bigKRLS crossvalidate.bigKRLS

Documented in crossvalidate.bigKRLS load.bigKRLS shiny.bigKRLS

#' bigKRLS
#' 
#' Runtime and Memory Optimized Kernel Regularized Least Squares
#' Pete Mohanty (Stanford University) and Robert Shaffer (University of Texas at Austin)
#' 
#' Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2014) to minimize parametric assumptions while maintaining interpretive clarity. Here, we introduce bigKRLS, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly fit KRLS models to medium and large datasets (N > ~2500). 
#'
#' Major Updates:
#'
#' 1. C++ integration. We re-implement most major computations in the model in C++ via Rcpp and RcppArmadillo. 
#' These changes produce up to a 50\% runtime decrease compared to the original R implementation even on a single core.
#'
#' 2. Leaner algorithm. Because of the Tikhonov regularization and parameter tuning strategies used in KRLS, this method of estimation is inherently 
#' memory-heavy O(N^2), making memory savings important even for small- and medium-sized applications. We develop and implement a new local derivatives 
#' algorithm, which reduces peak memory usage by approximately an order of magnitude, and cut the number of computations needed to find regularization 
#' parameter in half.
#'
#' 3. Improved memory management. Most data objects in R perform poorly in memory-intensive applications. We use a series of packages in the bigmemory 
#' environment to ease this constraint, allowing our implementation to handle larger datasets more smoothly.
#'
#' 4. Parallel Processing. We parallelize most major calculations in the model. Time savings are especially noticeable in the derivative estimation portion
#' of the algorithm.
#'
#' 5. Interactive data visualization. We include an R Shiny app that allows users bigKRLS users to easily share results with collaborators or 
#' more general audiences. Simply call shiny.bigKRLS() on the outputted regression object. 
#'
#' 6. Improved uncertainty estimates. bigKRLS uses an adjusted degrees-of-freedom estimator that reflect both the regularization process and the number of 
#' predictors. For details and other options, see help(summary.bigKRLS).
#' 
#' 7. Cross-validation. crossvalidate.bigKRLS performs CV and stores performance results and parameter settings for each fold. See vignette("bigKRLS_basics") 
#' or help("crossvalidate.bigKRLS") for syntax.
#'
#' Requirements. bigKRLS is under active development. bigKRLS, as well its dependencies, require current versions of R and its compilers 
#' (and RStudio if used). For details, see \url{https://github.com/rdrr1990/code/blob/master/bigKRLS_installation.md}.
#'
#' For details on syntax, load the library and then open our vignette vignette("bigKRLS_basics"). Because of the quadratic memory requirement, users 
#' working on a typical laptop (8-16 gigabytes of RAM) may wish to start at N = 2,500 or 5,000, particularly if the number of *x* variables is large. 
#' When you have a sense of how bigKRLS runs on your system, you may wish to only estimate a subset of the marginal effects at N = 10-15,000 by setting 
#' bigKRLS(... which.derivatives = c(1, 3, 5)) for the marginal effects of the first, third, and fifth x variable. 
#' 
#' Pete Mohanty and Robert Shaffer. 2018. "Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS." 
#' Political Analysis. Cambridge University Press. DOI=10.1017/pan.2018.33. pages 1-18.
#' See also: Mohanty, Pete; Shaffer, Robert, 2018, 
#' "Replication Data for: Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS", 
#' https://doi.org/10.7910/DVN/CYYLOK, Harvard Dataverse, V1.
#' 
#' Hainmueller, Jens and Chad Hazlett. 2014. "Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine 
#' Learning Approach." Political Analysis. 22:143-68. \url{https://web.stanford.edu/~jhain/Paper/PA2014a.pdf} (Accessed May 20th, 2016).
#' 
#' Recent papers, presentations, and other code available at \url{github.com/rdrr1990/code/}
#' 
#' License 
#' Code released under GPL (>= 2).
#' @useDynLib bigKRLS, .registration=TRUE
#' @importFrom Rcpp evalCpp
#' @importFrom stats pt quantile cor sd var
#' @importFrom parallel detectCores
#' @importFrom grDevices palette
#' @importFrom ggplot2 aes element_blank geom_hline geom_point geom_smooth ggplot labs theme theme_minimal xlab ylab
#' @import bigalgebra biganalytics bigmemory shiny parallel
#' @docType package
#' @name bigKRLS
"_PACKAGE"

#' bigKRLS
#' @param y A vector of numeric observations on the dependent variable. Missing values not allowed. May be base R matrix or big.matrix.
#' @param X A matrix of numeric observations of the independent variables. Factors, missing values, and constant vectors not allowed. May be base R matrix or big.matrix.
#' @param sigma Bandwidth parameter, shorthand for sigma squared. Default: sigma <- ncol(X). 
#' @param derivative Logical. Estimate derivatives (as opposed to just coefficients)? Recommended for interpretability.
#' @param which.derivatives Optional. For which columns of X should marginal effects be estimated? If derivative=TRUE and which.derivative=NULL (default), all marginal effects will be estimated.
#' @param vcov.est Logical. Estimate variance covariance matrix? Required to obtain derivatives and standard errors on predictions.
#' @param Neig Number of eigenvectors and eigenvalues to calculate. The default is to calculate all N and only use those where eigval >= 0.001 * max(eigval). Smaller values will reduce runtime, but decrease precision.
#' @param eigtrunc Eigentruncation parameter. If NULL, defaults to 0.001 if N > 3000 and 0 otherwise. eigtrunc = 0.25 keeps only those eigenvectors/values such that the eigenvalue is at least 25\% of the max. If eigtrunc == 0, all Neig are used to select lambda and to estimate variances. Larger values will reduce runtime, but decrease precision.
#' @param lambda Regularization parameter. Default: selected based (in part) on the eigenvalues of the kernel via Golden Search. Must be positive, real number. 
#' @param L Lower bound of Golden Search for lambda. 
#' @param U Upper bound of Golden Search for lambda.
#' @param tol tolerance parameter for Golden Search for lambda. Default: N / 1000.
#' @param model_subfolder_name If not null, will save estimates to this subfolder of your current working directory. Alternatively, use save.bigKRLS() on the outputted object.
#' @param overwrite.existing Logical. overwrite contents in folder 'model_subfolder_name'? If FALSE, appends lowest possible number to model_subfolder_name name (e.g., ../myresults3/). 
#' @param Ncores Number of processor cores to use. Default = ncol(X) or N - 2 (whichever is smaller). More than N - 2 NOT recommended. Uses library(parallel) unless Ncores = 1.
#' @param acf Logical. Experimental; default == FALSE. Calculate Neffective as function of mean absolute auto-correlation in X to correct p-values?
#' @param noisy Logical. Display detailed version of progress to console (intermediate output, time stamps, etc.) as opposed to minimal display? Default: if(N > 2000) TRUE else FALSE. SSH users should use X11 forwarding to see Rcpp progress display.  
#' @param instructions Logical. Display syntax after estimation with other library(bigKRLS) functions that can be used on output?
#' @return bigKRLS Object containing slope and uncertainty estimates; summary() and predict() defined for class bigKRLS, as is shiny.bigKRLS().
#' @examples
#'# Analyzing chicken weights
#'# y <- as.matrix(ChickWeight$weight) 
#'# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)
#'
#'# out <- bigKRLS(y, X)
#'# out$R2
#'# summary(out, labs = c("Time", "Diet")) 
#'
#'# don't use save() unless out$has.big.matrices == FALSE
#'# save.bigKRLS(out, "exciting_results") 
#' @export
bigKRLS <- function (y = NULL, X = NULL, sigma = NULL, 
                     derivative = TRUE, which.derivatives = NULL, vcov.est = TRUE,
                     Neig = NULL, eigtrunc = NULL,
                     lambda = NULL, L = NULL, U = NULL, tol = NULL,
                     model_subfolder_name = NULL, 
                     overwrite.existing = FALSE, Ncores = NULL, 
                     acf = FALSE, noisy = NULL, instructions = TRUE)
{
  
  # Ensure RStudio is new enough for dependencies, see init.R
  check_platform()
  # Note Windows requires newer RStudio ( >= 1.1.129) than Mac or Linux ( >= 1.0.136) 
  # due to BH compatility issues. 

  if(!is.null(model_subfolder_name)){
    stopifnot(is.character(model_subfolder_name))
    
    if(!overwrite.existing & (model_subfolder_name %in% dir())){
      
      i <- 1
      tmp.name <- paste(model_subfolder_name, i, sep="")
      while(tmp.name %in% dir()){
        tmp.name <- paste(model_subfolder_name, i, sep="")
        i <- i + 1
      }
      if(model_subfolder_name %in% dir()){
        warning(cat("\na subfolder named", model_subfolder_name, 
                    "exists in your current working directory.\nYour output will be saved to", tmp.name, 
                    "instead.\nTo disable this safeguard, set bigKRLS(..., overwrite.existing=TRUE) next time.\n"))
      }
      model_subfolder_name <- tmp.name
    }
    
    dir.create(model_subfolder_name, showWarnings=FALSE)
    cat("\nmodel estimates will be saved to:\n\n", model_subfolder_name, "\n\n")
    
  }
  
  # create a folder for file backings (metadata for each big matrix etc)
  # must call to.big.matrix() with
  # to.big.matrix(... path = big.meta)
  
  big.meta <- create.metadata.dir()
  
  # suppress warnings from bigmatrix
  oldw <- getOption("warn")
  options(warn = -1)
  options(bigmemory.allow.dimnames = TRUE)
  
  stopifnot(is.matrix(X) | is.big.matrix(X))
  
  w <- list()                                             # w will become bigKRLS object
  return.big.rectangles <- is.big.matrix(X)               # X matrix, derivatives -- how to return?
  return.big.squares <- is.big.matrix(X) | nrow(X) > 2500 # Kernel, variance matrices -- how to return?
  w[["has.big.matrices"]] <- return.big.squares | return.big.rectangles
  
  noisy <- if(is.null(noisy)) nrow(X) > 2000 else noisy
  stopifnot(is.logical(noisy))
  
  if(noisy){
    
    msg <- if(w[["has.big.matrices"]]) 
      "the output will contain big.matrix objects. To avoid crashing R, use save.bigKRLS() on the output, not save().\n\n" 
    else 
      "the output will consist entirely of base R objects.\n\n"
    cat("Based on sample size (whether or not N > 2,500) and input type (base R vs. 'big' matrices),", msg)
    
  }
  
  # all X columns must have labels to prevent various post-estimation nuissance errors
  colnames(X) <- if(is.null(colnames(X))) paste0("x", 1:ncol(X)) else colnames(X)
  for(i in 1:ncol(X)){
    if(nchar(colnames(X)[i]) == 0){
      colnames(X)[i] <- paste0("x", i)
    }
  }
  xlabs <- colnames(X)
  
  w[["X"]] <- if(return.big.rectangles) to.big.matrix(X, deepcopy = TRUE, path = big.meta) else X 
  # deepcopy(X) prevents pointer to X from being inadvertently standardized 
  # in AND outside of bigKRLS()
  X <- to.big.matrix(X, path = big.meta, name = "X")
  X.init.sd <- colsd(X)
  y <- to.big.matrix(y, p = 1, path = big.meta, name = "y")
  y.init <- deepcopy(y)
  
  miss.ind <- colna(X)
  if (sum(miss.ind) > 0) { 
    stop(paste("the following columns in X contain missing data, which must be removed:", 
               paste((1:length(miss.ind))[miss.ind > 0], collapse = ', '), collapse=''))
  }
  n <- nrow(X)
  p <- ncol(X)
  # correcting p values as f(pairwise correlation of rows of X) 
  # only possible + nontrivial when ncol(X) > 2 
  acf <- acf & p > 2
  
  Neig <- if(is.numeric(Neig)) min(n, as.integer(Neig)) else n
  if(is.null(eigtrunc)){
    if(n > 3000){
      eigtrunc <- 0.001
      cat('\nUsing eigentruncation = 0.001 to speed up computation. Reducing this value will improve accuracy, but increase runtime.')
    }else{
      eigtrunc <- 0
    }
  } else if(!is.numeric(eigtrunc) | eigtrunc < 0 | eigtrunc > 1){
    stop("eigtrunc must be between 0 (no truncation) and 1 (keep largest only).")
  }
  
  if(!is.null(which.derivatives)){
    
    if(!derivative)
      stop("which.derivative requires derivative = TRUE\n\nDerivative is a logical indicating whether derivatives should be estimated (as opposed to just coefficients); which.derivatives is a vector indicating which one (with NULL meaning all).")
    stopifnot(sum(which.derivatives %in% 1:p) == length(which.derivatives))
    if(noisy){
      cat("\nMarginal effects will be calculated for the following x variables:\n")
      cat(which.derivatives, sep=", ")
    }
  }
  
  if(min(X.init.sd) == 0) 
    stop("The following columns in X are constant and must be removed: ", which(X.init.sd == 0))
  if(n != nrow(y)) 
    stop("nrow(X) not equal to number of elements in y.")
  if(colna(y) > 0) 
    stop("y contains missing data.") 
  if(colsd(y) == 0) 
    stop("y is a constant.") 
  if(!is.null(lambda))
    stopifnot(is.vector(lambda), length(lambda) == 1, is.numeric(lambda), lambda > 0)
  if(!is.null(sigma))
    stopifnot(is.vector(sigma), length(sigma) == 1, is.numeric(sigma), sigma > 0)
  
  sigma <- if(is.null(sigma)) p else sigma
  
  if (is.null(tol)) { # tolerance parameter for lambda search
    tol <- n/1000
  } else {
    stopifnot(is.vector(tol), length(tol) == 1, is.numeric(tol), tol > 0)
  }
  
  stopifnot(is.logical(derivative), is.logical(vcov.est))
  if (derivative & !vcov.est) 
    stop("vcov.est is needed to get derivatives (derivative==TRUE requires vcov.est=TRUE).")
  
  x.is.binary <- apply(X, 2, function(x){length(unique(x))}) == 2 
  if(noisy & sum(x.is.binary) > 0){
    cat(paste("\nFirst differences will be computed for the following (binary) columns of X: ", 
              toString((1:p)[x.is.binary], sep=', '), sep=""), '\n\n')
  }
  
  y.init.sd <- colsd(y.init)
  y.init.mean <- colmean(y.init)
  
  for(i in 1:ncol(X)){
    X[,i] <- (X[,i] - mean(X[,i]))/sd(X[,i])
  }
  y[,1] <- (y[,1] - mean(y[,1]))/sd(y[,1])
  
  # by default uses the same number of cores as X variables or N available - 2, whichever is smaller
  Ncores <- if(is.null(Ncores)) min(c(parallel::detectCores() - 2, ncol(X))) else Ncores
  if(noisy) cat(Ncores, "cores will be used.\n")
  
  if(noisy) cat("\nStep 1/5: Kernel (started at ", Time(), ").", sep="")
  
  K <- bGaussKernel(X, sigma)

  if(noisy) cat("\nStep 2/5: Spectral Decomposition (started at ", Time(), ").", sep="")
  
  Eigenobject <- bEigen(K, Neig, eigtrunc)

  w[["K.eigenvalues"]] <- Eigenobject$values
  w[["lastkeeper"]] <- Eigenobject$lastkeeper
  
  if (is.null(lambda)) {
    if(noisy){cat("\nStep 3/5: Golden Search for regularization parameter lambda (started at ", 
                  Time(), ").", sep="")}
    lambda <- bLambdaSearch(L = L, U = U, y = y, 
                            Eigenobject = Eigenobject, noisy = noisy)
  }else{
    if(noisy){cat("\nSkipping step 3/5, proceeding with user-inputted lambda.\n")}
  }
  
  w[["Neffective"]] <- n - sum(w[["K.eigenvalues"]]/(w[["K.eigenvalues"]] + lambda))
  if(noisy){cat("Effective Sample Size: ", w[["Neffective"]], '.', sep='')}
  
  if(noisy){cat("\n\nStep 4/5: Calculate coefficients & related estimates (started at ", 
                Time(), ").", sep="")}
  
  out <- bSolveForc(y = y, Eigenobject = Eigenobject, lambda = lambda)
  
  # bSolveForc obtains the vector of coefficients (weights) 
  # that assign importance to the similarity scores (found in K)
  if(noisy){cat("\n\nFitting values.")}
  yfitted <- K %*% to.big.matrix(out$coeffs, p = 1, path = big.meta)

  if (vcov.est == TRUE) {
    sigmasq <- bCrossProd(y - yfitted)[]/n
    if(noisy){cat("\nIn standardized units, sigmasq = ", round(sigmasq, 5), ".", sep='')}
    if(noisy){cat("\nCalculating variance-covariance of the coefficients.")}
    
    # subsetting from Neig < N and/or eigtrunc now handled by bEigen()
    m <- bMultDiag(Eigenobject$vectors, 
                   sigmasq * (Eigenobject$values + lambda)^-2)
    vcovmatc <- bTCrossProd(m, Eigenobject$vectors)
    
    remove(Eigenobject)
    remove(m)
    gc()
    if(noisy) cat("\nEstimating variance covariance of the fitted values.")
    vcovmatyhat <- bCrossProd(K, vcovmatc %*% K)
    if(!is.null(model_subfolder_name) & return.big.squares){
      vcovmatyhat <- (y.init.sd^2) * vcovmatyhat
      cat("\nsaving vcovmatyhat to", getwd())
      write.big.matrix(x = vcovmatyhat, 
                       filename = file.path(model_subfolder_name, "vcovmatyhat.txt"))
      remove(vcovmatyhat)
      cat("\nvcovmatyhat successfully saved to disk (and removed from memory for speed).\n")
    }
    
  }else {
    vcov.est.c <- NULL
    vcov.est.fitted <- NULL
  }
  if (derivative == TRUE) {
    
    if(noisy){cat("\n\nStep 5/5: Estimate marginal effects and their uncertainty (started at ", 
                  Time(), ").\n\n", sep="")}
    
    X_estimate <- if(!is.null(which.derivatives)) deepcopy(X, cols = which.derivatives) else X
    
    if(Ncores == 1){
        deriv_out <- bDerivatives(X_estimate, sigma, K, out$coeffs, vcovmatc)
    }else{
      
      X_index <- if(is.null(which.derivatives)) 1:p else which.derivatives

      # each core will need to know how to find the big matrices
      # writing their description to disk will allow each core to do that...
      
      K <- to.big.matrix(K, name = "K", path = big.meta)
      vcovmatc <- to.big.matrix(vcovmatc, name = "V", path = big.meta)
      
      if(!("cl" %in% ls())){
        cl <- if(noisy) makeCluster(Ncores, outfile='') else makeCluster(Ncores)
        clusterEvalQ(cl, suppressPackageStartupMessages(library(bigKRLS)))
      } 
      
      tmp <- parLapply(cl, X_index, function(i, sigma, coefficients, X.init.sd, path){

        # each core finds the big matrices like so...
        X <- attach.resource(dget(file.path(path, "X.desc")), path = path)
        K <- attach.resource(dget(file.path(path, "K.desc")), path = path)
        V <- attach.resource(dget(file.path(path, "V.desc")), path = path)
        
        # the description that dget obtains doesn't contain a path variable
        # which prevents attach.big.matrix() from working. but it's just a wrapper
        # for attach.resource()...
          
        x <- deepcopy(X, cols = i)
        
        output <- bDerivatives(x, sigma, K, coefficients, V)
        # can't return pointers
        list(output[[1]][], output[[2]])
        # output is small but could also use to.big.matrix for reverse direction (should if N * N needed)
      }, sigma, out$coeffs, X.init.sd, big.meta)
      stopCluster(cl) 
      remove(cl)
      
      derivs <- matrix(nrow = n, ncol = ncol(X_estimate))
      varavgderiv <- c()
      for(i in 1:ncol(X_estimate)){
        derivs[,i] <- tmp[[i]][[1]]
        varavgderiv[i] <- tmp[[i]][[2]]
      }
      deriv_out <- list()
      deriv_out[["derivatives"]] <- to.big.matrix(derivs, path = big.meta, name = "derivs") 
      deriv_out[["varavgderiv"]] <- varavgderiv
      remove(tmp, derivs, varavgderiv)
    }
    
    
    if(noisy){
      cat('\n\nFinshed (', Time(), ').', sep="") 
      cat('\n\nPrepping bigKRLS output object...\n')
    }
    
    derivmat <- deriv_out$derivatives
    varavgderivmat <- deriv_out$varavgderiv
    remove(deriv_out)
    
    # Pseudo R2 using only Average Marginal Effects
    # yhat_ame <- (X_estimate[] %*% colMeans(derivmat[]))^2
    yhat_ame <- if(length(which.derivatives) == 1 || p == 1) tcrossprod(as.matrix(X_estimate[]), colMeans(as.matrix(derivmat[]))) else X_estimate[] %*% colMeans(derivmat[])

    w[["R2AME"]] <- cor(y.init[], yhat_ame)^2
    
    derivmat <- y.init.sd * derivmat
    for(i in 1:ncol(derivmat)){
      derivmat[,i] <- derivmat[,i]/X.init.sd[i]
    }
    
    attr(derivmat, "scaled:scale") <- NULL
    avgderiv <- matrix(colmean(derivmat), nrow=1)
    attr(avgderiv, "scaled:scale") <- NULL
    
    if(is.null(which.derivatives)){
      varavgderivmat <- matrix((y.init.sd/X.init.sd)^2 * as.matrix(varavgderivmat), nrow = 1)
    }else{
      varavgderivmat <- matrix((y.init.sd/X.init.sd[which.derivatives])^2 * as.matrix(varavgderivmat), nrow=1)
    }
    
    attr(varavgderivmat, "scaled:scale") <- NULL
  }
  
  if(acf){
    if(noisy){cat('Accumulating absolute pairwise correlations within X to correct p-values; see help(bigKRLS).')}
    Neffective.acf <- bNeffective(X)
    if(noisy){cat("\nEffective Sample Size as f(absolute correlation of X): ", Neffective.acf, '.', sep='')}
  }
  
  # w is the output object
  
  w[["coeffs"]] <- out$coeffs
  w[["y"]] <- y.init[]
  w[["sigma"]] <- sigma
  w[["lambda"]] <- lambda 
  w[["binaryindicator"]] <- x.is.binary
  w[["which.derivatives"]] <- which.derivatives
  w[["xlabs"]] <- xlabs
  
  w[["yfitted"]] <- yfitted <- as.matrix(yfitted) * y.init.sd + y.init.mean
  w[["R2"]] <- 1 - (var(y.init - yfitted)/(y.init.sd^2))
  w[["Looe"]] <- out$Le * y.init.sd
  w[["Neffective.acf"]] <- if(exists("Neffective.acf")) Neffective.acf else NULL
  
  # returning base R matrices when sensible...
  w[["K"]] <- if(return.big.squares) K else K[] 
  
  if (vcov.est) {
    
    vcovmatc <- (y.init.sd^2) * vcovmatc
    
    if(return.big.squares){
      
      w[["vcov.est.c"]] <- vcovmatc
      
      if(is.null(model_subfolder_name)){
        vcovmatyhat <- (y.init.sd^2) * vcovmatyhat
        w[["vcov.est.fitted"]] <- vcovmatyhat
      } # vcovmatyhat already saved otherwise
      
    }else{
      w[["vcov.est.c"]] <- vcovmatc[]
      w[["vcov.est.fitted"]] <- vcovmatyhat[]
    }
  }
  
  w[["derivative.call"]] <- derivative
  
if(derivative){
    rownames(avgderiv) <- rownames(varavgderivmat) <- ""
    
    w[["avgderivatives"]] <- avgderiv
    w[["var.avgderivatives"]] <- varavgderivmat
    w[["derivatives"]] <- if(return.big.rectangles) derivmat else derivmat[]
    
    if(p == 1 & !return.big.rectangles){
      w$derivatives <- matrix(w$derivatives)
      w$avgderivatives <- matrix(w$avgderivatives)
    }
    colnames(w$derivatives) <- colnames(w$avgderivatives) <- if(is.null(which.derivatives)) xlabs else xlabs[which.derivatives]
  }
  
  if(!is.null(model_subfolder_name)){
    
    cat("\nsaving output to", model_subfolder_name, "\n")
    w[["path"]] <- normalizePath(model_subfolder_name)
    
    for(i in which(unlist(lapply(w, is.big.matrix)))){
      output_file = file.path(model_subfolder_name, paste0(names(w)[i], ".txt"))
      cat("\twriting", output_file, "...\n")
      write.big.matrix(x = w[[i]], filename = output_file) 
                       # col.names = !is.null(colnames(w[[i]])) 
                       # deprecating, handling with object$xlabs
    }
    
    Nbm <- sum(unlist(lapply(w, is.big.matrix))) + return.big.squares
    cat("\n\n", Nbm, "matrices saved as big matrices.\n") 
    if(Nbm == 0){
      cat(" (base R save() may be used safely in this case too).\n")
    }else{
      cat("\nto reload, use syntax like:\n\nload.bigKRLS(\"", w$path, "\")\n or\n",
          "load.bigKRLS(\"", w$path, "\", newname=\"my_estimates\")\n", sep="")}
    if(Nbm > 0){
      bigKRLS_out <- w[-which(unlist(lapply(w, is.big.matrix)))]
      class(bigKRLS_out) <- "bigKRLS"
    }else{
      bigKRLS_out <- w
    }
    stopifnot(sum(unlist(lapply(bigKRLS_out, is.big.matrix))) == 0)
    save(bigKRLS_out, file=file.path(model_subfolder_name, "estimates.RData"))
    cat("\nbase R elements of the output saved to estimates.RData.\n")
    cat("Total file size approximately", 
        round(sum(file.info(list.files(path = model_subfolder_name, full.names = TRUE))$size)/1024^2), 
        "megabytes.\n\n")
    model_subfolder_name
  }
  
  unlink(big.meta, recursive = TRUE)
  # file.remove(dir(path = big.meta, full.names = TRUE))
  # description are pointers that will crash R outside of current R session; 
  # removing their footprint
  
  if(instructions) cat("\nAll done. You may wish to use summary() for more detail, predict() for out-of-sample forecasts, or shiny.bigKRLS() to interact with results. For an alternative approach, see help(crossvalidate.bigKRLS). Type vignette(\"bigKRLS_basics\") for sample syntax. Use save.bigKRLS() to store results and load.bigKRLS() to re-open them.\n\n")
  class(w) <- "bigKRLS" 
  return(w)
  
  options(warn = oldw)
}  

###################
# bigKRLS exports # 
###################

#' predict.bigKRLS
#' 
#' Predict function for bigKRLS object. crossvalidate.bigKRLS() provides additional functionality.
#' 
#' @param object bigKRLS output
#' @param newdata New data. Columns in newdata should be ordered identically to columns in X.
#' @param se.pred Calculate standard errors on predictions?
#' @param correct_SE If se.pred == TRUE, default is to use Neffective (if available) rather than model N in calculating uncertainty of predictions.
#' @param ytest Provide testing data to have it returned with the object. Optional. To automatically generate out-of-sample test statistics, use crossvalidate.bigKRLS() instead.
#' @param ... ignore
#' @return Returns bigKRLS_predicted list object.
#' @examples  
#'# set.seed(123)
#'# y <- as.matrix(ChickWeight$weight)
#'# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)
#'# N <- length(y)
#'
#'# train <- sample(N, 100, replace = FALSE)
#'# outOfSample <- c(1:N)[-train]
#'# test <- sample(outOfSample, 10, replace = FALSE)
#'
#'# fit <- bigKRLS(y[train], X[train,], instructions = FALSE) 
#'# p <- predict(fit, X[test,])
#' @method predict bigKRLS
#' @export
predict.bigKRLS <- function (object, newdata, se.pred = FALSE, correct_SE=TRUE, ytest = NULL, ...) 
{
  if (class(object) != "bigKRLS") {
    warning("Object not of class 'bigKRLS'")
    UseMethod("predict")
    return(invisible(NULL))
  }
  if(se.pred == TRUE) {
    if (is.null(object$vcov.est.c)) {
      stop("recompute bigKRLS object with bigKRLS(,vcov.est=TRUE) to compute standard errors")
    }
  }
  
  # convert everything to a bigmatrix for internal usage
  
  big.meta <- create.metadata.dir()
  
  object$X <- to.big.matrix(object$X, deepcopy = TRUE, path = big.meta)
  object$K <- to.big.matrix(object$K, path = big.meta) #, description = "K")
  object$vcov.est.c <- to.big.matrix(object$vcov.est.c, path = big.meta)

  if(!is.null(object$vcov.est.fitted)){
    object$vcov.est.fitted <- to.big.matrix(object$vcov.est.fitted, path = big.meta)  
  }else{
    cat("vcov.est.fitted not found in bigKRLS object, attempting to load from object's path,\n ", object$path)
    if("vcov.est.fitted.txt" %in% dir(object$path)){
      object$vcov.est.fitted <- read.big.matrix(filename = file.path(object$path, "vcov.est.fitted.txt"),
                                                type = 'double')
      cat("\nVariance(y hat) matrix loaded successfully\n")
    }else{
      cat("\nFile not found.\n")
    }
  }
  
  # flag: return big matrices? (new kernel, etc...)
  bigmatrix.in <- is.big.matrix(newdata) | object$has.big.matrices
  
  newdata.init <- newdata
  newdata <- to.big.matrix(newdata, deepcopy = TRUE, path = big.meta)
  
  if (ncol(object$X) != ncol(newdata)) 
    stop("ncol(newdata) differs from ncol(X) from fitted bigKRLS object")
  
  Xmeans <- colmean(object$X, na.rm = TRUE)
  Xsd <- colsd(object$X, na.rm = TRUE)
  
  for(i in 1:ncol(object$X))
    object$X[,i] <- (object$X[,i] - Xmeans[i])/Xsd[i]

  for(i in 1:ncol(newdata))
    newdata[,i] <- (newdata[,i] - Xmeans[i])/Xsd[i]

  newdataK <- bTempKernel(newdata, object$X, object$sigma)
  
  ypred <- (newdataK %*% to.big.matrix(object$coeffs, path = big.meta))[]
  
  if (se.pred) {
    
    # vcov.est.c.raw <- object$vcov.est.c * (1/var(object$y))
    # vcov.est.pred <- var(object$y) * bTCrossProd(newdataK %*% vcov.est.c.raw, newdataK)
    # remove(vcov.est.c.raw)
    vcov.est.pred <- var(object$y) * bTCrossProd(newdataK %*% (object$vcov.est.c * (1/var(object$y))), newdataK)
    
    if(correct_SE & !is.null(object$Neffective))
      vcov.est.pred <- sqrt(nrow(object$X)/object$Neffective) * vcov.est.pred
    
    se.pred <- sqrt(bDiag(vcov.est.pred))
    # se.pred <- matrix(sqrt(diag(vcov.est.pred[])), ncol = 1) 
    
  }
  else {
    vcov.est.pred <- se.pred <- NULL
  }
  
  ypred <- ypred * sd(object$y) + mean(object$y)
  
  if(!bigmatrix.in){
    vcov.est.pred <- vcov.est.pred[]
    newdataK <- newdataK[]
  }
  
  out <- list(predicted = ypred, se.pred = se.pred, vcov.est.pred = vcov.est.pred, 
              newdata = newdata.init, newdataK = newdataK, 
              has.big.matrices = bigmatrix.in, # TRUE if bigKRLS returned big OR user inputted to predict()
              ytest = ytest)
  
  class(out) <- "bigKRLS_predicted"
  unlink(big.meta, recursive = TRUE)
  return(out)
  
}

#' summary.bigKRLS
#' 
#' Summary function for bigKRLS output. 
#' 
#' @param object bigKRLS output.
#' @param degrees "Neffective" (default) or "N". If "Neffective" (default), degrees of freedom for t tests reflects degrees of freedom used to obtain regularization parameter, lambda. Neffective = N - sum(eigenvalues/(eigenvalues + lambda)); see e.g. Hastie et al. (2015, 61-68). 'N' is simply the observed sample size (note this is the default for library(KRLS)). Degrees of freedom for t-tests is either Neffective - P or N - P.
#' @param probs Which quantiles of the marginal effects of each x variable should be displayed?
#' @param digits Number of signficant digits.
#' @param labs Optional vector of x labels.
#' @param ... ignore
#' @return Returns list with "ttests" (Average Marginal Effect estimates, standard errors, t-values, and p values) and "percentiles" (of the marginal effects).
#' @method summary bigKRLS
#' @examples
#'# y <- as.matrix(ChickWeight$weight)
#'# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)
#'
#'# out <- bigKRLS(y, X, Ncores=1)
#'# summary(out)
#'
#'# s = summary(out, digits = 2, labs = c("Time", "ChickWeightDiet"))
#'
#'# knitr::kable(s[["ttests"]])     # to format with RNotebook or RMarkdown
#'# knitr::kable(s[["percentiles"]])
#'
#'# Calcuate p-values using uncorrected standard errors (not recommended)
#'# summary(out, degrees = "N")
#'@export
summary.bigKRLS <- function (object, degrees = "Neffective", probs = c(0.05, 0.25, 0.5, 0.75, 0.95), 
                             digits = 4, labs = NULL, ...) 
{
  if (class(object) != "bigKRLS") {
    warning("Object not of class 'bigKRLS'")
    UseMethod("summary")
    return(invisible(NULL))
  }
  
  N <- n <- nrow(object$X)
  
  stopifnot(degrees %in% c("acf", "Neffective", "N"))
  
  if(degrees == "Neffective"){
    n <- object$Neffective
  }
  if(degrees == "acf"){
    if(is.null(object$Neffective.acf)){
      big.meta <- create.metadata.dir()
      n <- bNeffective(to.big.matrix(scale(object$X[]), path = big.meta))
      unlink(big.meta)
      cat("\n\n\n")
    }else{
      n <- object$Neffective.acf
    } 
  } 
  
  cat("\n\nMODEL SUMMARY:\n\n")
  cat("lambda:", round(object$lambda, digits), "\n")
  cat("N:", N, "\n")
  if(n != N) cat("N Effective:", n, "\n")
  
  p <- ncol(object$X)
  cat("R2:", round(object$R2, digits), "\n")
  
  if (is.null(object$derivatives)) {
    cat("\nrecompute with bigKRLS(..., derivative = TRUE) for estimates of marginal effects\n")
    return(invisible(NULL))
  }
  
  if(!is.null(object$R2AME))
    cat("R2AME**:", round(object$R2AME, digits), "\n\n")
  
  if(!is.null(labs)){
    stopifnot(length(labs) == p)
    colnames(object$X) <- labs
  }else{
    if(is.big.matrix(object$X)) options(bigmemory.allow.dimnames=TRUE)
    colnames(object$X) <- object$xlabs
  }
  
  if(is.null(object$which.derivatives)){
    object$which.derivatives <- 1:p
  }
  
  est <- object$avgderivatives
  se <- sqrt(object$var.avgderivatives)
  if(degrees != "Neffective"){
    se <- se*nrow(object$X)/n # correcting variance estimate
  }
  tval <- est/se
  pval <- 2 * pt(abs(tval), n - p, lower.tail = FALSE)
  AME <- t(rbind(est, se, tval, pval))
  colnames(AME) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
  rownames(AME) <- colnames(object$X)[object$which.derivatives]
  if (sum(object$binaryindicator[object$which.derivatives]) > 0) {
    tmp <- rownames(AME)[object$binaryindicator[object$which.derivatives]]
    rownames(AME)[object$binaryindicator[object$which.derivatives]] <- paste(tmp, "*", sep="")
  }
  cat("Average Marginal Effects:\n\n")
  print(round(AME, digits))
  
  cat("\n\nPercentiles of Marginal Effects:\n\n")
  
  deriv <- matrix(object$derivatives[], ncol = length(object$which.derivatives))
  qderiv <- t(apply(deriv, 2, quantile, probs = probs, na.rm = TRUE))
  rownames(qderiv) <- rownames(AME)
  print(round(qderiv, digits))
  
  if (sum(object$binaryindicator) > 0) {
    cat("\n(*) Reported average and percentiles of dy/dx is for discrete change of the dummy variable from min to max (usually 0 to 1)).\n\n")
  }
  cat("\n(**) Pseudo-R^2 computed using only the Average Marginal Effects.") 
  if(length(object$which.derivatives) != ncol(object$X)) cat(" NOTE: If only a subset of marginal effects were estimated, Pseudo-R^2 calculated with that subset.")
  cat("\n\n")
  cat("\nYou may also wish to use predict() for out-of-sample forecasts or shiny.bigKRLS() to interact with results. Type vignette(\"bigKRLS_basics\") for sample syntax. Use save.bigKRLS() to store results and load.bigKRLS() to re-open them.\n\n")
  ans <- list(ttests = AME, 
              percentiles = qderiv)
  class(ans) <- "summary.bigKRLS"
  return(invisible(ans))
  
}

#' summary.bigKRLS_CV
#' 
#' Summary function for bigKRLS crossvalidated output.
#' 
#' @param object bigKRLS_CV output.
#' @param ... Additional parameters to be passed to summary() for the training model(s). For example, summary(cv, digits = 3). See ?bigKRLS.summary for details.
#' @method summary bigKRLS_CV
#' @examples
#'# y <- as.matrix(ChickWeight$weight)
#'# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)
#'
#'# cv.out <- crossvalidate.bigKRLS(y, X, seed = 123, ptesting = 20)
#'# summary(cv.out)
#'
#'# cv <- summary(cv.out, labs = c("Alpha", "Beta", "Gamma", "Delta", "Epsilon"))
#'# cv$training.ttests
#'
#'# kcv.out <- crossvalidate.bigKRLS(y, X, seed = 123, Kfolds = 3)
#'# summary(kcv.out)  
#'         
#'# kcv <- summary(kcv.out) 
#'# kcv$overview
#'# kcv$training2.ttests
#' @export
summary.bigKRLS_CV <- function (object, ...) 
{
  if (class(object) != "bigKRLS_CV") {
    warning("Object not of class 'bigKRLS_CV'")
    UseMethod("summary")
    return(invisible(NULL))
  }
  
  arguments = list(...)
  digits <- if("digits" %in% names(arguments)) digits else 3
  
  if(object$type == "crossvalidated"){
    
    cat("\nOverview of Model Performance\n\n")
    
    cat("N:", length(unlist(object$indices)), "\n")
    cat("Seed:", object$seed, "\n\n")
    
    overview <- matrix(ncol = 2, nrow = 6)
    colnames(overview) <- c("In Sample", "Out of Sample")
    rownames(overview) <- c("Mean Squared Error (Full Model)", 
                            "Mean Squared Error (Average Marginal Effects Only)",
                            "Pseudo-R^2 (Full Model)",
                            "Pseudo-R^2 (Average Marginal Effects Only)",
                            "",
                            "N")
    
    overview[1, 1] <- object$MSE_is
    overview[1, 2] <- object$MSE_oos
    
    overview[2, 1] <- object$MSE_AME_is
    overview[2, 2] <- object$MSE_AME_oos
    
    overview[3, 1] <- object$pseudoR2_is
    overview[3, 2] <- object$pseudoR2_oos
    
    overview[4, 1] <- object$pseudoR2AME_is
    overview[4, 2] <- object$pseudoR2AME_oos
    
    overview[6, 1] <- length(object$indices$train.set)
    overview[6, 2] <- length(object$indices$test.set)
    
    print(round(overview, digits = digits), na.print = "")
    
    cat("\n\nSummary of Training Model:\n")
    z <- summary(object$trained, ...)
    
    ans <- list(overview = overview,
                training.ttests = z$ttests, 
                training.percentiles = z$percentiles)
    class(ans) <- "summary.bigKRLS_CV"
    
  }else{
    
    stopifnot(object$type == "KfoldsCV")
    # object[lapply(object, class) == "numeric"]
    
    cat("Overview of Model Performance\n\n")
    
    cat("N:", length(unlist(object$folds)), "\n")
    cat("Kfolds:", object$Kfolds, "\n")
    cat("Seed:", object$seed, "\n\n")
    
    overview <- matrix(unlist(object[lapply(object, class) == "numeric"][-c(1:2)]), 
                       ncol=object$Kfolds, byrow = TRUE)
    colnames(overview) <- paste("Fold", 1:object$Kfolds)
    
    # somewhat cumbersome but the test stats will differ depending on whether 
    # user computes with bigKRLS(... derivative = TRUE)
    labs <- unlist(strsplit(names(unlist(object[lapply(object, class) == "numeric"][-c(1:2)])), "\\."))
    labs <- unique(labs[-grep("fold", labs)])
    labs <- gsub("_is", " (In Sample)", labs)
    labs <- gsub("_oos", " (Out of Sample)", labs)
    labs <- gsub("_", " ", labs)
    labs <- gsub("R2AME", "R2 AME", labs)
    
    rownames(overview) <- labs
    overview <- overview[match(sort(labs), labs), ]
    
    print(overview, digits = digits)
    class(overview) <- "summary.bigKRLS_CV"
    ans <- list(overview = overview)
    
    cat("\nMSE denotes Mean Squared Error. AME implies calculations done with Average Marginal Effects only.")
    
    for(k in 1:object$Kfolds){
      cat("\n\nSummary of Training Model", k , ":\n", sep="")
      z <- summary(object[[paste0("fold_", k)]][["trained"]], ...)
      ans[[paste0("training", k, ".ttests")]] <- z$ttests
      ans[[paste0("training", k, ".percentiles")]] <- z$percentiles
    }
    
  }
  
  return(invisible(ans))
  
}

#' save.bigKRLS
#' 
#' save function, recommended when bigKRLS output contains big matrices (once N > 2,500 the kernel is stored this way).
#' Base R data will be stored in a list in an .RData file, big matrices will be stored in .txt files. 
#' Call load.bigKRLS() to retrieve. 
#' 
#' @param object bigKRLS output (regression, prediction, and crossvalidation). Use load.bigKRLS(model_subfolder_name), not load().
#' @param model_subfolder_name A name of a folder where the file(s) will be written. 
#' @param overwrite.existing Logical. If TRUE, write over folders with the same name. Default == FALSE.
#' @param noisy Logical. If TRUE display progress, additional instructions. Default == TRUE.
#' @examples
#'# y <- as.matrix(ChickWeight$weight)
#'# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)
#'
#'# out <- bigKRLS(y, X, Ncores=1)
#'# save.bigKRLS(out, "exciting_results") 
#'
#'# don't use save() unless out$has.big.matrices == FALSE
#'# load.bigKRLS("/path/to/exciting_results") 
#' @export
save.bigKRLS <- function (object, model_subfolder_name, 
                          overwrite.existing = FALSE, noisy = TRUE) 
{
  
  bClasses <- c("bigKRLS", "bigKRLS_predicted", "bigKRLS_CV")
  if (!(class(object) %in% bClasses)) {
    warning("Object not a bigKRLS class.")
    UseMethod("save")
    return(invisible(NULL))
  }
  stopifnot(is.character(model_subfolder_name))
  
  object <- make_path(object, model_subfolder_name, overwrite.existing)
  
  if(class(object) == "bigKRLS" | class(object) == "bigKRLS_predicted"){
    bSave(object, noisy)
  }else{
    
    for(k in grep("fold_", names(object))){
      
      object[[k]][["trained"]] <- make_path(object[[k]][["trained"]], 
                                            file.path(object[["model_subfolder_name"]], names(object)[k], "trained"), 
                                            overwrite.existing = TRUE) # overwrite TRUE here in keeping with above choice
      bSave(object[[k]][["trained"]], noisy)
      
      object[[k]][["tested"]] <- make_path(object[[k]][["tested"]], 
                                           file.path(object[["model_subfolder_name"]], names(object)[k], "tested"), 
                                           overwrite.existing = TRUE)
      bSave(object[[k]][["tested"]], noisy)
      
    }
    object[["dir"]] <- dir(object[["model_subfolder_name"]], recursive = TRUE)
    tmp <- class(object)
    object <- object[-grep("fold_", names(object))]
    class(object) <- tmp
    save(object, file = file.path(object[["model_subfolder_name"]], "estimates.RData"))
    if(noisy) cat("\nBase R elements, summary stats, and metadata of the entire cross-validated outputted object saved in:", 
                  file.path(object[["model_subfolder_name"]], "estimates.RData\n"))
  }
  
  if(noisy) cat("\nTotal file size approximately", 
                round(sum(file.info(list.files(path = object[["model_subfolder_name"]], full.names = TRUE, recursive = TRUE))$size)/1024^2), 
                "megabytes.\n")
  
}

#' load.bigKRLS
#' 
#' Reconstructs bigKRLS output object as list.
#' 
#' @param path Path to folder where bigKRLS object was saved. 
#' @param newname If NULL (default), bigKRLS regression and prediction output will appear as "bigKRLS_out", while crossvalidation results will appear as "object".
#' @param pos position. Default == 1 (global environment). NULL means don't assign (return only).
#' @param return_object Logical. return library(bigKRLS) object? Default == FALSE. 
#' @param noisy Logical. display updates?
#' @examples
#'# save.bigKRLS(out, "exciting_results") # don't use save()
#'# load.bigKRLS("exciting_results") # don't use load()
#' @export
load.bigKRLS <- function(path, newname = NULL, pos = 1, noisy = TRUE, return_object = FALSE){
  
  stopifnot(is.null(newname) | is.character(newname))
  
  files <- dir(path = path)
  if(!(tolower("estimates.RData") %in% tolower(files))){
    stop("estimates.RData not found. Check the path to the output folder.\n\nNote: for any files saved manually, note that load.bigKRLS() anticipates the convention used by save.bigKRLS: estimates.RData stores the base R objects in a list called bigKRLS_out, big matrices stored as text files named like they are in bigKRLS objects (object$K becomes K.txt, etc.).\n\n")
  }
  
  # bigKRLS version > 1.5 uses more standard convention .RData but loads .rdata too...
  # name = load(file.path(path, "estimates.RData"))
  name <- load(file.path(path, files[match(tolower("estimates.RData"), tolower(files))]))
  # name will be 'bigKRLS_out' for predict and regression objects
  # and 'object' for CV
  
  # equivalent to class(get(name)) == "bigKRLS" | class(get(name)) == "bigKRLS_predicted"
  if(name == "bigKRLS_out"){
    
    bigKRLS_out <- bLoad(bigKRLS_out, path, noisy) # load big.matrix objects, if any
    
  }else{
    
    stopifnot(name == "object" & class(get(name)) == "bigKRLS_CV")
    
    est <- file.path(path, object$dir[grep("estimates.RData", object$dir)])
    
    for(i in 1:length(est)){
      
      load(est[i])  # loads next estimates.RData, necessarily named bigKRLS_out
      kind <- if(i %% 2 == 0) "trained" else "tested"
      fold <- paste0("fold_", ceiling(i/2))
      object[[fold]][[kind]] <- bLoad(bigKRLS_out, path, noisy)
      
    }
    
  } 
  
  
  if(!is.null(pos)) {
    
    if(is.null(newname)){
      newname <- name
    }
    
    if(class(get(name)) == "bigKRLS" | class(get(name)) == "bigKRLS_predicted"){
      assign(newname, bigKRLS_out, envir = as.environment(pos))
    }else{
      assign(newname, object, envir = as.environment(pos))
    }
    
    if(noisy) cat("\nNew object created named", 
                  newname, ".\n\nRun vignette(\"bigKRLS_basics\") for examples for a", 
                  class(bigKRLS_out), "object.")
  }
  
  if(return_object) return(bigKRLS_out)
  
}


#' shiny.bigKRLS
#' 
#' @param out bigKRLS output. Does not require any N * N matrices.
#' @param export Logical. Instead of running Shiny, prepare the key estimates as a separate file.
#' @param main.label Main label (upper left of app)
#' @param plot.label Optional character
#' @param xlabs Optional character vector for the x variables.
#' @param font_size Font size. Default == 20. calls "ggplot2::theme_minimal(base_size = font_size)"
#' @param shiny.palette color scheme for main app. 9 colors.
#' @param hline horizontal line. Default == 0 (x axis)
#' @examples
#'# N <- 500
#'# P <- 4
#'# X <- matrix(rnorm(N*P), ncol=P)
#'
#'# b <- 1:P 
#'# y <- sin(X[,1]) + X %*% b + rnorm(N)
#'
#'# out <- bigKRLS(y, X, Ncores=1)
#'# shiny.bigKRLS(out, "exciting_results", "The Results", c("Frequency", "xA", "xB", "xC"))
#' @export
shiny.bigKRLS <- function(out, export=FALSE, main.label = "bigKRLS estimates", 
                          plot.label = NULL, xlabs = NULL, font_size = 20, hline = 0,
                          shiny.palette = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
                                            "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999")){
  
  if(!export){cat("export set to FALSE; set export to TRUE to prepare files for another machine.")}
  
  if(is.null(xlabs)) xlabs = out$xlabs
  
  colnames(out$X) <- xlabs
  dydxlabs <- if(is.null(out$which.derivatives)) xlabs else xlabs[out$which.derivatives]
  colnames(out$derivatives) <- names(out$avgderivatives) <- names(out$var.avgderivatives) <- out$dydxlabs <- dydxlabs
  
  palette(shiny.palette)
  
  bigKRLS_server <- shinyServer(function(input, output, session) {
    
    selectedData <- reactive({
      
      return(list(x = as.numeric(out$X[, input$xp]),
                  derivatives = as.numeric(out$derivatives[, input$dydxp])))
    })
    
    output$graph <- renderPlot({
      
      P = ggplot(NULL) 
      P = P + geom_point(aes(x = selectedData()[["x"]], y = selectedData()[["derivatives"]]), 
                         alpha = 1, size=.1, color='grey') 
      P = P +  geom_smooth(aes(x = selectedData()[["x"]], y = selectedData()[["derivatives"]]),
                           method='loess') + xlab(input$xp) 
      P = P +  ylab(paste('Marginal Effects of ', input$dydxp)) 
      P = P +  geom_hline(aes(yintercept=hline))
      P = P +  theme_minimal(base_size = font_size)
      P = P +  theme(panel.background=element_blank(),
                     panel.border=element_blank(),
                     panel.grid.major=element_blank(),
                     panel.grid.minor=element_blank(),
                     plot.background=element_blank()) 
      P = P + labs(title = plot.label)
      P
      
    })
  })
  
  bigKRLS_ui <- shinyUI(fluidPage(
    
    titlePanel(main.label),
    
    sidebarPanel(
      selectInput('dydxp', 'Marginal Effects (dy/dx)', colnames(out$derivatives)),
      selectInput('xp', 'x', colnames(out$X))
    ),
    
    mainPanel(plotOutput('graph'))
    
  ))
  
  if(export){
    
    output_baseR <- out
    output_baseR$K <- output_baseR$vcov.c <- output_baseR$vcov.fitted <- NULL
    for(i in which(unlist(lapply(output_baseR, is.big.matrix)))){
      output_baseR[[i]] <- as.matrix(output_baseR[[i]])
    }
    
    save(output_baseR, file="shiny_out.RData")
    
    cat("A re-formatted version of your output has been saved with file name \"shiny_out.RData\" in your current working directory:\n", getwd(),
        "\nFor a few technical reasons, the big N * N matrices have been removed and the smaller ones converted back to base R;\nthis should make your output small enough for the free version of Shiny's server.\nTo access the Shiny app later or on a different machine, simply execute this script with the following commands:\n",
        "\nload(\"shiny_out.RData\")\nNext, execute this code:\n\nshiny.bigKRLS(output_baseR)")
  }else{
    shinyApp(ui = bigKRLS_ui, server = bigKRLS_server)
  }
}


#' crossvalidate.bigKRLS
#' 
#' @param y A vector of numeric observations on the dependent variable. Missing values not allowed. 
#' @param X A matrix of numeric observations of the independent variables. Factors, missing values, and constant vectors not allowed.
#' @param seed Randomization seed to be used when partitioning data. 
#' @param Kfolds Number of folds for cross validation. Requires ptesting == NULL. Note KRLS assumes variation in each column; rare events or rarely observed factor levels may violate this assumption if Kfolds is too large given the data.
#' @param ptesting Percentage of data to be used for testing (e.g., ptesting = 20 means 80\% training, 20\% testing). Requires Kfolds == NULL. Note KRLS assumes variation in each column; rare events or rarely observed factor levels may violate this assumptions if ptesting is too small given the data.
#' @param estimates_subfolder If non-null, saves all model estimates in current working directory.
#' @param ... Additional arguments to be passed to bigKRLS() or predict(). E.g., crossvalidate.bigKRLS(y, X, derivative = FALSE) will run faster but compute fewer test stats comparing in and out of sample performance (because the marginal effects will not be estimated).
#' @return bigKRLS_CV (list) Object of estimates and summary stats; summary() is defined. For train/test, contains a bigKRLS regression object and a predict object. For Kfolds, contains a nested series of training and testing models. 
#' @examples
#'# y <- as.matrix(ChickWeight$weight)
#'# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)
#'
#'# cv.out <- crossvalidate.bigKRLS(y, X, seed = 123, ptesting = 20)
#'# cv.out$pseudoR2_oos
#'# cv <- summary(cv.out)
#'
#'# cv$training.ttests
#'
#'# kcv.out <- crossvalidate.bigKRLS(y, X, seed = 123, Kfolds = 3)
#'# kcv <- summary(kcv.out, digits = 3) 
#'
#'# kcv$overview
#'# kcv$training2.ttests
#'
#'# save.bigKRLS(kcv.out, "myKfolds")
#'# load.bigKRLS("/path/to/myKfolds")     
#' @export 
crossvalidate.bigKRLS <- function(y, X, seed, Kfolds = NULL, ptesting = NULL, estimates_subfolder = NULL, ...){
  
  if(is.null(Kfolds) + is.null(ptesting) != 1) stop("Specify either Kfolds or ptesting but not both.")
  
  # suppressing warnings from bigmatrix
  oldw <- getOption("waxfrn")
  options(warn = -1)
  options(bigmemory.allow.dimnames=TRUE)
  
  stopifnot(is.big.matrix(X) | is.matrix(X))
  
  arguments <- list(...)
  marginals <- TRUE
  if("derivative" %in% names(arguments)){
    marginals <- arguments[["derivative"]]
  } # flag: compute test stats that require derivatives?
  
  Noisy <- nrow(X) > 2000 
  if("noisy" %in% names(arguments)){
    Noisy <- arguments[["noisy"]]
  } # flag: make CV output in line with user wishes, bigKRLS() defaults
  
  set.seed(seed)
  N <- nrow(X)
  
  big.meta <- create.metadata.dir()
  
  if(!is.null(ptesting)){
    
    if(ptesting < 0 | ptesting > 100) stop("ptesting, the percentage of data to be used for validation, must be between 0 and 100.")
    
    Ntesting <- round(N * ptesting/100, 0)
    Ntraining <- N - Ntesting
    train.set <- sample(N, Ntraining, replace = FALSE)
    test.set <- matrix(1:N)[which(!(1:N %in% train.set))]
    
    Xtrain <- submatrix(X, train.set)
    Xtest <- submatrix(X, test.set)
    ytrain <- submatrix(y, train.set)    
    ytest <- submatrix(y, test.set)
    
    trained <- bigKRLS(ytrain, Xtrain, instructions = FALSE, ...)
    tested <- predict.bigKRLS(trained, Xtest)
    tested[["ytest"]] <- ytest
    
    cv_out <- list(trained = trained, tested = tested, type = "crossvalidated")
    cv_out[["seed"]] <- seed
    cv_out[["indices"]] <- list(train.set = train.set, test.set = test.set)
    cv_out[["pseudoR2_is"]] <- trained$R2
    cv_out[["pseudoR2_oos"]] <- cor(tested$predicted, ytest[])^2
    cv_out[["MSE_oos"]] <- mean((tested$predicted - ytest[])^2)
    cv_out[["MSE_is"]] <- mean((trained$yfitted - trained$y[])^2)
    
    if(marginals){
      
      cv_out[["pseudoR2AME_is"]] <- trained$R2AME
      
      delta <- if(is.big.matrix(trained$X)) 
        to.big.matrix(matrix(trained$avgderivatives), p = 1, path = big.meta) else
          t(trained$avgderivatives)
      cv_out[["MSE_AME_is"]] <- mean((trained[["y"]] - (trained[["X"]] %*% delta)[])^2)
      
      delta <- if(is.big.matrix(Xtest)) 
        to.big.matrix(matrix(trained$avgderivatives), p = 1, path = big.meta) else
          t(trained$avgderivatives)
      yhat_ame <- (Xtest %*% delta)[]
      cv_out[["pseudoR2AME_oos"]] <- cor(tested[["ytest"]][], yhat_ame)^2
      cv_out[["MSE_AME_oos"]] <- mean((ytest - (Xtest %*% delta)[])^2)
      
    }
    
    cv_out[["ptesting"]] <- ptesting
    class(cv_out) <- "bigKRLS_CV" 
    # one bigKRLS object, one bigKRLS.predict object, type either "crossvalidated" or "kfolds"
    if("big.matrix" %in% lapply(trained, class) & is.null("estimates_subfolder")) 
      cat("NOTE: Outputted object contains big.matrix objects. To avoid crashing R, use save.bigKRLS(), not base R save() to store results.")
    if(Noisy) cat("You may wish to use summary() on the outputted object.")
    return(cv_out)
    
  }
  
  if(!is.null(Kfolds)){
    
    stopifnot(is.numeric(Kfolds) & Kfolds > 0 & Kfolds %% 1 == 0)
    
    # randomly places observations into (approximately) equal folds
    folds <- as.integer(cut(sample(N), breaks = Kfolds))
    
    for(k in 1:Kfolds){
      
      if(Noisy) cat("\n\n Performing pre-check of data for fold ", k, ".\n\n", sep="")
      
      Xtrain <- submatrix(X, folds != k)
      ytrain <- submatrix(y, folds != k)
      
      check_data(ytrain, Xtrain, instructions = FALSE)
      
    }
    
    out <- list(type = "KfoldsCV") # object to be returned
    class(out) <- "bigKRLS_CV"
    # out contains measures of fit, meta data, and (nested within each nested fold) bigKRLS, predict objects
    out[["Kfolds"]] <- Kfolds
    out[["seed"]] <- seed
    out[["folds"]] <- folds
    warn.big <- FALSE # dummy flag variable: warn re: big.matrix objects?
    
    # K measures of fit for each fold...
    out[["R2_is"]] <- c() # in sample R2, based on yhat = kernel[train, ] %*% coefs.hat
    out[["R2_oos"]] <- c() # out of sample R2, based on kernel(X[cbind(test, train])
    out[["MSE_is"]] <- c() # in sample mean squared error
    out[["MSE_oos"]] <- c() # out of sample mean squared error
    
    if(marginals){
      
      out[["R2AME_is"]] <- c() # in sample R2 average marginal effects, yhat = X[train, ] %*% colMeans(delta)
      out[["R2AME_oos"]] <- c() # oos R2, yhat = X[test, ] %*% colMeans(delta)
      out[["MSE_AME_is"]] <- c() # is for MSE for AMEs
      out[["MSE_AME_oos"]] <- c() # oos for MSE for AMEs
      
    }
    
    
    for(k in 1:Kfolds){
      
      if(Noisy) cat("\n\n Starting fold ", k, ".\n\n", sep="")
      
      Xtrain <- submatrix(X, folds != k)
      Xtest <- submatrix(X, folds == k)
      ytrain <- submatrix(y, folds != k)    
      ytest <- submatrix(y, folds == k)
      
      trained <- bigKRLS(ytrain, Xtrain, instructions = FALSE, ...)
      if(Noisy) summary(trained)
      tested <- predict.bigKRLS(trained, Xtest)
      tested[["ytest"]] <- ytest
      
      cv_out <- list(trained = trained, tested = tested)
      class(cv_out) <- "bigKRLS_CV"
      # cv_out contains one bigKRLS object (trained), one bigKRLS.predict object (tested)
      
      out[[paste0("fold_", k)]] <- cv_out
      
      ytest <- as.matrix(ytest) # in case of big.matrix objects...
      ytrain <- as.matrix(ytrain) # (loading vectors into R generally harmless)
      
      # measures of fit...
      out[["R2_is"]][k] <- trained$R2
      out[["R2_oos"]][k] <- cv_out[["tested"]][["pseudoR2"]] <- cor(ytest, as.matrix(tested$predicted))^2
      out[["MSE_is"]][k] <- cv_out[["trained"]][["MSE"]] <- mean((ytrain - as.matrix(trained$yfitted))^2)
      out[["MSE_oos"]][k] <- cv_out[["tested"]][["MSE"]] <- mean((ytest - tested$predicted)^2)
      
      if(marginals){
        
        out[["R2AME_is"]][k] <- trained$R2AME
        
        delta <- if(is.big.matrix(trained$X)) 
          to.big.matrix(matrix(trained$avgderivatives), p = 1, path = big.meta) else
            t(trained$avgderivatives)
        out[["MSE_AME_is"]][k] <- cv_out[["trained"]][["MSE_AME"]] <- mean((trained[["y"]] - (trained[["X"]] %*% delta)[])^2)
        
        delta <- if(is.big.matrix(Xtest)) 
          to.big.matrix(matrix(trained$avgderivatives), p = 1, path = big.meta) else
            t(trained$avgderivatives)
        yhat_ame <- (Xtest %*% delta)[]
        out[["R2AME_oos"]][k] <- cor(ytest, yhat_ame)^2
        out[["MSE_AME_oos"]][k] <- cv_out[["tested"]][["MSE_AME"]] <- mean((ytest - yhat_ame)^2)
      }
      
      warn.big <- warn.big | ("big.matrix" %in% lapply(trained, class) & is.null("estimates_subfolder")) 
      cat("\n")  
    }
    
    names(out[["R2_is"]]) <- names(out[["R2_oos"]]) <- names(out[["MSE_is"]]) <- 
      names(out[["MSE_oos"]]) <- paste0("fold", 1:Kfolds)
    
    if(marginals){
      names(out[["R2AME_is"]]) <- names(out[["R2AME_oos"]]) <- 
        names(out[["MSE_AME_is"]]) <- names(out[["MSE_AME_oos"]]) <- paste0("fold", 1:Kfolds)
    }
    
    if(warn.big) cat("NOTE: Outputted object contains big.matrix objects. To avoid crashing R, use save.bigKRLS(), not base R save() to store results.")
    
    if(!is.null(estimates_subfolder)) save.bigKRLS(out)
    unlink(big.meta, recursive = TRUE)
    
    return(out)
    
  }
  
}
rdrr1990/bigKRLS documentation built on Aug. 2, 2019, 9:13 a.m.