ligera: LIght GEnetic Robust Association

Documented in ligera2_bed_f_multi

#' LIGERA2 BED multi: LIght GEnetic Robust Association multiscan function
#'
#' This function performs multiple genetic association scans, adding one significant locus per iteration to the model (modeled as a covariate) to increase power in the final model.
#' The function returns a tibble containing association statistics and several intermediates.
#' This version calculates p-values using an F-test, which gives calibrated statistics under both quantitative and binary traits.
#' Compared to [ligera()], which uses the faster Wald test (calibrated for quantitative but not binary traits), this F-test version is quite a bit slower, and is optimized for `m >> n`, so it is a work in progress.
#' This optimized version requires the genotypes to be in a file in BED format.
#' 
#' Suppose there are `n` individuals and `m` loci.
#'
#' @param file The path to the BED file containing the genotypes, potentially excluding the BED extension.
#' @param trait The length-`n` trait vector, which may be real valued and contain missing values.
#' @param mean_kinship An estimate of the mean kinship produced externally, to ensure internal estimates of kinship are unbiased.
#' @param q_cut The q-value threshold to admit new loci into the polygenic model.
#' @param one_per_iter If true, only the most significant locus per iteration is added to model of next iteration.  Otherwise all significant loci per iteration are added to the model of next iteration.
#' @param covar An optional `n`-by-`K` matrix of `K` covariates, aligned with the individuals.
#' @param mem_factor Proportion of available memory to use loading and processing genotypes.
#' Ignored if `mem_lim` is not `NA`.
#' @param mem_lim Memory limit in GB, used to break up genotype data into chunks for very large datasets.
#' Note memory usage is somewhat underestimated and is not controlled strictly.
#' Default in Linux and Windows is `mem_factor` times the free system memory, otherwise it is 1GB (OSX and other systems).
#' @param m_chunk_max Sets the maximum number of loci to process at the time.
#' Actual number of loci loaded may be lower if memory is limiting.
#' @param V Algorithm version (0, 1, 2).
#' Experimental features, not worth explaining.
#' @param tol Tolerance value passed to conjugate gradient method solver.
#' @param prune_ld If `TRUE` (default `FALSE`), at every iteration (including the first one) if there is more than one new locus discovered, the set of loci is pruned by removing correlated loci with the following parameters.
#' @param r2_max Maximum squared correlation coefficient between loci (ignored if `prune_ld = FALSE`).
#' @param pos_window Window size, in basepairs, for on which to apply pruning (ignored if `prune_ld = FALSE`).  If `pos_window > 0`, only pairs of loci in the same chromosome and with positions less than `pos_window` away will be pruned.  However, if `pos_window == 0L` then all variants are pruned regardless of chr/pos values.
#' 
#' @return A tibble containing the following association statistics from the last scan for non-selected loci.
#' For selected loci, these are the values from the scan before each was added to the model (as after addition they get `beta ~= 0` and `pval ~= 1`).
#'
#' - `chr`: Chromosome of locus.
#' - `id`: Locus ID.
#' - `posg`: Position in genetic distance.
#' - `pos`: Position in basepairs.
#' - `alt`: Alternative allele.
#' - `ref`: Reference allele (counted).
#' - `pval`: The p-value of the last association scan.
#' - `beta`: The estimated effect size coefficient for the trait vector at this locus.
#' - `f_stat`: The F statistic.
#' - `df`: degrees of freedom: number of non-missing individuals minus number of parameters of full model
#' - `qval`: The q-value of the last association scan.
#' - `sel`: the order in which loci were selected, or zero if they were not selected.
#'
#' @examples
#' # MISSING SAMPLE BED FILE
#'
#' @seealso
#' The `popkin` package.
#' 
#' @export
ligera2_bed_f_multi <- function(
                                file,
                                trait,
                                mean_kinship,
                                q_cut = 0.05,
                                one_per_iter = FALSE,
                                covar = NULL,
                                mem_factor = 0.7,
                                mem_lim = NA,
                                m_chunk_max = 1000,
                                V = 0,
                                tol = 1e-15,
                                prune_ld = FALSE,
                                r2_max = 0.3,
                                pos_window = 10000000L
                                ) {
    # things to initialize for loop
    # indexes of selected loci, to remember
    loci_selected <- c()
    tib_selected <- tibble::tibble() # ok to be blank
    # were there new selections?  To know if we've converged
    # initialize to TRUE so first iteration occurs
    new_selec <- TRUE

    # unlike previous versions, here we want BIM for proper filtering of
    bim <- genio::read_bim( file, verbose = FALSE )
    # infer number of loci
    m_loci <- nrow( bim )
    # we don't need fam details but we do want number of individuals
    # (could get from trait vector, but this is better validation)
    n_ind <- genio::count_lines( file, ext = 'fam', verbose = FALSE )
    
    # load genotypes using BEDMatrix, for loading selected loci as covariates
    X <- BEDMatrix::BEDMatrix( file, n = n_ind, p = m_loci )
    
    # loop while there are still significant loci
    while( new_selec ) {
        # assume that covar has been grown to include selected loci already

        # run ligera
        # NOTE: selected loci become insignificant when retested
        tib <- ligera2_bed_f(
            file = file,
            m_loci = m_loci,
            n_ind = n_ind,
            trait = trait,
            mean_kinship = mean_kinship,
            covar = covar,
            mem_factor = mem_factor,
            mem_lim = mem_lim,
            m_chunk_max = m_chunk_max,
            V = V,
            tol = tol
        )

        # add q-values
        # only to non-selected loci (selected loci all have p=1, this is terrible for q-value estimation!)
        # unfortunately this doesn't work if loci_selected has length zero
        if ( length( loci_selected ) == 0 ) {
            tib$qval <- qvalue::qvalue( tib$pval )$qvalues
        } else {
            # have to initialize qval here before doing this
            tib$qval <- 1 # all selected loci get 1
            tib$qval[ -loci_selected ] <- qvalue::qvalue( tib$pval[ -loci_selected ] )$qvalues
        }

        # decide what to add, if anything
        if ( one_per_iter ) {
            # find the most significant locus
            # which.min always returns a scalar (*first* minimum)
            indexes <- which.min( tib$pval ) # p-values may have more resolution, q-values sometimes tie, otherwise the same as q-values
            # tell the next part of the loop if this was significant (overwrite new_select from before, really updates it)
            new_selec <- tib$qval[ indexes ] < q_cut
        } else {
            # directly select subset that was newly significant on this round
            indexes <- which( tib$qval < q_cut )
            # here we move to the next stage as long as indexes was not empty
            new_selec <- length( indexes ) > 0
        }

        # if there were any newly significant loci, add to covariates and reiterate
        if ( new_selec ) {

            # filter more to avoid LD redundancy, if desired
            if ( prune_ld ) 
                indexes <- ld_prune(
                    X,
                    tib$pval,
                    indexes,
                    r2_max = r2_max,
                    bim = bim,
                    pos_window = pos_window
                )

            # add to list of selected loci
            loci_selected <- c( loci_selected, indexes )

            # remember their stats from their last round before getting added as covariates
            tib_selected <- dplyr::bind_rows(
                                       tib_selected,
                                       tib[ indexes, ]
                                   )

            # add to covariates
            # first extract genotype submatrix
            # NOTE this is matrix even if there was a single index there
            # here transposition is backwards than usual (individuals in X are usually on the column, but as covariates we need then on the rows!)
            X_i <- X[ , indexes, drop = FALSE ] # BEDMatrix orientation
            
            # add to covariates matrix now
            covar <- cbind( covar, X_i )
            
            # it seems in odd cases (like my tiny test simulations) the number of covariates grows to be more than the number of parameters!
            # check here, force stop if this is exceeded
            # (otherwise matrices inside become singular, causes confusing problems!)
            if ( nrow( covar ) < ncol( covar ) )
                new_selec <- FALSE # force stop!
        }
        # otherwise `new_selec == FALSE`, so we're done
    }

    # it'd be nice to return the last tibble, but again the selected loci will all have p=1
    # instead we should record the last p-value and other stats they had prior to being added to the model
    # first we add a new column that marks selected loci, in order
    # all non-selected loci get a zero
    tib$sel <- 0
    # when there were no selections at all, this fails, so have to test for that
    if ( length( loci_selected ) > 0 ) {
        # this sets the right values for the selected loci (they are in the order in which they were selected)
        tib_selected$sel <- 1 : nrow( tib_selected )
        # this overwrites as desired
        tib[ loci_selected, ] <- tib_selected
    }
    
    # merge BIM into table to have a more complete report
    tib <- dplyr::bind_cols( bim, tib )

    # done, return tibble!
    return( tib )
}