R/significance_cpct.R
In expss: Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

Documented in add_sig_labels significance_cpct

COMPARE_TYPE = c("subtable",
                      "first_column", "adjusted_first_column", 
                      "previous_column")

KEEP_STAT = c("percent", "cases", "means", "bases", "sd", "none")

#' Mark significant differences between columns in the table
#' 
#' \itemize{
#' \item{\code{significance_cpct}}{ conducts z-tests between column percent in
#' the result of \link{cross_cpct}. Results are calculated with the same formula 
#' as in \link[stats]{prop.test} without continuity correction.}
#' \item{\code{significance_means}}{ conducts t-tests between column means in
#' the result of \link{cross_mean_sd_n}. Results are calculated with the same formula 
#' as in \link[stats]{t.test}.}
#' \item{\code{significance_cases}}{ conducts chi-squared tests on the subtable of
#' table with counts in the result of \link{cross_cases}. Results are calculated
#' with the same formula as in \link[stats]{chisq.test}.}
#' \item{\code{significance_cell_chisq}}{ compute cell chi-square test on table
#' with column percent. The cell chi-square test looks at each table cell and
#' tests whether it is significantly different from its expected value in the
#' overall table. For example, if it is thought that variations in political
#' opinions might depend on the respondent's age, this test can be used to
#' detect which cells contribute significantly to that dependence. Unlike the
#' chi-square test (\code{significance_cases}), which is carried out on a whole
#' set of rows and columns, the cell chi-square test is carried out
#' independently on each table cell. Although the significance level of the cell
#' chi-square test is accurate for any given cell, the cell tests cannot be used
#' instead of the chi-square test carried out on the overall table. Their
#' purpose is simply to point to the parts of the table where dependencies
#' between row and column categories may exist.}}
#' For \code{significance_cpct} and \code{significance_means} there are three
#' type of comparisons which can be conducted simultaneously (argument
#' \code{compare_type}):
#' \itemize{
#' \item{\code{subtable}}{ provide comparisons between all columns inside each
#' subtable.}
#' \item{\code{previous_column}}{ is a comparison of each column of the subtable
#' with the previous column. It is useful if columns are periods or survey
#' waves.}
#' \item{\code{first_column}}{ provides comparison the table first column with
#' all other columns in the table. \code{adjusted_first_column} is also
#' comparison with the first column but with adjustment for common base. It is
#' useful if the first column is total column and other columns are subgroups of
#' this total. Adjustments are made according to algorithm in IBM SPSS
#' Statistics Algorithms v20, p. 263. Note that with these adjustments t-tests
#' between means are made with equal variance assumed (as with \code{var_equal =
#' TRUE}).}}
#' By now there are no adjustments for multiple-response variables (results of 
#' \link{mrset}) in the table columns so significance tests are rather 
#' approximate for such cases.
#' Also, there are functions for the significance testing in the sequence of
#' custom tables calculations (see \link{tables}):
#' \itemize{
#' \item{\code{tab_last_sig_cpct}, \code{tab_last_sig_means} and 
#' \code{tab_last_sig_cpct}}{ make the same tests as their analogs mentioned 
#' above. It is recommended to use them after appropriate statistic function: 
#' \link{tab_stat_cpct}, \link{tab_stat_mean_sd_n} and \link{tab_stat_cases}.
#' }
#' \item{\code{tab_significance_options}}{ With this function we can set
#' significance options for the entire custom table creation sequence.}
#' \item{\code{tab_last_add_sig_labels}}{ This function applies 
#' \code{add_sig_labels}  to the last calculated table - it adds labels (letters
#' by default) for significance to columns header. It may be useful if you want
#' to combine a table with significance with table without it.}
#' \item{\code{tab_last_round}}{ This function rounds numeric columns in the
#' last calculated table to specified number of digits. It is sometimes
#' needed if you want to combine table with significance with table without it.}
#' }
#' @param x table (class \code{etable}): result of \link{cross_cpct} with 
#'   proportions and bases for \code{significance_cpct}, result of 
#'   \link{cross_mean_sd_n} with means, standard deviations and valid N for 
#'   \code{significance_means}, and result of \link{cross_cases} with counts and
#'   bases for \code{significance_cases}.
#' @param sig_level numeric. Significance level - by default it equals to \code{0.05}.
#' @param delta_cpct numeric. Minimal delta between percent for which we mark 
#'   significant differences (in percent points) - by default it equals to zero.
#'   Note that, for example, for minimal 5 percent point difference
#'   \code{delta_cpct} should be equals 5, not 0.05.
#' @param delta_means numeric. Minimal delta between means for which we mark 
#'   significant differences  - by default it equals to zero.
#' @param min_base numeric. Significance test will be conducted if both
#'   columns have bases greater or equal to \code{min_base}. By default, it equals to \code{2}.
#' @param compare_type Type of compare between columns. By default, it is 
#'   \code{subtable} - comparisons will be conducted between columns of each 
#'   subtable. Other possible values are: \code{first_column}, 
#'   \code{adjusted_first_column} and \code{previous_column}. We can conduct
#'   several tests simultaneously.
#' @param bonferroni logical. \code{FALSE} by default. Should we use Bonferroni
#'   adjustment by the number of comparisons in each row? 
#' @param subtable_marks character. One of "greater", "both" or "less". By
#'   deafult we mark only values which are significantly greater than some other
#'   columns. For \code{significance_cell_chisq} default is "both".We can change
#'   this behavior by setting an argument to \code{less} or \code{both}.
#' @param inequality_sign logical. FALSE if \code{subtable_marks} is "less" or 
#'   "greater". Should we show \code{>} or \code{<} before significance marks of
#'   subtable comparisons.
#' @param sig_labels character vector. Labels for marking differences between
#'   columns of subtable.
#' @param sig_labels_previous_column a character vector with two elements. Labels
#'   for marking a difference with the previous column. First mark means 'lower' (by
#'   default it is \code{v}) and the second means greater (\code{^}).
#' @param sig_labels_first_column a character vector with two elements. Labels 
#'   for marking a difference with the first column of the table. First mark means
#'   'lower' (by default it is \code{-}) and the second means 'greater'
#'   (\code{+}).
#' @param sig_labels_chisq a character vector with two labels 
#'   for marking a difference with row margin of the table. First mark means
#'   'lower' (by default it is \code{<}) and the second means 'greater'
#'   (\code{>}). Only for \code{significance_cell_chisq}.
#' @param correct logical indicating whether to apply continuity correction when
#'   computing the test statistic for 2 by 2 tables. Only for
#'   \code{significance_cases} and \code{significance_cell_chisq}. For details
#'   see \link[stats]{chisq.test}. \code{TRUE} by default.
#' @param row_margin character. One of values "auto" (default), "sum_row", or
#'   "first_column". If it is "auto" we try to find total column in the subtable
#'   by \code{total_column_marker}. If the search is failed, we use the sum of
#'   each rows as row total. With "sum_row" option we always sum each row to get
#'   margin. Note that in this case result for multiple response variables in
#'   banners may be incorrect. With "first_column" option we use table first
#'   column as row margin for all subtables. In this case result for the
#'   subtables with incomplete bases may be incorrect. Only for
#'   \code{significance_cell_chisq}.
#' @param total_column_marker character. Mark for total columns in the
#'   subtables. "#" by default.
#' @param keep character. One or more from "percent", "cases", "means", "bases", 
#'   "sd" or "none". This argument determines which statistics will remain in
#'   the table after significance marking.
#' @param na_as_zero logical. \code{FALSE} by default. Should we treat
#'   \code{NA}'s as zero cases?
#' @param total_marker character. Total rows mark in the table. "#" by default.
#' @param total_row integer/character. In the case of the several totals per
#'   subtable it is a number or name of total row for the significance
#'   calculation.
#' @param var_equal a logical variable indicating whether to treat the two
#'   variances as being equal. For details see \link[stats]{t.test}.
#' @param digits an integer indicating how much digits after decimal separator 
#' @param as_spss a logical. FALSE by default. If TRUE, proportions which
#'   are equal to zero or one will be ignored. Also will be ignored categories
#'   with bases less than 2.
#' @param data data.frame/intermediate_table for \code{tab_*} functions.
#' @param mode character. One of \code{replace}(default) or \code{append}. In
#'   the first case the previous result in the sequence of table calculation
#'   will be replaced with result of significance testing. In the second case
#'   result of the significance testing will be appended to sequence of table
#'   calculation.
#' @param label character. Label for the statistic in the \code{tab_*}. Ignored
#'   if the \code{mode} is equals to \code{replace}.
#' @return \code{tab_last_*} functions return objects of class
#'   \code{intermediate_table}. Use \link{tab_pivot} to get the final result -
#'   \code{etable} object. Other functions return \code{etable} object with
#'   significant differences.
#'   
#' @seealso \link{cross_cpct}, \link{cross_cases}, \link{cross_mean_sd_n}, \link{tables},
#'   \link{compare_proportions}, \link{compare_means}, \link[stats]{prop.test},
#'   \link[stats]{t.test}, \link[stats]{chisq.test}
#' @name significance
#' @export
#'
#' @examples
#' data(mtcars)
#' mtcars = apply_labels(mtcars,
#'                       mpg = "Miles/(US) gallon",
#'                       cyl = "Number of cylinders",
#'                       disp = "Displacement (cu.in.)",
#'                       hp = "Gross horsepower",
#'                       drat = "Rear axle ratio",
#'                       wt = "Weight (lb/1000)",
#'                       qsec = "1/4 mile time",
#'                       vs = "Engine",
#'                       vs = c("V-engine" = 0,
#'                              "Straight engine" = 1),
#'                       am = "Transmission",
#'                       am = c("Automatic" = 0,
#'                              "Manual"=1),
#'                       gear = "Number of forward gears",
#'                       carb = "Number of carburetors"
#' )
#' 
#' \dontrun{
#' mtcars_table = cross_cpct(mtcars, 
#'                           list(cyl, gear),
#'                           list(total(), vs, am)
#'                           )
#' 
#' significance_cpct(mtcars_table)
#' # comparison with first column
#' significance_cpct(mtcars_table, compare_type = "first_column")
#' 
#' # comparison with first column and inside subtable
#' significance_cpct(mtcars_table, 
#'             compare_type = c("first_column", "subtable"))
#' 
#' # only significance marks
#' significance_cpct(mtcars_table, keep = "none")
#' 
#' # means
#' mtcars_means = cross_mean_sd_n(mtcars, 
#'                                list(mpg, wt, hp),
#'                                list(total(), vs, cyl))
#'                                )
#'                         
#' significance_means(mtcars_means) 
#' 
#' # mark values which are less and greater
#' significance_means(mtcars_means, subtable_marks = "both")
#' 
#' # chi-squared test
#' mtcars_cases = cross_cases(mtcars, 
#'                            list(cyl, gear),
#'                            list(total(), vs, am)
#'                            )
#'                          
#' significance_cases(mtcars_cases)  
#' 
#' # cell chi-squared test
#' # increase number of cases to avoid warning about chi-square approximation
#' mtcars2 = add_rows(mtcars, mtcars, mtcars)
#'
#' tbl = cross_cpct(mtcars2, gear, am)
#' significance_cell_chisq(tbl)
#' 
#' # table with multiple variables
#' tbl = cross_cpct(mtcars2, list(gear, cyl), list(total(), am, vs))
#' significance_cell_chisq(tbl, sig_level = .0001)  
#' 
#' # custom tables with significance
#' mtcars %>% 
#'     tab_significance_options(subtable_marks = "both") %>% 
#'     tab_cells(mpg, hp) %>% 
#'     tab_cols(total(), vs, am) %>% 
#'     tab_stat_mean_sd_n() %>% 
#'     tab_last_sig_means(keep = "means") %>% 
#'     tab_cells(cyl, gear) %>% 
#'     tab_stat_cpct() %>% 
#'     tab_last_sig_cpct() %>% 
#'     tab_pivot()   
#'     
#' # Overcomplicated examples - we move significance marks to
#' # separate columns. Columns with statistics remain numeric    
#' mtcars %>% 
#'     tab_significance_options(keep = "none", 
#'                          sig_labels = NULL, 
#'                          subtable_marks = "both",  
#'                          mode = "append") %>% 
#'     tab_cols(total(), vs, am) %>% 
#'     tab_cells(mpg, hp) %>% 
#'     tab_stat_mean_sd_n() %>% 
#'     tab_last_sig_means() %>% 
#'     tab_last_hstack("inside_columns") %>% 
#'     tab_cells(cyl, gear) %>% 
#'     tab_stat_cpct() %>% 
#'     tab_last_sig_cpct() %>% 
#'     tab_last_hstack("inside_columns") %>% 
#'     tab_pivot(stat_position = "inside_rows") %>% 
#'     drop_empty_columns()    
#' }   
significance_cpct = function(x, 
                             sig_level = 0.05, 
                             delta_cpct = 0,
                             min_base = 2,
                             compare_type ="subtable",
                             bonferroni = FALSE,
                             subtable_marks = c("greater", "both", "less"),
                             inequality_sign = "both" %in% subtable_marks,
                             sig_labels = LETTERS,
                             sig_labels_previous_column = c("v", "^"),
                             sig_labels_first_column = c("-", "+"),
                             keep = c("percent", "bases"), 
                             na_as_zero = FALSE,
                             total_marker = "#",
                             total_row = 1,
                             digits = get_expss_digits(),
                             as_spss = FALSE
                             ){
    UseMethod("significance_cpct")
}


#' @export
significance_cpct.etable = function(x, 
                                    sig_level = 0.05, 
                                    delta_cpct = 0,
                                    min_base = 2,
                                    compare_type = "subtable",
                                    bonferroni = FALSE,
                                    subtable_marks = c("greater", "both", "less"),
                                    inequality_sign = "both" %in% subtable_marks,
                                    sig_labels = LETTERS,
                                    sig_labels_previous_column = c("v", "^"),
                                    sig_labels_first_column = c("-", "+"),
                                    keep = c("percent", "bases"), 
                                    na_as_zero = FALSE,
                                    total_marker = "#",
                                    total_row = 1,
                                    digits = get_expss_digits(),
                                    as_spss = FALSE
){
    
    
    compare_type = match.arg(compare_type, choices = COMPARE_TYPE, several.ok = TRUE)
    stopif(sum(compare_type %in% c("first_column", "adjusted_first_column"))>1, 
           "mutually exclusive compare types in significance testing:  'first_column' and 'adjusted_first_column'.")
    subtable_marks = match.arg(subtable_marks)
    mark_greater = subtable_marks %in% c("greater", "both")
    mark_less = subtable_marks %in% c("both", "less")
    delta_cpct = delta_cpct/100
    keep = match.arg(keep, KEEP_STAT, several.ok = TRUE)
    keep_percent = "percent" %in% keep
    keep_bases = "bases" %in% keep
    if(as_spss) {
        min_base = max(min_base, 2)
        na_as_zero = FALSE
    }
    if(NCOL(x)>1){
        groups = header_groups(colnames(x))
        if("subtable" %in% compare_type){
            if(!is.null(sig_labels)){
                x = add_sig_labels(x, sig_labels = sig_labels)
            } 
            all_column_labels = get_category_labels(colnames(x))
        }
        sections = split_table_by_row_sections(x, total_marker = total_marker, total_row = total_row)
        res = lapply(sections, function(each_section){
            curr_base = extract_total_from_section(each_section, total_marker = total_marker, total_row = total_row)
            recode(curr_base) = lt(min_base) ~ NA
            
            total_rows_indicator = get_total_rows_indicator(each_section, total_marker = total_marker)
            empty_sig_section = each_section[!total_rows_indicator, ]
            empty_sig_section[, -1] = ""
            sig_section = empty_sig_section
            curr_props = each_section[!total_rows_indicator, ]
            curr_props[,-1] = curr_props[,-1]/100
            if(na_as_zero){
                if_na(curr_props[,-1]) = 0
            }
            if(as_spss){
                recode(curr_props[,-1]) = list(c(0, 1) ~ NA)
            }
            if(any(c("first_column", "adjusted_first_column") %in% compare_type)){
                sig_section = section_sig_first_column(sig_section = sig_section, 
                                                       curr_props = curr_props, 
                                                       curr_base = curr_base,
                                                       groups = groups,
                                                       sig_labels_first_column = sig_labels_first_column,
                                                       sig_level = sig_level,
                                                       delta_cpct = delta_cpct,
                                                       bonferroni = bonferroni,
                                                       adjust_common_base = "adjusted_first_column" %in% compare_type)
            }
            if(any(c("previous_column") %in% compare_type)){
                sig_section = section_sig_previous_column(sig_section = sig_section, 
                                                          curr_props = curr_props, 
                                                          curr_base = curr_base,
                                                          groups = groups,
                                                          sig_labels_previous_column = sig_labels_previous_column,
                                                          sig_level = sig_level,
                                                          delta_cpct = delta_cpct,
                                                          bonferroni = bonferroni)
            }
            if("subtable" %in% compare_type){
                prepend = ""
                if(mark_greater){
                    if(inequality_sign) {
                        prepend = ">"    
                    }
                    subtable_sig_section = section_sig_prop(sig_section = empty_sig_section, 
                                                            curr_props = curr_props, 
                                                            curr_base = curr_base,
                                                            groups = groups,
                                                            all_column_labels = all_column_labels,
                                                            sig_level = sig_level,
                                                            delta_cpct = delta_cpct,
                                                            bonferroni = bonferroni,
                                                            mark_greater = TRUE,
                                                            prepend = prepend
                    )
                    for(i in seq_along(sig_section)[-1]){
                        sig_section[[i]] = paste_non_empty(sig_section[[i]], 
                                                           subtable_sig_section[[i]],
                                                           sep = " "
                        )
                    }
                }
                if(mark_less){
                    if(inequality_sign) {
                        prepend = "<"    
                    }
                    subtable_sig_section = section_sig_prop(sig_section = empty_sig_section, 
                                                            curr_props = curr_props, 
                                                            curr_base = curr_base,
                                                            groups = groups,
                                                            all_column_labels = all_column_labels,
                                                            sig_level = sig_level,
                                                            delta_cpct = delta_cpct,
                                                            bonferroni = bonferroni,
                                                            mark_greater = FALSE,
                                                            prepend = prepend
                    )
                    for(i in seq_along(sig_section)[-1]){
                        sig_section[[i]] = paste_non_empty(sig_section[[i]], 
                                                           subtable_sig_section[[i]],
                                                           sep = " "
                        )
                    }
                }
            }
            each_section[,-1] = ""
            each_section[!total_rows_indicator,-1] = sig_section[,-1]
            each_section
        })
        
        res = do.call(add_rows, res)
    } else {
        res = x
    }
    total_rows_indicator = get_total_rows_indicator(x, total_marker = total_marker)
    x = round_dataframe(x, digits = digits)
    if(keep_percent){
        x[!total_rows_indicator, ] = format_to_character(x[!total_rows_indicator, ], 
                                                         digits = digits)    
        x[, -1] = paste_df_non_empty(
            x[, -1, drop = FALSE], 
            res[, -1, drop = FALSE],
            sep = " "
        )
    } else {
        x[!total_rows_indicator, -1] = res[!total_rows_indicator, -1, drop = FALSE]
    }
    if(keep_bases) {
        x
    } else {
        x[!total_rows_indicator, ]
    }
}

########################

#' @rdname significance
#' @export
add_sig_labels = function(x, sig_labels = LETTERS){
    header = colnames(x)
    groups = header_groups(header)   
    for(each_group in groups){
        if(length(each_group)>1){
            if(length(each_group)<=length(sig_labels)){
                header[each_group] = paste0(header[each_group], "|", 
                                            sig_labels[each_group - min(each_group)+1])
            } else {
                numbers = seq_len(length(each_group)/length(sig_labels) + 1)
                long_labels = rep(sig_labels, length(numbers))
                numbers = rep(numbers, each = length(sig_labels))
                long_labels = paste0(long_labels, numbers)
                header[each_group] = paste0(header[each_group], "|", 
                                            long_labels[each_group - min(each_group)+1])
                
            }
        }
    }
    remove_unnecessary_splitters(header)
    colnames(x) = header
    x
}


########################

section_sig_prop = function(sig_section, curr_props,  curr_base, groups,
                            all_column_labels, sig_level, delta_cpct, bonferroni, 
                            mark_greater, prepend) {
    for(each_group in groups){
        if(length(each_group)>1){
            if(bonferroni) {
                invalid_columns = is.na(curr_base[each_group])
                comparable_values = !is.na(curr_props[,each_group, drop = FALSE])
                comparable_values[,invalid_columns] = FALSE
                # count number of comaprisons
                valid_values_in_row = rowSums(comparable_values, na.rm = TRUE)
                number_of_comparisons_in_row = valid_values_in_row*(valid_values_in_row-1)/2
                number_of_comparisons_in_row[number_of_comparisons_in_row<0] = 0
                bonferroni_coef = number_of_comparisons_in_row # sum(number_of_comparisons_in_row, na.rm = TRUE)
                bonferroni_coef[bonferroni_coef==0] = 1
            } else {
                bonferroni_coef = 1
            } 
            for(col1 in each_group[-length(each_group)]){
                prop1 = curr_props[[col1]]
                base1 = curr_base[[col1]]
                for(col2 in (col1+1):each_group[length(each_group)]){
                    prop2 = curr_props[[col2]]
                    base2 = curr_base[[col2]]
                    pval = compare_proportions(prop1, prop2, 
                                               base1, base2)
                    if_na(pval) = 1
                    pval = pmin(pval*bonferroni_coef, 1)
                    if(mark_greater) {
                        comparison = prop1>prop2
                    } else {
                        comparison = prop2>prop1
                    }    
                    delta =  abs(prop1 - prop2)
                    sig_section[[col1]] = ifelse(comparison & pval<sig_level & delta>delta_cpct,
                                                 paste_non_empty(sig_section[[col1]],
                                                                 all_column_labels[[col2]],
                                                                 sep = " "),
                                                 sig_section[[col1]]
                    )
                    sig_section[[col2]] = ifelse(!comparison & pval<sig_level & delta>delta_cpct,
                                                 paste_non_empty(sig_section[[col2]], 
                                                                 all_column_labels[[col1]], 
                                                                 sep = " "),
                                                 sig_section[[col2]]
                    )
                    
                    
                }                        
            } 
        }
        
        
    }
    if(prepend!=""){
        recode(sig_section[,-1]) = neq("") ~ function(x) paste(prepend, x)
    }
    sig_section
}

########################

section_sig_previous_column = function(sig_section, curr_props,  curr_base, groups,
                                       sig_labels_previous_column, 
                                       sig_level, delta_cpct, bonferroni) {
    for(each_group in groups){
        if(length(each_group)>1){
            # col1 - current column
            # col2 - previous column
            if(bonferroni) {
                invalid_columns = is.na(curr_base[each_group])
                comparable_values = !is.na(curr_props[,each_group, drop = FALSE])
                comparable_values[,invalid_columns] = FALSE
                # count number of comaprisons
                number_of_comparisons_in_row = 0
                for(col1 in seq_len(ncol(comparable_values))[-1]){
                    col2 = col1  - 1
                    number_of_comparisons_in_row = number_of_comparisons_in_row + 
                        (comparable_values[ ,col2] & comparable_values[ ,col1])
                } 
                bonferroni_coef = number_of_comparisons_in_row #sum(number_of_comparisons_in_row, na.rm = TRUE)
                bonferroni_coef[bonferroni_coef==0] = 1
            } else {
                bonferroni_coef = 1
            } 
            for(col1 in each_group[-1]){
                col2 = col1  - 1
                prop1 = curr_props[[col1]]
                base1 = curr_base[[col1]]
                prop2 = curr_props[[col2]]
                base2 = curr_base[[col2]]
                pval = compare_proportions(prop1, prop2, 
                                   base1, base2)
                if_na(pval) = 1
                pval = pmin(pval*bonferroni_coef, 1)
                sig_section[[col1]] = ifelse(pval<sig_level & abs(prop1 - prop2)>delta_cpct,
                                             # previous value is greater
                                             ifelse(prop2>prop1,
                                                    paste_non_empty(sig_section[[col1]], 
                                                          sig_labels_previous_column[[1]], 
                                                          sep = " "),
                                                    # previous value is smaller
                                                    paste_non_empty(sig_section[[col1]], 
                                                                    sig_labels_previous_column[[2]], 
                                                                    sep = " ")
                                             ),
                                             sig_section[[col1]]
                )
            }        
        }
    }
    sig_section
}

########################

section_sig_first_column = function(sig_section, curr_props,  curr_base, groups,
                                    sig_labels_first_column, sig_level,
                                    delta_cpct, bonferroni,
                                    adjust_common_base = FALSE) {
    groups = unlist(groups)
    # col1 - first column
    # col2 - other columns
    col1 = groups[1]
    prop1 = curr_props[[col1]]
    base1 = curr_base[[col1]]
    if(length(groups)>1 & !is.na(base1)){
        if(bonferroni) {
            invalid_columns = is.na(curr_base[-1])
            comparable_values = !is.na(curr_props[,-1, drop = FALSE])
            comparable_values[,invalid_columns] = FALSE
            # count number of comaprisons
            valid_values_in_row = rowSums(comparable_values, na.rm = TRUE)
            number_of_comparisons_in_row = valid_values_in_row
            number_of_comparisons_in_row[number_of_comparisons_in_row<0] = 0
            bonferroni_coef = number_of_comparisons_in_row #sum(number_of_comparisons_in_row, na.rm = TRUE)
            bonferroni_coef[bonferroni_coef==0] = 1
        } else {
            bonferroni_coef = 1
        } 
        for(col2 in groups[-1]){
            prop2 = curr_props[[col2]]
            base2 = curr_base[[col2]][1]
            pval = compare_proportions(prop1, prop2, 
                               base1, base2,
                               common_base = base2*adjust_common_base)
            if_na(pval) = Inf
            pval = pmin(pval*bonferroni_coef, 1)
            sig_section[[col2]] = ifelse(pval<sig_level & abs(prop1 - prop2)>delta_cpct,
                                         # previous value is greater
                                         ifelse(prop1>prop2,
                                                paste_non_empty(sig_section[[col2]], 
                                                      sig_labels_first_column[[1]], 
                                                      sep = " "),
                                                # previous value is smaller
                                                paste_non_empty(sig_section[[col2]], 
                                                      sig_labels_first_column[[2]], 
                                                      sep = " ")
                                         ),
                                         sig_section[[col2]]
            )
        }        
    }
    sig_section
}

########################

get_category_labels = function(header){
    # "aaa|bbb|ddd" -> "ddd"
    header = gsub("\\n|\\r", " ", header, perl = TRUE)
    gsub("^.*?\\|([^\\|]*)$", "\\1", header, perl = TRUE) 
}


########################

header_groups = function(header){
    header = header[-1]
    potentially_subgroup = function(x){
        has_levels = grepl("|", x, fixed = TRUE)
        seq_dupl = x[-1L] == x[-length(x)]
        seq_dupl =c(FALSE, seq_dupl) | c(seq_dupl, FALSE)
        has_levels & !seq_dupl
    }
    if(!any(grepl("|", header, fixed = TRUE))){
        # '+ 1' because of first column with row_labels
        return(list(seq_along(header)+1))
    }
    header = gsub("\\n|\\r", " ", header, perl = TRUE)
    to_strip = potentially_subgroup(header) 
    while(any(to_strip)){
        # "aaa|bbb|ddd" -> "aaa|bbb"
        header[to_strip] = gsub("\\|[^\\|]*$", "", header[to_strip], perl = TRUE) 
        to_strip = potentially_subgroup(header) 
    }
    res = rle(header)[["lengths"]]
    res = lapply(res, seq_len)
    # '+ 1' because of first column with row_labels
    res[[1]] = res[[1]] + 1 
    for(each in seq_along(res)[-1]){
        res[[each]] = res[[each]] + res[[each-1]][length(res[[each-1]])] 
    }
    res
}


########################

split_table_by_row_sections = function(tbl, total_marker = "#", total_row = 1){
    totals = get_total_rows_indicator(tbl, total_marker)
    if_na(totals) = FALSE
    stopif(!any(totals), 
           "significance testing - total rows not found. Incorrect total marker: ","'",
           total_marker, "'")
    total_above = totals[1]
    if(total_above){
        splitter = c(FALSE, totals[-length(totals)] < totals[-1])
    } else {
        splitter = c(FALSE, totals[-length(totals)] > totals[-1])
    }
    sections = cumsum(splitter)
    unname(split(tbl, sections))
}

extract_total_from_section = function(section, total_marker = "#", total_row = 1){
    curr_totals = get_total_rows_indicator(tbl = section, total_marker = total_marker)
    total = section[curr_totals,, drop = FALSE]
    if(is.character(total_row)){
        total = total[grepl(total_row, total[[1]], perl = TRUE), , drop = FALSE]
        stopif(nrow(total)<1, "significance testing - base not found: ", total_row)
    } else {
        stopif(nrow(total)<total_row, 
               "significance testing - base not found, too large 'total_row': ", total_row)
        total = total[total_row, , drop = FALSE]
    }
    total[[1]] = NA # it is supposed to be character (row_labels) so we change it
    unlist(total[1,])  # [1,] if we by occasion select several rows 
}


get_total_rows_indicator = function(tbl, total_marker = "#"){
    grepl(total_marker, tbl[[1]], perl = TRUE)
}



########################

paste_non_empty = function(x, y, sep = ""){
    ifs(x %in% c("", NA) ~ y,
        y %in% c("", NA) ~ x, 
        TRUE ~ paste(x, y, sep = sep)
    )
}

paste_df_non_empty = function(df1, df2, sep = ""){
    for(i in seq_along(df1)){
        max_width = max(nchar( df2[[i]]), na.rm = TRUE)
        df2[[i]] = format(df2[[i]], width = max_width) # , flag = "+"
        df1[[i]] = paste_non_empty(df1[[i]], df2[[i]], sep = sep)
    }
    df1
}
Any scripts or data that you put into this service are public.
expss documentation built on July 26, 2023, 5:23 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
expss
Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

R/significance_cpct.R
In expss: Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

Defines functions paste_df_non_empty paste_non_empty get_total_rows_indicator extract_total_from_section split_table_by_row_sections header_groups get_category_labels section_sig_first_column section_sig_previous_column section_sig_prop add_sig_labels significance_cpct.etable significance_cpct

Documented in add_sig_labels significance_cpct

Try the expss package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

expss Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

R/significance_cpct.R In expss: Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

Defines functions paste_df_non_empty paste_non_empty get_total_rows_indicator extract_total_from_section split_table_by_row_sections header_groups get_category_labels section_sig_first_column section_sig_previous_column section_sig_prop add_sig_labels significance_cpct.etable significance_cpct

Documented in add_sig_labels significance_cpct

Try the expss package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

expss
Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

R/significance_cpct.R
In expss: Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics