In rmsharp/excelutilsr: Convenience Functions for Working with Excel in R

knitr::opts_chunk$set(echo = TRUE)
suppressMessages(library(XLConnect))
library(stringi)
library(excelutilsr)

Though I try to not send data to individuals within an Excel file, sometimes my colleagues use Excel extensively in their own work, are adept at using Excel, and benefit from some of the formatting that is available for highlighting specific cells, columns, and rows within worksheets. For these reasons, I have put together a small function to encapsulate many of the nice formatting features provided by the XLConnect xlconnect@mirai-solutions.com package, which is available on CRAN.

The function add_formatted_worksheet is used on one worksheet at a time, but can add any number of formats to different cells within that worksheet.

An instructor asked me about calculating student scores and current averages using R and then displaying those scores in Excel with some formating enhancements to indicate those students who needes attention. I do not remember the specifics of his grading scheme, but this is an attempt to model a fairly realistic grading scenarior that can be easily simplified or enhanced.

Let's say you are a teacher who has given your students three different ways to earn points for the class you are teaching. They are listed below:

Type | Number Possible | Percent of Final Grade ------------|-----------------|----------------------- Excercise | 10 | 30 Exam | 4 | 40 Final | 1 | 30

You want to provide yourself and your students with a way of glancing at their grades and knowing where they stand quickly so you devise a plan to indicate how things are going by using Excel's formatting tools.

Excercises are in cells with an unlined left and right borders.
Exams are in cells with a single lined left and right borders.
The final has a double lined left and right borders.
Individual grades below 65 have a red foreground.
Individual grades below 75 have a yellow foreground.
Grades of 75 and above are in a transparent clear foreground.
The cells with the student names of individuals with cumulative scores that are equal to or above 75 have a green foreground.
The cells with the student names of individuals with cumulative scores that are greater than or equal to 65 and below 75 have a yellow foreground.
The cells with the student names of individuals with cumulative scores that are below 65 have a red foreground.

We can start by identifying the columns of cells with Exam grades. Let's say the column names are of the form "Exam_1" or "Week_2_Exam" such that we can use the string "exam" to identify the correct columns. The function could be written as follows.

is_exam <- function(df) {
  sapply(names(df), function(col_name) {
    if (stri_detect_regex(tolower(col_name), pattern = "exam") &
        !stri_detect_regex(tolower(col_name), pattern = "final")) {
      rep(TRUE, nrow(df))
    } else {
      rep(FALSE, nrow(df))
    }
  })
}

I have created a small dataframe containing some imaginary students and grades that can be used to illustrate the various functions.

grade_df <- data.frame(Student = c("James", "Jane", "Jamie"), 
                      "Exam_1" = c(86, 35, 42),
                      "Exercise_1" = c(75, 0, 95),
                      "Exercise_2" = c(100, 65, 80),
                      "Exam 2" = c(92, 49, 59), 
                      "Final Exam" = c(84, 68, 95),stringsAsFactors = FALSE)
is_exam(grade_df)

It is now easy to come up with a function that detects the column with the string "final" or "exercise" embedded in the column name.

is_final <- function(df) {
  sapply(names(df), function(col_name) {
    if (stri_detect_regex(tolower(col_name), pattern = "final")) {
      rep(TRUE, nrow(df))
    } else {
      rep(FALSE, nrow(df))
    }
  })
}
is_final(grade_df)
is_exercise <- function(df) {
  sapply(names(df), function(col_name) {
    if (stri_detect_regex(tolower(col_name), pattern = "exercise")) {
      rep(TRUE, nrow(df))
    } else {
      rep(FALSE, nrow(df))
    }
  })
}
is_exercise(grade_df)

In an equivalent way we can write a function to see if a cell holds a student's name.

is_student <- function(df) {
  sapply(names(df), function(col_name) {
    if (stri_detect_regex(tolower(col_name), pattern = "student")) {
      rep(TRUE, nrow(df))
    } else {
      rep(FALSE, nrow(df))
    }
  })
}
is_student(grade_df)

Testing for individual cell values is a bit different, but using R's vectors makes it quite simple. Thus, to test for individual grades below 65 we use the following function

is_score_below_65 <- function(grade_df) {grade_df < 65}
is_score_below_65(grade_df)

After seeing how simply that was constructed, the next two functions we need are easily written with the one complexity of how to handle grades below 75 and not less than 65.

is_score_equal_or_above_65_and_below_75 <- function(grade_df) {
  grade_df < 75 &
  !grade_df < 65
}
is_score_equal_or_above_65_and_below_75(grade_df)

is_score_equal_or_above_75 <- function(grade_df) {grade_df >= 75}

is_score_equal_or_above_75(grade_df)

When calculating current averages, scores have to take into account order and weight of the scores. We can do this by calculating possible total points up to the current cell. I am choosing to put my knowledge of the number and value of each type of grade into a single function (get_grading_scheme()). r stri_c(" _", unlist(stri_split_fixed("Handling excussed missing grades is outside the scope of this example", " ")), "_", collapse = "").

#' Grading scheme

#' This function houses the number of each grade type and the points assigned 
#' to those grade types.
#'
get_grading_scheme <- function() {
  list(n_exams = 4,
       exam_pts = 40.0,
       n_exercises = 10,
       exercise_pts = 30.0,
       n_finals = 1,
       final_pts = 30.0,
       total_pts = 100)
}

The first thing we need to know to see how a student is doing is to find out what the maximum number of points could have been accrued up and including each time point. Thus, in this example, after the first exam a student could have 100 on the exam, which would be 10 points (1 * 40 / 4) for the 1 examination taken thus far, the 40 points in the final grade made up of examination scores, and the fact that there are 4 of them. That is calculated with the get_individual_possible_pts() function. `r get_individual_possible_pts(0, 1, 0))

#' Looks at the number of exercises, exams, and finals complete to return the
#' maximum number of points that could have been earned by the student at
#' any specific point.
get_individual_possible_pts <- function(exercises = 10, exams = 4, finals = 1) {
  sched <- get_grading_scheme()
  exercises / sched$n_exercises * sched$exercise_pts  + 
    exams / sched$n_exams * sched$exam_pts  + 
    finals / sched$n_finals * sched$final_pts
}

We then need to calculate the possible number of points for each time point for each student. The code does not assume the location of the column containing the students names and its location is needed several times so a function that returns its position must be created.

get_student_col <- function(df) {
  col_names <- names(df)
  (1:length(df))[
    stri_detect_regex(tolower(col_names), pattern = "student")]
}

Now we can calculate the possible points for each student.

#' Looks at the number of exercises, exams, and finals complete to return the
#' maximum number of points that could have been earned by the student at
#' any specific point.
get_possible_pts <- function(grade_df) {
  student_col <- get_student_col(grade_df)
  possible_pts <- data.frame()
  for (row in 1:nrow(grade_df)) { # students are on rows
    exercises <- 0
    exams <- 0
    finals <- 0
    values <- numeric(ncol(grade_df) - 1)
    pts <- 0
    for (col in 1:ncol(grade_df)) {
      if (col == student_col) {
        next
      } else if (is_exercise(grade_df)[row, col]) {
        exercises <- exercises + 1
        values[col] <- get_individual_possible_pts(exercises, exams, finals)
      } else if (is_exam(grade_df)[row, col]) {
        exams <- exams + 1
        values[col] <- get_individual_possible_pts(exercises, exams, finals)
      } else if (is_final(grade_df)[row, col]) {
        finals <- finals + 1
        values[col] <- get_individual_possible_pts(exercises, exams, finals)
      }
    }
    possible_pts <- rbind(possible_pts,  values)
  }

  names(possible_pts) <- names(grade_df)
  possible_pts <- possible_pts[ , -student_col]
  possible_pts <- cbind(grade_df[ , names(grade_df)[
    stri_detect_regex(tolower(names(grade_df)), "student")]], 
                        possible_pts)
  possible_pts
}

Now that we know what is possible, we calculate the number of points each student has earned at each time point in their grade history.

get_earned_pts <- function(grade_df) {
  sched <- get_grading_scheme()
  student_col <- get_student_col(grade_df)
  partial_df <- grade_df[ , -student_col]
  exam_df <- partial_df * is_exam(partial_df) * 
    sched$exam_pts / (sched$n_exams * sched$total_pts)
  exercise_df <- partial_df * is_exercise(partial_df) * 
    sched$exercise_pts / (sched$n_exercises * sched$total_pts)
  final_df <- partial_df * is_final(partial_df) * 
    sched$final_pts / (sched$n_finals * sched$total_pts)
  current_grade_df <- exam_df + exercise_df + final_df
  t_df <- data.frame(grade_df[ , student_col], current_grade_df)
  names(t_df) <- names(grade_df)
  t_df
}

The cumulative earned points is calculated using the calculated number of points each student has earned at each time point in their grade history.

get_cumulative_earned_pts <- function(df) {
  student_col <- get_student_col(df)
  t_df <- get_earned_pts(df)
  c_pts <- t(cumsum(data.frame(as.matrix(t(t_df[ , -student_col]), 
                                         ncol = nrow(df)))))
  c_pts <- data.frame(df[ , student_col], c_pts, check.names = FALSE, 
                      stringsAsFactors = FALSE, row.names = NULL)
  names(c_pts) <- names(df)
  c_pts
}

We can easily calculate the running grade average using the possible earned points and the earned points. I have called this the current grade average. The cumulative earned points is calculated using the calculated number of points each student has earned at each time point in their grade history.

get_current_avg <- function(df) {
  student_col <- get_student_col(df)
  possible_points <- get_possible_pts(df)
  earned_points <- get_cumulative_earned_pts(df)
  current_avg <- earned_points[ , -student_col] / possible_points[ , -student_col]
  current_avg <- data.frame(earned_points[ , student_col], current_avg)
  names(current_avg) <- names(df)
  current_avg
}

current_avg <- get_current_avg(grade_df)

We have all of the grade calculations needed to post grades and current grade averages. Now we can demonstrate some simple to complex functions used to determine whether or not each cell gets a specific format. Thus, for the function is_avg_equal_or_above_75()

is_avg_equal_or_above_75 <- function(grade_df) {
  is_score_equal_or_above_75(get_current_avg(grade_df))
}

Identical strategy is used for both is_avg_below_65() and is_avg_equal_or_above_65_and_below_75().

is_avg_below_65 <- function(grade_df) {
  is_score_below_65(get_current_avg(grade_df))
}

is_avg_equal_or_above_65_and_below_75 <- function(grade_df) {
  is_score_equal_or_above_65_and_below_75(get_current_avg(grade_df))
}

The function make_df_with_FALSE() is a convenience function that takes a dataframe as an argument and returns a dataframe of the same number of rows and one less column filled with FALSE in each cell. The column left out is for student names.

This function is no longer used.

make_df_with_FALSE <- function(current_avg) {
  data.frame(matrix(rep(FALSE, nrow(current_avg) *
                          (max(col(current_avg)) - 1)), 
                    nrow = nrow(current_avg),
                    ncol = ncol(current_avg) - 1))
}

The add_student_col() function takes the current_avg dataframe, the vector (score_test_col) resulting from using the appropriate score_test() function, and the student names are in and adds score_test_col in the position of the student column and vectors of FALSE in all of the other columns. The score_test function is is_score_below_65(), is_score_equal_or_above_65_and_below_75(), or is_score_equal_or_above_75().

add_student_col <- function(current_avg, score_test_col) {
  student_col <- get_student_col(current_avg)
  false_col <- rep(FALSE, nrow(current_avg))
  student_df <- data.frame(false_col)
  for (col in seq_along(current_avg)) {
    if (col != student_col) {
      student_df <- data.frame(student_df, false_col)
    } else {
      student_df <- data.frame(student_df, student = score_test_col)
    }
  }
  student_df <- student_df[ , -1]
  names(student_df) <- names(current_avg)
  student_df
}
student_avg_test <- function(grade_df, score_test) {
  if (nrow(grade_df) == 0) {
    return(NA)
  }
  current_avg <- get_current_avg(grade_df)
  student_col <- get_student_col(current_avg)
  score_test_col <- logical(nrow(grade_df))
  for (row in 1:nrow(grade_df)) {
    max_col <- max(col(as.numeric(current_avg[row, ]) & !is.na(grade_df[row, ])))
    if (score_test(current_avg[row, max_col])) {
      score_test_col[row] <- TRUE
    } else {
      score_test_col[row] <- FALSE
    }
  }
  student_df <- add_student_col(current_avg, score_test_col)
  as.matrix(student_df)
}

student_avg_below_65 <- function(grade_df) {
  student_avg_test(grade_df, is_score_below_65)
}

student_avg_below_75_and_above_65 <- function(grade_df) {
  student_avg_test(grade_df, is_score_equal_or_above_65_and_below_75)
}

student_avg_equal_or_above_75 <- function(grade_df) {
  student_avg_test(grade_df, is_score_equal_or_above_75)
}

exam_fmt <- list(
  test = is_exam,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.NO_FILL),
  border = list(side = c("left",  "right"),
                type = XLC$BORDER.MEDIUM,
                color = XLC$COLOR.BLACK)
  )
final_fmt <- list(
  test = is_final,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.NO_FILL),
  border = list(side = c("left",  "right"),
                type = XLC$BORDER.DOUBLE,
                color = XLC$COLOR.BLACK)
  )
below_65_fmt <- list(
  test = is_score_below_65,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.SOLID_FOREGROUND),
  foreground_color = as.integer(XLC$COLOR.ROSE)
  )
below_75_fmt <- list(
  test = is_score_equal_or_above_65_and_below_75,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.SOLID_FOREGROUND),
  foreground_color = as.integer(XLC$COLOR.YELLOW)
  )
avg_below_65_fmt <-  list(
  test = student_avg_below_65,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.SOLID_FOREGROUND),
  foreground_color = as.integer(XLC$COLOR.ROSE)
  )
avg_below_75_fmt <- list(
  test = student_avg_below_75_and_above_65,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.SOLID_FOREGROUND),
  foreground_color = as.integer(XLC$COLOR.YELLOW)
  )
avg_above_75_fmt <- list(
  test = student_avg_equal_or_above_75,
  wrap = TRUE,
  fill_pattern = as.integer(XLC$FILL.SOLID_FOREGROUND),
  foreground_color = as.integer(XLC$COLOR.GREEN)
  )

Finally we put together the individual format lists into another list and call the add_formatted_worksheet() function. Note again how the current implementation has subsequent formatting commands overwrite prior formatting commands. In this example, the borders of exams and the final are removed when the scores are below 75.

We need to add an option to use existing formatting features that do not conflict with new ones. That is, if there is a border in the existing cell and the new format does not specify a border, the existing border would remain. This is not trivial, because it is a cell by cell formatting command.

grading_formats <- list(exam_fmt, final_fmt, below_65_fmt, below_75_fmt,
                        avg_below_65_fmt, avg_below_75_fmt, avg_above_75_fmt)

result <- add_formatted_worksheet(grade_df,
                                  "inst/extdata/student_grades.xlsx",
                                  sheet = "grades",
                                  header = TRUE,
                                  fmt_list = grading_formats,
                                  create = TRUE)