knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(moodleStats)
library(dplyr)

Read Data

{moodleStats} comes with a raw example data for Moodle Grades Report in a CSV file. Here is the steps to read that data into R.

  1. Get path to the example data by moodleStats_example().

  2. Read grades_report.csv into a data frame. I will use readr::read_csv().

# Get path to `grades_report.csv`
path <- moodleStats_example("grades_report.csv")

# Read to a Data Frame
grades_df <- readr::read_csv(path)

Things to consider

grades_df is a Moodle Grades Report that has r nrow(grades_df) rows and r ncol(grades_df) columns. Let's see an overview of this data frame.

head(grades_df)

There are couple of things that need cleaning before working with this data in R.

Need to clean

Column Names

names(grades_df)

Spaces in column names, we will have to remove them.

Too many metadata in the column names, for example:

These metadata (quiz maximum grades and question's maximum score) should be removed from the column names, and store it in another data structure that I will present to you later.

Column Formatting

grades_df$`Started on`[1]
grades_df$Completed[1]

Date-Time formatting in Started on and Completed column; however, this package only use Started on column.

unique(grades_df$`Grade/9.00`)
unique(grades_df$`Q. 1 /1.00`)

Numeric value formatting in Grade/9.00 and Q.* columns. Some characters need to be converted to NA such as "-" in this example.

Last Row

Last row of grades_df contains an "Overall average".

grades_df[nrow(grades_df), ]

It's convenient that Moodle provide brief summary of average quiz grades and question's scores for teachers, but we don't need that because {moodleStats} already provides a function to calculate quiz and question statistics which I'll show you later.

Therefore, the last row will be removed.

Lastly, the first name and last name of students are in First name and Surname column, which is OK. However, If we were to perform a data transformation grouping by each student, we would have to write code to group by 2 variables.

For convenience sake, joining First name and Surname into a single column (i.e., Name) can be more easier to work with, as I'll show you later.

Filtering attempts

Quiz setting in the Moodle of grades_df allows student to submit answers multiple times (i.e., unlimited attempts).

If we count student attempts we can see that many students do quiz multiple times.

grades_df %>% 
  count(`First name`, Surname, State, sort = TRUE) %>% 
  head()

If some observation (row) is duplicated, and we calculate statistics based these, we would get somewhat bias result.

Therefore, {moodleStats} will provide a method to filter rows by quiz status, grades, and started time of each student in order to obtain a unique observation for every rows.

Prepare Data

Previously, I've listed the things that needs to be done, {moodleStats} provides a function to do all that.

grades_df_preped() will do the followings:

Cleaning

Filtering

Filters will be applied in this order.

  1. Filter quiz status (State) by choose_state argument.

    • "Finished": (default) choose only submitted attempts

    • "In progress": choose only non-submitted attempts

    • "all": choose all attempts (no filter)

  2. Filter student grades (Grade) by choose_grade argument.

    • "max": (default) choose attempts that has maximum score of each students

    • "min": choose attempts that has minimum score of each students

    • "all": choose all attempts (no filter)

  3. Filter started time (Started) by choose_time argument.

    • "first": (default) choose the first attempt of each students

    • "last": choose the last attempt of each students

    • "all": choose all attempts (no filter)

In the example below, I call prep_grades_report() with arguments to filter the first submitted attempt that has a maximum score of each student.

grades_df_preped <- prep_grades_report(grades_df,
                                       sep_name = " ",
                                       choose_state = "Finished",
                                       choose_grade = "max",
                                       choose_time = "first")
names(grades_df_preped)
head(grades_df_preped)

In grades_df_preped, the column names are cleaned, columns types are formatted properly, and missing values NA are assigned as previously described.

Also, student names are unique in every rows, no duplication.

any(duplicated(grades_df_preped$Name))

Quiz Metadata

Now that column names has been cleaned, How can we get maximum settings of quiz and questions?

quiz_meta() can obtain that quiz settings and filtering record that performed previously. The result will nicely printed to R console.

grades_df_meta <- quiz_meta(grades_df_preped)
grades_df_meta

Underneath, It is just a list so you can subset them later as desired.

# Maximum score of each questions
grades_df_meta$quiz_setting$questions_max

Quiz Summary

Now that we have prepared our data, It's time to calculate quiz summary statistics.

summary_quiz() calculates various quiz summary statistics, and combine it into 1 row data frame.

quiz_report_df <- summary_quiz(grades_df_preped)

head(quiz_report_df)

All of these parameters were calculated from total quiz scores (Grade column) of each students except for Cronbach_Alpha which use all questions scores (Q columns), and max_quiz is a maximum score possible of this quiz.

For interpretation of these parameters, please visit summary_quiz() documentation.

Plot Quiz Score

Let's plot a histogram of quiz score using {ggplot2}

library(ggplot2)
theme_set(theme_minimal())
grades_df_preped %>% 
  ggplot(aes(Grade)) +
  geom_histogram(binwidth = 0.5, fill = "darkblue", alpha = 0.7) +

  geom_vline(xintercept = quiz_report_df$mean, color = "maroon", lty = "dashed") +
  annotate("label", 
           x = quiz_report_df$mean, y = 70,
           label = paste0("mean = ", round(quiz_report_df$mean, 2))) +

  scale_x_continuous(breaks = 0:10) +
  xlim(0, 10) +
  labs(title = "Quiz Score Distribution", 
       caption  = "Note: multiple attempts quiz, choose the best score of each students.",
       y = "Number of Students")

The plot shows that the distribution is non-Gaussian which has its tails on the left side, corresponding to the negative skew that we've calculated (skewness = r round(quiz_report_df$skewness, 2)), also this distribution has quite a bit of outliers with the kurtosis of r round(quiz_report_df$kurtosis, 2)

Question Summary & Item Analysis

summary_questions() calculates various question summary statistics and item analysis into a data frame.

Let's see what parameters we can get.

question_report_df <- summary_questions(grades_df_preped)
names(question_report_df)

Summary Stats of Questions

Descriptive statistics of each questions are in the data frame from column n to SD

question_report_df %>% 
  select(Questions:SD)

Item Analysis

Parameters of item analysis are Difficulty_Index and Discrimination_Index.

question_report_df %>% 
  select(Questions, Difficulty_Index:p.value)

Note:

Data visualization often delivers insight more naturally. Let's make some plots for item analysis!

Plot Question's Score

Let's plot a distribution of each question's score. I think that violin plot will be a good one in this case.

But first, we need to transform grades_df_preped to a long format.

grades_df_long <- grades_df_preped %>% 
  # Select Q1 to Q9 columns
  select(starts_with("Q")) %>% 
  # Pivot to long
  tidyr::pivot_longer(cols = starts_with("Q"), names_to = "Questions", values_to = "Mark") %>% 
  # Calculate mean of each questions
  group_by(Questions) %>% 
  mutate(Mark_mean = mean(Mark, na.rm = TRUE)) %>% 
  ungroup()

head(grades_df_long)

And then let's plot !

I will put questions in x-axis, and distribution of score in y-axis.

grades_df_long %>% 
  ggplot(aes(Questions, Mark, color = Mark_mean, fill = Mark_mean)) +
  # Violins
  geom_violin(alpha = 0.3, show.legend = F) +
  # Dots
  stat_summary(fun = mean, geom = 'point', show.legend = F) +

  scale_fill_viridis_c(option = "plasma", end = 0.8, begin = 0) +
  scale_color_viridis_c(option = "plasma", end = 0.8, begin = 0) +

  labs(x = "Questions", y = "Mark", 
       title = "Question's Score Distribution", 
       caption = "Points indicate question's mean score") 

Note that the points is the plots above is question's mean scores, which, in this case, is equal to Difficulty_Index.

(Because, Item Difficulty Index (p) is calculated by mean score of each questions divided by its range. For this example, every questions have a maximum score of 1, so it is equal to mean score of each questions.)

Plot Item Discrimination vs Item Difficulty

Let's make some text scatter plot with Discrimination_Index in x-axis and 1 - Difficulty_Index in y-axis.

I will use the Difficulty_Level to represent 1 - Difficulty_Index and I will call discriminative power for Discrimination_Index.

question_report_df <- question_report_df %>% 
  mutate(Difficulty_Level = 1 - Difficulty_Index)
question_report_df %>% 
  ggplot(aes(Discrimination_Index, Difficulty_Level, 
             color = Difficulty_Level*Discrimination_Index)) +

  geom_text(aes(label = Questions), show.legend = F) +

  #scale_y_reverse() +
  expand_limits(x = c(0,1), y = c(0,1)) +
  labs(title = "Question's difficulty vs discrimination",
       x = ("Discrimination Index (r)"),
       y = ("Difficulty Level (1 - p)")) 

This plot shows the difficulty level (1 - p) and discriminative power (r) for each questions in one picture.

As we move from the lower-left corner to the upper-right corner of the graph, the difficulty level and discrimination index for questions increases.

By looking at this graph, It seems to me that there are positive correlation between these 2 variables, the more difficult the questions are, the more discriminative power.

Let's verify this assumption.

res <- cor.test(question_report_df$Difficulty_Level,
                question_report_df$Discrimination_Index,
                method = "pearson")
res

The report show that there is a positive correlation ($r$ = r round(res$estimate, 3)) between difficulty level (1 - p) and discriminative power (r) of questions, although it is not statistically significant (p-value = r round(res$p.value, 3)).


Last updated: r format(Sys.time(), '%d %B %Y')



Lightbridge-KS/moodleStats documentation built on April 7, 2022, 8:14 p.m.