vignettes/DistributionOfMeetupDays.md

Distribution of Meetup Days for the Portland R User Group Meetup ================ Jim Tyhurst 2020-06-16

The Question

Overheard

Meetup Member P: “It would be really great if there were things scheduled on nights other than Mondays or Tuesdays!”

Meetup Member Q: “Events are scheduled throughout the week on Mondays to Thursdays traditionally. Feel free to check out past meetups to see the variety there in terms of days of the week scheduled.”

I could not resist this invitation! What is the distribution of events by day of the week for the Portland R User Group?

Disclaimer: I am neither Member P nor Member Q. This is just for fun!

Configuration

library(codesamplerr)
library(readr)
library(dplyr)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:cowplot':
#> 
#>     stamp
#> The following object is masked from 'package:base':
#> 
#>     date
library(ggplot2)

The Data

pdxRlang-meetup-days.csv is a one-column CSV file that has the dates for the last 3 years of Events listed on the Portland R User Group web site. I created the file on 2019-03-15 by browsing the Past Events web page and typing the dates into a file manually.

Design decisions

Feature Engineering

Let’s read the data and add variables that will help us to group the data:

events <- system.file("pdxRlang-meetup-days.csv", package = "codesamplerr") %>% 
  readr::read_csv() %>% 
  dplyr::mutate(
    DayOfWeek = lubridate::wday(Day),  # numeric, 1 = Sunday
    DayOfWeekLabel = lubridate::wday(Day, label = TRUE),  # Mon, Tue, ...
    Month = lubridate::month(Day), # numeric
    MonthLabel = lubridate::month(Day, label = TRUE), # Jan, Feb, ...
    Year = lubridate::year(Day)
  )
#> Parsed with column specification:
#> cols(
#>   Day = col_date(format = "")
#> )
print(sprintf(
  "Summary: %d events from %s to %s", 
  length(events$Day), 
  min(events$Day), 
  max(events$Day)
))
#> [1] "Summary: 77 events from 2016-01-16 to 2019-03-12"
events
#> # A tibble: 77 x 6
#>    Day        DayOfWeek DayOfWeekLabel Month MonthLabel  Year
#>    <date>         <dbl> <ord>          <dbl> <ord>      <dbl>
#>  1 2019-03-12         3 Tue                3 Mar         2019
#>  2 2019-03-06         4 Wed                3 Mar         2019
#>  3 2019-02-19         3 Tue                2 Feb         2019
#>  4 2019-01-16         4 Wed                1 Jan         2019
#>  5 2019-01-03         5 Thu                1 Jan         2019
#>  6 2018-12-11         3 Tue               12 Dec         2018
#>  7 2018-12-04         3 Tue               12 Dec         2018
#>  8 2018-11-14         4 Wed               11 Nov         2018
#>  9 2018-11-07         4 Wed               11 Nov         2018
#> 10 2018-10-16         3 Tue               10 Oct         2018
#> # … with 67 more rows

Visualizing the Distribution of Meeting Days

First, we plot the entire past 3 years:

start_date <- min(events$Day)
end_date <- max(events$Day)
n_events <- length(events$Day)
events %>% 
  ggplot(aes(DayOfWeekLabel)) +
  geom_bar(na.rm = TRUE) +
  ggtitle(sprintf(
    "Distribution of %d Meeting Days: %s to %s",
    n_events,
    start_date,
    end_date
  )) +
  xlab("Day of Week") +
  scale_y_continuous(name = "Number of Events", limits = c(0, 30), breaks = seq(0, 30, by = 5))

The distribution does not seem like a uniform distribution. Tuesday and Wednesday are definitely the favored days, although many events are also scheduled on Monday and Thursday.

Let’s focus on just the past year:

start_date <- max(events$Day) - dyears(1)
end_date <- max(events$Day)
recent_events <- events %>% dplyr::filter(Day >= start_date)
n_events <- length(recent_events$Day)
recent_events %>% 
  ggplot(aes(DayOfWeekLabel)) +
  geom_bar(na.rm = TRUE) +
  ggtitle(sprintf(
    "Distribution of %d Meeting Days: %s to %s",
    n_events,
    start_date,
    end_date
  )) +
  xlab("Day of Week") +
  scale_y_continuous(name = "Number of Events", limits = c(0, 30), breaks = seq(0, 30, by = 5))

This recent data looks much more uniform than the previous plot of all 3 years of data, although Monday has clearly been neglected in the past year.

While we are exploring and visualizing the data, let’s look at the distribution of events across months of the year:

start_date <- min(events$Day)
end_date <- max(events$Day)
n_events <- length(events$Day)
events %>% 
  ggplot(aes(MonthLabel)) +
  geom_bar(na.rm = TRUE) +
  ggtitle(sprintf(
    "Distribution of %d Events by Month: %s to %s",
    n_events,
    start_date,
    end_date
  )) +
  xlab("Month") +
  scale_y_continuous(name = "Number of Events", limits = c(0, 10), breaks = seq(0, 10, by = 2))

And the distribution of events by year:

start_date <- min(events$Day)
end_date <- max(events$Day)
n_events <- length(events$Day)
events %>% 
  ggplot(aes(Year)) +
  geom_bar(na.rm = TRUE) +
  ggtitle(sprintf(
    "Distribution of %d Events by Year: %s to %s",
    n_events,
    start_date,
    end_date
  )) +
  scale_x_continuous(name = "Year", limits = c(2015, 2020), breaks = seq(2015, 2020, by = 1)) +
  scale_y_continuous(name = "Number of Events", limits = c(0, 30), breaks = seq(0, 30, by = 5))

It appears that the Portland R User Group is stable with approximately the same number of events each year.

Statistical Analysis

If each day Monday - Thursday had an equal probability of being chosen, then we would expect roughly the same number of events for each day for a large sample of events. That would be a uniform distribution. I want to determine whether the actual distribution of events across the four days Monday - Thursday is statistically different than a uniform distribution.

A chi-square test is often used to compare two distributions for categorical variables. However, this is a very small set of observations. In particular, the observations for the past year have some cells with less than 5 observations, which violates an assumpution of a chi-square test. Therefore, Fisher’s Exact Test is a better choice for comparing the actual distribution to a a uniform distribution across Monday through Thursday for the past 3 years and for the past year.

actual_distribution <- events %>% 
  dplyr::filter(DayOfWeek >= 2, DayOfWeek <= 5) %>% 
  dplyr::group_by(DayOfWeekLabel) %>% 
  summarise(n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
print(actual_distribution)
#> # A tibble: 4 x 2
#>   DayOfWeekLabel     n
#>   <ord>          <int>
#> 1 Mon                9
#> 2 Tue               29
#> 3 Wed               22
#> 4 Thu               12
n_events <- sum(actual_distribution$n)
n_days <- length(actual_distribution$DayOfWeekLabel)
print(sprintf(
  "There are %d total events distributed across %d weekdays.",
  n_events,
  n_days
))
#> [1] "There are 72 total events distributed across 4 weekdays."

# Build uniform distribution with same number of events in each of
# the day slots for Monday - Thursday.
uniform_distribution <- ceiling(rep(n_events / n_days, n_days))

# Use Fisher's Test on a contingency table with the actual distribution compared
# to a uniform distribution.
test_result <- fisher.test(data.frame(
  x = actual_distribution$n, 
  y = uniform_distribution
))
print(test_result)
#> 
#>  Fisher's Exact Test for Count Data
#> 
#> data:  data.frame(x = actual_distribution$n, y = uniform_distribution)
#> p-value = 0.06854
#> alternative hypothesis: two.sided

Fisher’s Exact Test results in a p-value = 0.0685, so the difference is not significant at a 0.05 level, meaning that events over the past 3 years are not significantly different from a uniform distribution.

Let’s use the same test restricted to only the past year of events, which is a very small sample:

actual_distribution <- recent_events %>% 
  dplyr::filter(DayOfWeek >= 2, DayOfWeek <= 5) %>% 
  dplyr::group_by(DayOfWeekLabel) %>% 
  summarise(n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
print(actual_distribution)
#> # A tibble: 4 x 2
#>   DayOfWeekLabel     n
#>   <ord>          <int>
#> 1 Mon                1
#> 2 Tue               11
#> 3 Wed                6
#> 4 Thu                5
n_events <- sum(actual_distribution$n)
n_days <- length(actual_distribution$DayOfWeekLabel)
print(sprintf(
  "There are %d total events distributed across %d days.",
  n_events,
  n_days
))
#> [1] "There are 23 total events distributed across 4 days."

# Build uniform distribution with same number of events in each of
# the day slots for Monday - Thursday.
uniform_distribution <- ceiling(rep(n_events / n_days, n_days))

# Use Fisher's Test on a contingency table with the actual distribution compared
# to a uniform distribution.
test_result <- fisher.test(data.frame(
  x = actual_distribution$n, 
  y = uniform_distribution
))
print(test_result)
#> 
#>  Fisher's Exact Test for Count Data
#> 
#> data:  data.frame(x = actual_distribution$n, y = uniform_distribution)
#> p-value = 0.1725
#> alternative hypothesis: two.sided

Considering data for the past year, Fisher’s Exact Test results in a p-value = 0.1725, indicating that the difference is not significant, meaning that the distribution of events over the past 1 year are not significantly different than a uniform distribution.

Conclusion

I spent an interesting couple of hours:

The distribution of events favors Tuesdays and Wednesdays, but there are events on Mondays and Thursdays also. It appears that the Meetup Organizers are selecting a variety of days, so that people who have conflicts on certain days of the week will still be able to attend some of the Meetup events. Good job, Organizers!



jimtyhurst/codesamplerr documentation built on Aug. 13, 2021, 8:45 a.m.