Distribution of Meetup Days for the Portland R User Group Meetup ================ Jim Tyhurst 2020-06-16
Overheard
Meetup Member P: “It would be really great if there were things scheduled on nights other than Mondays or Tuesdays!”
Meetup Member Q: “Events are scheduled throughout the week on Mondays to Thursdays traditionally. Feel free to check out past meetups to see the variety there in terms of days of the week scheduled.”
I could not resist this invitation! What is the distribution of events by day of the week for the Portland R User Group?
Disclaimer: I am neither Member P nor Member Q. This is just for fun!
library(codesamplerr)
library(readr)
library(dplyr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:cowplot':
#>
#> stamp
#> The following object is masked from 'package:base':
#>
#> date
library(ggplot2)
pdxRlang-meetup-days.csv is a one-column CSV file that has the dates for the last 3 years of Events listed on the Portland R User Group web site. I created the file on 2019-03-15 by browsing the Past Events web page and typing the dates into a file manually.
DayOfWeek
, Month
, and Year
can be derived
from the date, so I did not bother to include those variables in the
data file.Let’s read the data and add variables that will help us to group the data:
events <- system.file("pdxRlang-meetup-days.csv", package = "codesamplerr") %>%
readr::read_csv() %>%
dplyr::mutate(
DayOfWeek = lubridate::wday(Day), # numeric, 1 = Sunday
DayOfWeekLabel = lubridate::wday(Day, label = TRUE), # Mon, Tue, ...
Month = lubridate::month(Day), # numeric
MonthLabel = lubridate::month(Day, label = TRUE), # Jan, Feb, ...
Year = lubridate::year(Day)
)
#> Parsed with column specification:
#> cols(
#> Day = col_date(format = "")
#> )
print(sprintf(
"Summary: %d events from %s to %s",
length(events$Day),
min(events$Day),
max(events$Day)
))
#> [1] "Summary: 77 events from 2016-01-16 to 2019-03-12"
events
#> # A tibble: 77 x 6
#> Day DayOfWeek DayOfWeekLabel Month MonthLabel Year
#> <date> <dbl> <ord> <dbl> <ord> <dbl>
#> 1 2019-03-12 3 Tue 3 Mar 2019
#> 2 2019-03-06 4 Wed 3 Mar 2019
#> 3 2019-02-19 3 Tue 2 Feb 2019
#> 4 2019-01-16 4 Wed 1 Jan 2019
#> 5 2019-01-03 5 Thu 1 Jan 2019
#> 6 2018-12-11 3 Tue 12 Dec 2018
#> 7 2018-12-04 3 Tue 12 Dec 2018
#> 8 2018-11-14 4 Wed 11 Nov 2018
#> 9 2018-11-07 4 Wed 11 Nov 2018
#> 10 2018-10-16 3 Tue 10 Oct 2018
#> # … with 67 more rows
First, we plot the entire past 3 years:
start_date <- min(events$Day)
end_date <- max(events$Day)
n_events <- length(events$Day)
events %>%
ggplot(aes(DayOfWeekLabel)) +
geom_bar(na.rm = TRUE) +
ggtitle(sprintf(
"Distribution of %d Meeting Days: %s to %s",
n_events,
start_date,
end_date
)) +
xlab("Day of Week") +
scale_y_continuous(name = "Number of Events", limits = c(0, 30), breaks = seq(0, 30, by = 5))
The distribution does not seem like a uniform distribution. Tuesday and Wednesday are definitely the favored days, although many events are also scheduled on Monday and Thursday.
Let’s focus on just the past year:
start_date <- max(events$Day) - dyears(1)
end_date <- max(events$Day)
recent_events <- events %>% dplyr::filter(Day >= start_date)
n_events <- length(recent_events$Day)
recent_events %>%
ggplot(aes(DayOfWeekLabel)) +
geom_bar(na.rm = TRUE) +
ggtitle(sprintf(
"Distribution of %d Meeting Days: %s to %s",
n_events,
start_date,
end_date
)) +
xlab("Day of Week") +
scale_y_continuous(name = "Number of Events", limits = c(0, 30), breaks = seq(0, 30, by = 5))
This recent data looks much more uniform than the previous plot of all 3 years of data, although Monday has clearly been neglected in the past year.
While we are exploring and visualizing the data, let’s look at the distribution of events across months of the year:
start_date <- min(events$Day)
end_date <- max(events$Day)
n_events <- length(events$Day)
events %>%
ggplot(aes(MonthLabel)) +
geom_bar(na.rm = TRUE) +
ggtitle(sprintf(
"Distribution of %d Events by Month: %s to %s",
n_events,
start_date,
end_date
)) +
xlab("Month") +
scale_y_continuous(name = "Number of Events", limits = c(0, 10), breaks = seq(0, 10, by = 2))
And the distribution of events by year:
start_date <- min(events$Day)
end_date <- max(events$Day)
n_events <- length(events$Day)
events %>%
ggplot(aes(Year)) +
geom_bar(na.rm = TRUE) +
ggtitle(sprintf(
"Distribution of %d Events by Year: %s to %s",
n_events,
start_date,
end_date
)) +
scale_x_continuous(name = "Year", limits = c(2015, 2020), breaks = seq(2015, 2020, by = 1)) +
scale_y_continuous(name = "Number of Events", limits = c(0, 30), breaks = seq(0, 30, by = 5))
It appears that the Portland R User Group is stable with approximately the same number of events each year.
If each day Monday - Thursday had an equal probability of being chosen, then we would expect roughly the same number of events for each day for a large sample of events. That would be a uniform distribution. I want to determine whether the actual distribution of events across the four days Monday - Thursday is statistically different than a uniform distribution.
A chi-square test is often used to compare two distributions for categorical variables. However, this is a very small set of observations. In particular, the observations for the past year have some cells with less than 5 observations, which violates an assumpution of a chi-square test. Therefore, Fisher’s Exact Test is a better choice for comparing the actual distribution to a a uniform distribution across Monday through Thursday for the past 3 years and for the past year.
actual_distribution <- events %>%
dplyr::filter(DayOfWeek >= 2, DayOfWeek <= 5) %>%
dplyr::group_by(DayOfWeekLabel) %>%
summarise(n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
print(actual_distribution)
#> # A tibble: 4 x 2
#> DayOfWeekLabel n
#> <ord> <int>
#> 1 Mon 9
#> 2 Tue 29
#> 3 Wed 22
#> 4 Thu 12
n_events <- sum(actual_distribution$n)
n_days <- length(actual_distribution$DayOfWeekLabel)
print(sprintf(
"There are %d total events distributed across %d weekdays.",
n_events,
n_days
))
#> [1] "There are 72 total events distributed across 4 weekdays."
# Build uniform distribution with same number of events in each of
# the day slots for Monday - Thursday.
uniform_distribution <- ceiling(rep(n_events / n_days, n_days))
# Use Fisher's Test on a contingency table with the actual distribution compared
# to a uniform distribution.
test_result <- fisher.test(data.frame(
x = actual_distribution$n,
y = uniform_distribution
))
print(test_result)
#>
#> Fisher's Exact Test for Count Data
#>
#> data: data.frame(x = actual_distribution$n, y = uniform_distribution)
#> p-value = 0.06854
#> alternative hypothesis: two.sided
Fisher’s Exact Test results in a p-value = 0.0685
, so the difference
is not significant at a 0.05 level, meaning that events over the past 3
years are not significantly different from a uniform distribution.
Let’s use the same test restricted to only the past year of events, which is a very small sample:
actual_distribution <- recent_events %>%
dplyr::filter(DayOfWeek >= 2, DayOfWeek <= 5) %>%
dplyr::group_by(DayOfWeekLabel) %>%
summarise(n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
print(actual_distribution)
#> # A tibble: 4 x 2
#> DayOfWeekLabel n
#> <ord> <int>
#> 1 Mon 1
#> 2 Tue 11
#> 3 Wed 6
#> 4 Thu 5
n_events <- sum(actual_distribution$n)
n_days <- length(actual_distribution$DayOfWeekLabel)
print(sprintf(
"There are %d total events distributed across %d days.",
n_events,
n_days
))
#> [1] "There are 23 total events distributed across 4 days."
# Build uniform distribution with same number of events in each of
# the day slots for Monday - Thursday.
uniform_distribution <- ceiling(rep(n_events / n_days, n_days))
# Use Fisher's Test on a contingency table with the actual distribution compared
# to a uniform distribution.
test_result <- fisher.test(data.frame(
x = actual_distribution$n,
y = uniform_distribution
))
print(test_result)
#>
#> Fisher's Exact Test for Count Data
#>
#> data: data.frame(x = actual_distribution$n, y = uniform_distribution)
#> p-value = 0.1725
#> alternative hypothesis: two.sided
Considering data for the past year, Fisher’s Exact Test results in a
p-value = 0.1725
, indicating that the difference is not significant,
meaning that the distribution of events over the past 1 year are not
significantly different than a uniform distribution.
I spent an interesting couple of hours:
The distribution of events favors Tuesdays and Wednesdays, but there are events on Mondays and Thursdays also. It appears that the Meetup Organizers are selecting a variety of days, so that people who have conflicts on certain days of the week will still be able to attend some of the Meetup events. Good job, Organizers!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.