In bearloga/waxer: API Client Library for 'Wikimedia Analytics Query Service (AQS)'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 5, fig.width = 7, dpi = 200,
  out.height = 500, out.width = 700
)
options(tibble.print_max = 10)

{waxer} makes the Wikimedia Analytics Query Service (AQS) REST API available and easy to use in R. With a consistent interface and output, {waxer} facilitates working with metrics & data available in AQS, especially when combined with modern data science frameworks like tidyverse for data wrangling.

library(waxer)

library(dplyr)
library(purrr)
library(ggplot2)
library(lubridate)

Brief explanation of packages used: {purrr} makes it very easy to run multiple {waxer} queries programmatically, {dplyr} makes it easy to manipulate the obtained data, {lubridate} aids working with dates & times, and we use {ggplot2} for visualization. In some of these examples we will use purrr::map to apply a {waxer} function to a set of values we're interested in while keeping all the other parameters constant. Here's how map works:

fun <- function(a, b) {
  return(a + b)
}
map(-1:1, fun, b = 2)

Notice that the output is a list, which is map's default behavior. We can also specify the output type by using the different flavors of map:

map_dbl(-1:1, fun, b = 2)
map_chr(-1:1, fun, b = 2)

Since the output of {waxer}'s API-querying functions is always a tibble (an extension of a data.frame), we will mostly be using the map_dfr function which stitches several tibbles into one (via dplyr::bind_rows).

Traffic-based data and metrics

Pageviews

This package uses the same API endpoint as the {pageviews} package for page view data. Similar to {pageviews}, the caveat is that the traffic data is only available from 1 August 2015. For legacy view counts, refer to {wikipediatrend} package.

In this example we retrieve the page-views for the New Year's Eve and New Year's Day articles on English Wikipedia. Specifically, we're interested in user traffic, which excludes known spiders/bots.

pageviews <- wx_page_views(
  project = "en.wikipedia",
  page_name = c("New Year's Eve", "New Year's Day"),
  access_method = "all",
  agent_type = "user",
  start_date = "20191231",
  end_date = "20200101"
)

pageviews

In this case, the New Year's Eve article was viewed much more on New Year's Eve (December 31st) than on New Year's Day. Similarly, the New Year's Day article was viewed much more on New Year's Day (January 1st) than on New Year's Eve.

Work smarter, not harder

Now, suppose we wanted to see if this pattern is consistent across the years (starting with 2015/2016, since the API for pageviews starts from 2015-08-01). One way to do this would be to create start_date-end_date pairs across the years and use map2_dfr (not map_dfr) to iterate through the pairs:

new_years_dates <- tibble(
  start_date = as.Date("2015-12-31") + years(0:4),
  end_date = as.Date("2016-01-01") + years(0:4)
)
new_years_dates

Notice that those are Dates, not "YYYYMMDD" strings. All of the start_date and end_date parameters in {waxer}'s functions accept either. This way we don't have to use as.Date when we're querying once and don't have to use as.character on dates in situations like this.

new_years_views <- map2_dfr(
  new_years_dates$start_date,
  new_years_dates$end_date,
  wx_page_views,
  project = "en.wikipedia",
  page_name = c("New Year's Eve", "New Year's Day"),
  access_method = "all",
  agent_type = "user",
  granularity = "daily",
  .id = "pair"
)

head(new_years_views)

new_years_views <- new_years_views %>%
  mutate(
    pair = factor(
      new_years_views$pair, 1:5,
      paste(2015:2019, 2016:2020, sep = "/")
    ),
    day = case_when(
      month(date) == 12 & mday(date) == 31 ~ "Eve",
      month(date) == 1 & mday(date) == 1 ~ "Day"
    ),
    day = factor(day, c("Eve", "Day"))
  )
head(new_years_views)

ggplot(new_years_views, aes(x = day, y = views)) +
  geom_line(aes(color = page_name, group = page_name), size = 1) +
  scale_y_continuous(
    minor_breaks = NULL,
    labels = scales::label_number(scale = 1e-3, suffix = "K")
  ) +
  facet_wrap(~ pair, nrow = 1) +
  labs(
    title = "User (non-bot) traffic to New Year's Eve/Day articles",
    color = "Article", x = "New Year's", y = "Pageviews"
  ) +
  theme_bw() +
  theme(legend.position = "bottom")

Including redirects

MediaWiki enables users to create redirects. This is usually done for common typos and aliases, to make it easier for users (both readers and editors) to arrive at a single article. The thing is, when someone visits a redirect page, that page view is not counted towards the total view count for the destination page. To include redirects in the output for wx_page_views:

pvs_with_redirects <- wx_page_views(
  "en.wikipedia",
  c("COVID-19 pandemic", "2019–20 coronavirus pandemic"),
  start_date = "20200401",
  end_date = "20200401",
  include_redirects = TRUE
)

Caution: this process requires finding all the redirects (within the article namespace) to the requested pages and retrieving those redirects' page views. This has a considerable impact on the speed with which page views are retrieved. However, the function is optimized to work with many pages and will query the MediaWiki API the fewest times it can (since the redirects API supports up to 50 titles per query). Other than that the same rate limits apply.

head(pvs_with_redirects)

On 1 April 2020, the 2019–20 coronavirus pandemic article had r sum(!is.na(pvs_with_redirects$redirect_name)) redirects to it with traffic to them (at least 1 view). The most visited redirects are:

pvs_with_redirects %>%
  filter(!is.na(redirect_name)) %>%
  top_n(10, views) %>%
  select(redirect_name, views) %>%
  arrange(desc(views))

(The difference between the target article and a very similarly named redirect is that the actual article uses an en-dash but the redirect uses a minus sign, which is much more easily accessible on most keyboards than the more typographically-correct en-dash.)

If we wanted to roll up the page views to the redirects into the overall total for the article (and calculate some additional summary metrics), this is easily done within the tidyverse framework:

pvs_with_redirects %>%
  group_by(project, page_name, date) %>%
  summarize(
    total_views = sum(views),
    redirect_views = sum(views[!is.na(redirect_name)]),
    redirects = sum(!is.na(redirect_name))
  ) %>%
  ungroup

Project views

For consistency, the project parameter in every {waxer} function can only accept 1 value -- unlike the page_name parameter in wx_page_views(). So if we want to get multiple projects' views (the total number of page-views across all of the project's pages), we can use the map_dfr to iterate through a named vector of projects, keeping all the other parameters the same:

projects <- c(
  "French" = "fr.wikipedia",
  "Italian" = "it.wikipedia",
  "Spanish" = "es.wikipedia"
)
project_views <- map_dfr(
  projects, wx_project_views,
  access_method = "desktop", agent_type = "user",
  granularity = "monthly", start_date = "20160101", end_date = "20201001",
  .id = "language"
)

head(project_views)

ggplot(project_views) +
  geom_vline(aes(xintercept = as.Date("2020-05-01")), linetype = "dashed") +
  geom_line(aes(x = date, y = views, color = language), size = 0.8) +
  geom_text(
    aes(
      x = as.Date("2020-05-01"), y = 0,
      label = "Automated traffic detection",
      vjust = "bottom", hjust = "left"
    ),
    angle = 90, nudge_x = -10
  ) +
  scale_y_continuous(
    minor_breaks = NULL,
    labels = scales::label_number(scale = 1e-6, suffix = "M")
  ) +
  scale_x_date(date_labels = "%b\n%Y", date_breaks = "3 month", minor_breaks = NULL) +
  labs(
    title = "Monthly Wikipedia user (non-bot) traffic, by language",
    subtitle = "To desktop website",
    x = "Month", y = "Pageviews", color = "Language"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_line(color = "gray90", size = 0.2),
    panel.grid.major.y = element_line(color = "gray70", size = 0.5),
    legend.position = "bottom"
  )

Hourly project views

We can also retrieve a project's pageviews at an hourly granularity. For example:

hourly_views <- wx_project_views(
  "is.wikipedia",
  agent_type = "user",
  granularity = "hourly",
  start_date = "20191230",
  end_date = "20200102"
)

head(hourly_views)

ggplot(hourly_views) +
  geom_line(aes(x = time, y = views)) +
  geom_vline(
    xintercept = lubridate::ymd(
      c("20191230", "20191231", "20200101", "20200102", "20200103"),
      tz = "UTC"
    ),
    linetype = "dashed"
  ) +
  scale_x_datetime(
    name = "Time",
    date_breaks = "6 hours", date_minor_breaks = "1 hour",
    date_labels = "%H:00\n%d %b"
  ) +
  scale_y_continuous(breaks = NULL, minor_breaks = NULL) +
  labs(
    title = "User (non-bot) traffic to Icelandic Wikipedia", y = NULL,
    subtitle = "Hourly pageviews around New Year's Eve 2019, New Year's Day 2020"
  ) +
  theme_minimal()

Compared to 11PM-12AM traffic on Dec 30th and January 1st, the 11PM-12AM traffic on Dec 31st is much lower. No surprises there since we would expect many Icelanders to be celebrating and partying around that time instead of reading/editing Wikipedia.

Unique devices

To obtain the monthly estimated number of unique devices that visited German Wikivoyage from Jan 2018 to June 2020:

unique_devices <- wx_unique_devices(
  project = "de.wikivoyage",
  granularity = "monthly",
  access_site = "all",
  start_date = "20180101",
  end_date = "20201031"
)

head(unique_devices)

Which we can visualize with a periodicity plot:

unique_devices$year <- factor(year(unique_devices$date))
year(unique_devices$date) <- 2018

ggplot(unique_devices) +
  geom_line(aes(x = date, y = devices, color = year), size = 0.8) +
  scale_y_continuous(
    minor_breaks = NULL,
    labels = scales::label_number(scale = 1e-3, suffix = "K")
  ) +
  scale_x_date(date_labels = "%b", date_breaks = "1 month", minor_breaks = NULL) +
  labs(
    title = "YoY daily unique devices to German Wikivoyage",
    subtitle = "To desktop and mobile website",
    x = "Month", y = "Unique devices", color = "Year"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_line(color = "gray90", size = 0.2),
    panel.grid.major.y = element_line(color = "gray70", size = 0.5),
    legend.position = "bottom"
  )

User-based data and metrics

Suppose we wanted to get the daily number of non-bot active editors of content pages on English Wikipedia in January 2020. This is easy with {waxer}'s wx_active_editors function:

active_editors <- wx_active_editors(
  project = "en.wikipedia", editor_type = "user", page_type = "content",
  start_date = "20200101", end_date = "20200131"
)

head(active_editors)

By activity level

Suppose we wanted to visualize these daily counts broken down by activity level:

activity_levels <- c(
  "low" = "1-4",
  "medium" = "5-24",
  "high" = "25-99",
  "very high" = "100+"
)
active_editors_by_activity <- map_dfr(
  activity_levels,
  wx_active_editors,
  project = "en.wikipedia", editor_type = "user", page_type = "content",
  start_date = "20200101", end_date = "20200131",
  .id = "activity_level"
)

head(active_editors_by_activity)

active_editors_by_activity <- active_editors_by_activity %>%
  mutate(
    activity_level = factor(
      activity_level,
      names(activity_levels),
      sprintf("%s (%s edits)", names(activity_levels), activity_levels)
    )
  )

ggplot(active_editors_by_activity, aes(x = date, y = editors)) +
  geom_col(aes(fill = activity_level)) +
  scale_x_date(date_labels = "%a, %d %b") +
  scale_fill_brewer("Activity level", palette = "Set1") +
  labs(
    title = "Number of English Wikipedia article editors in January 2020",
    subtitle = "Broken down by activity level (number of edits)"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

By wiki

Similarly, we can obtain the (monthly) totals for several Wikipedias. This time we're not breaking down by activity level (which is the default behavior for this function):

active_editors_by_wiki <- map_dfr(
  projects,
  wx_active_editors,
  editor_type = "user", page_type = "content",
  start_date = "20150101", end_date = "20201001",
  granularity = "monthly",
  .id = "language"
)

head(active_editors_by_wiki)

ggplot(active_editors_by_wiki) +
  geom_line(aes(x = date, color = language, y = editors)) +
  scale_x_date(date_breaks = "1 year", minor_breaks = NULL, date_labels = "%b\n%Y") +
  scale_y_continuous(minor_breaks = NULL) +
  facet_wrap(~ language, ncol = 1, scales = "free_y") +
  labs(
    title = "Number of Wikipedia article editors, by language",
    subtitle = "Monthly total since January 2018",
    y = "Active editors per month"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Content-based data and metrics

Pages over time

How many new articles were created each month by registered users on Russian Wikipedia in December 2019?

new_pages <- wx_new_pages(
  "ru.wikipedia",
  editor_type = "user",
  page_type = "content",
  granularity = "monthly",
  start_date = "20191201",
  end_date = "20200101"
)

new_pages

How has Russian Wikipedia grown over time since it started in May 2001?

total_pages <- wx_total_pages(
  "ru.wikipedia",
  editor_type = "all",
  page_type = "content", # focus on articles
  granularity = "monthly",
  start_date = "20010501",
  end_date = "20201001"
)

tail(total_pages)

ggplot(total_pages) +
  geom_line(aes(x = date, y = total_pages)) +
  scale_y_continuous(
    minor_breaks = NULL,
    labels = scales::label_number(scale = 1e-6, suffix = "M")
  ) +
  scale_x_date(date_labels = "%Y", date_breaks = "12 months", minor_breaks = NULL) +
  labs(
    title = "Growth of Russian Wikipedia",
    x = "Time", y = "Articles"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_line(color = "gray90", size = 0.2),
    panel.grid.major.x = element_line(color = "gray90", size = 0.5)
  )

Edits to a page

page_edits <- wx_page_edits(
  "en.wikipedia",
  c("Coronavirus disease 2019", "COVID-19 pandemic"),
  start_date = "20200101",
  end_date = "20201031"
)

head(page_edits)

ggplot(page_edits) +
  geom_line(aes(x = date, y = edits, color = page_name)) +
  labs(
    title = "Edits made to English Wikipedia articles on coronavirus",
    x = "Date", y = "Edits per day", color = "Article"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Editing activity

daily_edits <- map_dfr(
  projects,
  wx_project_edits,
  editor_type = "all", page_type = "content",
  start_date = "20200101", end_date = "20201001",
  granularity = "daily",
  .id = "language"
)
daily_editors <- map_dfr(
  projects,
  wx_active_editors,
  editor_type = "all", page_type = "content",
  start_date = "20200101", end_date = "20201001",
  granularity = "daily",
  .id = "language"
)
editing_activity <- daily_edits %>%
  left_join(daily_editors, by = c("project", "language", "date")) %>%
  mutate(edits_per_editor = edits / editors) %>%
  arrange(language, date)

head(editing_activity)

Using {RcppRoll} we can create a rolling 7-day average to smooth out the day-to-day variability, which will be helpful for visualization:

editing_activity %>%
  group_by(language) %>%
  mutate(
    rolling_avg = c(
      rep(NA, 3), # first 3 days
      RcppRoll::roll_mean(edits_per_editor, n = 7),
      rep(NA, 3) # last 3 days
    )
  ) %>%
  ungroup %>%
  ggplot(aes(x = date, color = language)) +
  geom_line(aes(y = edits_per_editor), alpha = 0.25) +
  geom_line(aes(y = rolling_avg)) +
  scale_y_continuous(minor_breaks = NULL) +
  scale_x_date(date_labels = "%d %b\n%Y", date_breaks = "2 weeks", minor_breaks = NULL) +
  labs(
    title = "Average article edits per editor",
    x = "Date", y = "Average edits per editor", color = "Wikipedia"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_line(color = "gray90", size = 0.2),
    panel.grid.major.x = element_line(color = "gray90", size = 0.5),
    legend.position = "bottom"
  )

Most edited pages

What were the top 5 most edited articles by on English Wikipedia from January through March of 2020?

edited_pages <- wx_top_edited_pages(
  "en.wikipedia",
  page_type = "content",
  granularity = "monthly",
  start_date = "20200101",
  end_date = "20200331"
)

head(edited_pages)

edited_pages %>%
  mutate(month = month(date, label = TRUE, abbr = FALSE)) %>%
  group_by(month) %>%
  top_n(5, desc(rank)) %>%
  select(month, rank, page_name)

bearloga/waxer documentation built on Nov. 28, 2020, 11:28 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bearloga/waxer
API Client Library for 'Wikimedia Analytics Query Service (AQS)'

In bearloga/waxer: API Client Library for 'Wikimedia Analytics Query Service (AQS)'

Traffic-based data and metrics

Pageviews

Work smarter, not harder

Including redirects

Project views

Hourly project views

Top viewed articles

Unique devices

User-based data and metrics

By activity level

By wiki

Content-based data and metrics

Pages over time

Edits to a page

Editing activity

Most edited pages

R Package Documentation

Browse R Packages

We want your feedback!

bearloga/waxer API Client Library for 'Wikimedia Analytics Query Service (AQS)'

In bearloga/waxer: API Client Library for 'Wikimedia Analytics Query Service (AQS)'

Traffic-based data and metrics

Pageviews

Work smarter, not harder

Including redirects

Project views

Hourly project views

Top viewed articles

Unique devices

User-based data and metrics

By activity level

By wiki

Content-based data and metrics

Pages over time

Edits to a page

Editing activity

Most edited pages

R Package Documentation

Browse R Packages

We want your feedback!

bearloga/waxer
API Client Library for 'Wikimedia Analytics Query Service (AQS)'