README.md

WMUtils

An internal utilities library for the Wikimedia Foundation's Research and Data team

Description

Every organisation has its own idiosyncratic data storage methods and engineering solutions, and the Wikimedia Foundation is no exception. To help solve for this when doing research, we have created WMUtils, a library of utility functions for handling the WMF's various data formats, stores and needs.

Domains

Log reading and database connections

Request logs are stored in both HDFS, unsanitised, for 30 days, and in a sanitised and sampled form on stat1002. hive_query and sampled_logs, respectively, allow you to get access to this data and read it into R. Once you have it, you can use log_strptime to parse the timestamp format, parse_uuids to extract the UUIDs used by the Wikipedia mobile applications, or even log_sieve to filter the requests down to those that are considered "pageviews". For general hive manipulation, hive_range makes a best-guess attempt at the smallest number of date-based partitions to run a query over to cover a expected range of timestamps.

The rest of our data lives in big MariaDB databases, which can be read from using mysql_query (or global_query to do it en-masse), written to with mysql_write, and checked or amended with mysql_exists or mysql_delete respectively.

MediaWiki idiosyncracies

mw_strptime replicates log_strptime, but for MediaWiki-specific timestamps, while to_mw allows you to shift POSIX timestamps back into acceptable MediaWiki ones. For namespace matching, namespace_match localises numeric namespace values and turns them into the appropriate strings, or takes localised strings and turns them into universally-accepted numeric values.

Geolocation

Through the MaxMind C API, we can take IP addresses and geolocate them. geo_country localises to country level, and geo_city to city-level, while geo_tz and geo_netspeed retrieve a tzdata-compatible timezone and a connection type, respectively.

User-agent parsing

With the assistance of tobie's ua-parser library (specifically the C++ port), we can take user agents and use ua_parse to localise them, retrieving the device, operating system, browser, and browser major/minor versions. This includes spider identification.

Once the agent is retrieved, device_classifier takes ua-parser's outputted device and makes a best guess at classifying them as phones, tablets or other.

Session analysis

A variety of functions implemented in C++ allow for session identification and analysis. intertimes takes a set of timestamps and turns them into a series of intertime values, which can then be passed to session_length to retrieve the length of the session(s), session_pages to retrieve the number of pages within those sessions, and session_count to get the number of sessions.

Dependencies

  1. R (doy)
  2. The Python libraries mentioned above
  3. data.table
  4. lubridate
  5. Rcpp
  6. jsonlite
  7. parallel


wikimedia-research/WMUtils documentation built on May 4, 2019, 5:23 a.m.