WMUtils: WMUtils - a WMF utilities package in R

Description RequestLogs MySQL Geodata User agents Session analysis Namespace matching Author(s)

Description

It slices, it dices, it handles:

RequestLogs

We don't study readers enough, and we finally have the tools to do that. WMUtils contains several functions centred on the RequestLogs. hive_query allows you to query the unsampled logs, while sampled_logs allows you to retrieve the 1:1000 sampled ones. For both data types, log_strptime turns the timestamp format used into POSIXlt timestamps, while log_sieve applies a prototype pageviews filter to the output of sampled_logs. extract_mcc allows you to retrieve MCC codes from zero-tagged requests, while host_handler and project_extractor truncate URLs to hostnames and (in the case of Wikimedia URLs) language.project pairs, respectively.

MySQL

If you study editors, our MySQL databases are where all the data lives. mysql_query allows you to query a single database on analytics-store.eqiad.wmnet, while global_query allows you to run over multiple databases. Either way, mw_strptime turns the timestamp format used in our DB into POSIXlt timestamps. And once you're done processing, use mysql_write to stream the results up to the databases again. Need to update previously written rows? No problem! mysql_delete is the function for you.

Geodata

Thanks to MaxMind's C API, we can access geographic data associated with IP addresses geo_country retrieves country codes, geo_region region codes or names, geo_city cities, and geo_tz retrieves tzdata-compatible timezones.

User agents

Our user-agent parsing, which uses tobie's ua-parser, is in C++ It's also now in R thanks to Rcpp. If you run into incorrectly identified user agents, poke Oliver, since he's a maintainer on the ua-parser repositories.

Session analysis

For session analysis, WMUtils contains intertimes, session_count, session_length and session_pages, all implemented in C++ for speed (improvements in some cases are up to three orders of magnitude. R handles recursion really poorly).

Namespace matching

namespace_match allows you convert namespace numbers to localised names, or vice versa, handling the presence of namespaces in reader or editor data. The dataset is also made available as namespace_names, or rebuildable via namespace_match_generator.

Author(s)

Oliver Keyes <okeyes@wikimedia.org>


wikimedia-research/WMUtils documentation built on May 4, 2019, 5:23 a.m.