Description RequestLogs MySQL Geodata User agents Session analysis Namespace matching Author(s)
It slices, it dices, it handles:
We don't study readers enough, and we finally have the tools to do that. WMUtils contains
several functions centred on the RequestLogs. hive_query allows you to query
the unsampled logs, while sampled_logs allows you to retrieve the 1:1000 sampled
ones. For both data types, log_strptime turns the timestamp format used into
POSIXlt timestamps, while log_sieve applies a prototype pageviews filter to the
output of sampled_logs. extract_mcc allows you to retrieve MCC codes from zero-tagged
requests, while host_handler and project_extractor truncate URLs to hostnames and
(in the case of Wikimedia URLs) language.project pairs, respectively.
If you study editors, our MySQL databases are where all the data lives. mysql_query
allows you to query a single database on analytics-store.eqiad.wmnet, while
global_query allows you to run over multiple databases. Either way,
mw_strptime turns the timestamp format used in our DB into POSIXlt timestamps.
And once you're done processing, use mysql_write to stream the results up to
the databases again. Need to update previously written rows? No problem! mysql_delete
is the function for you.
Thanks to MaxMind's C API, we can access geographic data associated with IP addresses
geo_country retrieves country codes, geo_region region codes or names,
geo_city cities, and geo_tz retrieves tzdata-compatible timezones.
Our user-agent parsing, which uses tobie's ua-parser, is in C++ It's also now in R thanks to Rcpp. If you run into incorrectly identified user agents, poke Oliver, since he's a maintainer on the ua-parser repositories.
For session analysis, WMUtils contains intertimes, session_count,
session_length and session_pages, all implemented in C++ for speed
(improvements in some cases are up to three orders of magnitude. R handles recursion really poorly).
namespace_match allows you convert namespace numbers to localised names, or
vice versa, handling the presence of namespaces in reader or editor data. The dataset is
also made available as namespace_names, or rebuildable via
namespace_match_generator.
Oliver Keyes <okeyes@wikimedia.org>
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.