Description RequestLogs MySQL Geodata User agents Session analysis Namespace matching Author(s)
It slices, it dices, it handles:
We don't study readers enough, and we finally have the tools to do that. WMUtils contains
several functions centred on the RequestLogs. hive_query
allows you to query
the unsampled logs, while sampled_logs
allows you to retrieve the 1:1000 sampled
ones. For both data types, log_strptime
turns the timestamp format used into
POSIXlt timestamps, while log_sieve
applies a prototype pageviews filter to the
output of sampled_logs
. extract_mcc
allows you to retrieve MCC codes from zero-tagged
requests, while host_handler
and project_extractor
truncate URLs to hostnames and
(in the case of Wikimedia URLs) language.project pairs, respectively.
If you study editors, our MySQL databases are where all the data lives. mysql_query
allows you to query a single database on analytics-store.eqiad.wmnet
, while
global_query
allows you to run over multiple databases. Either way,
mw_strptime
turns the timestamp format used in our DB into POSIXlt timestamps.
And once you're done processing, use mysql_write
to stream the results up to
the databases again. Need to update previously written rows? No problem! mysql_delete
is the function for you.
Thanks to MaxMind's C API, we can access geographic data associated with IP addresses
geo_country
retrieves country codes, geo_region
region codes or names,
geo_city
cities, and geo_tz
retrieves tzdata-compatible timezones.
Our user-agent parsing, which uses tobie's ua-parser, is in C++ It's also now in R thanks to Rcpp. If you run into incorrectly identified user agents, poke Oliver, since he's a maintainer on the ua-parser repositories.
For session analysis, WMUtils contains intertimes
, session_count
,
session_length
and session_pages
, all implemented in C++ for speed
(improvements in some cases are up to three orders of magnitude. R handles recursion really poorly).
namespace_match
allows you convert namespace numbers to localised names, or
vice versa, handling the presence of namespaces in reader or editor data. The dataset is
also made available as namespace_names
, or rebuildable via
namespace_match_generator
.
Oliver Keyes <okeyes@wikimedia.org>
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.