hive_query: Run a query against the WMF hive instance
In wikimedia-research/WMUtils: Utilities for Wikimedia staffers

Description Usage Arguments Value handling our hadoop/hive setup See Also

hive_query is a simple wrapper around the command line that makes queries against our Hive/Hadoop infrastructure more convenient.

1	hive_query(query, user, db = "wmf_raw", dt = TRUE, heapsize = 1024)

`query`	a query, or the location of a .hql file containing a query.
`user`	your hive username (normally your stat100* username)
`db`	the database to use. Set to wmf_raw (which contains the webrequest table) by default.
`dt`	Whether to return it as a data.table or not.
`heapsize`	the HADOOP_HEAPSIZE to use. 1024 by default.

a data.frame or data.table containing the results of the query.

The webrequests table is documented on Wikitech, which also provides a set of example queries.

When it comes to manipulating the rows with Java before they get to you, Nuria has written a brief tutorial on loading UDFs which should help if you want to engage in that; the example provided is a user agent parser, allowing you to get the equivalent of ua_parse's output further upstream.

log_strptime for converting the "dt" column in the webrequests table to POSIXlt, parse_uuids for parsing app unique IDs out of requestlog URLs, and mysql_query and global_query for querying our MySQL databases.

wikimedia-research/WMUtils documentation built on May 4, 2019, 5:23 a.m.