hive_query: Run a query against the WMF hive instance

Description Usage Arguments Value handling our hadoop/hive setup See Also

Description

hive_query is a simple wrapper around the command line that makes queries against our Hive/Hadoop infrastructure more convenient.

Usage

1
hive_query(query, user, db = "wmf_raw", dt = TRUE, heapsize = 1024)

Arguments

query

a query, or the location of a .hql file containing a query.

user

your hive username (normally your stat100* username)

db

the database to use. Set to wmf_raw (which contains the webrequest table) by default.

dt

Whether to return it as a data.table or not.

heapsize

the HADOOP_HEAPSIZE to use. 1024 by default.

Value

a data.frame or data.table containing the results of the query.

handling our hadoop/hive setup

The webrequests table is documented on Wikitech, which also provides a set of example queries.

When it comes to manipulating the rows with Java before they get to you, Nuria has written a brief tutorial on loading UDFs which should help if you want to engage in that; the example provided is a user agent parser, allowing you to get the equivalent of ua_parse's output further upstream.

See Also

log_strptime for converting the "dt" column in the webrequests table to POSIXlt, parse_uuids for parsing app unique IDs out of requestlog URLs, and mysql_query and global_query for querying our MySQL databases.


wikimedia-research/WMUtils documentation built on May 4, 2019, 5:23 a.m.