Description Usage Arguments Value handling our hadoop/hive setup See Also
hive_query
is a simple wrapper around the command line that makes queries
against our Hive/Hadoop infrastructure more convenient.
1 | hive_query(query, user, db = "wmf_raw", dt = TRUE, heapsize = 1024)
|
query |
a query, or the location of a .hql file containing a query. |
user |
your hive username (normally your stat100* username) |
db |
the database to use. Set to wmf_raw (which contains the webrequest table) by default. |
dt |
Whether to return it as a data.table or not. |
heapsize |
the HADOOP_HEAPSIZE to use. 1024 by default. |
a data.frame or data.table containing the results of the query.
The webrequests
table is documented
on Wikitech, which also provides
a set of example
queries.
When it comes to manipulating the rows with Java before they get to you, Nuria has written a
brief tutorial on loading UDFs
which should help if you want to engage in that; the example provided is a user agent parser, allowing you to
get the equivalent of ua_parse
's output further upstream.
log_strptime
for converting the "dt" column in the webrequests table to POSIXlt,
parse_uuids
for parsing app unique IDs out of requestlog URLs,
and mysql_query
and global_query
for querying our MySQL databases.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.