knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
ResearchGroupTools provides a collection of utilitiy function for rapid prototyping. These functions facilitate implemenation works related to advanced analytics. As such, it specifically supports data handling, preprocessing, visualization and analytics.
Using the devtools package, you can easily install the latest development version of ResearchGroupTools with
install.packages("devtools") # Recommended option: download and install latest version from "GitHub" devtools::install_github("sfeuerriegel/ResearchGroupTools", dependencies = TRUE)
Notes:
This section shows the basic functionality of how accelerate data science in R. First, load the corresponding package ResearchGroupTools.
library(ResearchGroupTools)
By default, the seed for the random number generator is initialized to 0.
Some export routines require a few changes to your LaTeX document in order to get it running. The steps are documented in the help of e.g. \code{correlationMatrix()}; below is a minimal working example:
\documentclass{article} \usepackage{SIunitx} \newcommand{\sym}[1]{\rlap{$^{#1}$}} \sisetup{input-symbols={()*}} \begin{document} \begin{tabular}{l SSS} \toprule \include{table_cor} \end{tabular} \end{document}
Above, we included SIunitx
, introduced a command \sym
, changed the input-symbols
and used custom column alignments (S
).
Library()
(note the capital "L") loads packages. If not available, these are automatically installed. Library("texreg")
loadRegressionLibraries()
loads and installs common libraries for econometric purposes.loadRegressionLibraries()
%+%
concatenates strings (as an alterantive to paste()
)."a" %+% "b" 3 %+% 4 do.call(`%+%`, as.list(letters))
ceil()
computes the largest integer less or equal given a numerical value. It is a wrapper for ceiling
with a more consistent naming.ceil(3.4)
cumskewness()
, cumkurtosis()
, cumsd()
(standard deviation) and cumadev()
(average deviation) return a vector with cumulative results of the specific function. library(dplyr) df <- data_frame(x = 1:10, y = rnorm(10)) cumsd(df$x) df %>% mutate_all(funs("mean" = cummean, "sd" = cumsd))
pull()
, pull_string()
and pull_ith()
extract single columns from a dplyr tbl
object and return them as a vector.d <- data_frame(x = 1:10, y = rnorm(10)) d %>% pull(x) d %>% pull("x") v <- "x" d %>% pull_string(v) d %>% pull_ith(1)
completeLowResolutionData()
takes data in low resolution (e.g. monthly) and copies its values to match a high resolution (e.g. daily).ts <- data.frame(Date = seq(from = as.Date("2000-01-01"), to = as.Date("2000-03-31"), by = "1 day")) df_monthly <- data.frame(Month = c(as.Date("2000-01-31"), as.Date("2000-02-29"), as.Date("2000-03-31")), Values = 1:3) df_daily <- completeLowResolutionData(ts$Date, df_monthly, "Month") # example of how to bind things together ts <- ts %>% left_join(df_daily, by = c("Date" = "Month"))
lags()
computes several lags of a vector.lags(1:5, c(1, 2, 3)) lags(ts(1:5), c(1, 2, 5))
differences()
calculates lagged differences of a given order. It is more convenient thant diff()
as it adds leading NA
values.differences(1:10) differences(c(1, 2, 4, 8, 16, 32)) differences(c(1, 2, 4, 8, 16, 32), order = 2) differences(c(1, 2, 4, 8, 16, 32), na_padding = FALSE)
returns()
calculates returns of a time series (similar to diff()
for differenes).returns(1:10) returns(c(1, 2, 4, 8, 16, 32)) returns(c(1, 2, 4, 8, 16, 32), na_padding = FALSE) # remove trailing NA's
logReturns()
computes log-returns (by default, with base exp(1)
).logReturns(c(1, 2, 4, 8, 16, 32), base = 2)
findRowsNA()
and showRowsNA()
, as well as findColsNA()
and showColsNA()
, help find NA
values within a dataset.m <- matrix(letters[c(1, 2, NA, 3, NA, 4, 5, 6, 7, 8)], ncol = 2, byrow = FALSE) colnames(m) <- c("x", "y") m anyNA(m) # use built-in routine to test for NA values findRowsNA(m) # returns indices of that rows showRowsNA(m) # prints rows with NA values findColsNA(m) # returns name of that columns showColsNA(m) # print columns with NA values
last_non_NA()
returns the last entry in a vector which is not NA
. This is helpful when aggregating high resolution data (see example below).last_non_NA(c(1, 2, 3, 4, NA)) values <- 1:100 values[sample(1:100, 10)] <- NA df <- cbind(Year = c(rep(2000, 5), rep(2001, 5)), as.data.frame(matrix(values, nrow = 10))) df %>% group_by(Year) %>% summarize_each(funs(last_non_NA)) %>% ungroup() %>% head()
removeOutlierObservations()
trims the dataset with regard to certain variables. It thus removes outliers at the 0.5% level at both ends (or any other threshold defined by the argument cutoff
).d <- data.frame(x1 = rnorm(200), x2 = rnorm(200), y = rnorm(200)) d_trimmed <- removeOutlierObservations(d) dim(d_trimmed) d_trimmed <- removeOutlierObservations(d, variables = "y", cutoff = 2.0) dim(d_trimmed) d_trimmed <- removeOutlierObservations(d, variables = c("x1", "x2"), cutoff = 2.0) dim(d_trimmed)
descriptiveStatistics()
produces pretty summary statistics. By default, it exports the statistics into a LaTeX file. An optional parameter filename
can be used to change the filename for the export. data(USArrests) descriptiveStatistics(USArrests) unlink("table_descriptives.tex")
correlationMatrix()
computes a pretty correlation matrix. An optional parameter filename
can be used to specify a LaTeX file to which the result is exported with significance stars. Note: this requires a few changes to your LaTeX preamble.correlationMatrix(USArrests) correlationMatrix(USArrests, filename = "table_cor.tex") # stores output in LaTeX file unlink("table_cor.tex")
jplot()
is an alternative to ggplot()
but with a journal-style layoutlibrary(ggplot2) df <- data.frame(x = 1:20, y = 1:20, z = as.factor(rep(1:4, each = 5))) jplot(df) + geom_line(aes(x = x, y = y, color = z, linetype = z)) # For comparison: # ggplot(df) + # geom_line(aes(x = x, y = y, color = z, linetype = z)) jplot(df) + geom_point(aes(x = x, y = y, color = z)) # For comparison: # ggplot(df) + # geom_point(aes(x = x, y = y, color = z))
linePlot()
is a simple wrapper to ggplot2. linePlot(1:10) x <- seq(0, 4, length.out = 100) linePlot(x, sin(x))
scientificLabels()
enables a nice exponential notation in ggplot2 plots.df <- data.frame(x=rnorm(100)/1000, y=rnorm(100)/1000) ggplot(df, aes(x=x, y=y)) + geom_point() + scale_x_continuous(labels=scientificLabels) + scale_y_continuous(labels=scientificLabels)
allDigitsLabels()
enforces that all digits are displayed in ggplot2 plots.ggplot(df, aes(x=x, y=y)) + geom_point() + scale_x_continuous(labels=allDigitsLabels) + scale_y_continuous(labels=allDigitsLabels)
makeFormula()
lets one build formulae based on strings to identify the individual variables. makeFormula("y", "x") makeFormula("y", c("x1", "x2", "x3")) makeFormula("y", c("x1", "x2", "x3"), "dummies")
regression()
is a customized, all-in-one routine for ordinary least squares with optional dummy variables. It can filter for a subset of observations, remove outliers at a certain cutoff and remove dummies that are NA
. It also changes to covariance matrix internally if desired (note: this requires a different estimator from sandwich
).x1 <- 1:100 x2 <- sin(1:100) clusters <- rep(c(1, 2), 50) dummies <- model.matrix(~ clusters) y <- x1 + x2 + clusters + rnorm(100) d <- data.frame(x1 = x1, x2 = x2, y = y) m_dummies <- regression(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90, dummies = "dummies", cutoff = 0.5) summary(m_dummies) library(sandwich) m_dummies <- regression(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90, dummies = "dummies", cutoff = 0.5, vcov = NeweyWest) summary(m_dummies)
regressionStepwise()
is an extension to iteratively incorporate regressors one by one. The resulting list can then easily be exported. It also changes to covariance matrix internally if desired (note: this requires a different estimator from sandwich
).models <- regressionStepwise(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90, dummies = "dummies", cutoff = 0.5) length(models) library(texreg) texreg(models, omit.coef = "dummies") models <- regressionStepwise(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90, dummies = "dummies", cutoff = 0.5, vcov = NeweyWest) texreg(models, omit.coef = "dummies")
showCoeftest()
shows coefficient tests, but hides (dummy) variables starting with a certain string. Note: this is designed for output in the R console or within Rmarkdown. For exporting, better use texreg which has an argument named omit.coef
. showCoeftest(m_dummies, hide = "x") # leaves only the intercept
standardizeCoefficients()
extracts standardized coefficients and hides (dummy) variables if needed.library(vars) data(Canada) prod <- differences(as.numeric(Canada[, 2])) production <- data.frame(Prod = prod, Lag1 = dplyr::lag(prod), Lag2 = dplyr::lag(prod, 2)) m <- lm(Prod ~ Lag1, data = production) standardizeCoefficients(m) library(quantreg) data(stackloss) qr <- rq(stack.loss ~ stack.x, 0.25) standardizeCoefficients(qr)
extractRegressionStatistics()
extracts key statistics of regression and returns them as a data.frame
(so that it can later be stacked via row-wise binding).x <- 1:10 y <- 1 + x + rnorm(10) m <- lm(y ~ x) extractRegressionStatistics(m)
getRowsOutlierRemoval()
helps to remove outliers at the 0.5% level at both ends (or any other threshold defined by the argument cutoff
).d <- data.frame(x = 1:200, y = 1:200 + rnorm(200)) m <- lm(y ~ x, d) # fit original model idx_rm <- getRowsOutlierRemoval(m) # identify row indices of outliers m <- lm(y ~ x, d[-idx_rm, ]) # refit model with outliers removed
texreg_tvalues()
converts a the result of an ordinary least squares regression into in LaTeX. Instead of reporting standard errors, it gives t-values as a common alternative in finance. An optional parameter dummies
can be specified which removes certain coefficients in the output. More than one model can be passed via a list.texreg_tvalues(m_dummies) texreg_tvalues(m_dummies, hide = "dummies") texreg_tvalues(list(m, m_dummies)) qr25 <- rq(stack.loss ~ stack.x, 0.25) qr50 <- rq(stack.loss ~ stack.x, 0.50) qr75 <- rq(stack.loss ~ stack.x, 0.75) texreg_tvalues(list(qr25, qr50, qr75))
testDiagnostics()
checks if non-autocollreation, no serial correlation, homoskedasticity and no multicollinearity is present. library(car) m <- lm(mpg ~ disp + hp + wt + drat, data = mtcars) testDiagnostics(m)
standardizeCoefficients()
returns standardized coefficients.var.2c <- VAR(Canada, p = 2, type = "none") standardizeCoefficients(var.2c$varresult$e) std <- standardizeCoefficients(var.2c) std$e
adf()
checks a time series for stationarity using the Augmented Dickey-Fuller (ADF) test. It returns the result in a pretty format and, if an optional argument filename
is specified, it also exports it as LaTeX. adf(USArrests, verbose = FALSE) adf(USArrests, vars = c("Murder", "Rape"), type = "drift", filename = "adf.tex", verbose = FALSE) unlink("adf.tex")
exportAdfDifferences()
allows to export an ADF test in levels and in differences in a combined table.adf_levels <- adf(USArrests) adf_diff1 <- adf(data.frame(Murder = diff(USArrests$Murder), Assault = diff(USArrests$Assault), UrbanPop = diff(USArrests$UrbanPop), Rape = diff(USArrests$Rape))) exportAdfDifferences(adf_levels, adf_diff1) unlink("adf.tex")
cointegrationTable()
performs a cointegration test following the Johansen procedure. The output is written as LaTeX into a file named filename
. cointegrationTable(USArrests, vars = c("Murder", "Rape"), K = 2, filename = "cointegration_eigen.tex") unlink("cointegration_eigen.tex")
plotIrf()
returns a ggplot
with a nice impulse response function in black/white.irf <- irf(var.2c, impulse = "e", response = "prod", boot = TRUE) plotIrf(irf, ylab = "Production")
impulseResponsePlot()
combines computation and plot, thereby returning a ggplot
with a nice impulse response function in black/white. If the optional argument filename
is specified, the plot is automatically saved on the disk.impulseResponsePlot(var.2c, impulse = "e", response = "prod", ylab = "Production", n.ahead = 5, filename = "irf_e_prod.pdf") unlink("irf_e_prod.pdf")
testSpecification()
checks if non-autocorrelation, normally distributed residuals and homoskedasticity is present. testSpecification(var.2c)
The default theme of ggplot2 is changed to theme_bw()
.
coeftostring()
from the texreg
package is overwritten. This also fixes the behavior of texreg()
itself.
coeftostring(-0.000001, digits = 4) # the original function would return "-.0000" d <- data.frame(y = 1:1000 - 0.0000001, x = 1:1000) m <- lm(y ~ x, data = d) texreg(m) # intercept would otherwise be "-0.00"
sanitize.numbers()
inside xtable
is overwritten.xtable(matrix(1:4, nrow = 2) * -0.000001) # would otherwise return "-0.00"
rebuildPackage()
builds, loads and checks package during the development process all at once. In particular, the manual is updated. rebuildPackage() rebuildPackage(TRUE) # also runs README.Rmd
ResearchGroupTools is released under the MIT License
Copyright (c) 2016 Stefan Feuerriegel
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.