In sfeuerriegel/ResearchGroupTools: Collection of Custom Helper Functions

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

ResearchGroupTools

ResearchGroupTools provides a collection of utilitiy function for rapid prototyping. These functions facilitate implemenation works related to advanced analytics. As such, it specifically supports data handling, preprocessing, visualization and analytics.

Installation

Using the devtools package, you can easily install the latest development version of ResearchGroupTools with

install.packages("devtools")

# Recommended option: download and install latest version from "GitHub"
devtools::install_github("sfeuerriegel/ResearchGroupTools", dependencies = TRUE)

Notes:

The package will only be shipped via GitHub; CRAN support is not intended due to several hooks.

Usage

This section shows the basic functionality of how accelerate data science in R. First, load the corresponding package ResearchGroupTools.

library(ResearchGroupTools)

By default, the seed for the random number generator is initialized to 0.

Changes to LaTeX {#LaTeX}

Some export routines require a few changes to your LaTeX document in order to get it running. The steps are documented in the help of e.g. \code{correlationMatrix()}; below is a minimal working example:

\documentclass{article}
\usepackage{SIunitx}
  \newcommand{\sym}[1]{\rlap{$^{#1}$}}
  \sisetup{input-symbols={()*}}
\begin{document}

\begin{tabular}{l SSS}
\toprule
\include{table_cor}
\end{tabular}
\end{document}

Above, we included SIunitx, introduced a command \sym, changed the input-symbols and used custom column alignments (S).

Functionality

Library handling

Library() (note the capital "L") loads packages. If not available, these are automatically installed.

Library("texreg")

loadRegressionLibraries() loads and installs common libraries for econometric purposes.

loadRegressionLibraries()

Strings

%+% concatenates strings (as an alterantive to paste()).

"a" %+% "b"
3 %+% 4
do.call(`%+%`, as.list(letters))

Numerical functions

ceil() computes the largest integer less or equal given a numerical value. It is a wrapper for ceiling with a more consistent naming.

ceil(3.4)

cumskewness(), cumkurtosis(), cumsd() (standard deviation) and cumadev() (average deviation) return a vector with cumulative results of the specific function.

library(dplyr)

df <- data_frame(x = 1:10, y = rnorm(10))
cumsd(df$x)

df %>%
  mutate_all(funs("mean" = cummean, "sd" = cumsd))

Data handling

pull(), pull_string() and pull_ith() extract single columns from a dplyr tbl object and return them as a vector.

d <- data_frame(x = 1:10,
               y = rnorm(10))
d %>% pull(x)
d %>% pull("x")

v <- "x"
d %>% pull_string(v)

d %>% pull_ith(1)

completeLowResolutionData() takes data in low resolution (e.g. monthly) and copies its values to match a high resolution (e.g. daily).

ts <- data.frame(Date = seq(from = as.Date("2000-01-01"), to = as.Date("2000-03-31"), by = "1 day"))
df_monthly <- data.frame(Month = c(as.Date("2000-01-31"), as.Date("2000-02-29"), as.Date("2000-03-31")),
                         Values = 1:3)

df_daily <- completeLowResolutionData(ts$Date, df_monthly, "Month")

# example of how to bind things together
ts <- ts %>%
  left_join(df_daily, by = c("Date" = "Month"))

Time series

lags() computes several lags of a vector.

lags(1:5, c(1, 2, 3))
lags(ts(1:5), c(1, 2, 5))

differences() calculates lagged differences of a given order. It is more convenient thant diff() as it adds leading NA values.

differences(1:10)
differences(c(1, 2, 4, 8, 16, 32))
differences(c(1, 2, 4, 8, 16, 32), order = 2)
differences(c(1, 2, 4, 8, 16, 32), na_padding = FALSE)

returns() calculates returns of a time series (similar to diff() for differenes).

returns(1:10)
returns(c(1, 2, 4, 8, 16, 32))
returns(c(1, 2, 4, 8, 16, 32), na_padding = FALSE) # remove trailing NA's

logReturns() computes log-returns (by default, with base exp(1)).

logReturns(c(1, 2, 4, 8, 16, 32), base = 2)

Matrix functions (or data.frame)

findRowsNA() and showRowsNA(), as well as findColsNA() and showColsNA(), help find NA values within a dataset.

m <- matrix(letters[c(1, 2, NA, 3, NA, 4, 5, 6, 7, 8)], ncol = 2, byrow = FALSE)
colnames(m) <- c("x", "y")
m

anyNA(m)      # use built-in routine to test for NA values

findRowsNA(m) # returns indices of that rows
showRowsNA(m) # prints rows with NA values

findColsNA(m) # returns name of that columns
showColsNA(m) # print columns with NA values

last_non_NA() returns the last entry in a vector which is not NA. This is helpful when aggregating high resolution data (see example below).

last_non_NA(c(1, 2, 3, 4, NA))

values <- 1:100
values[sample(1:100, 10)] <- NA
df <- cbind(Year = c(rep(2000, 5), rep(2001, 5)),
              as.data.frame(matrix(values, nrow = 10)))

df %>%
  group_by(Year) %>%
  summarize_each(funs(last_non_NA)) %>%
  ungroup() %>%
  head()

Descriptive statistics

removeOutlierObservations() trims the dataset with regard to certain variables. It thus removes outliers at the 0.5% level at both ends (or any other threshold defined by the argument cutoff).

d <- data.frame(x1 = rnorm(200), x2 = rnorm(200), y = rnorm(200))

d_trimmed <- removeOutlierObservations(d)
dim(d_trimmed)

d_trimmed <- removeOutlierObservations(d, variables = "y", cutoff = 2.0)
dim(d_trimmed)

d_trimmed <- removeOutlierObservations(d, variables = c("x1", "x2"), cutoff = 2.0)
dim(d_trimmed)

descriptiveStatistics() produces pretty summary statistics. By default, it exports the statistics into a LaTeX file. An optional parameter filename can be used to change the filename for the export.

data(USArrests)
descriptiveStatistics(USArrests)
unlink("table_descriptives.tex")

correlationMatrix() computes a pretty correlation matrix. An optional parameter filename can be used to specify a LaTeX file to which the result is exported with significance stars. Note: this requires a few changes to your LaTeX preamble.

correlationMatrix(USArrests)
correlationMatrix(USArrests, filename = "table_cor.tex") # stores output in LaTeX file
unlink("table_cor.tex")

Visualization

jplot() is an alternative to ggplot() but with a journal-style layout

library(ggplot2)

df <- data.frame(x = 1:20,
                 y = 1:20,
                 z = as.factor(rep(1:4, each = 5)))

jplot(df) +
  geom_line(aes(x = x, y = y, color = z, linetype = z))
# For comparison:
# ggplot(df) +
#  geom_line(aes(x = x, y = y, color = z, linetype = z))

jplot(df) +
  geom_point(aes(x = x, y = y, color = z))
# For comparison:
# ggplot(df) +
#   geom_point(aes(x = x, y = y, color = z))

linePlot() is a simple wrapper to ggplot2.

linePlot(1:10)

x <- seq(0, 4, length.out = 100)
linePlot(x, sin(x))

scientificLabels() enables a nice exponential notation in ggplot2 plots.

df <- data.frame(x=rnorm(100)/1000, y=rnorm(100)/1000)
ggplot(df, aes(x=x, y=y)) +
  geom_point() +
  scale_x_continuous(labels=scientificLabels) +
  scale_y_continuous(labels=scientificLabels)

allDigitsLabels() enforces that all digits are displayed in ggplot2 plots.

ggplot(df, aes(x=x, y=y)) +
  geom_point() +
  scale_x_continuous(labels=allDigitsLabels) +
  scale_y_continuous(labels=allDigitsLabels)

Regressions

makeFormula() lets one build formulae based on strings to identify the individual variables.

makeFormula("y", "x")
makeFormula("y", c("x1", "x2", "x3"))
makeFormula("y", c("x1", "x2", "x3"), "dummies")

regression() is a customized, all-in-one routine for ordinary least squares with optional dummy variables. It can filter for a subset of observations, remove outliers at a certain cutoff and remove dummies that are NA. It also changes to covariance matrix internally if desired (note: this requires a different estimator from sandwich).

x1 <- 1:100
x2 <- sin(1:100)
clusters <- rep(c(1, 2), 50)
dummies <- model.matrix(~ clusters)
y <- x1 + x2 + clusters + rnorm(100)
d <- data.frame(x1 = x1, x2 = x2, y = y)

m_dummies <- regression(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90,
                        dummies = "dummies", cutoff = 0.5)
summary(m_dummies)

library(sandwich)

m_dummies <- regression(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90,
                        dummies = "dummies", cutoff = 0.5, vcov = NeweyWest)
summary(m_dummies)

regressionStepwise() is an extension to iteratively incorporate regressors one by one. The resulting list can then easily be exported. It also changes to covariance matrix internally if desired (note: this requires a different estimator from sandwich).

models <- regressionStepwise(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90,
                            dummies = "dummies", cutoff = 0.5)

length(models)

library(texreg)

texreg(models, omit.coef = "dummies")

models <- regressionStepwise(formula("y ~ x1 + x2 + dummies"), data = d, subset = 1:90,
                            dummies = "dummies", cutoff = 0.5, vcov = NeweyWest)
texreg(models, omit.coef = "dummies")

showCoeftest() shows coefficient tests, but hides (dummy) variables starting with a certain string. Note: this is designed for output in the R console or within Rmarkdown. For exporting, better use texreg which has an argument named omit.coef.

showCoeftest(m_dummies, hide = "x") # leaves only the intercept

standardizeCoefficients() extracts standardized coefficients and hides (dummy) variables if needed.

library(vars)
data(Canada)

prod <- differences(as.numeric(Canada[, 2]))
production <- data.frame(Prod = prod, Lag1 = dplyr::lag(prod), Lag2 = dplyr::lag(prod, 2))

m <- lm(Prod ~ Lag1, data = production)
standardizeCoefficients(m)

library(quantreg)
data(stackloss)

qr <- rq(stack.loss ~ stack.x, 0.25)
standardizeCoefficients(qr)

extractRegressionStatistics() extracts key statistics of regression and returns them as a data.frame (so that it can later be stacked via row-wise binding).

x <- 1:10
y <- 1 + x + rnorm(10)
m <- lm(y ~ x)

extractRegressionStatistics(m)

getRowsOutlierRemoval() helps to remove outliers at the 0.5% level at both ends (or any other threshold defined by the argument cutoff).

 d <- data.frame(x = 1:200, y = 1:200 + rnorm(200))
m <- lm(y ~ x, d)                  # fit original model

idx_rm <- getRowsOutlierRemoval(m) # identify row indices of outliers
m <- lm(y ~ x, d[-idx_rm, ])       # refit model with outliers removed

texreg_tvalues() converts a the result of an ordinary least squares regression into in LaTeX. Instead of reporting standard errors, it gives t-values as a common alternative in finance. An optional parameter dummies can be specified which removes certain coefficients in the output. More than one model can be passed via a list.

texreg_tvalues(m_dummies)
texreg_tvalues(m_dummies, hide = "dummies")
texreg_tvalues(list(m, m_dummies))

qr25 <- rq(stack.loss ~ stack.x, 0.25)
qr50 <- rq(stack.loss ~ stack.x, 0.50)
qr75 <- rq(stack.loss ~ stack.x, 0.75)
texreg_tvalues(list(qr25, qr50, qr75))

testDiagnostics() checks if non-autocollreation, no serial correlation, homoskedasticity and no multicollinearity is present.

library(car)
m <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)

testDiagnostics(m)

Time series analysis

standardizeCoefficients() returns standardized coefficients.

var.2c <- VAR(Canada, p = 2, type = "none")

standardizeCoefficients(var.2c$varresult$e)

std <- standardizeCoefficients(var.2c)
std$e

adf() checks a time series for stationarity using the Augmented Dickey-Fuller (ADF) test. It returns the result in a pretty format and, if an optional argument filename is specified, it also exports it as LaTeX.

adf(USArrests, verbose = FALSE)
adf(USArrests, vars = c("Murder", "Rape"), type = "drift",
   filename = "adf.tex", verbose = FALSE)
unlink("adf.tex")

exportAdfDifferences() allows to export an ADF test in levels and in differences in a combined table.

adf_levels <- adf(USArrests)
adf_diff1 <- adf(data.frame(Murder = diff(USArrests$Murder),
                            Assault = diff(USArrests$Assault),
                            UrbanPop = diff(USArrests$UrbanPop),
                            Rape = diff(USArrests$Rape)))
exportAdfDifferences(adf_levels, adf_diff1)
unlink("adf.tex")

cointegrationTable() performs a cointegration test following the Johansen procedure. The output is written as LaTeX into a file named filename.

cointegrationTable(USArrests, vars = c("Murder", "Rape"), K = 2, filename = "cointegration_eigen.tex")
unlink("cointegration_eigen.tex")

plotIrf() returns a ggplot with a nice impulse response function in black/white.

irf <- irf(var.2c, impulse = "e", response = "prod", boot = TRUE)
plotIrf(irf, ylab = "Production")

impulseResponsePlot() combines computation and plot, thereby returning a ggplot with a nice impulse response function in black/white. If the optional argument filename is specified, the plot is automatically saved on the disk.

impulseResponsePlot(var.2c, impulse = "e", response = "prod", ylab = "Production", n.ahead = 5, filename = "irf_e_prod.pdf")
unlink("irf_e_prod.pdf")

testSpecification() checks if non-autocorrelation, normally distributed residuals and homoskedasticity is present.

testSpecification(var.2c)

Hooks to other packages

The default theme of ggplot2 is changed to theme_bw().
coeftostring() from the texreg package is overwritten. This also fixes the behavior of texreg() itself.

coeftostring(-0.000001, digits = 4) # the original function would return "-.0000"

d <- data.frame(y = 1:1000 - 0.0000001, x = 1:1000)
m <- lm(y ~ x, data = d)
texreg(m) # intercept would otherwise be "-0.00"

sanitize.numbers() inside xtable is overwritten.

xtable(matrix(1:4, nrow = 2) * -0.000001) # would otherwise return "-0.00"

Package development

rebuildPackage() builds, loads and checks package during the development process all at once. In particular, the manual is updated.

rebuildPackage()
rebuildPackage(TRUE) # also runs README.Rmd

License

ResearchGroupTools is released under the MIT License

sfeuerriegel/ResearchGroupTools documentation built on May 29, 2019, 8:01 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sfeuerriegel/ResearchGroupTools
Collection of Custom Helper Functions

In sfeuerriegel/ResearchGroupTools: Collection of Custom Helper Functions

ResearchGroupTools

Installation

Usage

Changes to LaTeX {#LaTeX}

Functionality

Library handling

Strings

Numerical functions

Data handling

Time series

Matrix functions (or data.frame)

Descriptive statistics

Visualization

Regressions

Time series analysis

Hooks to other packages

Package development

License

R Package Documentation

Browse R Packages

We want your feedback!

sfeuerriegel/ResearchGroupTools Collection of Custom Helper Functions

In sfeuerriegel/ResearchGroupTools: Collection of Custom Helper Functions

ResearchGroupTools

Installation

Usage

Changes to LaTeX {#LaTeX}

Functionality

Library handling

Strings

Numerical functions

Data handling

Time series

Matrix functions (or data.frame)

Descriptive statistics

Visualization

Regressions

Time series analysis

Hooks to other packages

Package development

License

R Package Documentation

Browse R Packages

We want your feedback!

sfeuerriegel/ResearchGroupTools
Collection of Custom Helper Functions