knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE, message = FALSE, fig.path = "README_figs/README-") library("devtools") library("ggplot2") library("knitr") load_all() set.seed(1234)
insp_outlier()Written by Gabriel Motta for outlier detection in time series, based on anomalize package. S3 methods are defined for ts and data.frame objects.
Undated object x:
set.seed(1234) x <- rpois(100, 10); x[c(10,30)] <- c(50,65) ; x[c(50)] <- -10 qplot(seq_along(x), x, geom = "line", xlab = "Index", ylab = "Values")
The function call results in a data.table object. If x has no explicit dates, a default date sequence is created.
x_modif <- insp_outlier(as.ts(x)) x_modif[49:55]
x_modif %>% melt(id = "DATE", measure = patterns("SERIES")) %>% dplyr::mutate(variable = factor(variable, labels = c("Original","Imputed"))) %>% {.[]} %>% ggplot(aes(DATE,value,color = variable)) + geom_line() + labs(x = "Index", y = "Values", color = "") + theme(legend.position = "bottom")
When working with multiple time series, one has to pay attention to two additional arguments: margin and idcol, dtcol.
margin = 1 and idcol = <integer specifying index column>. If there is no index column, then set idcol = 0 or idcol = NULL (default);margin = 2 and dtcol = <integer specifying a date column>. Likewise, dtcol = 0 or dtcol = NULL if there is not an explicit date column. Example:
d1 = cbind(x,x + rnorm(100)) %>% t() %>% as.data.table() %>% .[, ID := 1:2] %>% setcolorder('ID') %>% {.[]} d2 = data.table(DATE = seq(as.Date('2018-01-01'), by = 1, length.out = 100), SERIES1 = x, SERIES2 = round(rcauchy(100),4))
d1[,1:10] # 'Wide' output format insp_outlier(d1, margin = 1, idcol = 1) %>% dplyr::sample_n(10) %>% {.[order(ID, PERIOD)]} # 'Long' output format insp_outlier(d1, margin = 1, out_format = "long") %>% dplyr::sample_n(10) %>% {.[order(ID, PERIOD)]}
Example:
d2[1:5] # 'Wide' output format insp_outlier(d2, margin = 2, dtcol = 1) %>% head() # 'Long' output format insp_outlier(d2, margin = 2, dtcol = 1, out_format = "long") %>% dplyr::sample_n(10) %>% {.[order(SERIES, DATE)]}
insp_seasonality()Written by Gabriel Motta for getting seasonality of time series, based on forecast package. S3 methods are defined for ts and data.frame objects.
Options for data.frame: (check args(insp_outlier.data.frame) for defaults)
trend T/F - if time series have trend;margin margin = 1 for series framed row-wise. In this case, specify as well an integer to an index column on idcol;margin = 2 for series framed col-wise. Specify an integer to an date column on dtcol;periodicity of the series airquality dataset:data <- airquality %>% as.data.table() head(data) setorder(data, Month, Day) data[, .(Wind, Temp)] %>% insp_seasonality(margin = 2, dtcol = 0, periodicity = 'day')
Considering we have daily observations and a total length no longer than a year for 'Wind' and 'Temp' variables, seasonality 365 means that no periods of seasonality were found.
subs_any() and subs_na()subs_na()Replaces NA occurrences in variables of a table. The syntax is either on the form subs_na(data, col_1 = "?", col_2 = 999) or subs_na(data, list(col_1 = "?", col_2 = 999)), where '?' and 999 are examples of replacement values.
airquality datasetdata <- airquality %>% as.data.table() head(data) data %>% subs_na(Ozone = 999, Solar.R = 999)
Regular expressions for equal value replacement and multiple column matching are supported:
# Replaces NA by 999 at columns that contain '.' data %>% subs_na("\\." = 999)
subs_any()Replaces any x value for any y within data.frame variables. The syntax is subs_any(data, col_1 = list(<value>, <input>), col_2 = list(<value>, <input>)).
Just like subs_na(), refer to multiple columns with regular expressions is supported.
data %>% subs_any("^M|^D" = list(5, 10000))
clean()Methods for cleaning headers and text variables within data.frame. For factor objects, cleans levels attribute. Default replacements are stored in exported list .dict, where names rm and undln stand for 'remove' and 'underline' actions.
.dict to_clean <- c('Header with Nº of meas./hour(in measurement unit)', '\u00c0rtificial p\u00fcnctuat\u00ead attribute') to_clean # S3 method for char clean(to_clean)
Additional arguments:
keep - vector of chars to remain unchanged;add_repl - named vector of additional replacements desired.clean(to_clean, keep ='º') clean(to_clean, add_repl = c('H' = 'HHH'))
For data.frame, further logical options are:
col_names - T/F for changes in table header;vars - T/F for changes in table variables;byref - T/F if replacements are made by reference in the table.df_to_clean <- data.table( '"quoted name"' = 1, 'text' = LETTERS[1:5] %>% stringr::str_replace_all( c('C'='Ç', 'A' = '"An extract from a book"')) ) df_to_clean # default: col_names = T, vars = F df_to_clean %>% clean(add_repl = c('"' = '')) %>% clean(vars = T, col_names = F, keep = '\\s')
na_prop() and na_input()na_prop()Check total and proportion of missing values by data.frame variables.
# Default print data %>% na_prop() # Restricted. Useful to retrieve variables that present more than min_prop of NA. data %>% na_prop(min_prop = 0.1)
na_input()Numerical imputation of vectors. Supported types:
Graphical example:
set.seed(1234) x <- rpois(20, 2); x[c(1,10,11,12)] <- NA qplot(seq_along(x), x, geom = "line", xlab = "Index", ylab = "Values") dt <- data.table(Index = seq_along(x), x) for(type in c('mean','median','locf','nocb','lin_interp','cub_spline')) { dt[, stringr::str_to_title(type) := na_input(x, how = type)] } dt[, -'x'] %>% melt(id = 'Index') %>% ggplot(aes(Index, value, color = variable)) + geom_line(size = 1) + labs(y = 'Values', color = 'Imputation Type') + theme(legend.position = 'bottom')
Changing window parameter:
x <- c(NA, 10, 9, 8, 7, NA, NA, 100, 50) # Controling window size na_input(x, how = 'mean') # window default = Inf na_input(x, how = 'median', window = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.