knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE, message = FALSE, fig.path = "README_figs/README-") library("devtools") library("ggplot2") library("knitr") load_all() set.seed(1234)
insp_outlier()
Written by Gabriel Motta for outlier detection in time series, based on anomalize
package. S3 methods are defined for ts
and data.frame
objects.
Undated object x
:
set.seed(1234) x <- rpois(100, 10); x[c(10,30)] <- c(50,65) ; x[c(50)] <- -10 qplot(seq_along(x), x, geom = "line", xlab = "Index", ylab = "Values")
The function call results in a data.table
object. If x
has no explicit dates, a default date sequence is created.
x_modif <- insp_outlier(as.ts(x)) x_modif[49:55]
x_modif %>% melt(id = "DATE", measure = patterns("SERIES")) %>% dplyr::mutate(variable = factor(variable, labels = c("Original","Imputed"))) %>% {.[]} %>% ggplot(aes(DATE,value,color = variable)) + geom_line() + labs(x = "Index", y = "Values", color = "") + theme(legend.position = "bottom")
When working with multiple time series, one has to pay attention to two additional arguments: margin
and idcol, dtcol
.
margin = 1
and idcol = <integer specifying index column>
. If there is no index column, then set idcol = 0
or idcol = NULL
(default);margin = 2
and dtcol = <integer specifying a date column>
. Likewise, dtcol = 0
or dtcol = NULL
if there is not an explicit date column. Example:
d1 = cbind(x,x + rnorm(100)) %>% t() %>% as.data.table() %>% .[, ID := 1:2] %>% setcolorder('ID') %>% {.[]} d2 = data.table(DATE = seq(as.Date('2018-01-01'), by = 1, length.out = 100), SERIES1 = x, SERIES2 = round(rcauchy(100),4))
d1[,1:10] # 'Wide' output format insp_outlier(d1, margin = 1, idcol = 1) %>% dplyr::sample_n(10) %>% {.[order(ID, PERIOD)]} # 'Long' output format insp_outlier(d1, margin = 1, out_format = "long") %>% dplyr::sample_n(10) %>% {.[order(ID, PERIOD)]}
Example:
d2[1:5] # 'Wide' output format insp_outlier(d2, margin = 2, dtcol = 1) %>% head() # 'Long' output format insp_outlier(d2, margin = 2, dtcol = 1, out_format = "long") %>% dplyr::sample_n(10) %>% {.[order(SERIES, DATE)]}
insp_seasonality()
Written by Gabriel Motta for getting seasonality of time series, based on forecast
package. S3 methods are defined for ts
and data.frame
objects.
Options for data.frame
: (check args(insp_outlier.data.frame)
for defaults)
trend
T/F - if time series have trend;margin
margin = 1
for series framed row-wise. In this case, specify as well an integer to an index column on idcol
;margin = 2
for series framed col-wise. Specify an integer to an date column on dtcol
;periodicity
of the series airquality
dataset:data <- airquality %>% as.data.table() head(data) setorder(data, Month, Day) data[, .(Wind, Temp)] %>% insp_seasonality(margin = 2, dtcol = 0, periodicity = 'day')
Considering we have daily observations and a total length no longer than a year for 'Wind' and 'Temp' variables, seasonality 365 means that no periods of seasonality were found.
subs_any()
and subs_na()
subs_na()
Replaces NA occurrences in variables of a table. The syntax is either on the form subs_na(data, col_1 = "?", col_2 = 999)
or subs_na(data, list(col_1 = "?", col_2 = 999))
, where '?' and 999 are examples of replacement values.
airquality
datasetdata <- airquality %>% as.data.table() head(data) data %>% subs_na(Ozone = 999, Solar.R = 999)
Regular expressions for equal value replacement and multiple column matching are supported:
# Replaces NA by 999 at columns that contain '.' data %>% subs_na("\\." = 999)
subs_any()
Replaces any x
value for any y
within data.frame
variables. The syntax is subs_any(data, col_1 = list(<value>, <input>), col_2 = list(<value>, <input>))
.
Just like subs_na()
, refer to multiple columns with regular expressions is supported.
data %>% subs_any("^M|^D" = list(5, 10000))
clean()
Methods for cleaning headers and text variables within data.frame. For factor
objects, cleans levels attribute. Default replacements are stored in exported list .dict
, where names rm
and undln
stand for 'remove' and 'underline' actions.
.dict to_clean <- c('Header with Nº of meas./hour(in measurement unit)', '\u00c0rtificial p\u00fcnctuat\u00ead attribute') to_clean # S3 method for char clean(to_clean)
Additional arguments:
keep
- vector of chars to remain unchanged;add_repl
- named vector of additional replacements desired.clean(to_clean, keep ='º') clean(to_clean, add_repl = c('H' = 'HHH'))
For data.frame
, further logical options are:
col_names
- T/F for changes in table header;vars
- T/F for changes in table variables;byref
- T/F if replacements are made by reference in the table.df_to_clean <- data.table( '"quoted name"' = 1, 'text' = LETTERS[1:5] %>% stringr::str_replace_all( c('C'='Ç', 'A' = '"An extract from a book"')) ) df_to_clean # default: col_names = T, vars = F df_to_clean %>% clean(add_repl = c('"' = '')) %>% clean(vars = T, col_names = F, keep = '\\s')
na_prop()
and na_input()
na_prop()
Check total and proportion of missing values by data.frame
variables.
# Default print data %>% na_prop() # Restricted. Useful to retrieve variables that present more than min_prop of NA. data %>% na_prop(min_prop = 0.1)
na_input()
Numerical imputation of vectors. Supported types:
Graphical example:
set.seed(1234) x <- rpois(20, 2); x[c(1,10,11,12)] <- NA qplot(seq_along(x), x, geom = "line", xlab = "Index", ylab = "Values") dt <- data.table(Index = seq_along(x), x) for(type in c('mean','median','locf','nocb','lin_interp','cub_spline')) { dt[, stringr::str_to_title(type) := na_input(x, how = type)] } dt[, -'x'] %>% melt(id = 'Index') %>% ggplot(aes(Index, value, color = variable)) + geom_line(size = 1) + labs(y = 'Values', color = 'Imputation Type') + theme(legend.position = 'bottom')
Changing window
parameter:
x <- c(NA, 10, 9, 8, 7, NA, NA, 100, 50) # Controling window size na_input(x, how = 'mean') # window default = Inf na_input(x, how = 'median', window = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.