pad: Pad the datetime column of a data frame

Description Usage Arguments Details Value Examples

View source: R/pad.R

Description

pad will fill the gaps in incomplete datetime variables, by figuring out what the interval of the data is and what instances are missing. It will insert a record for each of the missing time points. For all other variables in the data frame a missing value will be inserted at the padded rows.

Usage

1
2
3
4
5
6
7
8
9
pad(
  x,
  interval = NULL,
  start_val = NULL,
  end_val = NULL,
  by = NULL,
  group = NULL,
  break_above = 1
)

Arguments

x

A data frame containing at least one variable of class Date, POSIXct or POSIXlt.

interval

The interval of the returned datetime variable. Any character string that would be accepted by seq.Date() or seq.POSIXt. The only exceptions is "DSTday", which is not accepted. pad will take care of daylight savings time when regular "day" is used. When NULL the the interval will be equal to the interval of the datetime variable. When specified it can only be lower than the interval and step size of the input data. See Details.

start_val

An object of class Date, POSIXct or POSIXlt that specifies the start of the returned datetime variable. If NULL it will use the lowest value of the input variable.

end_val

An object of class Date, POSIXct or POSIXlt that specifies the end of returned datetime variable. If NULL it will use the highest value of the input variable.

by

Only needs to be specified when x contains multiple variables of class Date, POSIXct or POSIXlt. Indicates which variable to use for padding.

group

Optional character vector that specifies the grouping variable(s). Padding will take place within the different groups. When interval is not specified, it will be determined applying get_interval on the datetime variable as a whole, ignoring groups (see last example).

break_above

Numeric value that indicates the number of rows in millions above which the function will break. Safety net for situations where the interval is different than expected and padding yields a very large dataframe, possibly overflowing memory.

Details

The interval of a datetime variable is the time unit at which the observations occur. The eight intervals in padr are from high to low year, quarter, month, week, day, hour, min, and sec. Since padr v.0.3.0 the interval is no longer limited to be of a single unit. (Intervals like 5 minutes, 6 hours, 10 days are possible). pad will figure out the interval of the input variable and the step size, and will fill the gaps for the instances that would be expected from the interval and step size, but are missing in the input data. Note that when start_val and/or end_val are specified, they are concatenated with the datetime variable before the interval is determined.

Rows with missing values in the datetime variables will be retained. However, they will be moved to the end of the returned data frame.

Value

The data frame x with the datetime variable padded. All non-grouping variables in the data frame will have missing values at the rows that are padded. The result will always be sorted on the datetime variable. If group is not NULL result is sorted on grouping variable(s) first, then on the datetime variable.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
simple_df <- data.frame(day = as.Date(c('2016-04-01', '2016-04-03')),
                        some_value = c(3,4))
pad(simple_df)
pad(simple_df, interval = "day")

library(dplyr) # for the pipe operator
month <- seq(as.Date('2016-04-01'), as.Date('2017-04-01'),
              by = 'month')[c(1, 4, 5, 7, 9, 10, 13)]
month_df <- data.frame(month = month,
                       y = runif(length(month), 10, 20) %>% round)
# forward fill the padded values with tidyr's fill
month_df %>% pad %>% tidyr::fill(y)

# or fill all y with 0
month_df %>% pad %>% fill_by_value(y)

# padding a data.frame on group level
day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')
x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4),
                       grp2 = letters[1:2],
                       y    = runif(12, 10, 20) %>% round(0),
                       date = sample(day_var, 12, TRUE)) %>%
 arrange(grp1, grp2, date)

# pad by one grouping var
x_df_grp %>% pad(group = 'grp1')

# pad by two groups vars
x_df_grp %>% pad(group = c('grp1', 'grp2'), interval = "month")

# Using group argument the interval is determined over all the observations,
# ignoring the groups.
x <- data.frame(dt_var = as.Date(c("2017-01-01", "2017-03-01", "2017-05-01",
"2017-01-01", "2017-02-01", "2017-04-01")),
id = rep(1:2, each = 3), val = round(rnorm(6)))
pad(x, group = "id")
# applying pad with do, interval is determined individualle for each group
x %>% group_by(id) %>% do(pad(.))

Example output

pad applied on the interval: 2 day
         day some_value
1 2016-04-01          3
2 2016-04-03          4
         day some_value
1 2016-04-01          3
2 2016-04-02         NA
3 2016-04-03          4

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

pad applied on the interval: month
        month  y
1  2016-04-01 10
2  2016-05-01 10
3  2016-06-01 10
4  2016-07-01 12
5  2016-08-01 11
6  2016-09-01 11
7  2016-10-01 15
8  2016-11-01 15
9  2016-12-01 11
10 2017-01-01 14
11 2017-02-01 14
12 2017-03-01 14
13 2017-04-01 13
pad applied on the interval: month
        month  y
1  2016-04-01 10
2  2016-05-01  0
3  2016-06-01  0
4  2016-07-01 12
5  2016-08-01 11
6  2016-09-01  0
7  2016-10-01 15
8  2016-11-01  0
9  2016-12-01 11
10 2017-01-01 14
11 2017-02-01  0
12 2017-03-01  0
13 2017-04-01 13
pad applied on the interval: month
   grp1 grp2  y       date
1     A    b 13 2016-06-01
2     A <NA> NA 2016-07-01
3     A    a 14 2016-08-01
4     A    a 17 2016-09-01
5     A <NA> NA 2016-10-01
6     A    b 16 2016-11-01
7     B    a 11 2016-01-01
8     B <NA> NA 2016-02-01
9     B <NA> NA 2016-03-01
10    B <NA> NA 2016-04-01
11    B    b 17 2016-05-01
12    B <NA> NA 2016-06-01
13    B    b 13 2016-07-01
14    B <NA> NA 2016-08-01
15    B <NA> NA 2016-09-01
16    B <NA> NA 2016-10-01
17    B    a 17 2016-11-01
18    C    b 15 2016-02-01
19    C    b 12 2016-02-01
20    C <NA> NA 2016-03-01
21    C <NA> NA 2016-04-01
22    C <NA> NA 2016-05-01
23    C    a 12 2016-06-01
24    C <NA> NA 2016-07-01
25    C <NA> NA 2016-08-01
26    C    a 12 2016-09-01
   grp1 grp2  y       date
1     A    a 14 2016-08-01
2     A    a 17 2016-09-01
3     A    b 13 2016-06-01
4     A    b NA 2016-07-01
5     A    b NA 2016-08-01
6     A    b NA 2016-09-01
7     A    b NA 2016-10-01
8     A    b 16 2016-11-01
9     B    a 11 2016-01-01
10    B    a NA 2016-02-01
11    B    a NA 2016-03-01
12    B    a NA 2016-04-01
13    B    a NA 2016-05-01
14    B    a NA 2016-06-01
15    B    a NA 2016-07-01
16    B    a NA 2016-08-01
17    B    a NA 2016-09-01
18    B    a NA 2016-10-01
19    B    a 17 2016-11-01
20    B    b 17 2016-05-01
21    B    b NA 2016-06-01
22    B    b 13 2016-07-01
23    C    a 12 2016-06-01
24    C    a NA 2016-07-01
25    C    a NA 2016-08-01
26    C    a 12 2016-09-01
27    C    b 15 2016-02-01
28    C    b 12 2016-02-01
Warning message:
datetime variable does not vary for 1 of the groups, no padding applied on this / these group(s) 
pad applied on the interval: month
      dt_var id val
1 2017-01-01  1  -2
2 2017-02-01  1  NA
3 2017-03-01  1  -2
4 2017-04-01  1  NA
5 2017-05-01  1  -1
6 2017-01-01  2  -1
7 2017-02-01  2   0
8 2017-03-01  2  NA
9 2017-04-01  2   0
pad applied on the interval: 2 month
pad applied on the interval: month
# A tibble: 7 x 3
# Groups:   id [3]
  dt_var        id   val
  <date>     <int> <dbl>
1 2017-01-01     1    -2
2 2017-03-01     1    -2
3 2017-05-01     1    -1
4 2017-01-01     2    -1
5 2017-02-01     2     0
6 2017-03-01    NA    NA
7 2017-04-01     2     0

padr documentation built on Oct. 1, 2021, 5:07 p.m.