fill_NA: 'fill_NA' function for the imputations purpose.

Description Usage Arguments Value Methods (by class) Note See Also Examples

View source: R/fill_NA.R

Description

Regular imputations to fill the missing data. Non missing independent variables are used to approximate a missing observations for a dependent variable. Quantitative models were built under Rcpp packages and the C++ library Armadillo.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fill_NA(x, model, posit_y, posit_x, w = NULL, logreg = FALSE, ridge = 1e-06)

## S3 method for class 'data.frame'
fill_NA(x, model, posit_y, posit_x, w = NULL, logreg = FALSE, ridge = 1e-06)

## S3 method for class 'data.table'
fill_NA(x, model, posit_y, posit_x, w = NULL, logreg = FALSE, ridge = 1e-06)

## S3 method for class 'matrix'
fill_NA(x, model, posit_y, posit_x, w = NULL, logreg = FALSE, ridge = 1e-06)

Arguments

x

a numeric matrix or data.frame/data.table (factor/character/numeric/logical) - variables

model

a character - posibble options ("lda","lm_pred","lm_bayes","lm_noise")

posit_y

an integer/character - a position/name of dependent variable

posit_x

an integer/character vector - positions/names of independent variables

w

a numeric vector - a weighting variable - only positive values, Default:NULL

logreg

a boolean - if dependent variable has log-normal distribution (numeric). If TRUE log-regression is evaluated and then returned exponential of results., Default: FALSE

ridge

a numeric - a value added to diagonal elements of the x'x matrix, Default:1e-5

Value

load imputations in a numeric/logical/character/factor (similar to the input type) vector format

Methods (by class)

Note

There is assumed that users add the intercept by their own. The miceFast module provides the most efficient environment, the second recommended option is to use data.table and the numeric matrix data type. The lda model is assessed only if there are more than 15 complete observations and for the lms models if number of independent variables is smaller than number of observations.

See Also

fill_NA_N VIF

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
library(miceFast)
library(dplyr)
library(data.table)
### Data
# airquality dataset with additional variables
data(air_miss)

### Intro: data.table
# IMPUTATIONS
# Imputations with a grouping option (models are separately assessed for each group)
# taking into account provided weights
air_miss[, Solar_R_imp := fill_NA_N(
  x = .SD,
  model = "lm_bayes",
  posit_y = "Solar.R",
  posit_x = c("Wind", "Temp", "Intercept"),
  w = .SD[["weights"]],
  k = 100
), by = .(groups)] %>%
  # Imputations - discrete variable
  .[, x_character_imp := fill_NA(
    x = .SD,
    model = "lda",
    posit_y = "x_character",
    posit_x = c("Wind", "Temp", "groups")
  )] %>%
  # logreg was used because almost log-normal distribution of Ozone
  # imputations around mean
  .[, Ozone_imp1 := fill_NA(
    x = .SD,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept"),
    logreg = TRUE
  )] %>%
  # imputations using positions - Intercept, Temp
  .[, Ozone_imp2 := fill_NA(
    x = .SD,
    model = "lm_bayes",
    posit_y = 1,
    posit_x = c(4, 6),
    logreg = TRUE
  )] %>%
  # model with a factor independent variable
  # multiple imputations (average of x30 imputations)
  # with a factor independent variable, weights and logreg options
  .[, Ozone_imp3 := fill_NA_N(
    x = .SD,
    model = "lm_noise",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .SD[["weights"]],
    logreg = TRUE,
    k = 30
  )] %>%
  .[, Ozone_imp4 := fill_NA_N(
    x = .SD,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .SD[["weights"]],
    logreg = TRUE,
    k = 30
  )] %>%
  .[, Ozone_imp5 := fill_NA(
    x = .SD,
    model = "lm_pred",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .SD[["weights"]],
    logreg = TRUE
  ), .(groups)] %>%
  .[, Ozone_imp6 := fill_NA_N(
    x = .SD,
    model = "pmm",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .SD[["weights"]],
    logreg = TRUE,
    k = 10
  ), .(groups)] %>%

  # Average of a few methods
  .[, Ozone_imp_mix := apply(.SD, 1, mean), .SDcols = Ozone_imp1:Ozone_imp6] %>%

  # Protecting against collinearity or low number of observations - across small groups
  # Be carful when using a data.table grouping option
  # because of lack of protection against collinearity or low number of observations.
  # There could be used a tryCatch(fill_NA(...),error=function(e) return(...))

  .[, Ozone_chac_imp := tryCatch(fill_NA(
    x = .SD,
    model = "lda",
    posit_y = "Ozone_chac",
    posit_x = c(
      "Intercept",
      "Month",
      "Day",
      "Temp",
      "x_character_imp"
    ),
    w = .SD[["weights"]]
  ),
  error = function(e) .SD[["Ozone_chac"]]
  ), .(groups)]

# Sample of results
air_miss[which(is.na(air_miss[, 1]))[1:5], ]

### Intro: dplyr
# IMPUTATIONS
air_miss <- air_miss %>%
  # Imputations with a grouping option (models are separately assessed for each group)
  # taking into account provided weights
  group_by(groups) %>%
  do(mutate(., Solar_R_imp = fill_NA(
    x = .,
    model = "lm_pred",
    posit_y = "Solar.R",
    posit_x = c("Wind", "Temp", "Intercept"),
    w = .[["weights"]]
  ))) %>%
  ungroup() %>%
  # Imputations - discrete variable
  mutate(x_character_imp = fill_NA(
    x = .,
    model = "lda",
    posit_y = "x_character",
    posit_x = c("Wind", "Temp")
  )) %>%
  # logreg was used because almost log-normal distribution of Ozone
  # imputations around mean
  mutate(Ozone_imp1 = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept"),
    logreg = TRUE
  )) %>%
  # imputations using positions - Intercept, Temp
  mutate(Ozone_imp2 = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = 1,
    posit_x = c(4, 6),
    logreg = TRUE
  )) %>%
  # multiple imputations (average of x30 imputations)
  # with a factor independent variable, weights and logreg options
  mutate(Ozone_imp3 = fill_NA_N(
    x = .,
    model = "lm_noise",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 30
  )) %>%
  mutate(Ozone_imp4 = fill_NA_N(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 30
  )) %>%
  group_by(groups) %>%
  do(mutate(., Ozone_imp5 = fill_NA(
    x = .,
    model = "lm_pred",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE
  ))) %>%
  do(mutate(., Ozone_imp6 = fill_NA_N(
    x = .,
    model = "pmm",
    posit_y = "Ozone",
    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 20
  ))) %>%
  ungroup() %>%
  # Average of a few methods
  mutate(Ozone_imp_mix = rowMeans(select(., starts_with("Ozone_imp")))) %>%

  # Protecting against collinearity or low number of observations - across small groups
  # Be carful when using a grouping option
  # because of lack of protection against collinearity or low number of observations.
  # There could be used a tryCatch(fill_NA(...),error=function(e) return(...))
  group_by(groups) %>%
  do(mutate(., Ozone_chac_imp = tryCatch(fill_NA(
    x = .,
    model = "lda",
    posit_y = "Ozone_chac",
    posit_x = c(
      "Intercept",
      "Month",
      "Day",
      "Temp",
      "x_character_imp"
    ),
    w = .[["weights"]]
  ),
  error = function(e) .[["Ozone_chac"]]
  ))) %>%
  ungroup()

# Sample of results
air_miss[which(is.na(air_miss[, 1]))[1:5], ]

Example output

Attaching package:dplyrThe following objects are masked frompackage:stats:

    filter, lag

The following objects are masked frompackage:base:

    intersect, setdiff, setequal, union


Attaching package:data.tableThe following objects are masked frompackage:dplyr:

    between, first, last

   Ozone Solar.R Wind Temp Day Intercept index   weights groups x_character
1:    NA      NA 14.3   56   5         1     5 0.9945367      5        <NA>
2:    NA     194  8.6   69  10         1    10 0.9950548      5   (140,210]
3:    NA      66 16.6   57  25         1    25 1.0124095      5      (0,70]
4:    NA     266 14.9   58  26         1    26 1.0047912      5   (210,280]
5:    NA      NA  8.0   57  27         1    27 0.9998296      5        <NA>
   Ozone_chac Ozone_f Ozone_high Solar_R_imp x_character_imp Ozone_imp1
1:       <NA>    <NA>         NA    88.98554          (0,70]   8.156054
2:       <NA>    <NA>         NA   194.00000       (140,210]  16.942220
3:       <NA>    <NA>         NA    66.00000          (0,70]   6.903988
4:       <NA>    <NA>         NA   266.00000       (210,280]  15.827276
5:       <NA>    <NA>         NA   108.27803          (0,70]  26.047008
   Ozone_imp2 Ozone_imp3 Ozone_imp4 Ozone_imp5 Ozone_imp6 Ozone_imp_mix
1:   6.722250   6.119828   5.315321   5.127981         16      7.906906
2:  12.727608  18.885182  20.408591  20.388899          6     15.892084
3:   6.491414   5.172505   5.029867   5.311351         11      6.651521
4:  14.427569  11.100120   9.791081   5.663351         11     11.301566
5:   3.301693   8.424999   8.425151   6.391228          4      9.431680
   Ozone_chac_imp
1:         (0,20]
2:        (40,60]
3:         (0,20]
4:         (0,20]
5:         (0,20]
# A tibble: 5 x 23
  Ozone Solar.R  Wind  Temp   Day Intercept index weights groups x_character
  <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>   <dbl> <fct>  <chr>      
1    NA      NA  14.3    56     5         1     5   0.995 5      <NA>       
2    NA     194   8.6    69    10         1    10   0.995 5      (140,210]  
3    NA      66  16.6    57    25         1    25   1.01  5      (0,70]     
4    NA     266  14.9    58    26         1    26   1.00  5      (210,280]  
5    NA      NA   8      57    27         1    27   1.00  5      <NA>       
# … with 13 more variables: Ozone_chac <chr>, Ozone_f <fct>, Ozone_high <lgl>,
#   Solar_R_imp <dbl>, x_character_imp <chr>, Ozone_imp1 <dbl>,
#   Ozone_imp2 <dbl>, Ozone_imp3 <dbl>, Ozone_imp4 <dbl>, Ozone_imp5 <dbl>,
#   Ozone_imp6 <dbl>, Ozone_imp_mix <dbl>, Ozone_chac_imp <chr>

miceFast documentation built on July 11, 2021, 1:06 a.m.