View source: R/getEncodedVar.R
| getEV | R Documentation |
Extracts a single encoded variable from a list or listenv of encoded matrices containing multiple encoded variables
getEV(en, name, ...)
en |
a (named) list, or listenv of matrices or a single matrix, all containing multiple encoded variables. See oneHot decoder for lists and matrices containing single encoded variables |
name |
character, length 1. Column name as found in source data |
... |
default, empty. Used to convert the class of extracted matrix to 'dgCMatrix' or 'matrix' |
This function includes code from package "Matrix.utils" v 0.9.8, published under GPL-3 license, currently removed from CRAN. With thanks to the package Author!
NOTE 1: If name is a source data column name that appears inside other column names, the extracted matrix
will combine all encoded matrices having this name inside their column names. Although the extracted matrix is a
proper matrix of encodings, it no longer represents a single encoded data column. As result, upon decoding,
the oneHot decoder will report ambiguous decoding.
NOTE 2: a warning reading either "single-column encoded matrix for ..." or "number of columns of
result is not a multiple of vector length (arg 2) ..." may appear when extracting an encoded categorical variable
from a list of encoded matrices. Most likely, this happens with low cardinality encoded variables. The warning
signals that most encoded matrices associated with respective variable contain subsets of only one category (level)
when, ideally, most of these matrices should contain a mixture of two or more categories or levels; thus, allowing
matrix row-binding by category's label. One or more of the following suggestions will solve the issue: a) shuffle
the data before encoding, b) increase the number of rows in data chunks when encoding, c) if memory allows,
opt for tileHot encoding single matrix output, as shown in Example 2.1, solution c.
A dense or sparse matrix of single encoded variable which can be decoded with the oneHot decoder.
oneHot, tileHot
if (interactive()) {
# 1. mtcars data have all columns type "double"
data(mtcars)
a = lapply(mtcars, oneHot, encode) # encode mtcars data
print(a) # list of sparse matrices
b = getEV(a, 'cyl') # extract encoded "cyl" column
print(b) # a 32x3 sparse matrix
c = oneHot(b, decode) # revert
identical(mtcars$cyl, c) # FALSE. 'mtcars$cyl' is type "double"
isTRUE(all.equal(mtcars$cyl, c)) # TRUE
# 2. Warnings associated with low cardinality categorical variable
# See tileHot() Examples for full decoding of a dataset
# 2.1 Make 'csv' file
data(iris) # low cardinality "Species"
tempf = tempfile(fileext = '.csv')
write.table(iris, tempf , sep = ',', row.names = FALSE, quote = FALSE)
A = tileHot(readpath = tempf, rows = 14, splits = 3) # encoded tiles list
print(A[[11]][[5]]) # e.g. one-column matrix
a = getEV(A, 'Species') # warning
colSums(a) # incorrect!
# solution b
B = tileHot(readpath = tempf, rows = 60, splits = 3) # increase number of rows
b = getEV(B, 'Species') # still warning
colSums(b) # incorrect!
# Solution b) could work in combination with solution a)
# solution c
C = tileHot(tempf, rows = 14, splits = 3, orn = TRUE) # encoded matrix
c = getEV(C, 'Species') # no warning
colSums(c) # correct!
unlink(tempf)
# 2.2 Shuffled 'csv' file
tempf = tempfile(fileext = '.csv')
iris22 = iris[{ set.seed(327); sample.int(150) },] # shuffled iris data
write.table(iris22, tempf , sep = ',', row.names = FALSE, quote = FALSE)
A = tileHot(readpath = tempf, rows = 14, splits = 3) # same as above
#solution a
a = getEV(A, 'Species') # no warning
colSums(a) # correct!
unlink(tempf)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.