getEV: Extract Encoded Variables From Encoded Split Or Tiled Data

View source: R/getEncodedVar.R

getEVR Documentation

Extract Encoded Variables From Encoded Split Or Tiled Data

Description

Extracts a single encoded variable from a list or listenv of encoded matrices containing multiple encoded variables

Usage

getEV(en, name, ...)

Arguments

en

a (named) list, or listenv of matrices or a single matrix, all containing multiple encoded variables. See oneHot decoder for lists and matrices containing single encoded variables

name

character, length 1. Column name as found in source data

...

default, empty. Used to convert the class of extracted matrix to 'dgCMatrix' or 'matrix'

Details

This function includes code from package "Matrix.utils" v 0.9.8, published under GPL-3 license, currently removed from CRAN. With thanks to the package Author!

NOTE 1: If name is a source data column name that appears inside other column names, the extracted matrix will combine all encoded matrices having this name inside their column names. Although the extracted matrix is a proper matrix of encodings, it no longer represents a single encoded data column. As result, upon decoding, the oneHot decoder will report ambiguous decoding.

NOTE 2: a warning reading either "single-column encoded matrix for ..." or "number of columns of result is not a multiple of vector length (arg 2) ..." may appear when extracting an encoded categorical variable from a list of encoded matrices. Most likely, this happens with low cardinality encoded variables. The warning signals that most encoded matrices associated with respective variable contain subsets of only one category (level) when, ideally, most of these matrices should contain a mixture of two or more categories or levels; thus, allowing matrix row-binding by category's label. One or more of the following suggestions will solve the issue: a) shuffle the data before encoding, b) increase the number of rows in data chunks when encoding, c) if memory allows, opt for tileHot encoding single matrix output, as shown in Example 2.1, solution c.

Value

A dense or sparse matrix of single encoded variable which can be decoded with the oneHot decoder.

See Also

oneHot, tileHot

Examples


if (interactive()) {

# 1. mtcars data have all columns type "double"

data(mtcars)
a = lapply(mtcars, oneHot, encode)                       # encode mtcars data
print(a)                                                 # list of sparse matrices
b = getEV(a, 'cyl')                                      # extract encoded "cyl" column
print(b)                                                 # a 32x3 sparse matrix
c = oneHot(b, decode)                                    # revert
identical(mtcars$cyl, c)                                 # FALSE. 'mtcars$cyl' is type "double"
isTRUE(all.equal(mtcars$cyl, c))                         # TRUE

# 2. Warnings associated with low cardinality categorical variable

# See tileHot() Examples for full decoding of a dataset

# 2.1 Make 'csv' file
data(iris)                                               # low cardinality "Species"
tempf = tempfile(fileext = '.csv')
write.table(iris, tempf , sep = ',', row.names = FALSE, quote = FALSE)

A = tileHot(readpath = tempf, rows = 14, splits = 3)     # encoded tiles list
print(A[[11]][[5]])                                      # e.g. one-column matrix
a = getEV(A, 'Species')                                  # warning
colSums(a)                                               # incorrect!

# solution b
B = tileHot(readpath = tempf, rows = 60, splits = 3)     # increase number of rows
b = getEV(B, 'Species')                                  # still warning
colSums(b)                                               # incorrect!

# Solution b) could work in combination with solution a)

# solution c
C = tileHot(tempf, rows = 14, splits = 3, orn = TRUE)    # encoded matrix
c = getEV(C, 'Species')                                  # no warning
colSums(c)                                               # correct!

unlink(tempf)

# 2.2 Shuffled 'csv' file
tempf = tempfile(fileext = '.csv')
iris22 = iris[{ set.seed(327); sample.int(150) },]      # shuffled iris data
write.table(iris22, tempf , sep = ',', row.names = FALSE, quote = FALSE)

A = tileHot(readpath = tempf, rows = 14, splits = 3)    # same as above

#solution a
a = getEV(A, 'Species')                                 # no warning
colSums(a)                                              # correct!

unlink(tempf)

}



akin documentation built on May 19, 2026, 5:07 p.m.