| tileHot | R Documentation |
One-hot encodes tiled data.
tileHot(readpath, rows, splits, omc = "dgCMatrix", ...)
readpath |
character, length 1. Path to source data that is readable with data.table::fread |
rows |
integer length 1. Number of rows in each data subset. Internally, it determines the total number of subsets before the vertical split |
splits |
integer, length 1. Number of vertical data splits in each subset, see splitV. Recommended for
very wide data frames. When |
omc |
character length 1. Output matrix class. Default, "dgCMatrix". Other option: "matrix" |
... |
reserved for splitH function arguments, such as |
This utility reads the data in disjoint subsets, tiles them and then one-hot encodes each tile. Encoded
tiles are returned as nested list of matrices, as a single matrix, as data frame or as a two-component data frame
and sparse matrix list, decided through combinations of dropcols, omc and orn values.
NOTE 1: traceability is assured by assembling the data as character names and values from columns marked for encoding. As side effect, at run time the encoding is reported as being applied to "integer(ish)" values only with no loss in accuracy. Empty source data columns gain the "NA" suffix and become single-column, single-valued matrices.
NOTE 2: this utility implements background, processing. Check "Security Considerations" in callr package documentation.
When orn = FALSE, an unnamed listenv of sparse matrices. Recommended for very large
source data files. Before proceeding with list output, read NOTE 2 in getEV documentation. See Examples 1 and 2.
When orn = TRUE, a matrix.
NOTE 3: In this case, row and column binding operations were avoided to prevent situations described in NOTE 2, getEV documentation. As result, the output matrix is gradually populated instead of being gradually expanded.
While orn = TRUE and dropcols != NULL:
When omc = 'matrix', a data.table containing encoded, as well as unencoded, dropped
columns placed in the leftmost positions.
When omc = 'dgCMatrix', a two-component listenv: a data table containing dropped,
unencoded columns and a sparse matrix containing the encoded columns. The row order in both components is identical.
See Examples below, and Example 2 in getEV documentation.
NOTE 4: In all above cases, specific encoded variables can be obtained with getEV extractor. When orn = TRUE,
oneHot decoded variables extracted from matrix outputs return named vectors having row numbers as names.
splitH, splitV, oneHot, listenv, Matrix
if (interactive()) {
# 1. Shuffled data
tempf = tempfile(fileext = '.csv')
data(iris)
iris22 = iris[{ set.seed(327); sample.int(150) },] # shuffled iris data
rownames(iris22) <- NULL # remove shuffled row names
write.table(iris22, tempf, sep = ',', row.names = FALSE, quote = FALSE)
# 1.1 Output as List
# In most cases, list output requires shuffled data!
A = tileHot(readpath = tempf
, rows = 14, splits = 3, print = FALSE) # encoded data tiles
print(A) # a listenv
print(A[[1]]) # a snapshot
# 1.2 Retrieve iris22 data from encoded list output
X = sapply(names(iris22), \(n) getEV(A, n)) # extract all encoded columns
Y = lapply(
lapply(X, oneHot, decode)
, unname) # decoded columns are named vectors!
d = as.data.frame(Y)
identical(iris22, d) # TRUE
unlink(tempf)
# 2. Unshuffled data
# Make unshuffled data 'csv' file
tempf = tempfile(fileext = '.csv')
write.table(iris, tempf, sep = ',', row.names = FALSE, quote = FALSE)
# 2.1 Output as list
# List output fails low cardinality variables on unshuffled data.
E = tileHot(readpath = tempf
, rows = 14, splits = 3, print = FALSE) # same as above
# 2.2 Retrieve iris data from encoded list output
V = sapply(names(iris), \(n) getEV(E, n)) # warning
W = lapply(
lapply(V, oneHot, decode)
, unname) # decoded columns are named vectors!
dd = as.data.frame(W)
identical(iris, dd) # FALSE
all.equal(iris, dd) # low cardinality "Species"
# 2.3 Output as matrix
# Matrix output handles low cardinality variables. No data shuffling required.
m = tileHot(readpath = tempf # low cardinality "Species"
, rows = 14
, splits = 3
, orn = TRUE, # needed for matrix output
, print = FALSE)
print(m) # 150x126 sparse matrix
# 2.4 Retrieve iris data from encoded matrix output
P = sapply(names(iris), \(n) getEV(m, n)) # extract encoded columns
Q = lapply(
lapply(P, oneHot, decode)
, unname) # decoded columns are named vectors!
R = as.data.frame(Q)
identical(iris, R) # TRUE
# 2.5 Output as "data.table" class
D = tileHot(readpath = tempf
, rows = 14
, splits = 3
, omc = 'matrix' # encoded dense matrix
, dropcols = c('Petal.Width', 'Petal.Length') # unencoded columns
, orn = TRUE # needed for matrix output
, print = FALSE)
print(head(D, 10)) # a "data.table" class
dim(D) # 150x63
# 2.6 Output as a 2-component list
Dl = tileHot(readpath = tempf
, rows = 14
, splits = 3
, omc = 'dgCMatrix' # the default class
, dropcols = c('Petal.Width', 'Petal.Length') # unencoded columns
, orn = TRUE # needed for matrix output
, print = FALSE)
print(Dl) # 2-component listenv
print(Dl[[1]]) # unencoded columns
print(Dl[[2]]) # encoded sparse matrix
# iris data can be retrieved from the Dl list in similar fashion described above
unlink(tempf)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.