tileHot: One-hot Encoder Of Tiled Data

View source: R/tileHot.R

tileHotR Documentation

One-hot Encoder Of Tiled Data

Description

One-hot encodes tiled data.

Usage

tileHot(readpath, rows, splits, omc = "dgCMatrix", ...)

Arguments

readpath

character, length 1. Path to source data that is readable with data.table::fread

rows

integer length 1. Number of rows in each data subset. Internally, it determines the total number of subsets before the vertical split

splits

integer, length 1. Number of vertical data splits in each subset, see splitV. Recommended for very wide data frames. When splits = 0, no vertical splitting occurs

omc

character length 1. Output matrix class. Default, "dgCMatrix". Other option: "matrix"

...

reserved for splitH function arguments, such as dropcols or orn = TRUE which is needed for single matrix output

Details

This utility reads the data in disjoint subsets, tiles them and then one-hot encodes each tile. Encoded tiles are returned as nested list of matrices, as a single matrix, as data frame or as a two-component data frame and sparse matrix list, decided through combinations of dropcols, omc and orn values.

NOTE 1: traceability is assured by assembling the data as character names and values from columns marked for encoding. As side effect, at run time the encoding is reported as being applied to "integer(ish)" values only with no loss in accuracy. Empty source data columns gain the "NA" suffix and become single-column, single-valued matrices.

NOTE 2: this utility implements background, processing. Check "Security Considerations" in callr package documentation.

Value

  • When orn = FALSE, an unnamed listenv of sparse matrices. Recommended for very large source data files. Before proceeding with list output, read NOTE 2 in getEV documentation. See Examples 1 and 2.

  • When orn = TRUE, a matrix.

NOTE 3: In this case, row and column binding operations were avoided to prevent situations described in NOTE 2, getEV documentation. As result, the output matrix is gradually populated instead of being gradually expanded.

While orn = TRUE and dropcols != NULL:

  • When omc = 'matrix', a data.table containing encoded, as well as unencoded, dropped columns placed in the leftmost positions.

  • When omc = 'dgCMatrix', a two-component listenv: a data table containing dropped, unencoded columns and a sparse matrix containing the encoded columns. The row order in both components is identical. See Examples below, and Example 2 in getEV documentation.

NOTE 4: In all above cases, specific encoded variables can be obtained with getEV extractor. When orn = TRUE, oneHot decoded variables extracted from matrix outputs return named vectors having row numbers as names.

See Also

splitH, splitV, oneHot, listenv, Matrix

Examples


if (interactive()) {

# 1. Shuffled data

tempf = tempfile(fileext = '.csv')
data(iris)
iris22 = iris[{ set.seed(327); sample.int(150) },]          # shuffled iris data
rownames(iris22) <- NULL                                    # remove shuffled row names
write.table(iris22, tempf, sep = ',', row.names = FALSE, quote = FALSE)

# 1.1 Output as List
# In most cases, list output requires shuffled data!

A = tileHot(readpath = tempf
          , rows = 14, splits = 3, print = FALSE)           # encoded data tiles
print(A)                                                    # a listenv
print(A[[1]])                                               # a snapshot

# 1.2 Retrieve iris22 data from encoded list output
X = sapply(names(iris22), \(n) getEV(A, n))                 # extract all encoded columns
Y = lapply(
         lapply(X, oneHot, decode)
                               , unname)                    # decoded columns are named vectors!
d = as.data.frame(Y)
identical(iris22, d)                                        # TRUE

unlink(tempf)

# 2. Unshuffled data

# Make unshuffled data 'csv' file
tempf = tempfile(fileext = '.csv')
write.table(iris, tempf, sep = ',', row.names = FALSE, quote = FALSE)

# 2.1 Output as list
# List output fails low cardinality variables on unshuffled data.

E = tileHot(readpath = tempf
              , rows = 14, splits = 3, print = FALSE)      # same as above

# 2.2 Retrieve iris data from encoded list output
V = sapply(names(iris), \(n) getEV(E, n))                  # warning
W = lapply(
         lapply(V, oneHot, decode)
                               , unname)                   # decoded columns are named vectors!
dd = as.data.frame(W)
identical(iris, dd)                                        # FALSE
all.equal(iris, dd)                                        # low cardinality "Species"

# 2.3 Output as matrix
# Matrix output handles low cardinality variables. No data shuffling required.

m = tileHot(readpath = tempf                               # low cardinality "Species"
          , rows = 14
          , splits = 3
          , orn = TRUE,                                    # needed for matrix output
          , print = FALSE)
print(m)                                                   # 150x126 sparse matrix

# 2.4 Retrieve iris data from encoded matrix output
P = sapply(names(iris), \(n) getEV(m, n))                  # extract encoded columns
Q = lapply(
           lapply(P, oneHot, decode)
                              , unname)                    # decoded columns are named vectors!
R = as.data.frame(Q)
identical(iris, R)                                         # TRUE

# 2.5 Output as "data.table" class
D = tileHot(readpath = tempf
          , rows = 14
          , splits = 3
          , omc = 'matrix'                                 # encoded dense matrix
          , dropcols = c('Petal.Width', 'Petal.Length')    # unencoded columns
          , orn = TRUE                                     # needed for matrix output
          , print = FALSE)
print(head(D, 10))                                         # a "data.table" class
dim(D)                                                     # 150x63

# 2.6 Output as a 2-component list
Dl = tileHot(readpath = tempf
          , rows = 14
          , splits = 3
          , omc = 'dgCMatrix'                              # the default class
          , dropcols = c('Petal.Width', 'Petal.Length')    # unencoded columns
          , orn = TRUE                                     # needed for matrix output
          , print = FALSE)
print(Dl)                                                  # 2-component listenv
print(Dl[[1]])                                             # unencoded columns
print(Dl[[2]])                                             # encoded sparse matrix

# iris data can be retrieved from the Dl list in similar fashion described above

unlink(tempf)

}


akin documentation built on May 19, 2026, 5:07 p.m.