knitr::opts_chunk$set(
  collapse = FALSE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

set.seed(1)
library(dplyr)
library(smallfactor)
pc <-covr::package_coverage()
percent <- round(covr::percent_coverage(pc), 1)
usethis::use_badge(
  "",
  href = 'https://img.shields.io/badge/testcoverage-100percent-blue.svg',
  src  = 'https://img.shields.io/badge/testcoverage-100percent-blue.svg'
)
# ![](https://img.shields.io/badge/testcoverage-100percent-blue.svg)

smallfactor - store factors in bytes and bits rather than an integer

R-CMD-check

smallfactor is an experiment to see what trade-offs there might be for storing a factor in bytes or bits instead of an integer.

An R integer is 4-bytes, meaning it could hold a factor with two billion levels, which seems like overkill.

bytefactor targets factors with up to 256 levels.

bitfactor will choose the fewest bits to store a factor for factors with 2 up to 32768 (2^15) levels.

What's in the box

Limitations

It seems possible that you could write more methods to make these smaller factors behave similar to regular R factor objects, but this package is not attempting this (yet).

Internally, there's currently a lot of transferring back-and-forth between these small factors and the standard R factor in order to make use of the printing and subsetting capabilities of the R factor implementation. Much of this back-and-forth could be avoided if effort was expended to do so.

Note: bitfactor uses only 31 bits of a 32-bit integer in order to avoid issues around NA_integer_ representation. This means, for example, that an integer can only hold 15 x 2-bit values. In practice the user is never expected to notice this or care about it.

Installation

You can install from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/smallfactor')

bytefactor

small <- bytefactor(c('a', 'b', 'c', 'a', 'd'))
small

small[1:3]

bitfactor

bitfactor() will choose an appropriate number of bits to store the given number of levels.

In the following example, there are 4 levels, so bitfactor() chooses to store each value in the factor in 2 bits.

tiny <- bitfactor(c('a', 'b', 'c', 'a', 'd'))
tiny

tiny[1:3]

In this next example, there are 100 levels in the factor, so 7 bits are needed to fully store all the levels

tiny <- bitfactor(sample(100, 10), levels=1:100)
tiny

tiny[4:6]

Storing some DNA in bytefactor and bitfactor objects.

| | character | factor | bytefactor | bitfactor | |------------|-----------|--------|------------|-----------| | bits/value | 64 | 32 | 8 | 2 | | total size | 8 MB | 4 MB | 1 MB | 270 kB | | size reduction | | 2x | 8x | 30x |

library(smallfactor)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Generate some random DNA
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
character_vector_dna <- sample(c('A', 'T', 'G', 'C'), 1e6, replace = TRUE)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create a `factor` and a `smallfactor` using the same basic syntax
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
integer_factor <- factor    (character_vector_dna, levels = c('A', 'T', 'G', 'C'))
byte_factor    <- bytefactor(character_vector_dna, levels = c('A', 'T', 'G', 'C'))
bit_factor     <- bitfactor (character_vector_dna, levels = c('A', 'T', 'G', 'C'))

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# `smallfactor` is approx 1/4 the size of the regular factor
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lobstr::obj_size(character_vector_dna)
lobstr::obj_size(integer_factor)
lobstr::obj_size(byte_factor)
lobstr::obj_size(bit_factor)

Similar projects

Acknowledgements



coolbutuseless/smallfactor documentation built on Dec. 19, 2021, 6:04 p.m.