knitr::opts_chunk$set( collapse = FALSE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) set.seed(1) library(dplyr) library(smallfactor)
pc <-covr::package_coverage() percent <- round(covr::percent_coverage(pc), 1) usethis::use_badge( "", href = 'https://img.shields.io/badge/testcoverage-100percent-blue.svg', src = 'https://img.shields.io/badge/testcoverage-100percent-blue.svg' ) # ![](https://img.shields.io/badge/testcoverage-100percent-blue.svg)
integer
smallfactor
is an experiment to see what trade-offs there might be for storing
a factor in bytes or bits instead of an integer.
An R integer is 4-bytes, meaning it could hold a factor with two billion levels, which seems like overkill.
bytefactor
targets factors with up to 256 levels.
bitfactor
will choose the fewest bits to store a factor for factors with 2 up to
32768 (2^15) levels.
bytefactor()
store factor values in raw bytes rather than an integer.bitfactor()
store factor values as bits within integers - with multiple factors
stored in a single integer value.
E.g. a factor with 4 levels could be stored as just 2 bits of information,
meaning 16 values could be stored in a single R integer (16 * 2 bits = 32 bits = 4 bytes)It seems possible that you could write more methods to make these smaller factors
behave similar to regular R factor
objects, but this package is not attempting
this (yet).
Internally, there's currently a lot of transferring back-and-forth between these small factors and the standard R factor in order to make use of the printing and subsetting capabilities of the R factor implementation. Much of this back-and-forth could be avoided if effort was expended to do so.
Note: bitfactor
uses only 31 bits of a 32-bit integer in order to avoid
issues around NA_integer_
representation. This means, for example, that an
integer can only hold 15 x 2-bit values. In practice the user is never
expected to notice this or care about it.
You can install from GitHub with:
# install.package('remotes') remotes::install_github('coolbutuseless/smallfactor')
bytefactor
small <- bytefactor(c('a', 'b', 'c', 'a', 'd')) small small[1:3]
bitfactor
bitfactor()
will choose an appropriate number of bits to store the given
number of levels.
In the following example, there are 4 levels, so bitfactor()
chooses to
store each value in the factor in 2 bits.
tiny <- bitfactor(c('a', 'b', 'c', 'a', 'd')) tiny tiny[1:3]
In this next example, there are 100 levels in the factor, so 7 bits are needed to fully store all the levels
tiny <- bitfactor(sample(100, 10), levels=1:100) tiny tiny[4:6]
bytefactor
and bitfactor
objects.| | character | factor | bytefactor | bitfactor | |------------|-----------|--------|------------|-----------| | bits/value | 64 | 32 | 8 | 2 | | total size | 8 MB | 4 MB | 1 MB | 270 kB | | size reduction | | 2x | 8x | 30x |
library(smallfactor) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Generate some random DNA #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ character_vector_dna <- sample(c('A', 'T', 'G', 'C'), 1e6, replace = TRUE) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Create a `factor` and a `smallfactor` using the same basic syntax #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ integer_factor <- factor (character_vector_dna, levels = c('A', 'T', 'G', 'C')) byte_factor <- bytefactor(character_vector_dna, levels = c('A', 'T', 'G', 'C')) bit_factor <- bitfactor (character_vector_dna, levels = c('A', 'T', 'G', 'C')) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # `smallfactor` is approx 1/4 the size of the regular factor #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ lobstr::obj_size(character_vector_dna) lobstr::obj_size(integer_factor) lobstr::obj_size(byte_factor) lobstr::obj_size(bit_factor)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.