NB: This vignette is work-in-progress and not yet complete.
TBD
digest()
and sha1()
R FAQ 7.31 illustrates potential problems with floating point arithmetic. Mathematically the equality $x = \sqrt{x}^2$ should hold. But the precision of floating points numbers is finite. Hence some rounding is done, leading to numbers which are no longer identical.
An illustration:
# FAQ 7.31 a0 <- 2 b <- sqrt(a0) a1 <- b ^ 2 identical(a0, a1) a0 - a1 a <- c(a0, a1) # hexadecimal representation sprintf("%a", a)
Although the difference is small, any difference will result in different hash when using the digest()
function.
However, thesha1()
function tackles this problem by using the hexadecimal representation of the numbers and truncates
that representation to a certain number of digits prior to calculating the hash function.
library(digest) # different hashes with digest sapply(a, digest, algo = "sha1") # same hash with sha1 with default digits (14) sapply(a, sha1) # larger digits can lead to different hashes sapply(a, sha1, digits = 15) # decreasing the number of digits gives a stronger truncation # the hash will change when then truncation gives a different result # case where truncating gives same hexadecimal value sapply(a, sha1, digits = 13) sapply(a, sha1, digits = 10) # case where truncating gives different hexadecimal value c(sha1(pi), sha1(pi, digits = 13), sha1(pi, digits = 10))
The result of floating point arithematic on 32-bit and 64-bit can be slightly different. E.g. print(pi ^ 11, 22)
returns 294204.01797389047
on 32-bit and 294204.01797389053
on 64-bit. Note that only the last 2 digits are different.
| command | 32-bit | 64-bit|
| - | - | - |
| print(pi ^ 11, 22)
| 294204.01797389047
| 294204.01797389053
|
| sprintf("%a", pi ^ 11)
| "0x1.1f4f01267bf5fp+18"
| "0x1.1f4f01267bf6p+18"
|
| digest(pi ^ 11, algo = "sha1")
| "c5efc7f167df1bb402b27cf9b405d7cebfba339a"
| "b61f6fea5e2a7952692cefe8bba86a00af3de713"
|
| sha1(pi ^ 11, digits = 14)
| "f3e7b9335497d791e1b5f4135e07b06f826e0f12"
| "493fac65ffef084a142ee1598a844cc752e43e8c"
|
| sha1(pi ^ 11, digits = 13)
| "d174131fb15f8e389f3843c10444f8cae594c5f2"
| "d174131fb15f8e389f3843c10444f8cae594c5f2"
|
| sha1(pi ^ 11, digits = 10)
| "6b9ef159b3218b0d5de6ba5b949415a53c49d726"
| "6b9ef159b3218b0d5de6ba5b949415a53c49d726"
|
digest()
or sha1()
sha1()
.sha1
.sha1()
on the (list of) relevant component(s).sha1()
zapsmall = 7
is recommended.digits = 14
is recommended in case all numerics are data.digits = 4
is recommended in case some numerics stem from floating point arithmetic.Let's illustrate this using the summary of a simple linear regression. Suppose that we want a hash that takes into account the coefficients, their standard error and sigma.
# taken from the help file of lm.influence lm_SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings) lm_sum <- summary(lm_SR) class(lm_sum) # str() gives the structure of the lm object str(lm_sum) # extract the coefficients and their standard error coef_sum <- coef(lm_sum)[, c("Estimate", "Std. Error")] # extract sigma sigma <- lm_sum$sigma # check the class of each component class(coef_sum) class(sigma) # sha1() has methods for both matrix and numeric # because the values originate from floating point arithmetic it is better to use a low number of digits sha1(coef_sum, digits = 4) sha1(sigma, digits = 4) # we want a single hash # combining the components in a list is a solution that works sha1(list(coef_sum, sigma), digits = 4) # now turn everything into an S3 method # - a function with name "sha1.classname" # - must have the same arguments as sha1() sha1.summary.lm <- function(x, digits = 4, zapsmall = 7){ coef_sum <- coef(x)[, c("Estimate", "Std. Error")] sigma <- x$sigma combined <- list(coef_sum, sigma) sha1(combined, digits = digits, zapsmall = zapsmall) } sha1(lm_sum) # try an altered dataset LCS2 <- LifeCycleSavings[rownames(LifeCycleSavings) != "Zambia", ] lm_SR2 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LCS2) sha1(summary(lm_SR2))
Let's illustrate this using the summary of a simple linear regression. Suppose that we want a hash that takes into account the coefficients, their standard error and sigma.
class(lm_SR) # str() gives the structure of the lm object str(lm_SR) # extract the model and the terms lm_model <- lm_SR$model lm_terms <- lm_SR$terms # check their class class(lm_model) # handled by sha1() class(lm_terms) # not handled by sha1() # define a method for formula sha1.formula <- function(x, digits = 14, zapsmall = 7){ sha1(as.character(x), digits = digits, zapsmall = zapsmall) } sha1(lm_terms) sha1(lm_model) # define a method for lm sha1.lm <- function(x, digits = 14, zapsmall = 7){ lm_model <- x$model lm_terms <- x$terms combined <- list(lm_model, lm_terms) sha1(combined, digits = digits, zapsmall = zapsmall) } sha1(lm_SR) sha1(lm_SR2)
Use case
analyses that require a lot of computing time
Bundle all relevant information on an analysis in a class
calculate sha1()
file fingerprint
~ sha1()
on the stable parts
status fingerprint
~ sha1()
on the parts that result for the model
Prepare analysis objects
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.