The UNF Algorithm"
In UNF: Tools for Creating Universal Numeric Fingerprints for Data

This vignette describes the UNF algorithm (Altman, Gill, McDonald 2003; Altman and King 2007; Altman 2008) and the R implementation thereof, which includes some peculiarities. The official specifications for various versions of UNF can be found elsewhere online. The algorithm is described in general terms here and one can also find more specific descriptions of the Version 3/4, Version 5 and Version 6 algorithms.

The current version of this package is an R implementation that relies on general implementations of the relevant hash functions provided by digest and base64 encoding provided by base64enc. Versions 1 and 2 were available in an earlier version of the UNF package authored by Micah Altman, which was built on custom C libraries, and is included in the version logs on GitHub. That package was orphaned by CRAN in 2009. The package retains the core unf() function from the earlier versions of the UNF package, but simplifies its use considerably. The package additionally implements some new helper functions.

(1) Numerics

Round numerics to k digits, where the default value of k is 7. (Note: In UNF versions <= 5, k was labeled n.) Then, convert those numerics to a character-class string containing exponential notation in the following form:

- A sign character
- A single leading non-zero digit
- A decimal point
- Up to *k*-1 remaining digits following the decimal, no trailing zeros
- A lowercase letter "e"
- A sign character
- The digits of the exponent, omitting trailing zeros

Note (a): Zero can be positive ("+0.e+") or negative ("-0.e+").

Note (b): Inf, -Inf, and NaN are represented as: "+inf", "-inf", and "+nan", respectively. At some point in time, Dataverse handled non-finites by treating them as missing.

Note (c): The Dataverse implementation of UNFv5 represents zero values (and logical FALSE) values as "+0.e-6" rather than the implied "+0.e+" (like logical TRUE values: "+1.e+"). This can be replicated in unf5() by adding the argument dvn_zero = TRUE.

(2) Character Strings

Truncate character strings to l characters, where the default value of l is 128. (Note: In UNF versions <= 5, l was labeled k.)

(3) Other Data Classes

Handle other types of data in the following ways.

For UNF versions < 5, convert all non-numeric data to character and handle as in (2), above.
For UNF versions >= 5:

a. Convert logical values to numeric (TRUE is "1" and FALSE is "0") and handle as in (1), above.

b. In this package, "factor" and "AsIs" class vectors are coerced to character and handled as in (2), above.

c. Treat bits (raw) variables as base64-encoded big-endian bit sequences.

d. Handle dates, times, and datetimes as in (4), below. In this package, time-series classes ("ts" and "zoo") and "difftime" class objects are coerced to numeric.

e. Format complex numbers as A,iB, where A is the real component and and B is the complex component and both are formatted as numeric values as in (1), above.

(4) Dates, times, datetimes, intervals, and durations

Dates, times, datetimes, intervals, and durations are handled as follows:

a. Dates are converted to character strings in the form "YYYY-MM-DD", but partial dates ("YYYY" and "YYYY-MM") are permitted. (Note: Partial dates are not supported in R, but one can create character representations of partial dates in the package by specifying date_format.)

b. Times are converted to character strings using the ISO 8601 format "hh:mm:ss.fffff". "fffff" is fractions of a second and must not containing trailing zeroes (as with any numeric value, see [1], above). The time should be expressed in UTC time with a terminal "Z" character. (Note: Times without accompanying dates are not supported in R, and thus not implemented in the package.)

c. Datetimes may be expressed as a concatenated date (only in the form "YYYY-MM-DD") and time, separated by "T". As an example, Fri Aug 22 12:51:05 EDT 2014 is encoded as: "2014-08-22T16:51:05Z".

d. Intervals are represented as two datetimes, concatenated by a "/". (Note: Intervals are not supported in R, and thus not implemented in the package.)

Note: Given the different implementation of timezones in different programming languages and software applications, UNF signatures calculated for identical datasets in different applications may differ. For example, the UNFv6 specification notes that Stata does not implement time zones, while R always assumes a timezone. The suggested work around is to convert variables to a string representation and handle as in (2), above.

Computing the UNF

Append all non-missing values with an end-of-line (\n) character and a single null byte. Represent all missing values as a string of three null bytes. (Note: At some point in time, Dataverse appeared to treat empty character strings "" as missing values. As of UNFv6, this is explicit that a missing value NA is represented by only three null bytes and an empty character string "" is represented by an end-of-line character and a null byte.)
Convert to Unicode bit encoding. For UNF versions < 4.1, use UTF-32BE. For UNF versions >= 4.1, use UTF-8.
Concatenate all values into a single byte sequence. Compute a hash on the resulting byte sequence. For UNF versions > 3, use SHA256. For UNF version 3, use MD5.
Base64 encode the resulting hash. For UNF versions >= 5, truncate the UNF by performing base64 encoding only on the leftmost 128, 192, 196, or 256 bits, where 128 bits (16 bytes) is the default.

To aggregate multiple variables:

Calculate the UNF for each variable, as above.

Note (a): For one-variable datasets, Dataverse implements the algorithm at the variable-level only, without aggregation. Thus a UNF for a one-variable dataframe is the same as the UNF for that variable alone. The standard is ambiguous in this regard and the package copies the Dataverse implementation.

Note (b): The package treats dataframes and lists identically. Matrices are coerced to dataframes before running the algorithm.

Sort the base64-encoded UNFs in POSIX locale order.
Apply the UNF algorithm to the sorted, base64-encoded UNFs, using a truncation value as large as the original, treating the UNFs as character. For UNF versions >= 5, the algorithm is applied to the truncated UNFs.

To aggregate multiple datasets:

Calculate the UNF for each dataset, as above.
Sort the base64-encoded UNFs in POSIX locale order.
Apply the UNF algorithm to the sorted, base64-encoded UNFs, using a truncation value as large as the original, treating the UNFs as character. For UNF versions >= 5, the algorithm is applied to the truncated UNFs.

Note: Multiple datasets need to be combined based on UNFs calculated with the same version of the algorithm. Thus when calculating a study-level UNF, dataset-level UNFs need to be calculated using the same version of the algorithm. (To achieve this, Dataverse recalculates old UNFs whenever new data is added to a study.)

Reporting the UNF

The UNF is intended to be used as part of a data citation, for example:

James Druckman; Jordan Fein; Thomas Leeper, 2012, "Replication data for: A Source of Bias in Public Opinion Stability", http://hdl.handle.net/1902.1/17864 UNF:5:esVZKwuUnh5kkpDhxXKLxA==

Here, a citation to the data file includes a persistent handle URI and a UNF signature specifying a specific version of the data file available from that handle. Note the UNF is printed as with a small header indicating the algorithm version, making it easy to match any particular UNF against a data file:

UNF:[UNF version]:[UNF hash]

In UNFv5, the header might also contain details of other parameters for non-default number rounding and character string truncation, respectively:

UNF:[UNF version]:[digits],[characters]:[UNF hash]

In UNFv6, the header can contain a number rounding parameter (N), a string truncation parameter (X), and a variable-level UNF hash truncation parameter (H) (in any order):

UNF:[UNF version]:N[digits],X[characters],H[bits]:[UNF hash]

The package prints each UNF in the appropriate format, including any non-default parameters when appropriate.

References

Altman, Micah, Jeff Gill and Michael P. McDonald. 2003. Numerical Issues in Statistical Computing for the Social Scientist. John Wiley \& Sons. (Describes version 3 of the algorithm)

Altman, Micah, \& Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data." D-Lib 13(3/4). http://dlib.org/dlib/march07/altman/03altman.html.

Altman, Micah 2008. "A Fingerprint Method for Scientific Data Verification." In T. Sobh, editor, Advances in Computer and Information Sciences and Engineering, chapter 57, pp. 311-316. Springer Netherlands, Netherlands. https://link.springer.com/chapter/10.1007/978-1-4020-8741-7_57. (Describes version 5 of the algorithm)

Data Citation Synthesis Group. 2013. "Declaration of Data Citation Principles."

Altman, Michah, and Merce Crosas. 2014. "The Evolution of Data Citation: From Principles to Implementation." IASSIST QUARTERLY.