orderly_deduplicate: Deduplicate an orderly archive

Description Usage Arguments Details Value Examples

View source: R/deduplicate.R


Deduplicate an orderly archive. Deduplicating an orderly archive will replace all files that have the same content with "hard links". This requires hard link support in the underlying operating system, which is available on all unix-like systems (e.g. MacOS and Linux) and on Windows since Vista. However, on windows systems this might require somewhat elevated privileges. If you use this feature, it is very important that you treat your orderly archive as read-only (though you should be anyway) as changing one copy of a linked file changes all the other instances of it - the files are literally the same file.


orderly_deduplicate(root = NULL, locate = TRUE, dry_run = TRUE, quiet = FALSE)



The path to an orderly root directory, or NULL (the default) to search for one from the current working directory if locate is TRUE.


Logical, indicating if the configuration should be searched for. If TRUE and config is not given, then orderly looks in the working directory and up through its parents until it finds an orderly_config.yml file.


Logical, indicating if the deduplication should be planned but not run


Logical, indicating if the status should not be printed


This function will alter your orderly archive. Ordinarily this is not something that should be done, so we try to be careful. In order for this to work, it is very important to treat your orderly archive as read-only generally. If your canonical orderly archive is behind OrderlyWeb this will almost certainly be the case already.

With "hard linking", two files with the same content can be updated so that both files point at the same physical bit of data (see this Wikipedia page for more information). This is great, as if the file is large, then only one copy needs to be stored. However, this means that if a change is made to one copy of the file, it is immediately reflected in the other, but there is nothing to indicate that the files are linked!

This approach is worth exploring if you have large files that are outputs of one report and inputs to another, or large inputs repeatedly used in different reports, or outputs that end up being the same in multiple reports. If you run the deduplication with dry_run = TRUE, an indication of the savings will be printed.


Invisibly, information about the duplication status of the archive before deduplication was run.


path <- orderly::orderly_example("demo")
id1 <- orderly::orderly_run("minimal", root = path)
id2 <- orderly::orderly_run("minimal", root = path)
orderly_commit(id1, root = path)
orderly_commit(id2, root = path)
  orderly::orderly_deduplicate(path, dry_run = TRUE),
  error = function(e) NULL)

orderly documentation built on June 17, 2021, 5:08 p.m.