fduper-package: Eliminate Duplicate Files

Description Details Author(s)

Description

fduper allows you to create custom workflows for removing duplicate files. There are many existing file deduplicators, however, they typically don't allow for nuanced file deduping strategies. While they typically provide a means to add and ignore folders recursively, it tends to be difficult to cut through the noise of legitimate dupes to focus on the real ones. Additionally, casting a wide net by searching a large number of files to find all duplicates can be computationally intensive. If the task is simply check a handful of orphaned files in a folder, but their duplicates may be anywhere on the drive, it can lead to significant time wasted computing hashes of duplicate pairs that do not inlcude the orphaned files in quesion.

Details

A solution to this is to increase the flexibility and availability of the constratints in the deduplication process. For example, it may be nice to do a a workflow such as "search entire drive for all image files, but only flag duplicate pairs if at least one is in the ~/images. In that case, delete the other file, but not if it's in a path that contains a .git folder because that is likely to be a legitimate duplicate."

At its heart, fduper extends dplyr to include methods relevant to the file deduplication process. The underlying data object is a tibble of arbitrary file information (path, size, hashes, etc.). Since this structure will be familiar to any R user, it should be relatively easy to add custom steps to further extend fduper for your custom workflow. The intent of fduper is not to be the fastest deduper, but to provide a readily understandable and hackable user experience. And in practice, a well crafted workflow can often save substantial unnecessary computation by file dedupers that don't allow for more finesse in their process.

fduper is under development and should be used with caution.

Author(s)

Maintainer: Glen Myrland glenmyrland@gmail.com


gmyrland/fduper documentation built on May 28, 2019, 8:53 p.m.