Files are indexed by inode and device, files with the same inode + device are considered equal. If the platform does not support inode ids, then this check is skipped.

Files are then indexed by size; only file with the same size are compared.

During comparison, the files are read at block sizes increasing in powers of two, starting with 2k. The blocks are hashed and compared, and if they do not match the comparison is stopped early (often without having to read the full file). If all the hashes are equal, then the files are considered to be equal.

Hashes are only computed when needed and cached in memory. Since the hash block size increases in powers of two, only a few dozen hashes are needed even for large files (reducing memory usage compared to a fixed hash block size).

findup is quite fast - it is within 2x of the fastest duplicate finders written in C/C++. Based on the V8 profiler output, about 40% of the time is spent on I/O, 13% on crypto and 11% on file traversal, so any further gains in performance will need to come from I/O optimizations rather than code optimizations.

BTW, you may notice that file-dedupe defaults to sync I/O. This is because the async I/O seems to have significant overhead for typical FS tasks. You can test this out by passing the --async flag on your system.

new Dedupe({ async: false}): creates a new class, which holds all the cached metadata. Options:

async: whether to use async or sync I/O for hashing files. Defaults to sync, which is usually faster.

dedupe.find(file, [stat], onDone): callback (err, result) where result is either false or a full path to a file that was previously deduplicated. You can optionally pass in a fs.Stat object to avoid having to do another fs.stat call in dedupe.