CDC-issues-reported-2025

CDC chunker parameter attack

There were recent news about attacks on content-defined chunking (of borgbackup, restic, tarsnap and likely also all other backup tools that use CDC), which can extract chunking parameters (including the randomly chosen chunker secret):

https://www.daemonology.net/blog/chunking-attacks.pdf
Truong, K. T., Merz, S.-P., Scarlata, M., Günther, F., and Paterson, K. G. Breaking and fixing content-defined chunking. Unpublished manuscript (submitted), January 2025.

The purpose of the chunker secret is to counteract fingerprinting attacks via the sizes of the produced chunks, so that a repo-side attacker can not easily determine which known files you have by looking at the chunk sizes in the repository. So, an attacker potentially being able to determine the chunker secret as described in the paper is bad news.

An attacker could e.g. determine which mp3 files of a well-known mp3 collection you have backed up to the repo.

Panic?

Don’t panic, it’s likely not as bad as it sounds at first:

the attacker would need to be able to put data into your backup source files AND access to your repository files.
the first paper also mentions the possibility of fine-grained traffic analysis. not sure whether that would be possible for borg as we use one long-lived ssh session as the transport.
depending on how much else (besides the attacker’s files) gets backed up, it might not be easy for the attacker to find the corresponding chunks for “his” probing files.
attacker can not try faster than you do backup runs (that does not protect you, but it could mean that an attack could take rather long)
the original (plaintext) chunk sizes are not directly visible in the repository:
- there are multiple compression algorithms/levels, so an attacker might have to try them all to find a match. of course, an attacker would first try the default, though. also, an attacker would try incompressible random data.
- the chunkers and their parameters are configurable, so an attacker might have to try a lot of combinations. of course, an attacker would first try the default, though.
- the chunk size fingerprinting attack is not new to us, it is the reason why we added the “obfuscate” pseudo compressor in Dec 2020 (released in borg 1.2.0 in Feb 2022). it has random-based additive and multiplicative methods to obfuscate the chunk size visible in the repository. if you have reason to fear high-profile attacks, that was made for you! it adds some space overhead though (which is configurable via its level parameter in a wide range), so that is the reason why it is not on by default. See “borg help compression”.
- alternatively, there is also the “fixed” chunker, which uses a fixed block size, leaking less fingerprinting-worthy information. the fixed chunker is simpler, has better performance and supports sparse files, but does not deduplicate well if data changes position within a file (e.g. bytes get inserted or removed and previously existing content gets shifted).
the attack is a fingerprinting attack (the attacker might be able to know that you have some known file(s) in your backup), but they can’t decrypt your repository data.

Other fingerprinting and countermeasures issues

There might be no easy solution to all size-related fingerprinting problems:

if an attacker has repo access, they might always be able to guess how much data you have and how much new data you have in each backup (depends on whether and how much you use “obfuscate”).
This is not just a chunker issue. To avoid overhead, borg’s buzhash CDC chunker avoids producing very small chunks, so relatively small source files won’t get cut into multiple chunks, but end up in a single chunk, so there could be a file-size fingerprinting attack on sets of small files. “obfuscate” would help here, too, of course.

Any ideas about how to improve the situation are welcome, but please consider that any change to how chunking works / works by default would impact the deduplication on existing repos, potentially doubling the amount of repository storage needed (and also having a very slow first-after-change backup).

Switching on obfuscation does not have these negative effects, because the chunk id is computed from the plaintext (before obfuscation is added). It’s not without overhead though and the desired amount of obfuscation depends a lot on the situation, so it’s the users’ choice how much obfuscation overhead they want to add. If you switch on obfuscation, please note that it will only affect NEW chunks written to the repo.

Chunker implementation notes

Changing the buzhash chunker in borg is not easy:

it’s written in C and some Cython.
it’s a mix of reading, chunking, buffer management.
if the chunker generates different chunks than before, that will break deduplication and need a lot of space (for a potentially very long time, until all “old” chunks are deleted).
there is no easy access from the C code to borg’s python code (e.g. the internal crypto API)

OTOH, changing the “fixed” chunker is easier:

it’s written in Cython.
it’s a simple mix of reading and chunking, but gets a bit more complex by supporting sparse files.

borg 1.2.x and 1.4.x

Considering these are stable release series, there won’t be big changes in there. The risk of breaking something seems higher than the fingerprinting risk. Small ideas with little risk and little side effects which improve the security are welcome though.

Guess users either can live with the fingerprinting risk or they use the existing “obfuscate” pseudo compressor (some maybe already do, this exists since 1.2.0).

The first paper suggests that using compression improves the security. borg by default uses lz4 (and this is usually faster than none). Guess some non-default algorithm with a non-default level might even add a bit more security.

borg 2 and potential improvements

borg2 will be a breaking release and data will need to be transferred to a new repo anyway, so that could be a good time to implement changes in chunking while avoiding the above mentioned “doubling repo space” issue. It would make “borg transfer” more complex and slower though.

But the question is whether we need changes there or whether it is better to just use the obfuscation.

The first paper seems to suggest using a 64bit buzhash instead of the current 32bit implementation. But that only fixes the CDC issue, but not the fingerprinting of sets of small files issue.

The second paper suggests encrypting the output of buzhash function with AES. Same here, does not fix fingerprinting of sets of small files.

The second paper also suggests padding of the chunks. Guess this has a similar effect as the already existing “obfuscate” functionality.