浏览代码

docs/data-structures: tie CDC back into dedup rationale

enkore 3 年之前
父节点
当前提交
79cb4e43e5
共有 1 个文件被更改,包括 5 次插入1 次删除
  1. 5 1
      docs/internals/data-structures.rst

+ 5 - 1
docs/internals/data-structures.rst

@@ -626,7 +626,11 @@ The idea of content-defined chunking is assigning every byte where a
 cut *could* be placed a hash. The hash is based on some number of bytes
 (the window size) before the byte in question. Chunks are cut
 where the hash satisfies some condition
-(usually "n numbers of trailing/leading zeroes").
+(usually "n numbers of trailing/leading zeroes"). This causes chunks to be cut
+in the same location relative to the file's contents, even if bytes are inserted
+or removed before/after a cut, as long as the bytes within the window stay the same.
+This results in a high chance that a single cluster of changes to a file will only
+result in 1-2 new chunks, aiding deduplication.
 
 Using normal hash functions this would be extremely slow,
 requiring hashing ``window size * file size`` bytes.