10 years ago · b5bdb52b6a
--- a/docs/internals.rst
+++ b/docs/internals.rst
@@ -168,13 +168,27 @@ A chunk is stored as an object as well, of course.
 
				 Chunks
			
 
				 ------
			
 
				 
			
 
				-|project_name| uses a rolling hash computed by the Buzhash_ algorithm, with a
			
 
				-window size of 4095 bytes (`0xFFF`), with a minimum chunk size of 1024 bytes.
			
 
				-It triggers (chunks) when the last 16 bits of the hash are zero, producing
			
 
				-chunks of 64kiB on average.
			
 
				+The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm.
			
 
				+It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
			
 
				+producing chunks of 2^HASH_MASK_BITS Bytes on average.
			
 
				+
			
 
				+create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE
			
 
				+can be used to tune the chunker parameters, the default is:
			
 
				+
			
 
				+- CHUNK_MIN_EXP = 10 (minimum chunk size = 2^10 B = 1 kiB)
			
 
				+- CHUNK_MAX_EXP = 23 (maximum chunk size = 2^23 B = 8 MiB)
			
 
				+- HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB)
			
 
				+- HASH_WINDOW_SIZE = 4095 [B] (`0xFFF`)
			
 
				+
			
 
				+The default parameters are OK for relatively small backup data volumes and
			
 
				+repository sizes and a lot of available memory (RAM) and disk space for the
			
 
				+chunk index. If that does not apply, you are advised to tune these parameters
			
 
				+to keep the chunk count lower than with the defaults.
			
 
				 
			
 
				 The buzhash table is altered by XORing it with a seed randomly generated once
			
 
				-for the archive, and stored encrypted in the keyfile.
			
 
				+for the archive, and stored encrypted in the keyfile. This is to prevent chunk
			
 
				+size based fingerprinting attacks on your encrypted repo contents (to guess
			
 
				+what files you have based on a specific set of chunk sizes).
			
 
				 
			
 
				 
			
 
				 Indexes / Caches
			
@@ -243,7 +257,7 @@ Indexes / Caches memory usage
 
				 
			
 
				 Here is the estimated memory usage of |project_name|:
			
 
				 
			
 
				-  chunk_count ~= total_file_size / 65536
			
 
				+  chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
			
 
				 
			
 
				   repo_index_usage = chunk_count * 40
			
 
				 
			
@@ -252,20 +266,32 @@ Here is the estimated memory usage of |project_name|:
 
				   files_cache_usage = total_file_count * 240 + chunk_count * 80
			
 
				 
			
 
				   mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage
			
 
				-             = total_file_count * 240 + total_file_size / 400
			
 
				+             = chunk_count * 164 + total_file_count * 240
			
 
				 
			
 
				 All units are Bytes.
			
 
				 
			
 
				-It is assuming every chunk is referenced exactly once and that typical chunk size is 64kiB.
			
 
				+It is assuming every chunk is referenced exactly once (if you have a lot of
			
 
				+duplicate chunks, you will have less chunks than estimated above).
			
 
				+
			
 
				+It is also assuming that typical chunk size is 2^HASH_MASK_BITS (if you have
			
 
				+a lot of files smaller than this statistical medium chunk size, you will have
			
 
				+more chunks than estimated above, because 1 file is at least 1 chunk).
			
 
				 
			
 
				 If a remote repository is used the repo index will be allocated on the remote side.
			
 
				 
			
 
				-E.g. backing up a total count of 1Mi files with a total size of 1TiB:
			
 
				+E.g. backing up a total count of 1Mi files with a total size of 1TiB.
			
 
				+
			
 
				+a) with create --chunker-params 10,23,16,4095 (default):
			
 
				 
			
 
				-  mem_usage  =  1 * 2**20 * 240  +  1 * 2**40 / 400  =  2.8GiB
			
 
				+  mem_usage  =  2.8GiB
			
 
				 
			
 
				-Note: there is a commandline option to switch off the files cache. You'll save
			
 
				-some memory, but it will need to read / chunk all the files then.
			
 
				+b) with create --chunker-params 10,23,20,4095 (custom):
			
 
				+
			
 
				+  mem_usage  =  0.4GiB
			
 
				+
			
 
				+Note: there is also the --no-files-cache option to switch off the files cache.
			
 
				+You'll save some memory, but it will need to read / chunk all the files then as
			
 
				+it can not skip unmodified files then.
			
 
				 
			
 
				 
			
 
				 Encryption
			
@@ -291,6 +317,7 @@ Encryption keys are either derived from a passphrase or kept in a key file.
 
				 The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable
			
 
				 or prompted for interactive usage.
			
 
				 
			
 
				+
			
 
				 Key files
			
 
				 ---------
			
 
				 
			
@@ -355,4 +382,10 @@ representation of the repository id.
 
				 Compression
			
 
				 -----------
			
 
				 
			
 
				-Currently, compression is disabled by default. Zlib compression can be enabled by passing ``--compression level`` on the command line. Level can be anything from 0 (no compression, fast) to 9 (high compression, slow).
			
 
				+|project_name| currently always pipes all data through a zlib compressor which
			
 
				+supports compression levels 0 (no compression, fast) to 9 (high compression, slow).
			
 
				+
			
 
				+See ``borg create --help`` about how to specify the compression level and its default.
			
 
				+
			
 
				+Note: zlib level 0 creates a little bit more output data than it gets as input,
			
 
				+due to zlib protocol overhead.