11 years ago · 688ba109ef
--- a/docs/internals.rst
+++ b/docs/internals.rst
@@ -33,43 +33,25 @@ index.%d
 
				   cache of the file indexes. those files can be regenerated with
			
 
				   ``check --repair``
			
 
				 
			
 
				-Indexes and memory usage
			
 
				-------------------------
			
 
				-
			
 
				-Repository index
			
 
				-  40 bytes x N ~ 200MB (If a remote repository is
			
 
				-  used this will be allocated on the remote side)
			
 
				-
			
 
				-Chunk lookup index
			
 
				-  44 bytes x N ~ 220MB
			
 
				+Repository config file
			
 
				+----------------------
			
 
				 
			
 
				-File chunk cache
			
 
				-  probably 80-100 bytes x N ~ 400MB
			
 
				+Each repository has a ``config`` file which which is a ``INI``
			
 
				+formatted file which looks like this:
			
 
				 
			
 
				-The chunk lookup index (chunk hash -> reference count, size, ciphered
			
 
				-size ; in file cache/chunk) and the repository index (chunk hash ->
			
 
				-segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
			
 
				-table, directly mapped in memory from the file content, with only one
			
 
				-slot per bucket, but that spreads the collisions to the following
			
 
				-buckets. As a consequence the hash is just a start position for a linear
			
 
				-search, and if the element is not in the table the index is linearly
			
 
				-crossed until an empty bucket is found. When the table is full at 90%
			
 
				-its size is doubled, when it's empty at 25% its size is halfed. So
			
 
				-operations on it have a variable complexity between constant and linear
			
 
				-with low factor, and memory overhead varies between 10% and 300%.
			
 
				+  [repository]
			
 
				+  version = 1
			
 
				+  segments_per_dir = 10000
			
 
				+  max_segment_size = 5242880
			
 
				+  id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
			
 
				 
			
 
				-The file chunk cache (file path hash -> age, inode number, size,
			
 
				-mtime_ns, chunks hashes ; in file cache/files) is stored as a python
			
 
				-associative array storing python objects, which generate a lot of
			
 
				-overhead. This takes around 240 bytes per file without the chunk
			
 
				-list, to be compared to at most 64 bytes of real data (depending on data
			
 
				-alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
			
 
				-of ~250 bytes even if only one chunck hash. The inode number is stored
			
 
				-to make sure we distinguish between different files, as a single path
			
 
				-may not be unique accross different archives in different setups.
			
 
				+This is where the ``repository.id`` is stored. It is a unique
			
 
				+identifier for repositories. It will not change if you move the
			
 
				+repository around so you can make a local transfer then decide to move
			
 
				+the repository in another (even remote) location at a later time.
			
 
				 
			
 
				-The ``index.%d`` files are random access but those files can be
			
 
				-recreated if damaged or lost using ``check --repair``.
			
 
				+|project_name| will do a POSIX read lock on that file when operating
			
 
				+on the repository.
			
 
				 
			
 
				 Repository structure
			
 
				 --------------------
			
@@ -115,7 +97,7 @@ beyond the ``MAX_OBJECT_SIZE`` barrier of 20MB.
 
				 A chunk is an object as well, of course, and its id is the hash of its
			
 
				 (unencrypted and uncompressed) content.
			
 
				 
			
 
				-Hints are stored in a file (repo/hints) and contain: version, list of
			
 
				+Hints are stored in a file (``repo/hints``) and contain: version, list of
			
 
				 segments, compact.
			
 
				 
			
 
				 Chunks
			
@@ -128,25 +110,49 @@ average. All these parameters are fixed. The buzhash table is altered
 
				 by XORing it with a seed randomly generated once for the archive, and
			
 
				 stored encrypted in the keyfile.
			
 
				 
			
 
				-Repository config file
			
 
				-----------------------
			
 
				+Indexes
			
 
				+-------
			
 
				 
			
 
				-Each repository has a ``config`` file which which is a ``INI``
			
 
				-formatted file which looks like this:
			
 
				+The chunk lookup index (chunk hash -> reference count, size, ciphered
			
 
				+size ; in file cache/chunk) and the repository index (chunk hash ->
			
 
				+segment, offset ; in file ``repo/index.%d``) are stored in a sort of hash
			
 
				+table, directly mapped in memory from the file content, with only one
			
 
				+slot per bucket, but that spreads the collisions to the following
			
 
				+buckets. As a consequence the hash is just a start position for a linear
			
 
				+search, and if the element is not in the table the index is linearly
			
 
				+crossed until an empty bucket is found. When the table is full at 90%
			
 
				+its size is doubled, when it's empty at 25% its size is halfed. So
			
 
				+operations on it have a variable complexity between constant and linear
			
 
				+with low factor, and memory overhead varies between 10% and 300%.
			
 
				 
			
 
				-  [repository]
			
 
				-  version = 1
			
 
				-  segments_per_dir = 10000
			
 
				-  max_segment_size = 5242880
			
 
				-  id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
			
 
				+The file chunk cache (file path hash -> age, inode number, size,
			
 
				+mtime_ns, chunks hashes ; in file cache/files) is stored as a python
			
 
				+associative array storing python objects, which generate a lot of
			
 
				+overhead. This takes around 240 bytes per file without the chunk
			
 
				+list, to be compared to at most 64 bytes of real data (depending on data
			
 
				+alignment), and around 80 bytes per chunk hash (vs 32), with a minimum
			
 
				+of ~250 bytes even if only one chunck hash. The inode number is stored
			
 
				+to make sure we distinguish between different files, as a single path
			
 
				+may not be unique accross different archives in different setups.
			
 
				 
			
 
				-This is where the ``repository.id`` is stored. It is a unique
			
 
				-identifier for repositories. It will not change if you move the
			
 
				-repository around so you can make a local transfer then decide to move
			
 
				-the repository in another (even remote) location at a later time.
			
 
				+The ``index.%d`` files are random access but those files can be
			
 
				+recreated if damaged or lost using ``check --repair``.
			
 
				 
			
 
				-|project_name| will do a POSIX read lock on that file when operating
			
 
				-on the repository.
			
 
				+Indexes memory usage
			
 
				+--------------------
			
 
				+
			
 
				+Here is the estimated memory usage of |project_name| when using those
			
 
				+indexes:
			
 
				+
			
 
				+Repository index
			
 
				+  40 bytes x N ~ 200MB (If a remote repository is
			
 
				+  used this will be allocated on the remote side)
			
 
				+
			
 
				+Chunk lookup index
			
 
				+  44 bytes x N ~ 220MB
			
 
				+
			
 
				+File chunk cache
			
 
				+  probably 80-100 bytes x N ~ 400MB
			
 
				 
			
 
				 Encryption
			
 
				 ----------