Browse Source

docs: internals: edited HashIndex

Marian Beermann 8 years ago
parent
commit
d33929a24d

+ 20 - 12
docs/internals/data-structures.rst

@@ -327,7 +327,7 @@ The archive object itself further contains some metadata:
   When :ref:`borg_check` rebuilds the manifest (e.g. if it was corrupted) and finds
   When :ref:`borg_check` rebuilds the manifest (e.g. if it was corrupted) and finds
   more than one archive object with the same name, it adds a counter to the name
   more than one archive object with the same name, it adds a counter to the name
   in the manifest, but leaves the *name* field of the archives as it was.
   in the manifest, but leaves the *name* field of the archives as it was.
-* *items*, a list of chunk IDs containing item metadata (size: count * ~33B)
+* *items*, a list of chunk IDs containing item metadata (size: count * ~34B)
 * *cmdline*, the command line which was used to create the archive
 * *cmdline*, the command line which was used to create the archive
 * *hostname*
 * *hostname*
 * *username*
 * *username*
@@ -339,8 +339,7 @@ The archive object itself further contains some metadata:
 
 
 .. _archive_limitation:
 .. _archive_limitation:
 
 
-Note about archive limitations
-++++++++++++++++++++++++++++++
+.. rubric:: Note about archive limitations
 
 
 The archive is currently stored as a single object in the repository
 The archive is currently stored as a single object in the repository
 and thus limited in size to MAX_OBJECT_SIZE (20MiB).
 and thus limited in size to MAX_OBJECT_SIZE (20MiB).
@@ -435,18 +434,16 @@ The cache
 The **files cache** is stored in ``cache/files`` and is used at backup time to
 The **files cache** is stored in ``cache/files`` and is used at backup time to
 quickly determine whether a given file is unchanged and we have all its chunks.
 quickly determine whether a given file is unchanged and we have all its chunks.
 
 
-The files cache is in memory a key -> value mapping (a Python *dict*) and contains:
+In memory, the files cache is a key -> value mapping (a Python *dict*) and contains:
 
 
-* key:
-
-  - full, absolute file path id_hash
+* key: id_hash of the encoded, absolute file path
 * value:
 * value:
 
 
   - file inode number
   - file inode number
   - file size
   - file size
   - file mtime_ns
   - file mtime_ns
-  - list of file content chunk id hashes
   - age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
   - age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
+  - list of chunk ids representing the file's contents
 
 
 To determine whether a file has not changed, cached values are looked up via
 To determine whether a file has not changed, cached values are looked up via
 the key in the mapping and compared to the current file attribute values.
 the key in the mapping and compared to the current file attribute values.
@@ -572,8 +569,9 @@ HashIndex
 The chunks cache and the repository index are stored as hash tables, with
 The chunks cache and the repository index are stored as hash tables, with
 only one slot per bucket, spreading hash collisions to the following
 only one slot per bucket, spreading hash collisions to the following
 buckets. As a consequence the hash is just a start position for a linear
 buckets. As a consequence the hash is just a start position for a linear
-search, and if the element is not in the table the index is linearly crossed
-until an empty bucket is found.
+search. If a key is looked up that is not in the table, then the hash table
+is searched from the start position (the hash) until the first empty
+bucket is reached.
 
 
 This particular mode of operation is open addressing with linear probing.
 This particular mode of operation is open addressing with linear probing.
 
 
@@ -582,14 +580,24 @@ emptied to 25%, its size is shrinked. Operations on it have a variable
 complexity between constant and linear with low factor, and memory overhead
 complexity between constant and linear with low factor, and memory overhead
 varies between 33% and 300%.
 varies between 33% and 300%.
 
 
-Further, if the number of empty slots becomes too low (recall that linear probing
+If an element is deleted, and the slot behind the deleted element is not empty,
+then the element will leave a tombstone, a bucket marked as deleted. Tombstones
+are only removed by insertions using the tombstone's bucket, or by resizing
+the table. They present the same load to the hash table as a real entry,
+but do not count towards the regular load factor.
+
+Thus, if the number of empty slots becomes too low (recall that linear probing
 for an element not in the index stops at the first empty slot), the hash table
 for an element not in the index stops at the first empty slot), the hash table
-is rebuilt. The maximum *effective* load factor is 93%.
+is rebuilt. The maximum *effective* load factor, i.e. including tombstones, is 93%.
 
 
 Data in a HashIndex is always stored in little-endian format, which increases
 Data in a HashIndex is always stored in little-endian format, which increases
 efficiency for almost everyone, since basically no one uses big-endian processors
 efficiency for almost everyone, since basically no one uses big-endian processors
 any more.
 any more.
 
 
+HashIndex does not use a hashing function, because all keys (save manifest) are
+outputs of a cryptographic hash or MAC and thus already have excellent distribution.
+Thus, HashIndex simply uses the first 32 bits of the key as its "hash".
+
 The format is easy to read and write, because the buckets array has the same layout
 The format is easy to read and write, because the buckets array has the same layout
 in memory and on disk. Only the header formats differ.
 in memory and on disk. Only the header formats differ.
 
 

BIN
docs/internals/object-graph.png


BIN
docs/internals/object-graph.vsd