瀏覽代碼

docs: data structures: demingle cache and repo index

Marian Beermann 8 年之前
父節點
當前提交
2b0e7bc924
共有 1 個文件被更改,包括 65 次插入44 次删除
  1. 65 44
      docs/internals/data-structures.rst

+ 65 - 44
docs/internals/data-structures.rst

@@ -104,12 +104,37 @@ to the file containing the object id and data. If an object is deleted
 a ``DELETE`` entry is appended with the object id.
 a ``DELETE`` entry is appended with the object id.
 
 
 A ``COMMIT`` tag is written when a repository transaction is
 A ``COMMIT`` tag is written when a repository transaction is
-committed.
+committed. The segment number of the segment containing
+a commit is the **transaction ID**.
 
 
 When a repository is opened any ``PUT`` or ``DELETE`` operations not
 When a repository is opened any ``PUT`` or ``DELETE`` operations not
 followed by a ``COMMIT`` tag are discarded since they are part of a
 followed by a ``COMMIT`` tag are discarded since they are part of a
 partial/uncommitted transaction.
 partial/uncommitted transaction.
 
 
+Index, hints and integrity
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The **repository index** is stored in ``index.<TRANSACTION_ID>`` and is used to
+determine an object's location in the repository. It is a HashIndex_,
+a hash table using open addressing. It maps object keys_ to two
+unsigned 32-bit integers; the first integer gives the segment number,
+the second indicates the offset of the object's entry within the segment.
+
+The **hints file** is a msgpacked file named ``hints.<TRANSACTION_ID>``.
+It contains:
+
+* version
+* list of segments
+* compact
+
+The **integrity file** is a msgpacked file named ``integrity.<TRANSACTION_ID>``.
+It contains checksums of the index and hints files and is described in the
+:ref:`Checksumming data structures <integrity_repo>` section below.
+
+If the index or hints are corrupted, they are re-generated automatically.
+If they are outdated, segments are replayed from the index state to the currently
+committed transaction.
+
 Compaction
 Compaction
 ~~~~~~~~~~
 ~~~~~~~~~~
 
 
@@ -384,13 +409,13 @@ For some more general usage hints see also ``--chunker-params``.
 
 
 .. _cache:
 .. _cache:
 
 
-Indexes / Caches
-----------------
+The cache
+---------
 
 
 The **files cache** is stored in ``cache/files`` and is used at backup time to
 The **files cache** is stored in ``cache/files`` and is used at backup time to
 quickly determine whether a given file is unchanged and we have all its chunks.
 quickly determine whether a given file is unchanged and we have all its chunks.
 
 
-The files cache is a key -> value mapping and contains:
+The files cache is in memory a key -> value mapping (a Python *dict*) and contains:
 
 
 * key:
 * key:
 
 
@@ -438,6 +463,10 @@ Borg can also work without using the files cache (saves memory if you have a
 lot of files or not much RAM free), then all files are assumed to have changed.
 lot of files or not much RAM free), then all files are assumed to have changed.
 This is usually much slower than with files cache.
 This is usually much slower than with files cache.
 
 
+The on-disk format of the files cache is a stream of msgpacked tuples (key, value).
+Loading the files cache involves reading the file, one msgpack object at a time,
+unpacking it, and msgpacking the value (in an effort to save memory).
+
 The **chunks cache** is stored in ``cache/chunks`` and is used to determine
 The **chunks cache** is stored in ``cache/chunks`` and is used to determine
 whether we already have a specific chunk, to count references to it and also
 whether we already have a specific chunk, to count references to it and also
 for statistics.
 for statistics.
@@ -453,46 +482,7 @@ The chunks cache is a key -> value mapping and contains:
   - size
   - size
   - encrypted/compressed size
   - encrypted/compressed size
 
 
-The chunks cache is a hashindex, a hash table implemented in C and tuned for
-memory efficiency.
-
-The **repository index** is stored in ``repo/index.%d`` and is used to
-determine a chunk's location in the repository.
-
-The repo index is a key -> value mapping and contains:
-
-* key:
-
-  - chunk id_hash
-* value:
-
-  - segment (that contains the chunk)
-  - offset (where the chunk is located in the segment)
-
-The repo index is a hashindex, a hash table implemented in C and tuned for
-memory efficiency.
-
-
-Hints are stored in a file (``repo/hints.%d``).
-
-It contains:
-
-* version
-* list of segments
-* compact
-
-hints and index can be recreated if damaged or lost using ``check --repair``.
-
-The chunks cache and the repository index are stored as hash tables, with
-only one slot per bucket, but that spreads the collisions to the following
-buckets. As a consequence the hash is just a start position for a linear
-search, and if the element is not in the table the index is linearly crossed
-until an empty bucket is found.
-
-When the hash table is filled to 75%, its size is grown. When it's
-emptied to 25%, its size is shrinked. So operations on it have a variable
-complexity between constant and linear with low factor, and memory overhead
-varies between 33% and 300%.
+The chunks cache is a HashIndex_.
 
 
 .. _cache-memory-usage:
 .. _cache-memory-usage:
 
 
@@ -556,6 +546,35 @@ b) with ``create --chunker-params 19,23,21,4095`` (default):
    You'll save some memory, but it will need to read / chunk all the files as
    You'll save some memory, but it will need to read / chunk all the files as
    it can not skip unmodified files then.
    it can not skip unmodified files then.
 
 
+HashIndex
+---------
+
+The chunks cache and the repository index are stored as hash tables, with
+only one slot per bucket, spreading hash collisions to the following
+buckets. As a consequence the hash is just a start position for a linear
+search, and if the element is not in the table the index is linearly crossed
+until an empty bucket is found.
+
+This particular mode of operation is open addressing with linear probing.
+
+When the hash table is filled to 75%, its size is grown. When it's
+emptied to 25%, its size is shrinked. Operations on it have a variable
+complexity between constant and linear with low factor, and memory overhead
+varies between 33% and 300%.
+
+Further, if the number of empty slots becomes too low (recall that linear probing
+for an element not in the index stops at the first empty slot), the hash table
+is rebuilt. The maximum *effective* load factor is 93%.
+
+Data in a HashIndex is always stored in little-endian format, which increases
+efficiency for almost everyone, since basically no one uses big-endian processors
+any more.
+
+The format is easy to read and write, because the buckets array has the same layout
+in memory and on disk. Only the header formats differ.
+
+.. todo:: Describe HashHeader
+
 Encryption
 Encryption
 ----------
 ----------
 
 
@@ -862,6 +881,8 @@ which writes the integrity data to a separate ".integrity" file.
 Integrity errors result in deleting the affected index and rebuilding it.
 Integrity errors result in deleting the affected index and rebuilding it.
 This logs a warning and increases the exit code to WARNING (1).
 This logs a warning and increases the exit code to WARNING (1).
 
 
+.. _integrity_repo:
+
 .. rubric:: Repository index and hints
 .. rubric:: Repository index and hints
 
 
 The repository associates index and hints files with a transaction by including the
 The repository associates index and hints files with a transaction by including the