Browse Source

Merge pull request #2606 from enkore/docs/internals-rewrite

docs: internals editing
enkore 8 năm trước cách đây
mục cha
commit
71495b90e0

+ 153 - 78
docs/internals/data-structures.rst

@@ -6,6 +6,10 @@
 Data structures and file formats
 Data structures and file formats
 ================================
 ================================
 
 
+.. todo:: Clarify terms, perhaps create a glossary.
+          ID (client?) vs. key (repository?),
+          chunks (blob of data in repo?) vs. object (blob of data in repo, referred to from another object?),
+
 .. _repository:
 .. _repository:
 
 
 Repository
 Repository
@@ -79,10 +83,6 @@ strong hash or MAC.
 Segments
 Segments
 ~~~~~~~~
 ~~~~~~~~
 
 
-A |project_name| repository is a filesystem based transactional key/value
-store. It makes extensive use of msgpack_ to store data and, unless
-otherwise noted, data is stored in msgpack_ encoded files.
-
 Objects referenced by a key are stored inline in files (`segments`) of approx.
 Objects referenced by a key are stored inline in files (`segments`) of approx.
 500 MB size in numbered subdirectories of ``repo/data``.
 500 MB size in numbered subdirectories of ``repo/data``.
 
 
@@ -104,12 +104,37 @@ to the file containing the object id and data. If an object is deleted
 a ``DELETE`` entry is appended with the object id.
 a ``DELETE`` entry is appended with the object id.
 
 
 A ``COMMIT`` tag is written when a repository transaction is
 A ``COMMIT`` tag is written when a repository transaction is
-committed.
+committed. The segment number of the segment containing
+a commit is the **transaction ID**.
 
 
 When a repository is opened any ``PUT`` or ``DELETE`` operations not
 When a repository is opened any ``PUT`` or ``DELETE`` operations not
 followed by a ``COMMIT`` tag are discarded since they are part of a
 followed by a ``COMMIT`` tag are discarded since they are part of a
 partial/uncommitted transaction.
 partial/uncommitted transaction.
 
 
+Index, hints and integrity
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The **repository index** is stored in ``index.<TRANSACTION_ID>`` and is used to
+determine an object's location in the repository. It is a HashIndex_,
+a hash table using open addressing. It maps object keys_ to two
+unsigned 32-bit integers; the first integer gives the segment number,
+the second indicates the offset of the object's entry within the segment.
+
+The **hints file** is a msgpacked file named ``hints.<TRANSACTION_ID>``.
+It contains:
+
+* version
+* list of segments
+* compact
+
+The **integrity file** is a msgpacked file named ``integrity.<TRANSACTION_ID>``.
+It contains checksums of the index and hints files and is described in the
+:ref:`Checksumming data structures <integrity_repo>` section below.
+
+If the index or hints are corrupted, they are re-generated automatically.
+If they are outdated, segments are replayed from the index state to the currently
+committed transaction.
+
 Compaction
 Compaction
 ~~~~~~~~~~
 ~~~~~~~~~~
 
 
@@ -253,10 +278,21 @@ If the quota shall be enforced accurately in these cases, either
 - edit the msgpacked ``hints.N`` file (not recommended and thus not
 - edit the msgpacked ``hints.N`` file (not recommended and thus not
   documented further).
   documented further).
 
 
+The object graph
+----------------
+
+On top of the simple key-value store offered by the Repository_,
+Borg builds a much more sophisticated data structure that is essentially
+a completely encrypted object graph. Objects, such as archives_, are referenced
+by their chunk ID, which is cryptographically derived from their contents.
+More on how this helps security in :ref:`security_structural_auth`.
+
+.. figure:: object-graph.png
+
 .. _manifest:
 .. _manifest:
 
 
 The manifest
 The manifest
-------------
+~~~~~~~~~~~~
 
 
 The manifest is an object with an all-zero key that references all the
 The manifest is an object with an all-zero key that references all the
 archives. It contains:
 archives. It contains:
@@ -278,24 +314,32 @@ each time an archive is added, modified or deleted.
 .. _archive:
 .. _archive:
 
 
 Archives
 Archives
---------
+~~~~~~~~
 
 
-The archive metadata does not contain the file items directly. Only
-references to other objects that contain that data. An archive is an
-object that contains:
-
-* version
-* name
-* list of chunks containing item metadata (size: count * ~40B)
-* cmdline
-* hostname
-* username
-* time
+Each archive is an object referenced by the manifest. The archive object
+itself does not store any of the data contained in the archive it describes.
+
+Instead, it contains a list of chunks which form a msgpacked stream of items_.
+The archive object itself further contains some metadata:
+
+* *version*
+* *name*, which might differ from the name set in the manifest.
+  When :ref:`borg_check` rebuilds the manifest (e.g. if it was corrupted) and finds
+  more than one archive object with the same name, it adds a counter to the name
+  in the manifest, but leaves the *name* field of the archives as it was.
+* *items*, a list of chunk IDs containing item metadata (size: count * ~34B)
+* *cmdline*, the command line which was used to create the archive
+* *hostname*
+* *username*
+* *time* and *time_end* are the start and end timestamps, respectively
+* *comment*, a user-specified archive comment
+* *chunker_params* are the :ref:`chunker-params <chunker-params>` used for creating the archive.
+  This is used by :ref:`borg_recreate` to determine whether a given archive needs rechunking.
+* Some other pieces of information related to recreate.
 
 
 .. _archive_limitation:
 .. _archive_limitation:
 
 
-Note about archive limitations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. rubric:: Note about archive limitations
 
 
 The archive is currently stored as a single object in the repository
 The archive is currently stored as a single object in the repository
 and thus limited in size to MAX_OBJECT_SIZE (20MiB).
 and thus limited in size to MAX_OBJECT_SIZE (20MiB).
@@ -324,10 +368,10 @@ also :issue:`1452`.
 .. _item:
 .. _item:
 
 
 Items
 Items
------
+~~~~~
 
 
-Each item represents a file, directory or other fs item and is stored as an
-``item`` dictionary that contains:
+Each item represents a file, directory or other file system item and is stored as a
+dictionary created by the ``Item`` class that contains:
 
 
 * path
 * path
 * list of data chunks (size: count * ~40B)
 * list of data chunks (size: count * ~40B)
@@ -336,12 +380,12 @@ Each item represents a file, directory or other fs item and is stored as an
 * uid
 * uid
 * gid
 * gid
 * mode (item type + permissions)
 * mode (item type + permissions)
-* source (for links)
-* rdev (for devices)
+* source (for symlinks, and for hardlinks within one archive)
+* rdev (for device files)
 * mtime, atime, ctime in nanoseconds
 * mtime, atime, ctime in nanoseconds
 * xattrs
 * xattrs
-* acl
-* bsdfiles
+* acl (various OS-dependent fields)
+* bsdflags
 
 
 All items are serialized using msgpack and the resulting byte stream
 All items are serialized using msgpack and the resulting byte stream
 is fed into the same chunker algorithm as used for regular file data
 is fed into the same chunker algorithm as used for regular file data
@@ -356,7 +400,7 @@ A chunk is stored as an object as well, of course.
 .. _chunker_details:
 .. _chunker_details:
 
 
 Chunks
 Chunks
-------
+~~~~~~
 
 
 The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm.
 The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm.
 It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
 It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
@@ -384,24 +428,22 @@ For some more general usage hints see also ``--chunker-params``.
 
 
 .. _cache:
 .. _cache:
 
 
-Indexes / Caches
-----------------
+The cache
+---------
 
 
 The **files cache** is stored in ``cache/files`` and is used at backup time to
 The **files cache** is stored in ``cache/files`` and is used at backup time to
 quickly determine whether a given file is unchanged and we have all its chunks.
 quickly determine whether a given file is unchanged and we have all its chunks.
 
 
-The files cache is a key -> value mapping and contains:
+In memory, the files cache is a key -> value mapping (a Python *dict*) and contains:
 
 
-* key:
-
-  - full, absolute file path id_hash
+* key: id_hash of the encoded, absolute file path
 * value:
 * value:
 
 
   - file inode number
   - file inode number
   - file size
   - file size
   - file mtime_ns
   - file mtime_ns
-  - list of file content chunk id hashes
   - age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
   - age (0 [newest], 1, 2, 3, ..., BORG_FILES_CACHE_TTL - 1)
+  - list of chunk ids representing the file's contents
 
 
 To determine whether a file has not changed, cached values are looked up via
 To determine whether a file has not changed, cached values are looked up via
 the key in the mapping and compared to the current file attribute values.
 the key in the mapping and compared to the current file attribute values.
@@ -438,6 +480,10 @@ Borg can also work without using the files cache (saves memory if you have a
 lot of files or not much RAM free), then all files are assumed to have changed.
 lot of files or not much RAM free), then all files are assumed to have changed.
 This is usually much slower than with files cache.
 This is usually much slower than with files cache.
 
 
+The on-disk format of the files cache is a stream of msgpacked tuples (key, value).
+Loading the files cache involves reading the file, one msgpack object at a time,
+unpacking it, and msgpacking the value (in an effort to save memory).
+
 The **chunks cache** is stored in ``cache/chunks`` and is used to determine
 The **chunks cache** is stored in ``cache/chunks`` and is used to determine
 whether we already have a specific chunk, to count references to it and also
 whether we already have a specific chunk, to count references to it and also
 for statistics.
 for statistics.
@@ -453,53 +499,18 @@ The chunks cache is a key -> value mapping and contains:
   - size
   - size
   - encrypted/compressed size
   - encrypted/compressed size
 
 
-The chunks cache is a hashindex, a hash table implemented in C and tuned for
-memory efficiency.
-
-The **repository index** is stored in ``repo/index.%d`` and is used to
-determine a chunk's location in the repository.
-
-The repo index is a key -> value mapping and contains:
-
-* key:
-
-  - chunk id_hash
-* value:
-
-  - segment (that contains the chunk)
-  - offset (where the chunk is located in the segment)
-
-The repo index is a hashindex, a hash table implemented in C and tuned for
-memory efficiency.
-
-
-Hints are stored in a file (``repo/hints.%d``).
-
-It contains:
-
-* version
-* list of segments
-* compact
-
-hints and index can be recreated if damaged or lost using ``check --repair``.
-
-The chunks cache and the repository index are stored as hash tables, with
-only one slot per bucket, but that spreads the collisions to the following
-buckets. As a consequence the hash is just a start position for a linear
-search, and if the element is not in the table the index is linearly crossed
-until an empty bucket is found.
-
-When the hash table is filled to 75%, its size is grown. When it's
-emptied to 25%, its size is shrinked. So operations on it have a variable
-complexity between constant and linear with low factor, and memory overhead
-varies between 33% and 300%.
+The chunks cache is a HashIndex_. Due to some restrictions of HashIndex,
+the reference count of each given chunk is limited to a constant, MAX_VALUE
+(introduced below in HashIndex_), approximately 2**32.
+If a reference count hits MAX_VALUE, decrementing it yields MAX_VALUE again,
+i.e. the reference count is pinned to MAX_VALUE.
 
 
 .. _cache-memory-usage:
 .. _cache-memory-usage:
 
 
 Indexes / Caches memory usage
 Indexes / Caches memory usage
 -----------------------------
 -----------------------------
 
 
-Here is the estimated memory usage of |project_name| - it's complicated:
+Here is the estimated memory usage of |project_name| - it's complicated::
 
 
   chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
   chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
 
 
@@ -513,13 +524,12 @@ Here is the estimated memory usage of |project_name| - it's complicated:
              = chunk_count * 164 + total_file_count * 240
              = chunk_count * 164 + total_file_count * 240
 
 
 Due to the hashtables, the best/usual/worst cases for memory allocation can
 Due to the hashtables, the best/usual/worst cases for memory allocation can
-be estimated like that:
+be estimated like that::
 
 
   mem_allocation = mem_usage / load_factor  # l_f = 0.25 .. 0.75
   mem_allocation = mem_usage / load_factor  # l_f = 0.25 .. 0.75
 
 
   mem_allocation_peak = mem_allocation * (1 + growth_factor)  # g_f = 1.1 .. 2
   mem_allocation_peak = mem_allocation * (1 + growth_factor)  # g_f = 1.1 .. 2
 
 
-
 All units are Bytes.
 All units are Bytes.
 
 
 It is assuming every chunk is referenced exactly once (if you have a lot of
 It is assuming every chunk is referenced exactly once (if you have a lot of
@@ -556,6 +566,69 @@ b) with ``create --chunker-params 19,23,21,4095`` (default):
    You'll save some memory, but it will need to read / chunk all the files as
    You'll save some memory, but it will need to read / chunk all the files as
    it can not skip unmodified files then.
    it can not skip unmodified files then.
 
 
+HashIndex
+---------
+
+The chunks cache and the repository index are stored as hash tables, with
+only one slot per bucket, spreading hash collisions to the following
+buckets. As a consequence the hash is just a start position for a linear
+search. If a key is looked up that is not in the table, then the hash table
+is searched from the start position (the hash) until the first empty
+bucket is reached.
+
+This particular mode of operation is open addressing with linear probing.
+
+When the hash table is filled to 75%, its size is grown. When it's
+emptied to 25%, its size is shrinked. Operations on it have a variable
+complexity between constant and linear with low factor, and memory overhead
+varies between 33% and 300%.
+
+If an element is deleted, and the slot behind the deleted element is not empty,
+then the element will leave a tombstone, a bucket marked as deleted. Tombstones
+are only removed by insertions using the tombstone's bucket, or by resizing
+the table. They present the same load to the hash table as a real entry,
+but do not count towards the regular load factor.
+
+Thus, if the number of empty slots becomes too low (recall that linear probing
+for an element not in the index stops at the first empty slot), the hash table
+is rebuilt. The maximum *effective* load factor, i.e. including tombstones, is 93%.
+
+Data in a HashIndex is always stored in little-endian format, which increases
+efficiency for almost everyone, since basically no one uses big-endian processors
+any more.
+
+HashIndex does not use a hashing function, because all keys (save manifest) are
+outputs of a cryptographic hash or MAC and thus already have excellent distribution.
+Thus, HashIndex simply uses the first 32 bits of the key as its "hash".
+
+The format is easy to read and write, because the buckets array has the same layout
+in memory and on disk. Only the header formats differ. The on-disk header is
+``struct HashHeader``:
+
+- First, the HashIndex magic, the eight byte ASCII string "BORG_IDX".
+- Second, the signed 32-bit number of entries (i.e. buckets which are not deleted and not empty).
+- Third, the signed 32-bit number of buckets, i.e. the length of the buckets array
+  contained in the file, and the modulus for index calculation.
+- Fourth, the signed 8-bit length of keys.
+- Fifth, the signed 8-bit length of values. This has to be at least four bytes.
+
+All fields are packed.
+
+The HashIndex is *not* a general purpose data structure.
+The value size must be at least 4 bytes, and these first bytes are used for in-band
+signalling in the data structure itself.
+
+The constant MAX_VALUE (defined as 2**32-1025 = 4294966271) defines the valid range for
+these 4 bytes when interpreted as an uint32_t from 0 to MAX_VALUE (inclusive).
+The following reserved values beyond MAX_VALUE are currently in use (byte order is LE):
+
+- 0xffffffff marks empty buckets in the hash table
+- 0xfffffffe marks deleted buckets in the hash table
+
+HashIndex is implemented in C and wrapped with Cython in a class-based interface.
+The Cython wrapper checks every passed value against these reserved values and
+raises an AssertionError if they are used.
+
 Encryption
 Encryption
 ----------
 ----------
 
 
@@ -862,6 +935,8 @@ which writes the integrity data to a separate ".integrity" file.
 Integrity errors result in deleting the affected index and rebuilding it.
 Integrity errors result in deleting the affected index and rebuilding it.
 This logs a warning and increases the exit code to WARNING (1).
 This logs a warning and increases the exit code to WARNING (1).
 
 
+.. _integrity_repo:
+
 .. rubric:: Repository index and hints
 .. rubric:: Repository index and hints
 
 
 The repository associates index and hints files with a transaction by including the
 The repository associates index and hints files with a transaction by including the

BIN
docs/internals/object-graph.png


BIN
docs/internals/object-graph.vsd


+ 2 - 0
docs/internals/security.rst

@@ -37,6 +37,8 @@ Under these circumstances Borg guarantees that the attacker cannot
 The attacker can always impose a denial of service per definition (he could
 The attacker can always impose a denial of service per definition (he could
 forbid connections to the repository, or delete it entirely).
 forbid connections to the repository, or delete it entirely).
 
 
+.. _security_structural_auth:
+
 Structural Authentication
 Structural Authentication
 -------------------------
 -------------------------
 
 

BIN
docs/internals/structure.png


BIN
docs/internals/structure.vsd