| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448 | 
							- .. include:: global.rst.inc
 
- .. _internals:
 
- Internals
 
- =========
 
- This page documents the internal data structures and storage
 
- mechanisms of |project_name|. It is partly based on `mailing list
 
- discussion about internals`_ and also on static code analysis.
 
- Repository and Archives
 
- -----------------------
 
- |project_name| stores its data in a `Repository`. Each repository can
 
- hold multiple `Archives`, which represent individual backups that
 
- contain a full archive of the files specified when the backup was
 
- performed. Deduplication is performed across multiple backups, both on
 
- data and metadata, using `Chunks` created by the chunker using the Buzhash_
 
- algorithm.
 
- Each repository has the following file structure:
 
- README
 
-   simple text file telling that this is a |project_name| repository
 
- config
 
-   repository configuration
 
- data/
 
-   directory where the actual data is stored
 
- hints.%d
 
-   hints for repository compaction
 
- index.%d
 
-   repository index
 
- lock.roster and lock.exclusive/*
 
-   used by the locking system to manage shared and exclusive locks
 
- Lock files
 
- ----------
 
- |project_name| uses locks to get (exclusive or shared) access to the cache and
 
- the repository.
 
- The locking system is based on creating a directory `lock.exclusive` (for
 
- exclusive locks). Inside the lock directory, there is a file indication
 
- hostname, process id and thread id of the lock holder.
 
- There is also a json file `lock.roster` that keeps a directory of all shared
 
- and exclusive lockers.
 
- If the process can create the `lock.exclusive` directory for a resource, it has
 
- the lock for it. If creation fails (because the directory has already been
 
- created by some other process), lock acquisition fails.
 
- The cache lock is usually in `~/.cache/borg/REPOID/lock.*`.
 
- The repository lock is in `repository/lock.*`.
 
- In case you run into troubles with the locks, you can just delete the `lock.*`
 
- directory and file IF you first make sure that no |project_name| process is
 
- running on any machine that accesses this resource. Be very careful, the cache
 
- or repository might get damaged if multiple processes use it at the same time.
 
- Config file
 
- -----------
 
- Each repository has a ``config`` file which which is a ``INI``-style file
 
- and looks like this::
 
-     [repository]
 
-     version = 1
 
-     segments_per_dir = 10000
 
-     max_segment_size = 5242880
 
-     id = 57d6c1d52ce76a836b532b0e42e677dec6af9fca3673db511279358828a21ed6
 
- This is where the ``repository.id`` is stored. It is a unique
 
- identifier for repositories. It will not change if you move the
 
- repository around so you can make a local transfer then decide to move
 
- the repository to another (even remote) location at a later time.
 
- Keys
 
- ----
 
- The key to address the key/value store is usually computed like this:
 
- key = id = id_hash(unencrypted_data)
 
- The id_hash function is:
 
- * sha256 (no encryption keys available)
 
- * hmac-sha256 (encryption keys available)
 
- Segments and archives
 
- ---------------------
 
- A |project_name| repository is a filesystem based transactional key/value
 
- store. It makes extensive use of msgpack_ to store data and, unless
 
- otherwise noted, data is stored in msgpack_ encoded files.
 
- Objects referenced by a key are stored inline in files (`segments`) of approx.
 
- 5MB size in numbered subdirectories of ``repo/data``.
 
- They contain:
 
- * header size
 
- * crc
 
- * size
 
- * tag
 
- * key
 
- * data
 
- Segments are built locally, and then uploaded. Those files are
 
- strictly append-only and modified only once.
 
- Tag is either ``PUT``, ``DELETE``, or ``COMMIT``. A segment file is
 
- basically a transaction log where each repository operation is
 
- appended to the file. So if an object is written to the repository a
 
- ``PUT`` tag is written to the file followed by the object id and
 
- data. If an object is deleted a ``DELETE`` tag is appended
 
- followed by the object id. A ``COMMIT`` tag is written when a
 
- repository transaction is committed.  When a repository is opened any
 
- ``PUT`` or ``DELETE`` operations not followed by a ``COMMIT`` tag are
 
- discarded since they are part of a partial/uncommitted transaction.
 
- The manifest
 
- ------------
 
- The manifest is an object with an all-zero key that references all the
 
- archives.
 
- It contains:
 
- * version
 
- * list of archive infos
 
- * timestamp
 
- * config
 
- Each archive info contains:
 
- * name
 
- * id
 
- * time
 
- It is the last object stored, in the last segment, and is replaced
 
- each time.
 
- The Archive
 
- -----------
 
- The archive metadata does not contain the file items directly. Only
 
- references to other objects that contain that data. An archive is an
 
- object that contains:
 
- * version
 
- * name
 
- * list of chunks containing item metadata
 
- * cmdline
 
- * hostname
 
- * username
 
- * time
 
- The Item
 
- --------
 
- Each item represents a file, directory or other fs item and is stored as an
 
- ``item`` dictionary that contains:
 
- * path
 
- * list of data chunks
 
- * user
 
- * group
 
- * uid
 
- * gid
 
- * mode (item type + permissions)
 
- * source (for links)
 
- * rdev (for devices)
 
- * mtime
 
- * xattrs
 
- * acl
 
- * bsdfiles
 
- ``ctime`` (change time) is not stored because there is no API to set
 
- it and it is reset every time an inode's metadata is changed.
 
- All items are serialized using msgpack and the resulting byte stream
 
- is fed into the same chunker used for regular file data and turned
 
- into deduplicated chunks. The reference to these chunks is then added
 
- to the archive metadata.
 
- A chunk is stored as an object as well, of course.
 
- Chunks
 
- ------
 
- The |project_name| chunker uses a rolling hash computed by the Buzhash_ algorithm.
 
- It triggers (chunks) when the last HASH_MASK_BITS bits of the hash are zero,
 
- producing chunks of 2^HASH_MASK_BITS Bytes on average.
 
- create --chunker-params CHUNK_MIN_EXP,CHUNK_MAX_EXP,HASH_MASK_BITS,HASH_WINDOW_SIZE
 
- can be used to tune the chunker parameters, the default is:
 
- - CHUNK_MIN_EXP = 10 (minimum chunk size = 2^10 B = 1 kiB)
 
- - CHUNK_MAX_EXP = 23 (maximum chunk size = 2^23 B = 8 MiB)
 
- - HASH_MASK_BITS = 16 (statistical medium chunk size ~= 2^16 B = 64 kiB)
 
- - HASH_WINDOW_SIZE = 4095 [B] (`0xFFF`)
 
- The default parameters are OK for relatively small backup data volumes and
 
- repository sizes and a lot of available memory (RAM) and disk space for the
 
- chunk index. If that does not apply, you are advised to tune these parameters
 
- to keep the chunk count lower than with the defaults.
 
- The buzhash table is altered by XORing it with a seed randomly generated once
 
- for the archive, and stored encrypted in the keyfile. This is to prevent chunk
 
- size based fingerprinting attacks on your encrypted repo contents (to guess
 
- what files you have based on a specific set of chunk sizes).
 
- Indexes / Caches
 
- ----------------
 
- The **files cache** is stored in ``cache/files`` and is indexed on the
 
- ``file path hash``. At backup time, it is used to quickly determine whether we
 
- need to chunk a given file (or whether it is unchanged and we already have all
 
- its pieces).
 
- It contains:
 
- * age
 
- * file inode number
 
- * file size
 
- * file mtime_ns
 
- * file content chunk hashes
 
- The inode number is stored to make sure we distinguish between
 
- different files, as a single path may not be unique across different
 
- archives in different setups.
 
- The files cache is stored as a python associative array storing
 
- python objects, which generates a lot of overhead.
 
- The **chunks cache** is stored in ``cache/chunks`` and is indexed on the
 
- ``chunk id_hash``. It is used to determine whether we already have a specific
 
- chunk, to count references to it and also for statistics.
 
- It contains:
 
- * reference count
 
- * size
 
- * encrypted/compressed size
 
- The **repository index** is stored in ``repo/index.%d`` and is indexed on the
 
- ``chunk id_hash``. It is used to determine a chunk's location in the repository.
 
- It contains:
 
- * segment (that contains the chunk)
 
- * offset (where the chunk is located in the segment)
 
- The repository index file is random access.
 
- Hints are stored in a file (``repo/hints.%d``).
 
- It contains:
 
- * version
 
- * list of segments
 
- * compact
 
- hints and index can be recreated if damaged or lost using ``check --repair``.
 
- The chunks cache and the repository index are stored as hash tables, with
 
- only one slot per bucket, but that spreads the collisions to the following
 
- buckets. As a consequence the hash is just a start position for a linear
 
- search, and if the element is not in the table the index is linearly crossed
 
- until an empty bucket is found.
 
- When the hash table is almost full at 90%, its size is doubled. When it's
 
- almost empty at 25%, its size is halved. So operations on it have a variable
 
- complexity between constant and linear with low factor, and memory overhead
 
- varies between 10% and 300%.
 
- Indexes / Caches memory usage
 
- -----------------------------
 
- Here is the estimated memory usage of |project_name|:
 
-   chunk_count ~= total_file_size / 2 ^ HASH_MASK_BITS
 
-   repo_index_usage = chunk_count * 40
 
-   chunks_cache_usage = chunk_count * 44
 
-   files_cache_usage = total_file_count * 240 + chunk_count * 80
 
-   mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage
 
-              = chunk_count * 164 + total_file_count * 240
 
- All units are Bytes.
 
- It is assuming every chunk is referenced exactly once (if you have a lot of
 
- duplicate chunks, you will have less chunks than estimated above).
 
- It is also assuming that typical chunk size is 2^HASH_MASK_BITS (if you have
 
- a lot of files smaller than this statistical medium chunk size, you will have
 
- more chunks than estimated above, because 1 file is at least 1 chunk).
 
- If a remote repository is used the repo index will be allocated on the remote side.
 
- E.g. backing up a total count of 1Mi files with a total size of 1TiB.
 
- a) with create --chunker-params 10,23,16,4095 (default):
 
-   mem_usage  =  2.8GiB
 
- b) with create --chunker-params 10,23,20,4095 (custom):
 
-   mem_usage  =  0.4GiB
 
- Note: there is also the --no-files-cache option to switch off the files cache.
 
- You'll save some memory, but it will need to read / chunk all the files then as
 
- it can not skip unmodified files then.
 
- Encryption
 
- ----------
 
- AES_ is used in CTR mode (so no need for padding). A 64bit initialization
 
- vector is used, a `HMAC-SHA256`_ is computed on the encrypted chunk with a
 
- random 64bit nonce and both are stored in the chunk.
 
- The header of each chunk is : ``TYPE(1)`` + ``HMAC(32)`` + ``NONCE(8)`` + ``CIPHERTEXT``.
 
- Encryption and HMAC use two different keys.
 
- In AES CTR mode you can think of the IV as the start value for the counter.
 
- The counter itself is incremented by one after each 16 byte block.
 
- The IV/counter is not required to be random but it must NEVER be reused.
 
- So to accomplish this |project_name| initializes the encryption counter to be
 
- higher than any previously used counter value before encrypting new data.
 
- To reduce payload size, only 8 bytes of the 16 bytes nonce is saved in the
 
- payload, the first 8 bytes are always zeros. This does not affect security but
 
- limits the maximum repository capacity to only 295 exabytes (2**64 * 16 bytes).
 
- Encryption keys are either derived from a passphrase or kept in a key file.
 
- The passphrase is passed through the ``BORG_PASSPHRASE`` environment variable
 
- or prompted for interactive usage.
 
- Key files
 
- ---------
 
- When initialized with the ``init -e keyfile`` command, |project_name|
 
- needs an associated file in ``$HOME/.borg/keys`` to read and write
 
- the repository. The format is based on msgpack_, base64 encoding and
 
- PBKDF2_ SHA256 hashing, which is then encoded again in a msgpack_.
 
- The internal data structure is as follows:
 
- version
 
-   currently always an integer, 1
 
- repository_id
 
-   the ``id`` field in the ``config`` ``INI`` file of the repository.
 
- enc_key
 
-   the key used to encrypt data with AES (256 bits)
 
-   
 
- enc_hmac_key
 
-   the key used to HMAC the encrypted data (256 bits)
 
- id_key
 
-   the key used to HMAC the plaintext chunk data to compute the chunk's id
 
- chunk_seed
 
-   the seed for the buzhash chunking table (signed 32 bit integer)
 
- Those fields are processed using msgpack_. The utf-8 encoded passphrase
 
- is processed with PBKDF2_ (SHA256_, 100000 iterations, random 256 bit salt)
 
- to give us a derived key. The derived key is 256 bits long.
 
- A `HMAC-SHA256`_ checksum of the above fields is generated with the derived
 
- key, then the derived key is also used to encrypt the above pack of fields.
 
- Then the result is stored in a another msgpack_ formatted as follows:
 
- version
 
-   currently always an integer, 1
 
- salt
 
-   random 256 bits salt used to process the passphrase
 
- iterations
 
-   number of iterations used to process the passphrase (currently 100000)
 
- algorithm
 
-   the hashing algorithm used to process the passphrase and do the HMAC
 
-   checksum (currently the string ``sha256``)
 
- hash
 
-   the HMAC of the encrypted derived key
 
- data
 
-   the derived key, encrypted with AES over a PBKDF2_ SHA256 key
 
-   described above
 
- The resulting msgpack_ is then encoded using base64 and written to the
 
- key file, wrapped using the standard ``textwrap`` module with a header.
 
- The header is a single line with a MAGIC string, a space and a hexadecimal
 
- representation of the repository id.
 
- Compression
 
- -----------
 
- |project_name| supports the following compression methods:
 
- - none (no compression, pass through data 1:1)
 
- - lz4 (low compression, but super fast)
 
- - zlib (level 0-9, level 0 is no compression [but still adding zlib overhead],
 
-   level 1 is low, level 9 is high compression)
 
- - lzma (level 0-9, level 0 is low, level 9 is high compression).
 
- Speed:  none > lz4 > zlib > lzma
 
- Compression: lzma > zlib > lz4 > none
 
- Be careful, higher zlib and especially lzma compression levels might take a
 
- lot of resources (CPU and memory).
 
- The overall speed of course also depends on the speed of your target storage.
 
- If that is slow, using a higher compression level might yield better overall
 
- performance. You need to experiment a bit. Maybe just watch your CPU load, if
 
- that is relatively low, increase compression until 1 core is 70-100% loaded.
 
- Even if your target storage is rather fast, you might see interesting effects:
 
- while doing no compression at all (none) is a operation that takes no time, it
 
- likely will need to store more data to the storage compared to using lz4.
 
- The time needed to transfer and store the additional data might be much more
 
- than if you had used lz4 (which is super fast, but still might compress your
 
- data about 2:1). This is assuming your data is compressible (if you backup
 
- already compressed data, trying to compress them at backup time is usually
 
- pointless).
 
- Compression is applied after deduplication, thus using different compression
 
- methods in one repo does not influence deduplication.
 
- See ``borg create --help`` about how to specify the compression level and its default.
 
 
  |