Browse Source

add --json-lines option to diff command (#5710)

diff: add --json-lines option

Co-authored-by: Robert Blenis <r.blenis@visionxyz.com>
Robert Blenis 4 years ago
parent
commit
5c7e2857ad
4 changed files with 203 additions and 24 deletions
  1. 79 7
      docs/internals/frontends.rst
  2. 9 1
      docs/usage/diff.rst
  3. 37 16
      src/borg/archiver.py
  4. 78 0
      src/borg/testsuite/archiver.py

+ 79 - 7
docs/internals/frontends.rst

@@ -231,11 +231,16 @@ Standard output
 *stdout* is different and more command-dependent than logging. Commands like :ref:`borg_info`, :ref:`borg_create`
 and :ref:`borg_list` implement a ``--json`` option which turns their regular output into a single JSON object.
 
+Some commands, like :ref:`borg_list` and :ref:`borg_diff`, can produce *a lot* of JSON. Since many JSON implementations
+don't support a streaming mode of operation, which is pretty much required to deal with this amount of JSON, these
+commands implement a ``--json-lines`` option which generates output in the `JSON lines <http://jsonlines.org/>`_ format,
+which is simply a number of JSON objects separated by new lines.
+
 Dates are formatted according to ISO 8601 in local time. No explicit time zone is specified *at this time*
 (subject to change). The equivalent strftime format string is '%Y-%m-%dT%H:%M:%S.%f',
 e.g. ``2017-08-07T12:27:20.123456``.
 
-The root object at least contains a *repository* key with an object containing:
+The root object of '--json' output will contain at least a *repository* key with an object containing:
 
 id
     The ID of the repository, normally 64 hex characters
@@ -438,12 +443,7 @@ The same archive with more information (``borg info --last 1 --json``)::
 File listings
 +++++++++++++
 
-Listing the contents of an archive can produce *a lot* of JSON. Since many JSON implementations
-don't support a streaming mode of operation, which is pretty much required to deal with this amount of
-JSON, output is generated in the `JSON lines <http://jsonlines.org/>`_ format, which is simply
-a number of JSON objects separated by new lines.
-
-Each item (file, directory, ...) is described by one object in the :ref:`borg_list` output.
+Each archive item (file, directory, ...) is described by one object in the :ref:`borg_list` output.
 Refer to the *borg list* documentation for the available keys and their meaning.
 
 Example (excerpt) of ``borg list --json-lines``::
@@ -451,6 +451,78 @@ Example (excerpt) of ``borg list --json-lines``::
     {"type": "d", "mode": "drwxr-xr-x", "user": "user", "group": "user", "uid": 1000, "gid": 1000, "path": "linux", "healthy": true, "source": "", "linktarget": "", "flags": null, "mtime": "2017-02-27T12:27:20.023407", "size": 0}
     {"type": "d", "mode": "drwxr-xr-x", "user": "user", "group": "user", "uid": 1000, "gid": 1000, "path": "linux/baz", "healthy": true, "source": "", "linktarget": "", "flags": null, "mtime": "2017-02-27T12:27:20.585407", "size": 0}
 
+Archive Differencing
+++++++++++++++++++++
+
+Each archive difference item (file contents, user/group/mode) output by :ref:`borg_diff` is represented by an *ItemDiff* object.
+The propertiese of an *ItemDiff* object are:
+
+path:
+    The filename/path of the *Item* (file, directory, symlink).
+
+changes:
+    A list of *Change* objects describing the changes made to the item in the two archives. For example,
+    there will be two changes if the contents of a file are changed, and its ownership are changed.
+
+The *Change* object can contain a number of properties depending on the type of change that occured. 
+If a 'property' is not required for the type of change, it is not output.
+The possible properties of a *Change* object are:
+
+type:
+  The **type** property is always present. It identifies the type of change and will be one of these values:
+  
+  - *modified* - file contents changed.
+  - *added* - the file was added.
+  - *removed* - the file was removed.
+  - *added directory* - the directory was added.
+  - *removed directory* - the directory was removed.
+  - *added link* - the symlink was added.
+  - *removed link* - the symlink was removed.
+  - *changed link* - the symlink target was changed.
+  - *mode* - the file/directory/link mode was changed. Note - this could indicate a change from a
+    file/directory/link type to a different type (file/directory/link), such as -- a file is deleted and replaced
+    with a directory of the same name.
+  - *owner* - user and/or group ownership changed.
+
+size:
+    If **type** == '*added*' or '*removed*', then **size** provides the size of the added or removed file.
+
+added:
+    If **type** == '*modified*' and chunk ids can be compared, then **added** and **removed** indicate the amount
+    of data 'added' and 'removed'. If chunk ids can not be compared, then **added** and **removed** properties are
+    not provided and the only information available is that the file contents were modified.
+
+removed:
+    See **added** property.
+    
+old_mode:
+    If **type** == '*mode*', then **old_mode** and **new_mode** provide the mode and permissions changes.
+
+new_mode:
+    See **old_mode** property.
+ 
+old_user:
+    If **type** == '*owner*', then **old_user**, **new_user**, **old_group** and **new_group** provide the user
+    and group ownership changes.
+
+old_group:
+    See **old_user** property.
+ 
+new_user:
+    See **old_user** property.
+ 
+new_group:
+    See **old_user** property.
+    
+
+Example (excerpt) of ``borg diff --json-lines``::
+
+    {"path": "file1", "changes": [{"path": "file1", "changes": [{"type": "modified", "added": 17, "removed": 5}, {"type": "mode", "old_mode": "-rw-r--r--", "new_mode": "-rwxr-xr-x"}]}]}
+    {"path": "file2", "changes": [{"type": "modified", "added": 135, "removed": 252}]}
+    {"path": "file4", "changes": [{"type": "added", "size": 0}]}
+    {"path": "file3", "changes": [{"type": "removed", "size": 0}]}
+
+
 .. _msgid:
 
 Message IDs

+ 9 - 1
docs/usage/diff.rst

@@ -16,6 +16,7 @@ Examples
     $ echo "something" >> file2
     $ borg create ../testrepo::archive2 .
 
+    $ echo "testing 123" >> file1
     $ rm file3
     $ touch file4
     $ borg create ../testrepo::archive3 .
@@ -26,11 +27,18 @@ Examples
        +135 B    -252 B file2
 
     $ borg diff testrepo::archive2 archive3
+        +17 B      -5 B file1
     added           0 B file4
     removed         0 B file3
 
     $ borg diff testrepo::archive1 archive3
-    [-rw-r--r-- -> -rwxr-xr-x] file1
+        +17 B      -5 B [-rw-r--r-- -> -rwxr-xr-x] file1
        +135 B    -252 B file2
     added           0 B file4
     removed         0 B file3
+
+    $ borg diff --json-lines testrepo::archive1 archive3
+    {"path": "file1", "changes": [{"type": "modified", "added": 17, "removed": 5}, {"type": "mode", "old_mode": "-rw-r--r--", "new_mode": "-rwxr-xr-x"}]}
+    {"path": "file2", "changes": [{"type": "modified", "added": 135, "removed": 252}]}
+    {"path": "file4", "changes": [{"type": "added", "size": 0}]}
+    {"path": "file3", "changes": [{"type": "removed", "size": 0}]}

+ 37 - 16
src/borg/archiver.py

@@ -1087,11 +1087,11 @@ class Archiver:
             # regular file is replaced with a link or vice versa, it is
             # indicated in compare_mode instead.
             if item1.get('deleted'):
-                return 'added link'
+                return ({"type": 'added link'}, 'added link')
             elif item2.get('deleted'):
-                return 'removed link'
+                return ({"type": 'removed link'}, 'removed link')
             elif 'source' in item1 and 'source' in item2 and item1.source != item2.source:
-                return 'changed link'
+                return ({"type": 'changed link'}, 'changed link')
 
         def contents_changed(item1, item2):
             if item1.get('deleted') != item2.get('deleted'):
@@ -1111,35 +1111,41 @@ class Archiver:
         def compare_content(path, item1, item2):
             if contents_changed(item1, item2):
                 if item1.get('deleted'):
-                    return 'added {:>13}'.format(format_file_size(sum_chunk_size(item2)))
+                    sz = sum_chunk_size(item2)
+                    return ({"type": "added", "size": sz}, 'added {:>13}'.format(format_file_size(sz)))
                 if item2.get('deleted'):
-                    return 'removed {:>11}'.format(format_file_size(sum_chunk_size(item1)))
+                    sz = sum_chunk_size(item1)
+                    return ({"type": "removed", "size": sz}, 'removed {:>11}'.format(format_file_size(sz)))
                 if not can_compare_chunk_ids:
-                    return 'modified'
+                    return ({"type": "modified"}, "modified")
                 chunk_ids1 = {c.id for c in item1.chunks}
                 chunk_ids2 = {c.id for c in item2.chunks}
                 added_ids = chunk_ids2 - chunk_ids1
                 removed_ids = chunk_ids1 - chunk_ids2
                 added = sum_chunk_size(item2, added_ids)
                 removed = sum_chunk_size(item1, removed_ids)
-                return '{:>9} {:>9}'.format(format_file_size(added, precision=1, sign=True),
-                                            format_file_size(-removed, precision=1, sign=True))
+                return ({"type": "modified", "added": added, "removed": removed},
+                        '{:>9} {:>9}'.format(format_file_size(added, precision=1, sign=True),
+                        format_file_size(-removed, precision=1, sign=True)))
 
         def compare_directory(item1, item2):
             if item2.get('deleted') and not item1.get('deleted'):
-                return 'removed directory'
+                return ({"type": 'removed directory'}, 'removed directory')
             elif item1.get('deleted') and not item2.get('deleted'):
-                return 'added directory'
+                return ({"type": 'added directory'}, 'added directory')
 
         def compare_owner(item1, item2):
             user1, group1 = get_owner(item1)
             user2, group2 = get_owner(item2)
             if user1 != user2 or group1 != group2:
-                return '[{}:{} -> {}:{}]'.format(user1, group1, user2, group2)
+                return ({"type": "owner", "old_user": user1, "old_group": group1, "new_user": user2, "new_group": group2},
+                        '[{}:{} -> {}:{}]'.format(user1, group1, user2, group2))
 
         def compare_mode(item1, item2):
             if item1.mode != item2.mode:
-                return '[{} -> {}]'.format(get_mode(item1), get_mode(item2))
+                mode1 = get_mode(item1)
+                mode2 = get_mode(item2)
+                return ({"type": "mode", "old_mode": mode1, "new_mode": mode2}, '[{} -> {}]'.format(mode1, mode2))
 
         def compare_items(output, path, item1, item2, hardlink_masters, deleted=False):
             """
@@ -1167,17 +1173,26 @@ class Archiver:
                 changes.append(compare_owner(item1, item2))
                 changes.append(compare_mode(item1, item2))
 
+            # changes is a list of paths, changesets:  [(path1, [{changeset1}, ..]), (path2, [{changeset1}, ..]), ..]
             changes = [x for x in changes if x]
             if changes:
-                output_line = (remove_surrogates(path), ' '.join(changes))
+                output_line = (remove_surrogates(path), changes)
 
+                # if sorting, save changes for later, otherwise go ahead and output the results as they are generated.
                 if args.sort:
                     output.append(output_line)
+                elif args.json_lines:
+                    print_json_output(output_line)
                 else:
-                    print_output(output_line)
+                    print_text_output(output_line)
 
-        def print_output(line):
-            print("{:<19} {}".format(line[1], line[0]))
+        def print_text_output(line):
+            path, diff = line
+            print("{:<19} {}".format(' '.join([txt for j, txt in diff]), path))
+
+        def print_json_output(line):
+            path, diff = line
+            print(json.dumps({"path": path, "changes": [j for j, txt in diff]}))
 
         def compare_archives(archive1, archive2, matcher):
             def hardlink_master_seen(item):
@@ -1243,6 +1258,10 @@ class Archiver:
                 assert hardlink_master_seen(item2)
                 compare_items(output, item1.path, item1, item2, hardlink_masters)
 
+            print_output = print_json_output if args.json_lines else print_text_output
+
+            # if we wanted sorted output (args.sort is true), then results are collected in 'output' and
+            # need to be sort them before printing. Otherwise results were already printed and 'output' is empty.
             for line in sorted(output):
                 print_output(line)
 
@@ -3649,6 +3668,8 @@ class Archiver:
                                help='Override check of chunker parameters.')
         subparser.add_argument('--sort', dest='sort', action='store_true',
                                help='Sort the output lines by file path.')
+        subparser.add_argument('--json-lines', action='store_true',
+                               help='Format output as JSON Lines. ')
         subparser.add_argument('location', metavar='REPO::ARCHIVE1',
                                type=location_validator(archive=True),
                                help='repository location and ARCHIVE1 name')

+ 78 - 0
src/borg/testsuite/archiver.py

@@ -3731,9 +3731,87 @@ class DiffArchiverTestCase(ArchiverTestCaseBase):
             if are_hardlinks_supported():
                 assert 'input/hardlink_target_replaced' not in output
 
+        def do_json_asserts(output, can_compare_ids):
+            def get_changes(filename, data):
+                chgsets = [j['changes'] for j in data if j['path'] == filename]
+                assert len(chgsets) < 2
+                # return a flattened list of changes for given filename
+                return [chg for chgset in chgsets for chg in chgset]
+
+            # convert output to list of dicts
+            joutput = [json.loads(line) for line in output.split('\n') if line]
+
+            # File contents changed (deleted and replaced with a new file)
+            expected = {'type': 'modified', 'added': 4096, 'removed': 1024} if can_compare_ids else {'type': 'modified'}
+            assert expected in get_changes('input/file_replaced', joutput)
+
+            # File unchanged
+            assert not any(get_changes('input/file_unchanged', joutput))
+
+            # Directory replaced with a regular file
+            if 'BORG_TESTS_IGNORE_MODES' not in os.environ:
+                assert {'type': 'mode', 'old_mode': 'drwxr-xr-x', 'new_mode': '-rwxr-xr-x'} in \
+                    get_changes('input/dir_replaced_with_file', joutput)
+
+            # Basic directory cases
+            assert {'type': 'added directory'} in get_changes('input/dir_added', joutput)
+            assert {'type': 'removed directory'} in get_changes('input/dir_removed', joutput)
+
+            if are_symlinks_supported():
+                # Basic symlink cases
+                assert {'type': 'changed link'} in get_changes('input/link_changed', joutput)
+                assert {'type': 'added link'} in get_changes('input/link_added', joutput)
+                assert {'type': 'removed link'} in get_changes('input/link_removed', joutput)
+
+                # Symlink replacing or being replaced
+                assert any(chg['type'] == 'mode' and chg['new_mode'].startswith('l') for chg in
+                    get_changes('input/dir_replaced_with_link', joutput))
+                assert any(chg['type'] == 'mode' and chg['old_mode'].startswith('l') for chg in
+                    get_changes('input/link_replaced_by_file', joutput))
+
+                # Symlink target removed. Should not affect the symlink at all.
+                assert not any(get_changes('input/link_target_removed', joutput))
+
+            # The inode has two links and the file contents changed. Borg
+            # should notice the changes in both links. However, the symlink
+            # pointing to the file is not changed.
+            expected = {'type': 'modified', 'added': 13, 'removed': 0} if can_compare_ids else {'type': 'modified'}
+            assert expected in get_changes('input/empty', joutput)
+            if are_hardlinks_supported():
+                assert expected in get_changes('input/hardlink_contents_changed', joutput)
+            if are_symlinks_supported():
+                assert not any(get_changes('input/link_target_contents_changed', joutput))
+
+            # Added a new file and a hard link to it. Both links to the same
+            # inode should appear as separate files.
+            assert {'type': 'added', 'size': 2048} in get_changes('input/file_added', joutput)
+            if are_hardlinks_supported():
+                assert {'type': 'added', 'size': 2048} in get_changes('input/hardlink_added', joutput)
+
+            # check if a diff between non-existent and empty new file is found
+            assert {'type': 'added', 'size': 0} in get_changes('input/file_empty_added', joutput)
+
+            # The inode has two links and both of them are deleted. They should
+            # appear as two deleted files.
+            assert {'type': 'removed', 'size': 256} in get_changes('input/file_removed', joutput)
+            if are_hardlinks_supported():
+                assert {'type': 'removed', 'size': 256} in get_changes('input/hardlink_removed', joutput)
+
+            # Another link (marked previously as the source in borg) to the
+            # same inode was removed. This should not change this link at all.
+            if are_hardlinks_supported():
+                assert not any(get_changes('input/hardlink_target_removed', joutput))
+
+            # Another link (marked previously as the source in borg) to the
+            # same inode was replaced with a new regular file. This should not
+            # change this link at all.
+            if are_hardlinks_supported():
+                assert not any(get_changes('input/hardlink_target_replaced', joutput))
+
         do_asserts(self.cmd('diff', self.repository_location + '::test0', 'test1a'), True)
         # We expect exit_code=1 due to the chunker params warning
         do_asserts(self.cmd('diff', self.repository_location + '::test0', 'test1b', exit_code=1), False)
+        do_json_asserts(self.cmd('diff', self.repository_location + '::test0', 'test1a', '--json-lines'), True)
 
     def test_sort_option(self):
         self.cmd('init', '--encryption=repokey', self.repository_location)