Scrub¶
Scrubbing backups is needed to ensure data consistency over time.
Reasons for Scrubbing¶
Benji divides the data you backup into blocks. These blocks are referenced by the metadata stored in the database backend. When restoring images, these blocks are read and restored to form the original image. As Benji also does deduplication, an invalid block can potentially affect multiple versions and so image backups.
Invalid blocks can occur for the following reasons (probably incomplete):
Bit rot / data degradation (https://en.wikipedia.org/wiki/Data_degradation)
Software failure when writing the block for the first time
OS errors and bugs
Human error: Deleting or modifying blocks by accident
Software errors in Benji and other used tools
Scrubbing Methods¶
Benji implements three different scrubbing methods. Each of these methods
accepts the --block-percentage
(short form -p
) option. With it you
can limit the scrubbing to a randomly selected percentage of the blocks.
Attention
When using the --block-percentage
option with a value of
less than 100 percent with any of the deep scrubbing commands, an invalid
version won’t be marked as valid again, when it has been marked as
invalid in the past. Only a full successful deep-scrub will do that.
Consistency and Checksum¶
$ benji deep-scrub --help
usage: benji deep-scrub [-h] [-s SOURCE] [-p BLOCK_PERCENTAGE] version_uid
positional arguments:
version_uid Version UID
optional arguments:
-h, --help show this help message and exit
-s SOURCE, --source SOURCE
Additionally compare version against source URL
(default: None)
-p BLOCK_PERCENTAGE, --block-percentage BLOCK_PERCENTAGE
Check only a certain percentage of blocks (default:
100)
For each block in a version Benji reads the block’s metadata (UID and checksum) from the database backend, reads the actual block by its UID from the storage, calculates its checksum and compares it to the originally recorded checksum. If the checksums are not the same the block is marked as invalid and won’t be used for deduplication anymore. All other versions which reference this block are also marked as invalid as is the scrubbed version itself.
Using the Backup Source¶
benji deep-scrub --source <snapshot> <version_uid>
In addition to the consistency and checksum checks Benji can also compare
the backup data to the original backup source by specifying the --source
option. The comparison is done byte by byte. Although this is an additional
safeguard against data corruption it requires that the backup source is still
present and it produces additional load on the backup source.
Consistency Only¶
$ benji scrub --help
usage: benji scrub [-h] [-p BLOCK_PERCENTAGE] version_uid
positional arguments:
version_uid Version UID
optional arguments:
-h, --help show this help message and exit
-p BLOCK_PERCENTAGE, --block-percentage BLOCK_PERCENTAGE
Check only a certain percentage of blocks (default:
100)
With this command Benji only checks the metadata consistency between the metadata saved in the database and the metadata accompanying each block on the storage. It also checks if the block exists and has the right length as reported by the storage provider. The actual data is not checked in this case.
This mode of operation can be a useful in addition to deep-scrubs if you pay for data downloads from the storage provider or your bandwidth is limited. It is not a replacement for deep-scrubs but you can reduce their frequency.
Batch scrubbing¶
Benji also supports two commands to facilitate batch scrubbing of versions:
benji batch-scrub
and benji batch-deep-scrub
:
$ benji batch-scrub --help
usage: benji batch-scrub [-h] [-p BLOCK_PERCENTAGE] [-P VERSION_PERCENTAGE]
[-g GROUP_LABEL]
[filter_expression]
positional arguments:
filter_expression Version filter expression (default: None)
optional arguments:
-h, --help show this help message and exit
-p BLOCK_PERCENTAGE, --block-percentage BLOCK_PERCENTAGE
Check only a certain percentage of blocks (default:
100)
-P VERSION_PERCENTAGE, --version-percentage VERSION_PERCENTAGE
Check only a certain percentage of versions (default:
100)
-g GROUP_LABEL, --group_label GROUP_LABEL
Label to find related versions (default: None)
$ benji batch-deep-scrub --help
usage: benji batch-deep-scrub [-h] [-p BLOCK_PERCENTAGE]
[-P VERSION_PERCENTAGE] [-g GROUP_LABEL]
[filter_expression]
positional arguments:
filter_expression Version filter expression (default: None)
optional arguments:
-h, --help show this help message and exit
-p BLOCK_PERCENTAGE, --block-percentage BLOCK_PERCENTAGE
Check only a certain percentage of blocks (default:
100)
-P VERSION_PERCENTAGE, --version-percentage VERSION_PERCENTAGE
Check only a certain percentage of versions (default:
100)
-g GROUP_LABEL, --group_label GROUP_LABEL
Label to find related versions (default: None)
Both can take a list of version names. All versions matching these names will be scrubbed. If you don’t specify any names all versions will be checked.
If the --tag
(short form -t
) is given too, the above selection is
limited to versions also matching the given tag. If multiple --tag
options are given, then they constitute an OR operation.
By default all matching versions will be scrubbed. But you can also
randomly select a certain sample of these versions with --version-percentage
(short form``-P``). A version’s size isn’t taken into account when selecting the
sample, every version is equally eligible.
The batch scrubbing commands also accepts the --block-percentage
(short
form -p
) option.
benji batch-deep-scrub
doesn’t support the --source
option like
benji deep-scrub
.
This is a good use cause for tags: You could mark your versions with a list of different tags denoting the importance of the backed up data. Then you could scrub each class of versions differently:
# 14% of the versions are deep scrubbed for data of high importance
$ benji batch-deep-scrub --version-percentage 14 'labels["priority"] == "high"'
# 7% of the versions are deep scrubbed for data of medium importance
$ benji batch-deep-scrub --version-percentage 7 'labels["priority"] == "medium"'
# 3% of the versions are deep scrubbed for data of low importance
$ benji batch-deep-scrub --version-percentage 3 'labels["priority"] == "low"'
# 3% of the versions are scrubbed when they contain reproducible scratch data or don't have a priority label
$ benji batch-scrub --version-percentage 3 'labels["priority"] == "scratch" or not labels["priority"]'
If you’d call this schedule every day, you’d scrub the important data completely about every seven days (statistically), data of medium importance completely every fourteen days and low priority data completely every month. Scratch data would also be scrubbed completely every month, but only metadata consistency and block existence is checked.
Scrubbing Failures¶
If scrubbing finds invalid blocks, these blocks are marked as invalid in the metadata store. However, such blocks will persist and not be deleted.
Also, the versions affected by such invalid blocks are marked invalid.
Such versions cannot be the base (i.e. benji backup -f
, see
Differential Backup) for differential backups anymore, Benji will throw
an error if you try.
However, invalid versions can still be restored. So a single block will not break the restore process. Instead, you’ll get a clear log output that there is invalid data restored.
You can find invalid versions by looking at the output of benji ls
:
$ benji ls
INFO: $ benji ls
+---------------------+-------------+------+----------+----------+------------+-------+-----------+------+
| date | uid | name | snapshot | size | block_size | valid | protected | tags |
+---------------------+-------------+------+----------+----------+------------+-------+-----------+------+
| 2018-06-07T12:51:19 | V0000000001 | test | | 41943040 | 4194304 | False | False | |
+---------------------+-------------+------+----------+----------+------------+-------+-----------+------+
Note
Multiple versions can be affected by a single block as Benji does deduplication and one block can belong to multiple versions, even to different images.