Some cryptographic details distributed backup
As a followup to my previous note on a distributed backup system, I give here some details on its cryptography. Most of this note comes from Niels Ferguson and Bruce Schneier Practical Cryptography book. Feel free to criticize my choices.;)
Cryptographic primitives
I intend to use Niels and Schneier Fortuna random number generator. Compared to their original design, I won't implement my entropy pool system but use the/dev/random facility of
Linux. That's not perfect but certainly much easier to implement. This
random number generator is used to generate per file (or per backup
phase) specific key, each time a file is saved. This gives perfect
forward secrecy in case a key is compromised.
AES 256 bits is used as block cipher. I could also use Serpent 256 bits,
or even both of them cascated, but I don't know if this has any value.
As block chaining mode, use Cipher Block Chaining (CBC) mode with a
Nonce-Generated Initialization Vector (IV). The nonce is generated by
encrypting with the file key the 64-bits file counter, incremented each
time a file is saved (i.e. a new encryption is made).
(* + is xor addition *)
C0 := E(K, 64-bits-counter)
Ci := E(K, Pi + C{i-1}) for i = 1, .. k (k = number of blocks)
As cryptograhic hash function, use two call to SHA256d defined from
SHA-256 as:
SHA256d(m) := SHA-256(SHA-256(m))
As Message Authentication Code, use HMAC constructed with SHA-256
(K is the MAC key, m the message, a and
b are two specified constants):
(* + is xor operation *)
HMAC(K,m) := SHA-256((K + a) || SHA-256((K + b) || m))
Protection of master index
The master index stores the list of saved file with their encryption key and other associated information. This master index is protected by a password, known by the user. As such a password has not enough entropy, so a salting and streching technique is used to reinforce password strength. Firstly, a 256-bits random salt is generated and stored alongside the master index. Then, a cryptographic hash function SHA256d is used repeatedly as follow:
x0 := 0
xi := SHA256d(x{i-1} || password || salt) for i = 1, .., r
K := xr
r is the maximum number of iterations that can be made during
200-1000 ms. K is the final master key for the master index.
Each time the master index is encrypted, a new Initialization Vector is
needed. For this, we use as nonce the global 64-bits file counter by
taking a value from it.
To check that the master index as been correctly decrypted once the user
has entered his password or that the master index has not been modified,
the compressed master index is added a Message Authentication Code
before being encrypted. The MAC key and encryption/decryption keys are
generated from the derived password key with following formulas, where
"string" is just a simple 16-bytes string concatenated to the
key:
MAC_key := SHA256d(password-derived-K || "authentication01")
encryption_key := SHA256d(password-derived-K || "encryptionabcdef")
Before being stored on remote nodes, the master index is thus
compressed, added a MAC and encrypted.
Encryption and authentication of files
In my initial design, each file is encrypted with a freshly generated 256 bits key, the nonce used for Initialization Vector being the freshly allocated 64-bits file identifier. As that might be too costly, I think it would be easier to use a unique key for the set of files making a backup phase. So I indent to use following algorithm:
For each backup phase:
1. generate a random encryption key
2. generate a random Message Authentication Code key
3. for each file
a. take a value and increment the global 64-bits counter
b. compress the file
c. compute file MAC with MAC key
d. encrypt the file and its MAC with IV and encryption key
d. store an entry in master index, with file identifier and
used keys (encryption and MAC)
We store for each file the used keys, so in there is no need to make
cleanup in the index file, as each file can be re-encrypted with a
different key in a latter backup phase.
As said in my previous note, it is not necessary to encrypt directories
and empty files, except putting an entry for them in the master
index. As an optimisation, if the compressed file is smaller than 32
bytes (the size of an encryption and MAC keys), it is more efficient
effective to store it directly in the master index. The real question
is: are there so many files of that type?
The main issue with above design is that file size is immediately
visible to attackers. A refinement of the design would be to use a fixed
block size (e.g. 8 or 16 Kbytes) and partition a file or a set of files
amongst several blocks. For a complete backup phase, following algorithm
would be used:
For each backup phase:
1. compress all files and sort them according to there size
2. starting from the biggest one:
if the file is bigger that a block:
split it in the needed amount of blocks and pad the last
block with dummy data (zeros)
else (the file is smaller than a block):
put it in a new block and try to fit as much as possible
other files in the same block
3. for each generated block, give it a 64-bits block identifier
and encrypt it
For each file, store in the master index the used keys (encryption and
MAC), the list of block identifiers where parts of the file is in,
compressed file size and offset of the file within the first block.
For very big files, it would be impractical to compress them in advance,
so one would compress, split them in blocks and compute MAC and
encryption on each block on the fly.
To conclude
I've used fairly standard cryptography, following the advice of others and being pretty conservative. However, I might have missed something: please tell it to me!;)
A side note: to improve the backup program latency, one should
seperate encryption and block segmentation on one side from remote
backup phase and use on the other side using a set of threads to do the
job in parallel. Moreover, in order to improve network usage, one should
allow to send several blocks/files in the same RPC call.
2006-03-20T22:42:40Z [] permanent link
