A software to securely backup your hard-disk on friends' hard-disk
The idea
Making backup in a cost effective way is usually difficult with current technology for indivuals (who can't support the cost of a tape backup system). Considering that hard-disk are quite cheap and have more and more capacity, one solution would be to save the content of your hard-disk on the disk of your friends (and of course, they would do the same on yours). With enough duplication, a crash of a disk would not impact your backups. Of course, you don't want your friends to read your emails, so saved data should be properly encrypted.Specification (or at least something close to it :-)
Note: following details are freely inspired by encfs and TrueCrypt. A backup system is made as a set of nodes, each node running on the machine of an individual and saving parts of his files (probably as a Unix daemon-like program). Each node has an SSL-like private/public key and communicates with other nodes over SSL connections. At system setup, each node knows the public key of other nodes through out-of-band means (email). Each node has a master password, used to generate derivative keys. Each node is configured to save a part of the local file hierachy, with exclusion patterns (for example ~ files). Each node is in charge of the encryption and the copying of its files onto remote nodes. So a node will take care of copying each file of the backup in at least n encrypted copies. Each file copied on a remote node is compressed and encrypted with an individual key, generated from a master key and some random each time the file is copied. A set of rules is enforced by each node so that total available capacity is equally shared between nodes. At launch time, each node knows the total space reserved for the backup system on local hard-disk. It agrees with other node of the group total capacity and the best way to use it. Each saved file is indexed by a 64-bits integer, generated by the source node. On the source node, a global counter is used, initially set to zero and incremented each time a file is copied on a remote node. That way, remote nodes only knows a file number and its size and cannot deduce the file hierarchy on source node, when files are modified (well, maybe some information can be deduced from file size but I consider this a minor issue), etc. On the source node, a master index is maintained. This index stores, for each saved file:- a 64-bits identifier of the saved file on remote nodes;
- cryptographic hash of the original and encrypted file, for coherency check;
- the attributes of the file (creation date, modification date, rights, etc.);
- the original path of the file;
- the set of nodes on which this file is saved;
- information specific to the backup system: total or incremental backup, reference file if incremental, etc.
Transactional system
One issue of the system is that the master index and saved files should be coherent, even in case of crash of one or more nodes at any step of a backup phase. So we need to use a transactional-like system to ensure this. The API between nodes offers a way to start a transaction, and then close it or abort it. If the SSL connection supporting the transaction is interrupted (remote node crash for example), the transaction is aborted. When saving a set of files (e.g. a directory), the source node:- starts a transaction;
- saves a file on a remote node, allocating to it a new identifier;
- repeat previous step for all files of the directory;
- updates locally the index;
- saves the index on remote nodes, as a new one;
- closes successfully the transaction;
- starts a second transaction;
- removes the old files on previous nodes.
- removes the old index on all remote nodes.
- closes successfully the transaction;
- removes locally the old index.
API between nodes
One could use ONC RPC (as defined by IETF) to communicate between nodes. The data stored (files and index) on remote node should be defined in the XDR format. We could use following API:
/* transactions */
start_transaction() -> transaction_id
close_transaction(transaction_id)
abort_transaction(transaction_id)
/* to handle files */
save_file(transaction_id, 64bits_id, file_content)
erase_file(transaction_id, 64bits_id)
get_file(transaction_id, 64bits_id) -> file_content
/* to handle master index */
save_index(transaction_id, version)
erase_index(transaction_id, version)
get_index(transaction_id, version) -> index_content
/* to manage groups */
enter_group() -> computed_group_parameters
leave_group()
pushed_group_update() -> computed_group_parameters
Each API call is identified by the source node, from the connection
used. So each node can uniquely store a source node files into a
separate directory.
Possible user interface for the program
I think it would be nice if the program would have a very simple user interface, like the following:
distrib-backup --initialize-start-group # to initiate a new group
distrib-backup --initialize-join-group a-remote-node # to join an existing group
distrib-backup # usual start
distrib-backup --restore-after-crash a-remote-node # to reconstruct after a crash
The backup software should have a local user interface: either a command
line one (shell like) or a web interface, whichever is the easiest to
implement.
Implementation
In OCaml of course! :) OCaml as already the bindings or libraries for: And of course, the OCaml language has all the properties to make a safe system (garbage collection, strong typing, etc.).Remaining issues
- Define precisely the crytographic parts: which algorithm to use, how to generate keys, how to generate random, etc. Look at TrueCrypt user documentation which has useful information;
- Check the transactional mechanism indeed works;
- Refine the set of rules to allocate disk space;
- See if it possible to avoid bandwith hog, both locally and on remote nodes.
2006-03-18T20:11:59Z [] permanent link
