yet another blog : en
Sat, 03 Jun 2006
As part of my
diseba
project, I'm looking at ways to implement a secure backup on a remote
machine. It appears that there is a simple way to ensure that the backup
can only be read by yourself: encrypt it with gpg using your public key
(
gpg --encrypt --recipient is your friend). Only you with
access to your private key can decrypt the backup. Of course, as a good
friend told me, you'd better save your private key in a safe place
outside your backup! :-)
Such a system could be implemented in existing softwares like
backup2l or
BackupPC.
2006-06-02T22:04:19Z [/ideas]
permanent link
Wed, 29 Mar 2006
When implementing
ocamlscript, I
realized that argument passing for a shebang script (a Unix script
starting with
#!) was rather undocumented. So here is a
tiny documentation of this rather contrived argument passing convention.
Suppose you have the OCaml program:
let _ =
for i = 0 to Array.length Sys.argv - 1 do
Format.printf "argv(%d): \"%s\"@\n" i Sys.argv.(i)
done
Compiled with:
ocamlopt -o argv-test argv-test.ml.
And suppose you have following
script-with-args shebang
script:
#!./argv-test intarg1 intarg2 intarg3
and following
script-without-args shebang script:
#!./argv-test
As a reference, if
argv-test is called directly, you have:
$ ./argv-test arg1 arg2 arg3
argv(0): "./argv-test"
argv(1): "arg1"
argv(2): "arg2"
argv(3): "arg3"
Now, let's compare the different ways of calling the shebang script:
$ ./script-without-args
argv(0): "./argv-test"
argv(1): "./script-without-args"
$ ./script-without-args arg1 arg2 arg3
argv(0): "./argv-test"
argv(1): "./script-without-args"
argv(2): "arg1"
argv(3): "arg2"
argv(4): "arg3"
$ ./script-with-args
argv(0): "./argv-test"
argv(1): "intarg1 intarg2 intarg3"
argv(2): "./script-with-args"
$ ./script-with-args arg1 arg2 arg3
argv(0): "./argv-test"
argv(1): "intarg1 intarg2 intarg3"
argv(2): "./script-with-args"
argv(3): "arg1"
argv(4): "arg2"
argv(5): "arg3"
You see the pattern? :) If the shebang script is called with arguments
given on the shebang line, those arguments are available in a single
string as argv(1). Otherwise, argv(1) is equal to the name of the
shebang script itself. If any argument in given on the command line to
the shebang script, they are available as normal argument, after binary
name, optional internal arguments and script name. To determine if the
shebang script has internal arguments: if the called binary has at least
three arguments, one should test if argv(1) is the name of an real file
(and thus the script has not argument) or not (and thus the script has
arguments).
As a side note, if one wants to use the
env utility:
$ cat ./env-script
#!/usr/bin/env ./argv-test
$ ./env-script
argv(0): "./argv-test"
argv(1): "./env-script"
$ ./env-script arg1 arg2 arg3
argv(0): "./argv-test"
argv(1): "./env-script"
argv(2): "arg1"
argv(3): "arg2"
argv(4): "arg3"
However
env does not support arguments on the shebang line:
$ cat ./env-script-with-args
#!/usr/bin/env ./argv-test intarg1 intarg2 intarg3
$ ./env-script-with-args
/usr/bin/env: ./argv-test intarg1 intarg2 intarg3: No such file or directory
2006-03-29T19:38:24Z [/misc]
permanent link
Tue, 28 Mar 2006
Sat, 25 Mar 2006
Continuing on the idea of a solution for distributed and secure backup,
I have now a more clear view of the software. Compared to my
first
original idea, I changed some details. However the
cryptographic
details could be kept as is. And I have found a name for the project:
diseba. :)
Overview
The general idea is the same as in my first note: save the content of
your hard disk on your friends' hard-disks, in such a way that only you
can read those copies (through strong encryption). Moreover, the network
of friends is closed (i.e. you must agree with the others to enter the
"club").
The main idea is to divide the backup into two main steps: storing the
content to backup into a set of encrypted blocks and then distributing
those encrypted blocks onto the hard-disks of your friends. More
specifically, a backup phase is divided into four steps:
- Determine which files are new or updated;
- Cut them in fixed size and encrypted blocks, on local disk, along
with a Master Index that contains all needed information to
restore a backup;
- Save those blocks on the remote hard-disks, distributing enough
copies to ensure long term availability. Two methods could be used:
rsync or a bittorrent-like distribution;
- Send to remote nodes the list of blocks and master index we want
to keep, all other blocks or master indexes for this node being
erased.
The main disadvantage of this approach is that it might use, in the
worse case, twice the disk space of the saved space. But its big
advantage is it implementation simplicity and quite good security.
At any point in the above steps, the local software stores on disk its
state so it can restart from where it was stopped.
In steps 1 and 2, all files, links and directories of the directories to
backup are examined:
- If the file belongs to an exclusion list (with glob operators, like
"*~"), it is excluded from the list of files to save;
- The last modification date and size is compared to the last
backup to see if it needs a new backup;
- If the file is guessed to be already compressed from its suffix, like
images (.png, .jpg), compressed files (.gz, .zip, .tgz) or video and
music (.mp3, .mp4, .ogg, .flac), it is not compressed;
- Each file or link is compressed, added a MAC (Message
Authentication Code), cut into fixed size blocks (1024 bytes) and all
blocks are encrypted with an IV (Initialization Vector) at the
begining of the first block.
As result, a file is saved with following structure:
+------------------+ +------------------+ +------------------+
|IV| file content | | file content | | file content |MAC|
+------------------+ +------------------+ +------------------+
encrypted block encrypted block encrypted block
Content of the Master Index
The
Master Index is a password encrypted compressed file that
contains all the information needed to restore a backup from a set of
encrypted blocks. More specifically, it contains:
- local node number;
- private and public SSL key of this node;
- a reference to a possible previous Master Index;
- a master key that allows to derive both encryption/decryption and
MAC keys. This master key is generated freshly for each new master
index;
- a 64-bits counter used to number blocks and master indexes. This
counter is copied from one master index to the other (i.e. its value is
kept along the lifetime of the backups);
- a table of saved files, index by file/directory path. For
directories, links and tiny files (a few bytes), their content is
stored directly in the Master Index. Small files (compressed size
smaller than a block) are stored as several grouped in a block. Otherwise
a regular file is divided into a range of blocks. Each line of this
table contains, depending on the type of the file:
- for directories:
- Access rights;
- User id of the owner;
- Group ID of the directory's group;
- Last access time;
- Last modification time;
- Last status change time;
- for tiny files and links (a few bytes, to small to gain size from
compression, so they are stored within master index):
- Access rights;
- User id of the owner;
- Group ID of the file or link's group;
- Last access time;
- Last modification time;
- Last status change time;
- File or link content with its type;
- for small files or links (compressed size smaller that a block,
so several of them can be saved into a single block):
- Access rights;
- User id of the owner;
- Group ID of the file or link's group;
- Last access time;
- Last modification time;
- Last status change time;
- An entry number into the table of grouped file/link blocks (see
below);
- for regular files:
- Access rights;
- User id of the owner;
- Group ID of the file's group;
- Last access time;
- Last modification time;
- Last status change time;
- Range of blocks identifiers containing the file;
- File's compressed size;
- MAC of the file;
- a table of grouped file/link blocks, for blocks containing several
files. Each entry in this table stores:
- block identifier of the block containing the files or links;
- MAC of the group of files/links;
- The list of all files into this block;
- For each file or link: type, offset and length within the
block;
The Master Index is added a MAC, encrypted and stored on disk with salt
and number of hash iterations for decryption.
On disk backup structure
The backup on the local disk has following hierarchy:
node0/master_index/1 <-- saved master index #1
node0/master_index/5
node0/backup_state <-- local status of the software in case of interruption
node0/blocks/0/40
node0/blocks/0/80 <-- encrypted blocks as 1 KB files
:
node0/blocks/f/67f
|
+---- the set of blocks are divided into 16 directories, 0 to f
node1/ <-- same directory structure as for node0
:
node5/
Remote copy of blocks and master indexes
Once a new backup (blocks and master index) has been generated on local
disk, one needs to save it on remote nodes. We propose to use two
strategies:
- rsync synchronisation of the disk directory structure. This would
be not efficient with several remote nodes but it could be quite
useful on a local network (and simple to implement ;) ;
- a bittorrent like approach.
The bittorrent-like approach is quite simple: use the copy made on other
nodes to reduce the amount of data that an initial node should
send. More precisely, for each block that should be copied on
N
remote nodes:
- first copy: complete copy from node 0 to node 1;
- second copy: 1/2 copy from node 0 to node 2, 1/2 copy from node 1
to node 2;
- ...
- Nth copy: 1/N copy from node 0 to node N-1, ..., 1/N copy
from node N-2 to node N-1.
Each node is in charge of making the copies of its own blocks and master
index on other nodes. To implement above algorithm, one would need
following network API:
trigger_block_copy([list of source nodes], [list of block ids])
=> returns success or failure for each block
trigger_master_index_copy([list of source nodes], master_index_id)
=> returns success or failure
pull_block_parts([list of (block id, offset, length)])
=> returns, for each requested part, failure or block content
pull_master_index_part(master index id, offset, length)
=> returns failure or master index content
erase_block_list([list of block id])
=> returns success or failure
erase_master_index(master index id)
=> returns success or failure
A source node would call
trigger_blocks_copy to request to
one or more nodes a copy of a list of blocks. Each one of the nodes
receiving this call would in turn use
pull_block_parts to
get the needed parts from other nodes having a copy of the needed
block. After all new blocks are copied on remote nodes, the source node
can call
erase_block_list to erase outdated blocks. The
procedure is the same for master indexes.
In case a destination node crashes in the middle of a copy, the source
node should consider that the copy has failed and it should elect a new
destination node. In case a sending node crashes in the middle of a
copy, the receiving node should erase the corresponding block and report
a failure to the source node.
As the source node sends at backup step 4 the complete list of blocks
and master index it wants to keep, we avoid keeping stale blocks or
partially copied blocks in case of nodes' crashes.
It might appear that a source node cannot save enough copies onto remote
nodes (because they are unavailable): it should attempt to retry those
copies at a latter time.
Clique management
To create a new clique (group of friends making backups between each
other), one should:
- Ask for a password;
- Create private and public SSL keys;
- Create an empty master index with block counter set to one and
relevant crypto material.
To join an existing clique, a new node should do:
- Ask for a password;
- Create private and public SSL keys;
- Create an empty master index with block counter set to one and
relevant crypto material;
- Let the user send public key of this node to the owner of remote
nodes;
- Open connection to remote nodes.
In order to have the same node numbering on each node, the public keys
of the clique nodes are sorted and each node is given a number equal to
its public key rank. A local unencrypted configuration file contains
public SSL keys of other nodes and local directories to backup.
Few notes on implementation
In order to answer remote node requests asynchronously, the diseba
program should be multi-threaded: one thread to make local backup and
trigger local to remote copies, one or more threads to fulfil remote
nodes' copy requests.
I still intend to implement this program in OCaml. ;) RPC over SSL would
be used for communication between nodes.
Any opinion on this design?
2006-03-25T14:27:40Z [/ideas]
permanent link
Tue, 21 Mar 2006
I wanted to know how files are distributed in my home directory, so I
wrote a quick
OCaml
program to do this analysis. This program gives following result on
my home directory (it counts links as files):
$ ./filesystem-analysis.ml ~
Total file size: 6120592065 bytes
Minimum non empty file size: 1.00 B
Maximum file size: 163.77 MB
Average file size: 28.81 KB
Total number of files: 207476
of which are empty files: 544
Total number of directories: 22932
File size distribution:
[ 1.00 B - 1.00 B ] 592 files - total average of 592.00 B
[ 2.00 B - 3.00 B ] 574 files - total average of 1.68 KB
[ 4.00 B - 7.00 B ] 772 files - total average of 4.52 KB
[ 8.00 B - 15.00 B ] 2205 files - total average of 25.84 KB
[ 16.00 B - 31.00 B ] 4077 files - total average of 95.55 KB
[ 32.00 B - 63.00 B ] 15371 files - total average of 720.52 KB
[ 64.00 B - 127.00 B ] 7598 files - total average of 712.31 KB
[ 128.00 B - 255.00 B ] 6258 files - total average of 1.15 MB
[ 256.00 B - 511.00 B ] 10922 files - total average of 4.00 MB
[ 512.00 B - 1023.00 B ] 21493 files - total average of 15.74 MB
[ 1.00 KB - 2.00 KB] 24063 files - total average of 35.25 MB
[ 2.00 KB - 4.00 KB] 35981 files - total average of 105.41 MB
[ 4.00 KB - 8.00 KB] 27669 files - total average of 162.12 MB
[ 8.00 KB - 16.00 KB] 16737 files - total average of 196.14 MB
[ 16.00 KB - 32.00 KB] 16654 files - total average of 390.33 MB
[ 32.00 KB - 64.00 KB] 7057 files - total average of 330.80 MB
[ 64.00 KB - 128.00 KB] 4190 files - total average of 392.81 MB
[ 128.00 KB - 256.00 KB] 2905 files - total average of 544.69 MB
[ 256.00 KB - 512.00 KB] 1036 files - total average of 388.50 MB
[ 512.00 KB - 1024.00 KB] 738 files - total average of 553.50 MB
[ 1.00 MB - 2.00 MB] 301 files - total average of 451.50 MB
[ 2.00 MB - 4.00 MB] 115 files - total average of 345.00 MB
[ 4.00 MB - 8.00 MB] 104 files - total average of 624.00 MB
[ 8.00 MB - 16.00 MB] 38 files - total average of 456.00 MB
[ 16.00 MB - 32.00 MB] 14 files - total average of 336.00 MB
[ 32.00 MB - 64.00 MB] 8 files - total average of 384.00 MB
[ 64.00 MB - 128.00 MB] 1 files - total average of 96.00 MB
[ 128.00 MB - 256.00 MB] 3 files - total average of 576.00 MB
Returned results are pretty expected. Over the 6GB of files in my home
directory, I have about 200,000 files of an average size of 30KB and
23,000 directories (about 10% of the number of files). The vast majority of
files are below 32KB but a small number of big files are taking the vast
majority of my disk space.
A more interesting point is that a non negligible number of files (10%)
is smaller than 64 bytes. In the case of the distributed backup system,
it would be time and space saving to save them within the master index.
2006-03-21T20:07:50Z [/misc]
permanent link
Mon, 20 Mar 2006
As a followup to my
previous
note on a distributed backup system, I give here some details on its
cryptography. Most of this note comes from Niels Ferguson and Bruce
Schneier
Practical
Cryptography book. Feel free to criticize my
choices.
;)
Cryptographic primitives
I intend to use Niels and Schneier Fortuna random number
generator. Compared to their original design, I won't implement my
entropy pool system but use the
/dev/random facility of
Linux. That's not perfect but certainly much easier to implement. This
random number generator is used to generate per file (or per backup
phase) specific key, each time a file is saved. This gives perfect
forward secrecy in case a key is compromised.
AES 256 bits is used as block cipher. I could also use Serpent 256 bits,
or even both of them cascated, but I don't know if this has any value.
As block chaining mode, use Cipher Block Chaining (CBC) mode with a
Nonce-Generated Initialization Vector (IV). The nonce is generated by
encrypting with the file key the 64-bits file counter, incremented each
time a file is saved (i.e. a new encryption is made).
(* + is xor addition *)
C0 := E(K, 64-bits-counter)
Ci := E(K, Pi + C{i-1}) for i = 1, .. k (k = number of blocks)
As cryptograhic hash function, use two call to SHA256d defined from
SHA-256 as:
SHA256d(m) := SHA-256(SHA-256(m))
As Message Authentication Code, use HMAC constructed with SHA-256
(
K is the MAC key,
m the message,
a and
b are two specified constants):
(* + is xor operation *)
HMAC(K,m) := SHA-256((K + a) || SHA-256((K + b) || m))
Protection of master index
The master index stores the list of saved file with their encryption key
and other associated information. This master index is protected by a
password, known by the user. As such a password has not enough entropy,
so a salting and streching technique is used to reinforce password
strength. Firstly, a 256-bits random salt is generated and stored
alongside the master index. Then, a cryptographic hash function
SHA256d is used repeatedly as follow:
x0 := 0
xi := SHA256d(x{i-1} || password || salt) for i = 1, .., r
K := xr
r is the maximum number of iterations that can be made during
200-1000 ms.
K is the final master key for the master index.
Each time the master index is encrypted, a new Initialization Vector is
needed. For this, we use as nonce the global 64-bits file counter by
taking a value from it.
To check that the master index as been correctly decrypted once the user
has entered his password or that the master index has not been modified,
the compressed master index is added a Message Authentication Code
before being encrypted. The MAC key and encryption/decryption keys are
generated from the derived password key with following formulas, where
"string" is just a simple 16-bytes string concatenated to the
key:
MAC_key := SHA256d(password-derived-K || "authentication01")
encryption_key := SHA256d(password-derived-K || "encryptionabcdef")
Before being stored on remote nodes, the master index is thus
compressed, added a MAC and encrypted.
Encryption and authentication of files
In my initial design, each file is encrypted with a freshly generated
256 bits key, the nonce used for Initialization Vector being the freshly
allocated 64-bits file identifier. As that might be too costly, I think
it would be easier to use a unique key for the set of files making a
backup phase. So I indent to use following algorithm:
For each backup phase:
1. generate a random encryption key
2. generate a random Message Authentication Code key
3. for each file
a. take a value and increment the global 64-bits counter
b. compress the file
c. compute file MAC with MAC key
d. encrypt the file and its MAC with IV and encryption key
d. store an entry in master index, with file identifier and
used keys (encryption and MAC)
We store for each file the used keys, so in there is no need to make
cleanup in the index file, as each file can be re-encrypted with a
different key in a latter backup phase.
As said in my previous note, it is not necessary to encrypt directories
and empty files, except putting an entry for them in the master
index. As an optimisation, if the compressed file is smaller than 32
bytes (the size of an encryption and MAC keys), it is more efficient
effective to store it directly in the master index. The real question
is: are there so many files of that type?
The main issue with above design is that file size is immediately
visible to attackers. A refinement of the design would be to use a fixed
block size (e.g. 8 or 16 Kbytes) and partition a file or a set of files
amongst several blocks. For a complete backup phase, following algorithm
would be used:
For each backup phase:
1. compress all files and sort them according to there size
2. starting from the biggest one:
if the file is bigger that a block:
split it in the needed amount of blocks and pad the last
block with dummy data (zeros)
else (the file is smaller than a block):
put it in a new block and try to fit as much as possible
other files in the same block
3. for each generated block, give it a 64-bits block identifier
and encrypt it
For each file, store in the master index the used keys (encryption and
MAC), the list of block identifiers where parts of the file is in,
compressed file size and offset of the file within the first block.
For very big files, it would be impractical to compress them in advance,
so one would compress, split them in blocks and compute MAC and
encryption on each block on the fly.
To conclude
I've used fairly standard cryptography, following the advice of others
and being pretty conservative. However, I might have missed something:
please tell it to me!
;)
A side note: to improve the backup program latency, one should
seperate encryption and block segmentation on one side from remote
backup phase and use on the other side using a set of threads to do the
job in parallel. Moreover, in order to improve network usage, one should
allow to send several blocks/files in the same RPC call.
2006-03-20T22:42:40Z [/ideas]
permanent link
Sat, 18 Mar 2006
The idea
Making backup in a cost effective way is usually difficult with current
technology for indivuals (who can't support the cost of a tape backup
system).
Considering that hard-disk are quite cheap and have more and more
capacity, one solution would be to
save the content of your
hard-disk on the disk of your friends (and of course, they would do
the same on yours). With enough duplication, a crash of a disk would not
impact your backups. Of course, you don't want your friends to read your
emails, so
saved data should be properly encrypted.
Specification (or at least something close to it :-)
Note: following details are freely inspired by
encfs and
TrueCrypt.
A backup system is made as a set of nodes, each node running on the
machine of an individual and saving parts of his files (probably as a
Unix daemon-like program). Each node has an SSL-like private/public key
and communicates with other nodes over SSL connections. At system setup,
each node knows the public key of other nodes through out-of-band means
(email). Each node has a master password, used to generate derivative
keys.
Each node is configured to save a part of the local file hierachy, with
exclusion patterns (for example ~ files).
Each node is in charge of the encryption and the copying of its files
onto remote nodes. So a node will take care of copying each file of the
backup in at least
n encrypted copies. Each file copied on a
remote node is compressed and encrypted with an individual key,
generated from a master key and some random each time the file is
copied.
A set of rules is enforced by each node so that total available capacity
is equally shared between nodes. At launch time, each node knows the
total space reserved for the backup system on local hard-disk. It agrees
with other node of the group total capacity and the best way to use it.
Each saved file is indexed by a 64-bits integer, generated by the source
node. On the source node, a global counter is used, initially set to
zero and incremented each time a file is copied on a remote node. That
way, remote nodes only knows a file number and its size and cannot
deduce the file hierarchy on source node, when files are modified (well,
maybe some information can be deduced from file size but I consider this
a minor issue), etc.
On the source node, a master index is maintained. This index stores, for
each saved file:
- a 64-bits identifier of the saved file on remote nodes;
- cryptographic hash of the original and encrypted file, for
coherency check;
- the attributes of the file (creation date, modification date,
rights, etc.);
- the original path of the file;
- the set of nodes on which this file is saved;
- information specific to the backup system: total or incremental
backup, reference file if incremental, etc.
In case of a directory, saved information in the index are indentical
except that no file identifier is necessary. The master index also
stores the latest value of the 64-bits file identifier counter.
The master index of a source node is saved on remote nodes explicitely
(i.e. they know this is the master index of the source node) and
encrypted with a key generated from the node master password. In case of
a crash on the source node, the master index can be recovered from
remote nodes, in order to reconstruct the disk from saved files indexed
by the master index (using the master password).
To support incremental backup, a local copy of each saved file is
necessary in order to compute the difference. This local copy could be
used as an added local backup, on a different disk for example.
Transactional system
One issue of the system is that the master index and saved files should
be coherent, even in case of crash of one or more nodes at any step of a
backup phase. So we need to use a transactional-like system to ensure
this.
The API between nodes offers a way to start a transaction, and then
close it or abort it. If the SSL connection supporting the transaction
is interrupted (remote node crash for example), the transaction is
aborted.
When saving a set of files (e.g. a directory), the source node:
- starts a transaction;
- saves a file on a remote node, allocating to it a new identifier;
- repeat previous step for all files of the directory;
- updates locally the index;
- saves the index on remote nodes, as a new one;
- closes successfully the transaction;
- starts a second transaction;
- removes the old files on previous nodes.
- removes the old index on all remote nodes.
- closes successfully the transaction;
- removes locally the old index.
Previous algorithm should probably be tailored, depending on the number
of files to save. Maybe it would be better to saved a fixed amount of
files (say 10 to 50 files) in each transaction.
If a remode node sees an opened transaction as aborted (either
voluntarily or because the source node has disappeared), it erases all
files created during the transaction. If a source node sees a remote
node crash, it can elect a new remote node and use it in replacement of
the previous one.
When a file is saved on a remote node, the remode node replies with a
success only when the file is effectively stored on the remote node
local disk (of course, real write of data depends of the underlying
file-system and mounting parameters).
I should probably do a
SPIN model of this
algorithm. :)
API between nodes
One could use ONC RPC (as defined by IETF) to communicate between
nodes. The data stored (files and index) on remote node should be
defined in the XDR format.
We could use following API:
/* transactions */
start_transaction() -> transaction_id
close_transaction(transaction_id)
abort_transaction(transaction_id)
/* to handle files */
save_file(transaction_id, 64bits_id, file_content)
erase_file(transaction_id, 64bits_id)
get_file(transaction_id, 64bits_id) -> file_content
/* to handle master index */
save_index(transaction_id, version)
erase_index(transaction_id, version)
get_index(transaction_id, version) -> index_content
/* to manage groups */
enter_group() -> computed_group_parameters
leave_group()
pushed_group_update() -> computed_group_parameters
Each API call is identified by the source node, from the connection
used. So each node can uniquely store a source node files into a
separate directory.
Possible user interface for the program
I think it would be nice if the program would have a very simple user
interface, like the following:
distrib-backup --initialize-start-group # to initiate a new group
distrib-backup --initialize-join-group a-remote-node # to join an existing group
distrib-backup # usual start
distrib-backup --restore-after-crash a-remote-node # to reconstruct after a crash
The backup software should have a local user interface: either a command
line one (shell like) or a web interface, whichever is the easiest to
implement.
Implementation
In OCaml of course! :) OCaml as already the bindings or libraries for:
And of course, the OCaml language has all the properties to make a safe
system (garbage collection, strong typing, etc.).
Remaining issues
- Define precisely the crytographic parts: which algorithm to use,
how to generate keys, how to generate random, etc. Look at TrueCrypt
user documentation which has useful information;
- Check the transactional mechanism indeed works;
- Refine the set of rules to allocate disk space;
- See if it possible to avoid bandwith hog, both locally and on
remote nodes.
What do you think of it?
2006-03-18T20:11:59Z [/ideas]
permanent link
Sat, 11 Feb 2006
For a long time, I've been interested in literate programming, an
idea originaly developed by Donald. E Knuth in
its CWEB
system when programming the TeX system. His idea is quite simple: write
a program not as a piece of code but as a book, where actual code is
interleaved with pieces of documentation about that code. For example,
for more than two years now I've been using Noweb to write the
code of demexp.
Noweb is very nice. Most notably, it is language agnostic so it can be
used on all the different pieces of source you have in a project.
The main issue with literate programming is a technical one: the
mixed source code and documentation is written into a file,
e.g. file.ml.nw, from which are derived compilable code
(file.ml) and documentation
(file.ml.tex). This approach is not compatible with
proprietary IDE environments that your are forced to use for very
specific programming (e.g. Intel's IXP SDK for IXP2x00 Network
Processors or Xilinx's ISE for FPGA). Moreover, the more I think about
it, the more I believe that compilable source code should be the
reference point:
- it can thus be shared easily with other programmers that don't want
to use literate programming tools;
- program order is important in certain languages (e.g. OCaml) and
should be kept;
- most substitution mechanisms in literate programming tools I know
about (Noweb, original Knuth's CWEB) is a very fragile macro systems
that can lead to bugs.
I have thus started to think about embedding literate programming
chunks as language comments. A special processing on source code
extracts documentation from literate markup.
My original idea was quite simple: use a Wiki-like syntax to write the
documentation, with a few set of tags:
- references and links within one source file or between source
file;
- usual emphasized, bold, italics, code, etc. markups;
- inclusion of images;
- (optionally) syntax coloring and index creation.
One could even consider transforming an existing Wiki system storing
its data in text file, like DokuWiki, to
write the literate parts of the code.
More recently, Felix BREUER has proposed to use TeXmacs as a literate programming tool. I
have proposed
him my idea of documentation as source comments and he seems to like
it, so we will
try to experiment that idea.
To conclude, while I'm pretty sure that literate programming as
source code comments is the way to go, I'm not sure of the right path
between a wiki-like syntax and a TeXmacs approach. While TeXmacs approach
is very seducing because TeXmacs is a full fledge scientific editor,
with image creation and inclusion, formulas, references and links, its
file format is not very readable and could clutter a lot the source
code, making it unreadable. Felix
has proposed an original solution to this issue: put the TeXmacs
code at the end of the source file (albeit still embedded in it) and use
a system of references for edition within TeXmacs. On the other hand, a
wiki-like system is always readable, even for people not using the
literate programming approach. Time and experiment will tell what is the
right approach.
2006-02-11T14:14:26Z [/ideas]
permanent link
Wed, 25 Jan 2006
For those having some spare time and who are willing to dig into tricky
(but interesting) areas, I have a nice idea of project to do: write the
(free) software of a voting machine where all (or at least most of) the
code is formally proven.
The voting machine I consider would be a very classical one: no network
connection (so no remote electronic voting), a touch screen with the
different voting options, votes are stored on a local hard drive. One
can use a standard PC as reference hardware (with for example a
graphical screen with mouse click to emulate the touch screen), even if
for a real voting machine a temper-proof hardware would be mandatory.
The purpose of the software would be:
- Before the vote phase, to take the set of vote option and the list
of voters identifiers (an unique random integer for each one of them,
that would emulate the use of a smartcard);
- During the vote itself, authenticate a voter (through his/her
unique identifier), take his/her vote and store it;
- After the vote phase, output the vote result.
The software would be written in plain C, using information (and maybe
code) from
Simple Operating System
(SOS) for hardware interface (mouse and screen control). Using
Why proof obligation producer, one would
produce the formal version of the program for the
Coq proof assistant, that would be used
for the formal proofs.
Of course, all of those software are available as free software and the
resulting code and formal proofs should be also available as free
software. The net result of such a project would not be a complete
voting machine (this is much more complex than that) but a first step
towards a trustable voting machine.
What do you think of it?
2006-01-25T18:03:05Z [/ideas]
permanent link
Fri, 30 Dec 2005
I switched from
blosxom to
pyblosxom to make my blog:
the permanent link of blog entries was changing with modification time!
Pyblosxom has a sensible way of managing permanent links. Additional
bonus: the user documentation of pyblosxom is very nice.
2005-12-30T17:56:39Z []
permanent link
Wed, 28 Dec 2005
I'm currently using Sage as a RSS
reader for Firefox. It is pretty basic but suits my needs: to know what
is new is a newsfeed (dynamic bookmarks of Firefox lack this
functionnality). And being a Firefox extension, it is very simple to
install and is integrated with the browser: clicking on a news item
opens the right link (and even a vote on demexp if
Firefox is properly configured).
2005-12-28T11:02:40Z []
permanent link
Tue, 27 Dec 2005
I recently discovered
PGP pathfinder & key statistics
a system to gather statistics about your GPG key and determine the trust
path between you and another person.
For example,
the trust path between me and Branden Robinson (Debian Project Leader)
is, in the shortest case:
0 A3AD7A2A stats David MENTRE dmentre.at.linux-france.org #13401 signs
1 D0AB4075 stats Pascal Terjan (Professional) pterjan.at.mandriva.com #2488 signs
2 29499F61 stats Sam Hocevar sam.at.zoy.org #77 signs
3 2B46A27C stats Branden Robinson branden.at.deadbeast.net #478
The longest being:
0 A3AD7A2A stats David MENTRE dmentre.at.linux-france.org #13401 signs
1 6DEAAEFC stats Jean-Marie Favreau (jm) jeanmarie.favreau.at.free.fr #16593 signs
2 85C79389 stats Damien Raude-Morvan (Mail) drazzib.at.drazzib.com #24021 signs
3 C3567689 stats Emmanuel Berre manu.at.ouvaton.org #11053 signs
4 23706F87 stats Pierre Machard pmachard.at.debian.org #1075 signs
5 3FCC2A90 stats Amaya Rodrigo Sastre amaya.at.debian.org #55 signs
6 2B46A27C stats Branden Robinson branden.at.deadbeast.net #478
You can see the Web of Trust at work. Quite interesting isn't it?
The author of this script, Henk P. Penning, also provides an analysis
of the strong set in the PGP web of trust which I haven't read yet.
2005-12-27T15:38:14Z []
permanent link
I read in latest Debian Weekly News an analysis of Debian Source code
made by Spanish students. In fact, they just ran the SLOCCount utility of David
A. Wheeler. But I find always interesting to have up-to-date figures: it
gives some hints on the accomplishment made by the free software
community.
You'll find the report at:
http://www.upgrade-cepis.org/issues/2005/3/up6-3Amor.pdf
About 60% of software in Sarge are written in C, 17% in C++. OCaml is
not counted explicitely, but ML is at 0.31%. :(
The biggest programs are:
- OpenOffice.org (1.1.3): more than 5 millions of lines;
- Linux kernel (2.6.8): 4M;
- NVU: ~2.5M;
- Mozilla (1.7.7): ~2.5M;
- GCC (3.4.3): 2.4M;
- XFS-XTT (1.4.1): 2.3M;
- Xfree86 (4.3.0): 2.3M.
I'm quite surprised by the size of NVU, but I suppose this is because it
is based on Mozilla: most lines of code are probably duplicates of
Mozilla's code. XFS-XTT size is also surprising.
I like those figures because they give some hints on the complexity of
modern softwares. We need to design programming languages and
programming environments that help to write software of more than 10
million lines of code. None of the languages I've seen so far help much
in managing so complex programs. Of course, counting lines of code is a
rough metric and programs written in high level languages (O'Caml,
Haskell) need less lines of code for the same functionnality.
2005-12-27T15:25:17Z [/free-software]
permanent link
I've made a few slides to present GnuPG and Enigmail extension to
Thundirbird and use them for securing email exchanges. Here there are in
SXI
format.
2005-12-27T15:12:24Z []
permanent link
