Amazon S3 versus rsync
Posted by matijs 08/06/2006 at 11h15
JungleDisk is a new tool that uses
Amazon’s S3 as a storage device but appears to the user as just another
disk. It is a closed source, open standard application, seen locally as a
WebDAV server, so it interfaces with most desktop file managers.
I can’t get it for my Linux-on-iBook system, but I suppose if JungleDisk
catches on, someone will come along and do a Free reimplementation (source
code for retrieving files can already be downloaded from JungleDisk’s
site).
The question then becomes: Do I want to use it?
I myself use rsync to backup my files to a friend’s machine on the other
side of the ocean. JungleDisk doesn’t do rsync. It’s just a disk, so you
have to consciously copy your files there. That’s fine if you just want to
store some files on-line. For backups, on the other hand, I want an
automated solution.
The following scheme might work: Store md5 and/or sha-1 hashes of all the
files sent to S3. The files sent there are simply indexed by hash, and we
store a mapping of hashes to directory nodes. This way also, when we move a
file, it doesn’t have to be uploaded again. The mapping has to be uploaded
to S3 as well, of course. For more fine-grained upload control, files can
even be divided into equal-sized blocks, and only changed blocks will have
to be re-uploaded.
As an aside, I wonder if S3 allows you to check md5 or sha-1 hashes of the files
stored there, or if there is some other way to check the files there are
the same as the files here.