In the last couple months I have been trying to come up with a
coherent data hosting and backup strategy. I have come up with
something I am fairly happy with (and haven't quite implemented
yet):
For large media (music, TV, movies), store it all in one place on
a 300 GB or 500 GB drive, and back it up once a week to an
external USB drive of the same size. Exchange the USB drive with
a similar one once a year or so. (so you have an weekly full
backup, and an annual snapshot)
For smaller but more important data (e.g. the CVS repository that
holds my web site, currently about 30 GB), store it in that same
master space but also create redundant backups of it with various
online services, like Amazon S3 and anywhere else you have space.
(I'm currently using Dreamhost where I have 400 GB of space for
$16/mo, and my disk quota increases faster than I'll use it.)
http://www.dreamhost.com/r.cgi?227949
I haven't implemented the first part of my plan yet but a couple
months ago I ran out of room on the server where I was backing up
my CVS repository, so that motivated me to figure out how to
implement the second part of my plan for smaller data.
I wanted to be able to back up my data in a way that doesn't let
the backup host (e.g. Amazon or Dreamhost) read it; there's
nothing especially secret in there but might be one day, and
Amazon or whoever would probably happily give up my data if it
were subpoenaed for some reason.
Also, I didn't want to rely on any particular data hosting
service; I don't expect S3 to go away any time soon but I don't
think they've committed to offering it indefinitely, and I'd like
to use multiple storage services in case one fails.
A few things I tried to accomplish these goals:
- Brackup
http://brad.livejournal.com/tag/brackup
Sounds like exactly what I need, but after a couple hours of
struggle I was unable to get it to work. I don't remember
what the problems were, and didn't send bug reports. (it was
over a month ago when I tried.) I'm planning to try it again
after it matures a bit more.
- Write my own code to encrypt each file in my repository and
upload it to Amazon S3. I expected storing a file in S3 would
be extremely simple, but after hours of struggle with
software libraries in all kinds of different languages I gave
up on this too.
I am mystified that this would be so difficult; I just want a
program in any language that lets me say
"put this file to S3, return an error if you can't", and
"get this file from S3, return an error if you can't".
- Duplicity
http://duplicity.nongnu.org/
I was able to get this to work with only minor effort. I used
it to back up my 30 GB CVS repository to my shell account on
Dreamhost, and tested it by restoring a file. I haven't
really examined duplicity closely so wouldn't rely on this as
my only backup but it seems to work.
Notes about using duplicity:
1. the first full backup takes a long time, and if it's
interrupted in the middle (e.g. due to a network hiccup) it
seems to have to start over from scratch. My first few
attempts to back up 30 GB to Dreamhost failed after a few GB
and I had to start over again hours later. I worked around
this problem with the command line arg:
--scp-command 'scp -o ConnectionAttempts=1800'
to tell it to keep retrying for half an hour.
2. Duplicity generates a big file with signatures of all the
files it backs up; for my 30 GB of data (145674 files) this
file is 480 MB, which filled up my /tmp the first time. I
worked around this with:
TMP=$HOME/tmp
3. By default it breaks the backup data into encrypted chunks of
5 MB each and scp's each chunk to the remote site one at a
time; to speed this up I installed openssh v4 to benefit from
its connection reuse features (and started a master ssh
session between the two systems inside 'screen')
http://larve.net/people/hugo/2005/blog/2006/04/20/speeding-up-ssh-fsh-vs-openssh-v4/
4. Incremental backups seem to work but are really inefficient
because the first thing it does each time is download the
massive 480 MB signature file from the remote server. I
haven't checked if there's some way to prevent that, e.g. by
caching the signature file locally. This is probably less of
an issue if you aren't backing up lots of small files like
me, because the signature file will be smaller.
5. It encrypts everything with GPG, which I haven't figured out
a good way to run non-interactively yet. I found gpg-agent
which presumably works much like ssh-agent but haven't tried
it yet.
The commands I ended up using to back up my data were:
For the first full backup:
TMP=$HOME/tmp duplicity \
--scp-command 'scp -o ConnectionAttempts=1800' \
$HOME/cvsroot
scp://[email protected]/duplicity/bubbles
To do an incremental against the last full backup:
TMP=$HOME/tmp duplicity --incremental \
--scp-command 'scp -o ConnectionAttempts=1800' \
$HOME/cvsroot
scp://[email protected]/duplicity/bubbles
(the --incremental isn't really needed, it just says to abort
if you can't do an incremental)
To restore a single file:
TMP=$HOME/tmp duplicity \
--file-to-restore www/people/gerald/index.html,v \
scp://[email protected]/duplicity/bubbles \
/tmp/restored-index.html,v
Various related stuff I had bookmarked:
http://www.debian-administration.org/articles/209
(hmm, haven't read that yet, looks very useful)
http://aws.amazon.com/s3
http://www.jungledisk.com/
https://files.dreamhost.com/
http://joseph.randomnetworks.com/archives/2006/10/03/amazon-s3-vs-dreamhost/
http://developer.amazonwebservices.com/connect/thread.jspa?messageID=44471
http://www.muellerware.org/projects/s3u/index.html
http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=47
http://s3.amazonaws.com/ServEdge_pub/s3sync/README_s3cmd.txt
http://jeremy.zawodny.com/blog/archives/007641.html
http://web.mit.edu/~emin/www/source_code/dibs/index.html
http://backuppc.sourceforge.net/
--
Gerald Oskoboiny <
[email protected]>
http://impressive.net/people/gerald/