The basics – how this system works
In a nutshell, the backup system presented by Rubel involves the creation of a number of read-only backup directories in the designated backup location labeled backup0, backup 1 . . . backupN. The directory labeled backup0 contains the most recent and only full backup. Each subsequent directory contains an incremental backup for previous days – backup1 being an incremental backup for one day ago, backup2 for two days ago, and so on.
The backup system presented here is nearly identical to the one developed by Rubel, and relies on a simple bash script and an external storage device of sufficiently large size. For maximum data protection, we recommend that this script be executed daily as a cron job, which is discussed in detail later. Each time the script is executed, all incremental backups are shifted, a new incremental backup is created, and a single full backup is performed. Here are the basic operations which will be discussed in more detail below (albeit in the reverse order):
- Create N backup directories in the designated backup location, if not already present
- Delete the oldest backup directory
# rm -rf backupN
- Shift the backup directories
# mv backup[N-1] backupN
...
# mv backup1 backup2
- Create an incremental backup
# cp -al backup0 backup1
- Create a full backup using
rsync
# rsync -a --delete source_directory backup_directory/backup0/
Each individual operation must be executed in this order, although it's far more logical to discuss them in the reverse order, which is what we're going to do. The full backup is handled by a simple rsync operation. Rsync contains algorithms for efficiently synchronizing data between two locations. The actual "locations" can be almost anywhere or anything, including different directories on the same disk, different disks or partitions on the same host, different disks on different hosts, etc. Any disk that can be mounted directly or accessed via a network is fair game. We recommend backing up to an external disk that is connected via USB for maximum protection and reliability. This ensures that the backups will be on a separate physical disk that is directly connected to the host, thus avoiding potential network problems or outages. Here’s the basic command:
# rsync -a -z --stat --delete source_directory dest_directory/backup0/
The ‘-a’ option puts rsync into archive mode which preserves permissions, attributes, devices, etc. during the synchronization, while the ‘-z’ option causes rsync to compress all data during the transfer. The ‘-stat’ option dumps relevant statistics about the connection and file transfer. The ‘-delete’ option forces rsync to remove any files that have been deleted on the source from the destination also. In essence, rsync compares the source and destination locations and copies or deletes only the data it needs to. As a result, once the first full backup has been completed (the very first time the script is executed) the daily backup process should be relatively quick as only the most recently updated files will actually be copied from the source to the destination.
The real elegance of this script is that it creates ‘N’ incremental or ‘snapshot’ backups that look and behave like full duplicates of the original source directories but occupy only a fraction of the space. The mechanism used to create these incremental backups is actually very simply. Every time this script is run, only a single incremental backup is actually created (located in backup1), while incremental backups for previous days are shifted into backup2 through backupN. An incremental backup is created using the venerable ‘cp’ command with a slight twist, namely use of the powerful ‘–al’ options:
# cp –al destination_directory/backup0 destination_directory/backup1
The -a option again forces archive mode while the -l option results in the creation of hard links which are pointers to specific blocks of data on the physical storage device or disk rather than actual copies of the relevant data. As a result, backup1 looks and feels identical to backup0 but does not contain any of the actual data - only links to the location of the data on the physical disk - and therefore occupies very little disk space compared to backup0. Copying with hard links is the perfect mechanism for an automated backup system like this because it allows multiple copies of a file or directory tree to be stored without utilizing a significant amount of additional disk space.
By default, the total number of backups maintained by this script is five; the current full backup plus four additional incremental backups (this number can be easily changed). Incremental backups are simply shifted from backup1 to backup2, backup3, and backup4 (or backupN) before finally being deleted. The backup0 … backupN files are basically just containers. The shifting process is carried out with the mv command:
# mv destination_directory /backup1 destination_directory /backup2
It is important to note that the backup0 is not shifted as the other backup directories are. The contents of backup0 are copied to backup1 using the cp -al command as discussed earlier, which is the mechanism that creates the incremental backups. Use of the cp -al and mv commands together is what creates the shifting incremental backup functionality of this system.
The above operations must be performed in the correct order as outlined earlier (otherwise data will be lost). All incremental backups must be shifted before the new incremental backup is created. The rsync operation is the last step in the process.
There are a few things that must be configured before this script can be put to use. Namely, the full path of the directory containing the data to be backed up, the mount point and full path of the directory that will contain the backups, the total number of backups to maintain, and the full path and name of a log file to be used by the backup script. Once this information has been specified, the script is ready to use. These variables are discussed in the "Let's get this thing set up" section.
A simple bash script for backing up your CDRouter-related data
The following script implements the backup system discussed above. The core operations are only a few lines, however we've included additional error checking and logging information which adds considerably to the overall length.
The script provided here is designed to backup a single source directory to a single destination directory. To backup multiple source and destination directories, just copy create a separate script for each source and/or destination directory.
#!/bin/bash
# Example backup script using rysnc with incremental backups
# Created August 23, 2007, QA Cafe
# Adapted from Mike Rubel's article: www.mikerubel.org/computers/rsync_snapshots/
# External drives must be formatted as .ext3 for this script to work properly.
# This script must also be run as root.
SRC_DIR=/usr/buddyweb
DST_DIR=/media/disk
NUM_BACKUPS="5"
LOGFILE=$DST_DIR/backup.log
# Remount destination directory with RW permissions
mount -o remount,rw /$DST_DIR
# Redirect all output to LOGFILE
exec >>$LOGFILE 2>&1
###### Make sure that we are running as root
if [ $(whoami) != "root" ]; then
echo "Only root can run this script!"
exit 1
fi
###### Make sure the source directory (the directory to be backed up) exists
if [ -d $SRC_DIR ]; then
echo -n
else
echo "Error: '$SRC_DIR' does not exist!"
echo "Please check the path and restart the backup process."
exit 1
fi
###### Make sure the directory to backup to exists
if [ -d $DST_DIR ]; then
echo -n
else
echo "Error: '$DST_DIR' does not exist!"
echo "Please check the path and restart the backup process."
exit 1
fi
###### Backup Script
echo "###### CDRouter backup process started $(date +%F-%k:%M:%S) ######"
# Create incremental backup directories, if needed
echo "Creating $NUM_BACKUP backup directories"
i="0"
while [ $i -lt $NUM_BACKUPS ]; do
if [ -a $DST_DIR/backup$i ]; then
echo "Will write to existing $DST_DIR/backup$i folder"
else
echo "Creating new folder $DST_DIR/backup$i"
mkdir $DST_DIR/backup$i
fi
i=$[$i+1]
done
# Delete the oldest incremental backup directory
echo "Removing oldest incremental backup: backup$[ $NUM_BACKUPS - 1 ]"
rm -rf $DST_DIR/backup$[ $NUM_BACKUPS - 1 ]
# Rotate the incremental backup directories
i=$[ $NUM_BACKUPS - 1 ]
while [ $i -gt 1 ]; do
mv $DST_DIR/backup$[$i-1] $DST_DIR/backup$i
echo "Rotating incremental backups: $DST_DIR/backup$[$i-1] to $DST_DIR/backup$i"
i=$[$i-1]
done
# Use rsync to synchronize the local directory with the main backup directory (backup0)
echo "Synchronizing local and backup directories"
cp -al $DST_DIR/backup0 $DST_DIR/backup1
rsync -a –z --stats --delete $SRC_DIR $DST_DIR/backup0/
echo "Backup0 contains a full backup of $SRC_DIR as of $(date +%F)"
# Remount destination directory with read-only permissions
mount -o remount,ro /$DST_DIR
echo -e "###### CDRouter backup process finished $(date +%F-%k:%M:%S) ######\n"
Let's set this thing set up
This script is very easy to run and set up. There are only four variables (located at the top of the script) that need to be configured before the script can be used: SRC_DIR, DST_DIR, NUM_BACKUPS, and LOGFILE. The SRC_DIR variable points to the location of the source directory which contains the data to be backed up. In many cases this will be /usr/buddyweb although any directory on the CDRouter host machine can be configured, including the “/” directory, provided there is adequate storage capacity on the backup device (note that if you choose to backup the “/” directory, you may get errors regarding certain system subdirectories which cannot be backup; this has been tested and all other subdirectories should be backed properly, despite any errors).
The DST_DIR variable contains the location of the actual backups. These will typically be on an external storage device that is connected via USB. Backup devices must be accessible and formatted properly (see the “Recommended external USB drives” section) to be used with this script. In many cases, a USB storage device will be mounted in the /media directory as ‘disk’. As a result, the DST_DIR should be /media/disk/cdrouter_backups or something similar. Note that this script could be easily modified to backup to another directory on the host machine or to a network accessible server or PC (the details for either of these modifications are not provided in this article).
The NUM_BACKUPS variable determines the number of incremental backups that are stored. By default this value is five, resulting in the storage of [ NUM_BACKUPS – 1 ] or four incremental backups, although any number can be specified provided there is adequate storage space on the backup device. The LOGFILE variable is the location and name of file that will be used for status information and logging. By default the log file is named ‘backup.log’ and is located in DST_DIR. The log file can be used to determine if there were any errors as well as the overall duration of the backup process.
Selecting and setting up an external USB drive
This script has been tested on SanDisk cruzer micro USB drives and Seagate FreeAgent external USB drives with success, although any external disk that connects via USB should work. For long term storage we recommend a USB hard drive, as typical USB flash drives have a limited life span. We also recommend that you have at least two times the size of the source directory for dedicated backup storage. It is important to ensure that whatever backup storage device is used is configured and connected properly. In many cases the operating system will automatically mount a USB storage device when it is connected.
Typically these devices are located somewhere in the /dev directory and are mounted to /media/disk (this is true for both Ubuntu and newer releases of Fedora Core, provided only a single storage device is connected). To sort this all out and confirm the device name and mount point for your chosen storage device, you can examine the output of the ‘df’ command:
The ‘df’ command will list the disk usage for all of the file systems that are currently mounted. In some cases the operating system will not automatically mount the external storage device (it will not show up in the output of the ‘df’ command). In these cases, you may have to use the ‘dmesg’ command to figure out the name of your external storage device:
The ‘dmesg’ or ‘diagnostic message’ command displays the kernel’s message buffer. Whenever you connect a device to the host machine, it should generate a diagnostic message. Regardless of which command you use to determine the name of your external storage device, be sure that you are referencing the correct device be verifying that the size of the disk is what you expect it to be (if you are connecting a 100 Gb disk, you should see that a certain device in the ‘df’ or ‘dmesg’ output has a capacity of 100 Gb).
On a typical Ubuntu or Fedora Core machine, an external USB storage device will show up as device ‘sda1’ or ‘sdb1’. Once the name of your backup device has been determined, you must reformat all or part of it as an .ext3 partition. CAUTION: reformatting your backup device will erase all data. Reformatting can be accomplished with the mkfs command (note that you must unmount the filesystem before reformatting):
# umount /media/disk
# mkfs.ext3 /dev/sdb1
If your external storage device is not automatically mounted by the operating system, there are a few additional steps that you will have to complete. Namely, you will have to first reformat the external storage device, create a mount point, and add a line to your /etc/fstab file to have it auto-mounted:
# mkfs.ext3 /dev/sdb1
# mkdir /media/disk
To auto-mount the external storage device, simply edit your /etc/fstab file and add the following line:
/dev/sdb1 /media/disk ext3 defaults 0 0
Now just mount (or re-connect) your backup device and verify that it is accessible and operable:
# mount -o ro /dev/sdb1 /media/disk
Your external storage device should now be ready to use. To test out the backup script, simply navigate to the /etc/cron.daily directory and run it (as root):
Based on the amount of data in your source directory, this could take some time. In our tests it took approximately five minutes for the script to process 700Mb of data. Subsequent backups will take much less time as only the new files will be copied.
The final step - remove all human intervention
We recommend setting this script up to execute automatically using cron. This will vary a little depending on your Linux distribution, but is usually very straightforward. If your system is running Anacron, just verify that the script is executable and move or copy it to the /etc/cron.daily directory:
# chmod 777 backup
# cp backup /etc/cron.daily
The /etc/cron.daily directory is a container for scripts that are executed automatically by Anacron each day. Anacron executes all scripts as root, so you don’t have to worry about permissions. All scripts in the /etc/cron.daily directory will be executed automatically at the time specified in the /etc/crontab file as uptime permits (/etc/crontab is the main configuration file for Anacron). We do not recommend editing the /etc/crontab file. If you would like to have the backup script executed at a specific time, we recommend that you manually edit the crontab file for root (the backup script must be run as root, which is why we need to edit root’s crontab file). To do this, enter the following command when logged in as root (the –e option opens crontab with the default system editor):
Next just add a line to the crontab file in the standard cron format:
# m h dom mon dow command
00 20 * * * ~qa/Desktop/scripts/backup
The above entry will execute the script ‘backup’ located n the ~qa/Desktop/scripts directory every day at 8:00 pm. Refer to the cron man pages for additional information.
Once set up using Anacron or individual crontabs, the backup utility should run once per day. We recommend that you periodically analyze the backup process log file for errors that may go unnoticed otherwise.
My hard disk crashed and I lost ALL of my CDRouter data!
No fear, this is why you implemented a backup a system! To recover your CDRouter data, you must first get the host machine back up and running in a stable state, which may require replacement of a hard disk and re-installation of the operating system and CDRouter. Regardless, once things are back to normal, you can simply copy data from saved backups to your host machine. Remember that you have multiple backups saved, so you first have to choose which set of backups to restore. Also remember that ‘backup0’ contains the most recent set of backups, whereas ‘backup2’ contains backups from two periods before (if the script runs daily ‘backup2’ will contain backup data as of two days prior to the last full backup).
Assuming you were originally backing up the /usr/buddyweb directory, you could easily restore all of this data from your most recent set of backups to its original location using a basic ‘cp’ or copy command:
# cp -a /media/disk/backup0/. /usr/buddyweb/.
The locations and exact files to restore can be changed as needed. If only certain files or directories are needed, they can be copied individually.
References
-
Mike Rubel’s excellent article "Easy Automated Snapshot-Style Backups with Linux and Rsync":
http://www.mikerubel.org/computers/rsync_snapshots/
- Basically Tech's article "Using a USB external hard disk for backups with Linux":
http://www.basicallytech.com/blog/index.php?/archives/73-Using-a-USB-external-hard-disk-for-backups-with-Linux.html
- Kevin Korb's article "Backups using rsync":
http://www.sanitarium.net/golug/rsync_backups.html
- net shift media's article "Mounting an External USB Drive in Linux":
http://www.netshiftmedia.com/netshift/archives/2006/09/15/mounting-external-usb-drive-linux.php
- Unix man pages for cp, rsync, mv, mkfs, df, cron
Questions or comments about this article?
Please contact QA Cafe Support: support@qacafe.com
www.qacafe.com
© 2007 QA Cafe