#!/usr/bin/perl -w ######################################################## =head1 RIPS Bedundancy Bnformation B

rotection Bystem Version 0.11 =head1 Purpose Have you ever teared out your hair because one damaged block of one of your disks screwed your whole archive ? Have you ever tried to restore an old tape backup with really important data and cursed yourself for using tar -cB and gzip stopped after the second block which is the only broken one at the whole tape? Well then this is probably what you need. RIPS can protect your files in a manner that minor corruptions can be balanced out by adding some redundancy to your file. =head1 Usage =head2 rips [options] =over 4 =item B<-v> or B<--verbose> Be noisy. Default is not verbose. =item B<-d> or B<--decode> Decode given input file. Default is encoding. =item B<-t> or B<--test> Test archive; the same as specifying /dev/null as output. =item B<-c> or B<--continue> Continue even if an archive can't be restored. =item B<-h> or B<--help> Show this help. =item B<-i> F or B<--input=>F Specify the input file. If nothing or the file '-' is given, C is used. Filesize must be greater 0 or nothing will be written. =item B<-o> F or B<--output=>F Specify the output file. If nothing or the file '-' is given, C is used. =item B<-b> I or B<--blocksize=>I Defines the size of one slice. Default is 65536 bytes. Must comply with 0 < b < 2^32 . =item B<-s> I or B<--slices=>I Defines the number of slices-1 that will be combined to one subarchive. Default are four (plus one) slices, which means your file will be roughly extended by one fourth of it's original size. Valid values range from 1-255. =back =head1 Background and examples To efficiently protect your data against loss it is crucial to choose the right values for blocksize and the number of slices per archive. To be able to do that you should have some experience with the media that you are using which is error prone. For example if you are going to transport 20MB or more using standard PC (HD) floppies you probably know that although the smallest entity floppies are accessed by are blocks (512 bytes) if an error occurs it usually affects the whole track (18 blocks). In most cases it even affects several tracks on both sides. Let's assume our usual error is a track failure of 1-5 tracks, effectivly (5*18*512) 46080 bytes. That means that choosing anything below 46080 bytes as blocksize would be a stupid idea if intended to protect from such failures. What we have to accomplish is to have this error relatively sure encapsulated in one block. Ok, now it becomes math (which I've never been particulary good at, so I tried to keep it as simple as possible for my own sake;). Let's assume those 46080 bytes are our error of the size B and our chosen blocksize is B. The number of possible combinations B

(permutations) with at least one byte of B in B is p = m + n - 1 The number of combination where m is not entirely in n is B (failed): f = ( m - 1 ) * 2 Thereof the number of combination where m is entirely in n is B (ok): o = p - f Now let's label another variable B (safety) which should reflect our safety. The definition should be: A safety value of 5 means chances are 5:1 (in 5 out of 6 cases, thus very likely) that m is completely within n. We can use this to set f and o in relation: o = sf Using all of above and bit of transposing we reach the following equation: n = 2sm - 2s + m - 1 n = 2s ( m - 1 ) + ( m - 1 ) For our example m equals 46080 and s equals 5 therefore n would be 506869. So we should chose a blocksize of about that size. Choosing a bigger blocksize has two negative side affects. The first is that the memory during creation and restoring and archive is increased and the second is that archive size can be increased. Now we have to select the number of slices. The number of slices determines the redundancy within the collection and in conjuction with the blocksize the size of the archives within the collection. More than choosing the right the blocksize, choosing the right value for the slices is a tradeoff between security and required space. Let's use our example to explain in detail. Assuming our floppies aren't the best or the worst kind with an average failure on every 10th disk. This gives us a probability B

for a good disk of 0.9 or in other words (160*18*512) 1474560 consecutive bytes b (mediasize) are good with a probability of 0.9. So the probability for one byte B to be good is (nearly 1 but more exactly): b = p ^ (1/m) b = 0.9 ^ (1/1474560) or 'simply' the 1474560th root of 0.9 If we choose the same safety as for our blocksize calculations (s=5) we get a probability threshold B of 5/6 or 0.833 . Now the question is how many consecutive bytes B can we have before our probability for good bytes drops below our treshold: b ^ x >= t ln t t = 5/6 = 0.833 (trust) = > x <= ---- * m m = 1474560 (mediasize) ln p p = 0.9 (error probability / m) Which gives x lower or equal to 2551658.68. This means we should try to keep the size per archive below that number. If we devide x by n (506869) we get 5.03 so we should make sure our archive does not include more than 5 blocks, keeping in mind that each archive contains slices+1 blocks we should specify 4 slices per archive. Choosing a lower number increases the redundancy but of course also the final size of the collection. So the complete commandline to create the collection could look like this: rips -b 506869 -s 4 -i foo.tgz -o foo.tgz.rip and to decode: rips -d -i foo.tgz.rip -o foo.tgz =head1 Requirements Rips uses Getopt::Std, Getopt::Long, Digest::MD5 which are usually part of the standard perl installation and should not bother you. However older perl installations may not include MD5 thus may require require the system administrator or you to install it manually. Methods of install may vary depending on your OS, if you have a linux distro check the repository, if you have NetBSD try B, on FreeBSD try B, if you lack any of these options you can try B to install the package. Of course rips also requires some memory to run in. Encoding requires less memory than decoding if rips encounters an error in the archive during decoding. The minimum memory requirement is 2 * blocksize + the memory perl requires to run. The maximum memory requirement can be one complete archive + 1 extra block + the memory perl requires to run. Keep this in mind when creating archives with little redundancy/many slices per archive. =head1 Archive format A rips collection consists of several rips archives concatenated, each of these archives has I+1 chunks, each chunk has the size of I+16 (md5) + 4 bytes I. Each archive has a header: 4 bytes ID ("RIPS") + 1 byte archive version + 1 byte # of slices in archive. So a standard archive would be: (4+1) * (65536+20) + 6 = 327786 bytes long. =head1 ToDo Make archive decoding continue on more errors. =head1 Bugs Currently RIPS is not able to recover from errors in it's metadata. =head1 Author Alexander Kuehn EIE L =cut ############ # Ok, for those of you that can't get enough of the math # here a few more lines from the transposing.. # for the blocksize, the equations: # p = m + n - 1 # f = 2 ( m - 1 ) # o = p - f = sf # the transition: # sf = p - f # ( s + 1 ) f = p # ( s + 1 ) f = ( m + n ) - 1 # 2 ( s + 1 ) ( m - 1 ) = m + n - 1 # ( 2s + 2 ) ( m - 1 ) = m + n - 1 # 2sm - 2s + 2m - 2 = m + n - 1 # 2sm - 2s + m - 1 = n # # for the slices, the equations: # b = p ^ (1/m) (p<1, m>0) # b ^ x >= t # the transition: # (p ^ (1/m)) ^ x >= t # p ^ (x/m) >= t # x / m <= ln t / ln p or otherwise lg t (iirc) # p # x <= m ( ln t / ln p ) # # package main; # includes use strict; use Getopt::Std; use Getopt::Long; use Digest::MD5 qw(md5 md5_hex md5_base64); local $::header="RIPS"; local $::caller=$0; local $::defarchversion=1; # should always be the most recent version (0 \$verbose, # --verbose "decode" => \$decode, # --decode "test" => \$testmode, # --test "continue" => \$continue, # --test "help" => \$help, # --help "input:s" => \$input, # --input= "output:s" => \$output, # --output= "blocksize:i" => \$blocksize, # --blocksize= "slices:i" => \$slices, # --slices "archversion:i" => \$archversion); # --archversion GetOptions(%args); use vars qw/ $opt_v $opt_d $opt_t $opt_h $opt_i $opt_o $opt_b $opt_s $opt_c $opt_a/; getopts('vdthi:o:b:s:a:'); # normalize parameters, long format parameters overwrite short format parameters $verbose=($opt_v || $verbose ? 1 : 0); $decode=($opt_d || $decode ? 1 : 0); $testmode=($opt_t || $testmode ? 1 : 0); $help=($opt_h || $help ? 1 : 0); unless ($input) { $input=($opt_i ? $input : "-"); } unless ($output) { $output=($opt_o ? $output : "-"); } unless ($blocksize) { $blocksize=($opt_b ? $opt_b : 32768); } unless ($slices) { $slices=($opt_s ? $opt_s : 4); } $continue=($opt_c || $continue ? 1 : 0); unless ($archversion) { $archversion=($opt_a ? $opt_a : $::defarchversion); } my $msg_rderr="unexpected end of archive."; my $msg_archerr="input file does not look like a RIPS archive."; my $msg_unkarch="unknown archive version, try to get a newer version."; my $msg_archbroken="found second broken bock in archive giving up, sorry."; local $::totalrsize=0; local $::totalwsize=0; verbprt ("%args:", map { "$_ = ${$args{$_}}" } keys %args); # I'm sick of all the the if's, let's see if a state machine does better.. # # determine initial state local $::state="init"; local $::returncode=0; if ($help) { $::state="help"; } elsif ($decode) { $::state="decode"; } else { $::state="encode"; } # state - subroutine assigment my %states = ( "help" => \&printhelp, "decode" => \&decode, "encode" => \&encode, "encode_v_one" => \&encode_v_one, "encode_v_two" => \&encode_v_two ); # check parameters # set up in- and output path unless ($::state eq "help") { if ($blocksize < 1 || $blocksize > 0xffffffff ) { errlog("Blocksize ($blocksize) out of range (1-4294967295)!"); exit(-1); } if ($slices < 1 || $slices > 0xff ) { errlog("Number of slices ($slices) out of range (1-255)!"); exit(-1); } if ($archversion < 1 || $archversion > $::defarchversion ) { # errlog("Requested archive version ($archversion) unsupported by this version (1-$::defarchversion)!"); exit(-1); } $output="/dev/null" if ($testmode); open (INFILE, "< $input") || die ("Can't open $input for reading: $!\n"); binmode(INFILE); open (OUTFILE, "> $output") || die ("Can't open $output for reading: $!\n"); binmode(OUTFILE); } while ($::state ne "done") { if ($states{$::state}) { $states{$::state}->(); } else { errlog("State $::state unknown " . '?-|'); $::state="done"; } } close(INFILE); close(OUTFILE); exit($::returncode); ############################################################################### # print help/usage ############################################################################### sub printhelp { system ("perldoc $::caller"); $::state="done"; } ############################################################################### # encode # archive creation # reads from the input until the end and spits out a rips archive # in the process ############################################################################### sub encode { if ( $archversion eq 1 ) { $::state="encode_v_one"; } elsif ( $archversion eq 2 ) { $::state="encode_v_two"; } else { errlog("Unsupported archversion specified! Archversion may not be higher than $::defarchversion."); $::state="done"; } } ############################################################################### # encode_v_one # archive v1 creation # reads from the input until the end and spits out a rips archive # in the process ############################################################################### sub encode_v_one { my ($red, $wrote, $bufferA, $bufferB, $digest); my $slice=0; $red = rips_read($bufferA, $blocksize); while ($red eq $blocksize) { if ($slice eq 0) { rips_write($::header . (pack "CC", $archversion, $slices)); $bufferB=$bufferA; } else { $bufferB^=$bufferA; } rips_write( (pack "N", $blocksize) . md5($bufferA) ); rips_write( $bufferA ); $slice++; if ($slice eq $slices) { #last slice rips_write( (pack "N", $blocksize) . md5($bufferB) ); #save current slice rips_write( $bufferB ); $slice=0; } $red = rips_read($bufferA, $blocksize); } if ($red gt 0) { if ($slice eq 0) { rips_write( $::header . (pack "CC", 1, 1) ); #version 1, 1+1 slices $bufferB=$bufferA; $blocksize=$red; } else { $bufferB^=$bufferA; } rips_write( (pack "N", $red) . md5($bufferA)); rips_write( $bufferA ); rips_write( (pack "N", $blocksize) . md5($bufferB) ); rips_write( $bufferB ); } verbprt("Red $::totalrsize bytes, wrote $::totalwsize bytes."); $::state="done"; } ############################################################################### # encode_v_two # archive v2 creation # reads from the input until the end and spits out a rips archive # in the process ############################################################################### sub encode_v_two { my ($red, $wrote, $bufferA, $bufferB, $bufferC, $md5cacheA, $md5cacheB, $md5cacheC, $digest); # buffers: A: last slice red my $slice=0; # B: redudancy slice $red = rips_read($bufferA, $blocksize); # C: next slice to write $md5cacheA=md5($bufferA); while ($red eq $blocksize) { if ($slice eq 0) { rips_write($::header . (pack "CC", $archversion, $slices)); $bufferB=$bufferA; $digest=""; $bufferC=$bufferA; $md5cacheC=$md5cacheA; } elsif ($slice eq $slices) { #last slice $digest^=$md5cacheC; rips_write( (pack "N", $blocksize) . $md5cacheC ); rips_write( $bufferC ); $bufferB^=$bufferA; $md5cacheB=md5($bufferB); $digest^=$md5cacheB; rips_write( (pack "N", $blocksize) . $md5cacheB ); #save redundancy slice rips_write( $bufferB ); $digest^=$md5cacheA; rips_write( (pack "N", $blocksize) . $md5cacheA ); rips_write( $bufferA ); rips_write( $digest ); $slice=0; } else { $bufferB^=$bufferA; $digest^=$md5cacheC; rips_write( (pack "N", $blocksize) . $md5cacheC ); rips_write( $bufferC ); $bufferC=$bufferA; $md5cacheC=$md5cacheA; } $slice++; $red = rips_read($bufferA, $blocksize); $md5cacheA=md5($bufferA); } if ($red eq 0) { #filesize was multiple of blocksize unless ($slice eq 0) { # !filesize zero or filesize was multiple of (blocksite*slices) $md5cacheB=md5($bufferB); $digest^=$md5cacheB; rips_write( (pack "N", $blocksize) . $md5cacheB ); #save redundancy slice rips_write( $bufferB ); $digest^=$md5cacheC; rips_write( (pack "N", $blocksize) . $md5cacheC ); # save last block rips_write( $bufferC ); rips_write( $digest ); } } elsif ($red gt 0) { if ($slice eq 0) { rips_write( $::header . (pack "CC", $archversion, 1) ); #version, 1+1 slices $blocksize=$red; $digest=$md5cacheA; rips_write( (pack "N", $blocksize) . $md5cacheA ); rips_write( $bufferA ); # our new redundancy block $digest^=$md5cacheA; rips_write( (pack "N", $blocksize) . $md5cacheA ); rips_write( $bufferA ); # our data block rips_write( $digest ); # should be 0 } else { $digest^=$md5cacheC; rips_write( (pack "N", $blocksize) . $md5cacheC ); # save last block rips_write( $bufferC ); $bufferB^=$bufferA; $md5cacheB=md5($bufferB); $digest^=$md5cacheB; rips_write( (pack "N", $blocksize) . $md5cacheB ); #save redundancy slice rips_write( $bufferB ); $digest=$md5cacheA; rips_write( (pack "N", $blocksize) . $md5cacheA ); rips_write( $bufferA ); # our new redundancy block rips_write( $digest ); } } verbprt("Red $::totalrsize bytes, wrote $::totalwsize bytes."); $::state="done"; } ############################################################################### # decode ############################################################################### sub decode { my ($red, $wrote, $bufferA, $bufferB, $digest); my $slice=0; $red = rips_read($bufferA, 4); while ($red eq 4) { if ($bufferA eq $::header) { # simple check passed $red = rips_read($bufferA, 1); if ($red eq 1) { if (unpack ("CC", $bufferA ) eq 1) { verbprt("This archive is type 1."); $red = rips_read($slices, 1); if ($red eq 1) { $slices=unpack("CC", $slices); verbprt("This collection uses $slices slices per archive."); $slice=0; while ( $slice le $slices ) { $red = rips_read($bufferA, 4); # size if ($red eq 4) { $blocksize=unpack("N", $bufferA); verbprt("Slice $slice is $blocksize bytes."); my $md5_sum; $red = rips_read($md5_sum, 16); # md5 if ($red eq 16) { $red = rips_read($bufferA, $blocksize); # block if (md5($bufferA) eq $md5_sum) { if ($slice eq 0) { $bufferB=$bufferA; } else { $bufferB^=$bufferA; } if ($slice ne $slices) { #not last slice rips_write( $bufferA ); } } else { # oops broken block! verbprt("broken slice: $slice!"); my $tailsize=$blocksize; if ($slice eq $slices) { #phu, nothing important hit verbprt("redundancy block corrupt - no worries."); } else { # red alert verbprt("trying to recover. :-o"); my $bufferC; # let's collect the rest of the archive here if ($slice eq 0) { $bufferB="\0" x $blocksize; } while ($slice le $slices) { $slice++; $red = rips_read($bufferA, 4); # size if ($red eq 4) { $blocksize=unpack("N", $bufferA); verbprt("Blocksize is $blocksize."); $tailsize+=$blocksize; my $md5_sum; $red = rips_read($md5_sum, 16); # md5 if ($red eq 16) { $bufferA=""; $red = rips_read($bufferA, $blocksize); # block if (md5($bufferA) eq $md5_sum) { verbprt("Slice $slice ok."); $bufferB^=$bufferA; $bufferC.=$bufferA; if ($slice eq $slices) { $bufferC=$bufferB . $bufferC; $tailsize-=$blocksize; # substract redundancy block verbprt("Horray, archive repaired!"); $bufferC = substr($bufferC, 0 , $tailsize); rips_write( $bufferC ); $bufferC=""; $slice++; # ok, we're through } } else { # the heavens are broken errlog($msg_archbroken); exit(-1) unless($continue); } } else { errlog($msg_rderr); } } else { errlog($msg_rderr); } } } } } else { errlog($msg_rderr); } } else { errlog($msg_rderr); } $slice++; } } else { errlog($msg_rderr); } } else { errlog($msg_unkarch); } } else { errlog($msg_rderr); } } else { errlog($msg_archerr); } $red = rips_read($bufferA, 4); } verbprt("Red $::totalrsize bytes, wrote $::totalwsize bytes."); $::state="done"; } ############################################################################### # reads the given amount of bytes into the given buffer # from INFILE # spit out proper error message if an error occurs ############################################################################### sub rips_read(\$; $) { my $bufferref = shift @_; my $toread = shift @_; my $red = read(INFILE, $$bufferref, $toread); if (!(defined $red)) { errlog("error while reading from $input: $!"); unless ($::continue) { $::state="done"; } } elsif ($red eq 0) { verbprt("could not read anymore bytes from $input: $! (probably end of collection)"); } elsif ($red ne $toread) { verbprt("could only read $red of $toread bytes from $input: $!"); } $::totalrsize+=$red; return($red); } ############################################################################### # writes the given amount of bytes from the given buffer # to OUTFILE # spit out proper error message if an error occurs ############################################################################### sub rips_write(@) { my $flag=$True; for (@_) { $::totalwsize+=length; $flag = print OUTFILE $_; unless($flag) { errlog("error while writing to $output: $!"); unless ($::continue) { $::state="done"; } last; } } } ############################################################################### # sets $dberror if an error in a query occurs # called by error handler ############################################################################### sub catcherror { errlog($_[0]); $help=1; } ############################################################################## # print error messages to stderr # parameters: text for output ############################################################################## sub errlog (@) { print STDERR "$::caller: $_[0]\n"; } ############################################################################## # print verbose messages to stderr # parameters: text for output ############################################################################## sub verbprt (@) { for (@_) { print STDERR "$_\n" if $verbose; } }