Efficient error-free* encoding

20

5

The mission

As is well known, the genetic material of all known creatures on Earth is encoded in DNA; using the four nucleotides adenine, thymine, cytosine, and guanine. (Commonly represented by ATGC).

A bioinformaticist wishing to store an entire genome would of course not want to store this as ASCII, because each choice can be represented by a mere two bits!

The specification

Your mission, should you choose to accept it, is to write a pair of programs, functions or methods to convert an ASCII representation to a binary representation and back; representing A as b00, T as b01, G as b10, and C as b11 (henceforth "units").

In addition, the high bits of each byte should contain the count of units in the byte, making each byte represent a triplet.

For example: "GATTACCA" becomes b11 100001 b11 010011 b10 1100xx.

In the ASCII to binary input, spaces, tabs and newlines should be ignored. Any character not in the set of [ \r\n\tATGC] is an error and may either be ignored or terminate processing.

In the binary to ASCII input, bytes whose two high bits are b00 may be ignored.

The ASCII output may contain whitespace; but should never be more than 4 times the size of the binary input plus one byte long, and must end with a newline.

The binary output may contain an arbitrary number of b00xxxxxx "control" bytes; but must never be longer than the ASCII input.

Each conversion program must support input of arbitrary length; and should complete the encoding or decoding in approximately linear time.

The twist

Unfortunately for the bioinformaticist for whom you are performing this task, he has in some way wronged you, on a personal yet perhaps petty level.

Perhaps he went out with your sister once, and never called her again. Perhaps he stepped on your dog's tail. The specifics are not really important.

What is important is that you have a chance for payback!

The details

Each conversion should introduce a small error rate; on the order of one error per ten thousand to a million units processed.

An error can be one of the following:

  • Duplication errors: "GAT TAC CA" becomes "GAT TAA CCA"
  • Deletion errors: "GAT TAC CA" becomes "GAT TAC A"
  • Translation errors: "GAT TAC CA" becomes "GTA TAC CA"
  • Triplet duplications: "GAT TAC CA" becomes "GAT TAC TAC CA"
  • Triplet slippages: "GAT TAC CA" becomes "TAC GAT CA"
  • Triplet reversals: "GAT TAC CA" becomes "GAT CAT CA"

That errors will be introduced should of course not be immediately obvious from the code; and irrespective of the length of the input; the conversion should introduce at least one error.

Two runs with identical inputs should not necessarily produce identical outputs.

The trick

The vile bioinformaticist is a moderately competent coder; and as such, some constructs will be automatically discovered, and are as such banned:

  • He will automatically discover any calls to system random number generators, such as rand(), random(), or reading from /dev/urandom or /dev/random (or whatever the language equivalent is).
  • He will also notice any superfluous variables, counters or loops.

The scoring

The encoder and decoder will be scored individually.

Each will be run 3 times against a set of 100 randomly generated input files, each file with a size on the order of 3 million units.

The data for the encoder test cases will be created approximately as so:

for (l = 1 => bigNum)
  for (t = 1 => 20)
    random_pick(3,ATGC)
    t == 20 ? newline : space

The data for the decoder test cases will be created approximately as so:

for (u = 1 => bigNum)
  for (t = 1 => 20)
    random_byte() | 0b11000000
   0x00

The encoder

  • Each byte missing from the expected minimum length in the actual length will score -1 points, up to a maximum of -1000. (The expected minimum length is ceil(count(ATGC) / 3).)

The decoder

  • Each byte over the expected maximum length in the actual length will score -1 points, up to a maximum of -1000. (The expected maximum length is size(input) * 4 + 1.)

Both

  • Each kind of error that can be produced will score 100 points; for a total of 600 points possible for each, 1200 total.
  • Each test case for which the encoder produces more than 30% more or less errors than its own average will be penalized by -5 points.
  • Each test case for which the encoder produces less than 15% more or less errors than its own average will be given 5 points.
  • Each test case where all three runs produce identical outputs will be penalized by -10 points.

Hard requirements

An entry will be disqualified if:

  • For any valid input longer than one triplet it fails to produce even one error.
  • Its performance is such that it cannot complete the test gauntlet within approximately one hour.
  • It on average produces more than one error in every ten thousand units.
  • It on average produces less than one error in every million units.

The interface

The entrants should accept input on the standard input, and output to the standard output.

If the entry is one program with dual function; the switches -e and -d should set the program for encoding and decoding, respectively.

Example invocations:

$ encoder <infile.txt >outfile.bin
$ decoder <infile.bin >outfile.txt
$ recoder -e <infile.txt >outfile.bin

The winner

The winner is the entry with the highest score; the theoretical maximum being 1200 for error kinds plus 3000 points for stability in error generation rate.

In the unlikely event of a draw; the winner will be determined by vote count.

The additional notes

For the purposes of running the test gauntlet, each entry should include run or compilation instructions.

All entries should preferably be runnable on a Linux machine without X.

Williham Totland

Posted 2014-06-12T10:27:28.047

Reputation: 1 186

Question was closed 2016-04-18T16:13:50.160

1I am going to give my answer.. but I think the two sentences "Each conversion should introduce a small error rate; on the order of one error per ten thousand to a million units processed." and "An entry will be disqualified if: It on average produces more than one error in every ten thousand units." are incompatible. The same between "Each conversion should introduce a small error rate; on the order of one error per ten thousand to a million units processed." and "An entry will be disqualified if: It on average produces less than one error in every million units." – Mattsteel – 2015-07-24T21:43:56.823

@Mattsteel I don't think the requirements are incompatible; they just explicitly state that the error rate should be between 1 : 10 000 and 1 : 1 000 000. – Williham Totland – 2015-07-25T02:14:00.970

@WillihamTotland. Now I see the correctness of your sentences. I simply wrongly flipped the two fractions. My only justification is that I am not English native speaking. – Mattsteel – 2015-07-27T20:31:57.523

There is a problem with the hard limits in this question. Say the input to the encoder consists of two triplets. If it doesn't produce an error, it's disqualified. If the input consists primarily of such cases, and it produces at least one error for all of them, its disqualified for producing too many errors. – pppery – 2015-10-16T22:47:48.757

@ppperry You're right, kind of. I'll address that in the morning. – Williham Totland – 2015-10-17T01:03:28.547

1

I'm voting to close this question as off-topic because underhanded challenges are no longer on-topic on this site. http://meta.codegolf.stackexchange.com/a/8326/20469

– cat – 2016-04-18T14:47:23.423

4Changed the tag. KotH is for challenges where submissions interact with each other. Also I'm afraid it will be difficult to impossible to enforce the "underhanded" component objectively. – Martin Ender – 2014-06-12T11:49:39.483

2I agree with @m.buettner's comment that the underhanded part is difficult to judge. On the other hand I feel that this is the only interesting part in the challenge. I can guarantee that the error generation and rate are exactly within the specifications and therefore have the maximum points. Or do I miss something from the spec? Moreover, If an additional kind of error is accepted, it will be added to the above list; and all answers will be scored on it. sounds like you are going to change the challenge after people started working or submitting solutions which I think is not a good idea. – Howard – 2014-06-12T15:37:29.917

@Howard: Noted. The rules are updated with specific underhandedness criteria; and the mutational aspect wrt. errors is removed. – Williham Totland – 2014-06-12T16:01:20.267

@WillihamTotland I think "superfluous" is still not quite objective, but this may be a bit better now. – Martin Ender – 2014-06-12T16:29:58.153

@m.buettner: Given the simplicity of the core task; pretty much any loop apart from the main reading/writing loop would be superfluous. – Williham Totland – 2014-06-12T16:57:50.097

Answers

3

Perl 5.10

I'm glad to present my solution in Perl.

When I started the challenge I was quite sure that Perl would stay well below the limit of 1 hour.

For testing purpose I developed a plain sample generator and a coded sample generator.

Then I developed the encoder that took a bigger effort and produced a longer code. The encoder works as follow:

  1. first loop reads the whole file and splits data to have an array of all triplets
  2. second loop traverses the array and prepend to each element its length
  3. third loop traverses again and maps each character to give the output.

The coded binary output is so formatted as new-line terminated "lines" of 20 octets where each octet codes one triplet, with two characters of prefix (like a cyclical line-number).

for example, the shortest three byte input:

AAA

should give the shortest output of three bytes plus new-line.

00ÿ

that is

30 30 FF 0A

and

AGG CGC AAC GGC TAA ATC GTT TTC ACA CCA CGT TTG AAA CGG GTG ACA CGA GAT TTA GTC
TAT GGT ACT AGG TAC GCC GTG GTG CGT GCG GAG TTA CTA GAT GTG TTA GTA CGC CAT CGT

should give the following binary.

01ÊûÃëÐÇå×ÌüùÖÀúæÌøáÔç
00ÑéÍÊÓïææùîâÔôáæÔäûñù

Should because of the small error rate: For the smallest input, the script introduces 1 error.

For a 3 million triplets file run, the encoder introduces 11 errors.

Provided that the script is dnacodec3.pl, the run is invoked at command prompt as usual:

$> perl dnacodec3.pl -e < plain.txt > coded.txt

The decoder works as follow:

  1. first loop reads the whole file and splits data to have an array of all octets. It keep track of every new-line.
  2. second loop examines each octet, retaining those not beginning with 00 and disregards the rest. The plain Ascii output is formatted as new-line terminated lines of triplets separated by one space. The new-line are in the same position as they were in the input.

I prepared a 3 million triplets sample test file (about 12MByte) and tested for timing. Using my laptop with an Intel Core i5 vPro at 2.6 GHz, the 3M encoder run always takes less than 20 sec. During the run it takes 200-220 MByte of RAM. What a waste!

The decode run takes less than 10 sec. It cannot introduce errors... for now.

Again, for the decoding run

$> perl dnacodec3.pl -d < coded.txt > plain.txt

Here is the code

#!/usr/bin/perl
use strict ;
use warnings ;
my $switch = shift || die "usage $0 [-e|-d]\n";
my %map = qw( x 10  X 11  c 0b  ? 00
              A 00  T 01  G 10  C 11  
              0 00  1 01  2 10  3 11  
              00 A  01 T  10 G  11 C  ) ;
my $r = 20 ;
my @dummy = unpack ( '(A4)*', '0xxx' x $r ) ;
my $map = oct( $map{ c } . ($map{ C } x 9) ) ;
my $t = time() ;
my @inp = () ;
my @out = () ;
my @buf = () ;
my $n ;

sub arch {
    push @buf, @dummy[ 0 .. $r - $#buf - 2 ] ;
    push @out, "@buf" ;
    @buf = () ;
}

sub encode {
    my $mask = '(A3)*' ;
    while ( my $row = <STDIN> ) {
        chomp $row ;
        $row =~ s/\s+//g ;
        $row =~ s/[^\r\n\tATGC]//g ;
        next unless $row ;
        my @row = unpack( $mask, $row ) ;
        push @inp, @row if $row ;
    }
    $n = scalar @inp ;
    $r = $n if $r > $n ;
    for ( my $i = $n - 1 ; $i >= 0 ; --$i ) {
        my $e = $inp[$n-$i-1] ;
        my $l = length( $e ) ;
        my $d = $e =~ /\?/? 0: $l ;
        push @buf, ( $d -((($i-($n>>1))&$map)?0:1) )
           . $e . 'x'x(3-$l) ;
        arch unless $i % $r ;
    }
    arch if scalar @buf ;
    my $m = scalar @out ;
    for ( my $j = $m - 1 ; $j >= 0; --$j ) {
        my @ary = () ;
        my $e = $out[$m-$j-1] ;
        for my $byte ( split / /, $e ) {
            my @byte = split ( //, $byte ) ;
            my @trad = map { $map{ $_ } } @byte ;
            my $byte = join( '', @trad ) ;
            push @ary, $byte ;
        };
        my $row = sprintf( '%02d', $j % $r) ;
        $row .= pack( '(B8)*', @ary ) ;
        print "$row\n" ;
    }
}

sub decode {
    my $mask = '(B8)*' ;
    while ( my $row = <STDIN> ) {
        chomp $row ;
        next unless $row ;
        my @row = unpack( $mask, $row ) ;
        push @inp, @row[0..$#row], '?' if $row ;
    }
    $n = scalar @inp ;
    my @ary = () ;
    for ( my $i = $n - 1 ; $i >= 0 ; --$i ) {
        my $e = $inp[$n-$i-1] ;
        if ( $e ne '?' ) {
            my $u = oct( '0b'. substr($e,0,2) ) ;
            my $a = '' ;
            for my $j ( 1 .. $u ) {
                $a .= $map{ substr($e,$j+$j,2) } ;
            }
            push @ary, $a if $u ;
        }
        else {
            my $row = "@ary" ;
            $row =~ s/\s{2,}//g ;
            print "$row\n" if $row ;
            @ary =() ;
        }
    }
}

decode if $switch eq '-d' ;
encode if $switch eq '-e' ;

And here is the sample generator:

sub test_coder {
    my $n = shift || 1000 ;
    my @b = qw( A C G T ) ;
    for (my $l = 0; $l < $n; $l++) {
        my @ary = () ;
        for (my $t = 0; $t < 20; $t++) {
            push @ary, $b[ int(rand(4)) ] . $b[ int(rand(4)) ] . $b[ int(rand(4)) ] ;
        }
        print "@ary\n" ;
    }
    1;
}

sub test_decoder {
    my $n = shift || 1000;
    for (my $l = 0; $l < $n; $l++) {
        my @ary = () ;
        for (my $t = 0; $t < 20; $t++) {
            push @ary, int(rand(256)) | 0b11000000 ;
        }
        my $row = pack( 'C*', @ary ) ;
        print "$row\000" ;
    }
    1;
}


test_coder( @ARGV ) if $switch eq '-g' ;
test_decoder( @ARGV )  if $switch eq '-h' ;

Mattsteel

Posted 2014-06-12T10:27:28.047

Reputation: 381

I forgotten to show where the error is injected: the push @buf in the second loop does the trick. – Mattsteel – 2015-08-14T22:20:33.420

It's subtle, I'll give you that. I won't be running full-scale tests until there is more than one competitor, but this is good stuff. :) – Williham Totland – 2015-08-19T22:31:06.840

Thanks. I know this is a suggestion for other friends...I would like to improve the randomness of error position using the (still unused) time func: starting run at odd or even seconds should yield different output – Mattsteel – 2015-08-20T07:36:52.990