Longest reverse palindromic DNA substring

11

As you may know, in DNA there are four bases — adenine (A), cytosine (C), guanine (G) and thymine (T). Typically A bonds with T and C bonds with G, forming the "rungs" of the DNA double helix structure.

We define the complement of a base to be the base it bonds to — i.e. the complement of A is T, the complement of T is A, the complement of C is G and the complement of G is C. We can also define the complement of a DNA string to be the string with each base complemented, e.g. the complement of GATATC is CTATAG.

Because of the double-stranded structure of DNA, the bases on one strand are complementary to the bases on the other strand. However DNA has a direction, and DNA transcription occurs in opposite directions on the two strands. Hence molecular biologists are often interested in the reverse complement of a DNA string — quite literally the reverse of the complement of the string.

To extend our previous example, the reverse complement of GATATC is CTATAG backwards, so GATATC. As you may have noticed, in this example the reverse complement is equal to the original string — we call such a string a reverse palindrome.*

Given a string of DNA, can you find the longest substring which is a reverse palindrome?

*I use the term "reverse palindrome", taken from Rosalind, to differentiate from the usual meaning of palindrome.


Input

Input will be a single string consisting of only the characters ACGT in upper case. You may write either a function or a full program for this challenge.

Output

You may choose to output via either printing or returning (the latter choice is only available in the case of a function).

Your program should output the longest reverse palindromic substring of the input string, if there is a unique solution. If multiple solutions exist, then you may either output any single one of them, or all of them (your choice). Duplicates are okay if you choose to output all of them.

The input is guaranteed to have a solution of at least length 2.

Worked example

ATGGATCCG -> GGATCC

The reverse complement of GGATCC is itself (GGATCC --complement--> CCTAGG --reverse--> GGATCC), so GGATCC is a reverse palindrome. GATC is also a reverse palindome, but it is not the longest one.

Test cases

AT -> AT
CGT -> CG
AGCA -> GC
GATTACA -> AT, TA
ATGGATCCG -> GGATCC
CCCCCGGGGG -> CCCCCGGGGG
ACATATATAGACT -> ATATAT, TATATA
ATTCGATCTATGTAAAGAGG -> TCGA, GATC
CGCACGTCTACGTACCTACGTAG -> CTACGTAG
TCAATGCATGCGGGTCTATATGCAT -> ATGCAT, GCATGC [, ATGCAT]
CGCTGAACTTTGCCCGTTGGTAGAACGGACTGATGTGAACGAGTGACCCG -> CG, GC, TA, AT [, GC, CG, CG, CG, CG]
CTCGCGTTTGCATAACCGTACGGGCGGAACAGTCGGCGGTGCCTCCCAGG -> CCGTACGG

Scoring

This is code golf, so the solution in the fewest bytes wins.

Sp3000

Posted 2014-12-05T04:18:02.107

Reputation: 58 729

It would have been nicer if printing all of them had some sort of bonus. – Optimizer – 2014-12-05T06:03:53.867

@Optimizer isn't printing just the longest more difficult than printing all of them? – trichoplax – 2014-12-06T15:03:40.550

Or do you mean printing all of the longest ones? – trichoplax – 2014-12-06T15:04:09.173

@githubphagocyte yes, your second comment. – Optimizer – 2014-12-06T15:12:13.783

Answers

6

Pyth, 37 36 28 24 bytes

ef&}TzqmaCd6T_mx4aCk6Tyz

Combining the tips from FryAmTheEggman and the reverse palindrome check trick from Peter, this is a super short version.

However, this only works with Pyth 3.0.1 which you can download from this link and run like

python3 pyth.py -c "ef&}TzqmaCd6T_mx4aCk6Tyz" <<< "ATTCGATCTATGTAAAGAGG"

(linux bash only. On windows, press Enter instead of the <<< and then type the input)


This is my previous submission - 28 bytes solution

J"ACGT"ef&}TzqTjk_m@_JxJdTyz

Thanks to FryAmTheEggman for this version. This one creates all possible subsets of the input DNA string, filters the subsets on the condition that the subset is a substring of input and the reverse of transform is equal to the subset itself.

Due to all possible subset creation, this takes up even more memory than Peter's answer.


This is my first submission - 36 byte solution.

J"ACGT"eolNfqTjk_m@_JxJdTm:zhkek^Uz2

This is the exact translation of my CJam answer. I was hoping that this would be much smaller but turns out that lack of translation method made it almost similar size (still 2 bytes smaller though)

Try it online here

Optimizer

Posted 2014-12-05T04:18:02.107

Reputation: 25 836

Uz is equivalent to Ulz. – isaacg – 2014-12-05T10:40:25.387

1J"ACGT"eolNf&}TzqTjk_m@_JxJdTyz Using y for subsets and then filtering out strings that aren't substrings of z is shorter :) – FryAmTheEggman – 2014-12-05T14:10:37.880

1Oh, and if you do that, you don't need to sort because y is already sorted by length. You can just do ef... – FryAmTheEggman – 2014-12-05T14:13:28.837

5

GolfScript (35 34 bytes)

]{{..(;\);}%)}do{{6&}%.{4^}%-1%=}?

For testing purposes you may wish to use

]{{..(;\);}%.&)}do{{6&}%.{4^}%-1%=}?

which adds a .& to reduce the duplicated effort.

Dissection

]{         # Gather string into an array and do-while...
  {        #   Map over each string in the array
    ..     #     Make a couple of copies of the string
    (;     #     Remove the first character from one of them
    \);    #     Remove the last character from the other
  }%
  )        #   Extract the last string from the array
}do        # Loop until that last string is ''
           # Because of the duplication we now have an array containing every substring
           # of the original string, and if we filter to the first occurrence of each
           # string then they're in descending order of length
{          # Find the first element in the string satisfying the condition...
  {6&}%    #   Map each character in the string to its bitwise & with 6
  .{4^}%   #   Duplicate, and map each to its bitwise ^ with 4
           #   This serves to test for A <-> T, C <-> G
  -1%=     #   Reverse and test for equality
}?

Peter Taylor

Posted 2014-12-05T04:18:02.107

Reputation: 41 901

q{]{__(;\);}%~}h]{:c:i6f&_4f^W%=}= in CJam. Same size. Do not try it in the online compiler for anything bigger than 7 length input – Optimizer – 2014-12-05T12:29:57.897

4

CJam, 39 38 bytes

I am sure this can be golfed further...

q:Q,,_m*{~Q<>}%{,~}${_"ACGT"_W%erW%=}=

Takes the DNA string from STDIN and outputs the longest reverse palindromic DNA to STDOUT

Try it online here

(Explanation soon) (Saved 1 byte thanks to Peter)

Optimizer

Posted 2014-12-05T04:18:02.107

Reputation: 25 836

4

Python 3, 125 chars

S=input()
l=[]
while S:
 s=_,*S=S
 while s:l+=[s]*all(x+y in"ATA CGC"for x,y in zip(s,s[::-1]));*s,_=s
print(*max(l,key=len))

Look ma, no indexing! (Well, except to reverse the string, that doesn't count.)

Iterating over the substrings is done by taking off chars from the front and end using starred assignment. The outer loop removes characters for the start of S, and for each such suffix, s loops over all prefixes of it, testing them one by one.

Testing for reverse palindrome is done by the code

all(x+y in"ATA CGC"for x,y in zip(s,s[::-1]))

which checks that each symbol and its reversed-string counterpart are one of "AT", "TA", "CG", and "GC". I also found a set-based solution to be one character shorter, but loses two chars on requiring outer parens when used.

set(zip(s,s[::-1]))<=set(zip("ACTG","TGAC"))

This still feels like it can be shortened.

Finally, the longest palindrome is printed.

print(*max(l,key=len))

I hope space-separated outputs are OK. If a list also fine, the the star could be removed. I had tried tracking the running max in the loop instead, as well as cramming the inner loops into a list comprehension so I could take the max directly without constructing l, and both turned out slightly longer. But, it was close enough that it's hard to tell which approach is actually best.

xnor

Posted 2014-12-05T04:18:02.107

Reputation: 115 687

I wanted to be more flexible with this question so I didn't specify an exact output format for tied solutions. If it's clear what the solutions are then it's fine, so a list is okay. – Sp3000 – 2014-12-06T22:57:57.370

3

Perl - 59 bytes

#!perl -p
$_=$_[~!map$_[length]=$_,/((.)(?R)?(??{'$Q5'^$+.-$+}))/gi]

Counting the shebang as one, input is taken from STDIN.

Sample usage:

$ echo CTCGCGTTTGCATAACCGTACGGGCGGAACAGTCGGCGGTGCCTCCCAGG | perl dna.pl
CCGTACGG

primo

Posted 2014-12-05T04:18:02.107

Reputation: 30 891

3

J (45)

{.@(\:#&.>)@,@(('ACGT'&(|.@]-:[{~3-i.)#<)\\.)

This is a function that takes a string:

   {.@(\:#&.>)@,@(('ACGT'&(|.@]-:[{~3-i.)#<)\\.) 'ATGGATCCG'
┌──────┐
│GGATCC│
└──────┘

Explanation:

{.@(\:#&.>)@,@(('ACGT'&(|.@]-:[{~3-i.)#<)\\.) 

              (                          \\.)  for each prefix of each suffix
               (                      #<)      include the argument if,
                        |.@]                      its reverse
                            -:                    is equal to
                'ACGT'&(      [{~3-i.)            the complement
            ,@                                 ravel
   (\:#&.>)@                                   sort by length of item
{.@                                            take the first one   

marinus

Posted 2014-12-05T04:18:02.107

Reputation: 30 224

3

Python 2 - 177 bytes

s=raw_input()
r,l,o=range,len(s),[]
for a in[s[i:j+1]for i in r(l)for j in r(i,l)]:q=['TC GA'.index(c)-2for c in a];o+=[a if[-n for n in q][::-1]==q else'']
print max(o,key=len)

Simple brute force. The actual "reverse palindromic" check is the only interesting part. Here it is written more readably:

check = ['TC GA'.index(c)-2 for c in substring]
if [-n for n in check][::-1] == check:
    # substring is reverse palindromic

I do that on every possible substring and put them in a list if it's true. If it's false, I put in an empty string instead. When all the checks are done I output the longest element of the list. I used an empty string because it saves bytes over putting nothing in, but it also means the program won't choke if there's no solution. It outputs an empty line and exits gracefully.

undergroundmonorail

Posted 2014-12-05T04:18:02.107

Reputation: 5 897

1This seems to be shorter if you smoosh everything into a single list uncomprehension. I had to change the logic a bit, but I got 162 with s=raw_input();r,l,g=range,len(s),'TGCA';print max([a for a in[s[i:j+1]for i in r(l)for j in r(i,l)]if[g[n]for n in[~g.find(c)for c in a]]==list(a)[::-1]],key=len). Also, for strings, use find over index :) – FryAmTheEggman – 2014-12-05T16:03:26.780