Parse RNA into codons

18

3

Introduction

RNA is the less famous cousin of DNA. Its main purpose is to control the production of proteins in cells through a process called translation. In this challenge, your task is to implement a part of this process where the RNA is split into codons.

This challenge is thematically related, but concentrates on another part of the translation process.

Codons

We will think of RNA as a long string over the alphabet of base pairs, AUCG. In translation, RNA is split into non-overlapping chunks of three base pairs, called codons. The process begins at a start codon, AUG, and ends at a stop codon, one of UAA, UAG or UGA. Each codon (except the stop codons) corresponds to an amino acid, and the resulting string of amino acids forms the protein.

Input

Your input is a non-empty string of RNA.

Output

Your output is the list of codons in which the RNA is split, in any reasonable format. In this simplified model, the process begins at the leftmost start codon AUG, which is included in the output. It ends when a stop codon is encountered, or when we run out of RNA. If the input contains no start codon, the output shall be an empty list.

Examples

Consider the input sequence

ACAUGGAUGGACUGUAACCCCAUGC

The parsing begins at the leftmost occurrence of AUG, at index 2. It continues as follows:

AC AUG GAU GGA CUG UAA CCCCAUGC
   *   ^   ^   ^   +

The codon marked with * is the start codon, and those marked with ^ are also part of the output. The stop codon is marked with +. The correct output is

AUG,GAU,GGA,CUG

For the shorter input

ACAUGGAUGGACUGU

the process goes

AC AUG GAU GGA CUG U
   *   ^   ^   ^

This time, a stop codon is not encountered, so the process stops when we run out of base pairs. The output is the same as above.

Rules and scoring

You can write a full program of a function. The lowest byte count wins, and standard loopholes are disallowed.

Test cases

GGUACGGAUU -> 
GGCGAAAUCGAUGCC -> AUG
ACAUGGAUGGACUGU -> AUG,GAU,GGA,CUG
AUGACGUGAUGCUUGA -> AUG,ACG
UGGUUAGAAUAAUGAGCUAG -> AUG,AGC
ACAUGGAUGGACUGUAACCCCAUGC -> AUG,GAU,GGA,CUG
CUAAGAUGGCAUGAGUAAUGAAUGGAG -> AUG,GCA
AAUGGUUUAAUAAAUGUGAUAUGAUGAUA -> AUG,GUU
UGUCACCAUGUAAGGCAUGCCCAAAAUCAG -> AUG
UAUAGAUGGUGAUGAUGCCAUGAGAUGCAUGUUAAU -> AUG,GUG,AUG,AUG,CCA
AUGCUUAUGAAUGGCAUGUACUAAUAGACUCACUUAAGCGGUGAUGAA -> AUG,CUU,AUG,AAU,GGC,AUG,UAC
UGAUAGAUGUAUGGAUGGGAUGCUCAUAGCUAUAAAUGUUAAAGUUAGUCUAAUGAUGAGUAGCCGAUGGCCUAUGAUGCUGAC -> AUG,UAU,GGA,UGG,GAU,GCU,CAU,AGC,UAU,AAA,UGU

Zgarb

Posted 2016-01-16T01:19:19.097

Reputation: 39 083

13The relationship of DNA to RNA to protein was once explained to me in computing terms that I could understand: DNA equates to a program on a hard disk; RNA equates to that program loaded into memory; and protein equates to the output data produced as a result of that program running. – Digital Trauma – 2016-01-16T01:42:47.753

4The Dogma of molecular biology is "DNA makes RNA makes protein." So DNA is fairly rare, and RNA is less famous, but far more common. Protein is most common of all. – Level River St – 2016-01-16T01:53:27.270

1@DigitalTrauma: As a geneticist I need to point out that this analogy is woefully inadequate to describe the reality of how DNA works. DNA is not some dead thing waiting to be transcribed into RNA so it can do something. – Jack Aidley – 2016-01-16T10:08:51.070

What actually occurs in practice if a piece of mRNA terminates before a stop codon (as in the simple example), meaning no stop triplet for a release factor to bind to? – Reinstate Monica - ζ-- – 2016-01-16T11:52:11.003

1@Jack hard disk contents are not necessarily dead things either - upgrades, auto updates, etc, though of course not self-healing to the extent I understand DNA to be. But you're right - It is a weak analogy. However I think it got my non-geneticist self a little closer to a layman understanding – Digital Trauma – 2016-01-16T17:31:33.533

The Dogma is also horribly wrong when it comes to some types of viruses. – ApproachingDarknessFish – 2016-06-12T05:11:33.773

@DigitalTrauma DNA -> RNA -> protein chain -> protein folding -> goes of and does awesome nature – noɥʇʎԀʎzɐɹƆ – 2016-06-12T20:13:57.353

@DigitalTrauma My analogy would be DNA = github repo online, RNA = downloaded source code, protein chain = after ./.configure, protein folding = compiled program (it's super duper complicated, our computers can't even fold programming) – noɥʇʎԀʎzɐɹƆ – 2016-06-12T20:17:31.940

Answers

9

Retina, 39 38 32 30 bytes

M!`AUG|\B\G...
U(AA|AG|GA)\D*

The trailing linefeed is significant.

Output as a linefeed-separated list.

Try it online.

Explanation

M!`AUG|\B\G...

This is match stage which turns the input into a linefeed-separated list of all matches (due to the !). The regex itself matches every codon starting from the first AUG. We achieve this with two separate options. AUG matches unconditionally, so that it can start the list of matches. The second match can be any codon (... matches any three characters), but the \G is a special anchor which ensures that this can only match right after another match. The only problem is that \G also matches at the beginning of the string, which we don't want. Since the input consists only of word characters, we use \B (any position that is not a word boundary) to ensure that this match is not used at the beginning of the input.

U(AA|AG|GA)\D*

This finds the first stop codon, matched as U(AA|AG|GA) as well as everything after it and removes it from the string. Since the first stage split the codons into separate lines, we know that this match is properly aligned with the start codon. We use \D (non-digits) to match any character, since . wouldn't go past the linefeeds, and the input won't contain digits.

Martin Ender

Posted 2016-01-16T01:19:19.097

Reputation: 184 808

5

Haskell, 115 112 bytes

import Data.Lists
fst.break(\e->elem e["UAA","UAG","UGA"]||length e<3).chunksOf 3.snd.spanList((/="AUG").take 3)

Usage example:

*Main> ( fst.break(\e->elem e["UAA","UAG","UGA"]||length e<3).chunksOf 3.snd.spanList((/="AUG").take 3) ) "AUGCUUAUGAAUGGCAUGUACUAAUAGACUCACUUAAGCGGUGAUGAA"
["AUG","CUU","AUG","AAU","GGC","AUG","UAC"]

How it works:

                spanList((/="AUG").take 3)  -- split input at the first "AUG"
             snd                            -- take 2nd part ("AUG" + rest)
     chunksOf 3                             -- split into 3 element lists
fst.break(\e->                              -- take elements from this list
           elem e["UAA","UAG","UGA"]||      -- as long as we don't see end codons
           length e<3)                      -- or run out of full codons 

nimi

Posted 2016-01-16T01:19:19.097

Reputation: 34 639

1

Python 2, 185 bytes

i=input()
o=[]
if i.find('AUG')>=0:i=map(''.join,zip(*[iter(i[i.find('AUG'):])]*3))
else:print "";exit()
for j in i:
 if j not in['UGA','UAA','UAG']:o+=[j]
 else:break
print ','.join(o)

Explanation Set i to input. Split it from 'AUG' to the end. Split into strings of three. Check if stop codon, and cut.

Try it here

TanMath

Posted 2016-01-16T01:19:19.097

Reputation: 1 431

1

JavaScript 88 82 70 69 chars

s=>/AUG(...)+?(?=(U(AA|AG|GA)|$))/.exec(s)[0].match(/.../g).join(",")

Usage Example:

(s=>/AUG(...)+?(?=(U(AA|AG|GA)|$))/.exec(s)[0].match(/.../g).join(","))("ACAUGGAUGGACUGUAACCCCAUGC")

Benjamin Gruenbaum

Posted 2016-01-16T01:19:19.097

Reputation: 219

Doesn't this fail for input without a stop codon? – Flambino – 2016-01-16T23:11:19.200

1You're right, I didn't see that was an option, fixing – Benjamin Gruenbaum – 2016-01-16T23:13:52.797

Try s=>/AUG(...)+?(?=(U(AA|AG|GA)|$))/.exec(s)[0].match(/.../g). – Mama Fun Roll – 2016-01-17T02:43:03.783

Still fails for stop codons. (Try test case 3) – user81655 – 2016-01-17T02:55:20.987

1

MATL, 57 bytes

j'AUG(...)*?(?=(UAA|UAG|UGA|.?.?$))'XXtn?1X)tnt3\-:)3[]e!

This uses current version (9.3.1) of the language/compiler.

Input and output are through stdin and stdout. The output is separated by linebreaks.

Example

>> matl
 > j'AUG(...)*?(?=(UAA|UAG|UGA|.?.?$))'XXtn?1X)tnt3\-:)3[]e!
 >
> ACAUGGAUGGACUGUAACCCCAUGC
AUG
GAU
GGA
CUG

EDIT (June 12, 2016): to adapt to changes in the language, [] should be removed. The link below includes that modification

Try it online!

Explanation

The code is based on the regular expression

AUG(...)*?(?=(UAA|UAG|UGA|.?.?$))

This matches substrings starting with AUG, containing groups of three characters (...) and ending in either UAA, UAG, or UGA; or ending at the end of the string, and in this case there may be one last incomplete group (.?.?$). Lookahead ((?=...)) is used so that the stop codons are not part of the match. The matching is lazy (*?) in order to finish at the first stop codon found, if any.

j                                     % input string
'AUG(...)*?(?=(UAA|UAG|UGA|.?.?$))'   % regex
XX                                    % apply it. Push cell array of matched substrings
tn?                                   % if non-empty
1X)                                   % get first substring
tnt3\-:)                              % make length the largest possible multiple of 3
3[]e!                                 % reshape into rows of 3 columns
                                      % implicit endif
                                      % implicit display

Luis Mendo

Posted 2016-01-16T01:19:19.097

Reputation: 87 464

0

Ruby, 97 95 78 75 62 bytes

->(r){r.scan(/AUG|\B\G.../).join(?,).sub(/,U(AA|AG|GA).*/,'')}

I don't golf much, so I'm sure it can be improved.

Edit: Stole Borrowed Martin Büttner's excellent \B\G trick

Flambino

Posted 2016-01-16T01:19:19.097

Reputation: 1 001