mtDNA mutation tree

13

4

Background:

MtDNA is a part of the human DNA that is passed from a mother to a child and it rarely mutates. Since this is true for all humans it is possible to create a huge tree that visualize how all humans are related to each other through their maternal ancestry all the way back to the hypothetical EVE. Every mutation in the MtDNA when a child is born creates a new sub-branch from its parent branch in the tree.

Find out more about Mitochondrial DNA (mtDNA) here: https://en.wikipedia.org/wiki/Mitochondrial_DNA

Objective:

Your program is served a list of mtDNA tree branches mutation count and your program should deliver a tree view of it

Example input and output:

The input is a 3-column semi-colon separated table with a line for each branch. Example:

L0a'b'f'k;L0;14
L0a'b'f;L0a'b'f'k;23
L0;mtEVE;10
L0a'b;L0a'b'f;30
L0a;L0a'b;38
L0a1'4;L0a;39
L0a1;L0a1'4;40
L0a1a;L0a1;42
L0a1a NL;L0a1a;43
L0a1a1;L0a1a NL;44
L0a1a2;L0a1a NL;45
L0a1a3;L0a1a NL;44
L0a1 NL;L0a1;41
L0a1b;L0a1 NL;44
L0a1b NL;L0a1b;45
L0a1b1;L0a1b NL;46
L0a1b1a;L0a1b1;47
L0a1b1a1;L0a1b1a;48
L0a1b2;L0a1b NL;48
L0a1b2a;L0a1b2;50
L0a1c;L0a1 NL;45
L0a1d;L0a1 NL;44
L0a4;L0a1'4;55
L0a2;L0a;47
L0a2a;L0a2;49
L0a2a1;L0a2a;50
L0a2a1a;L0a2a1;51
L0a2a1a1;L0a2a1a;53
L0a2a1a2;L0a2a1a;53
L0a2a2;L0a2a;53
L0a2a2a;L0a2a2;54
L0a2b;L0a2;57
L0a2b1;L0a2b;58
L0a2c;L0a2;60
L0a2d;L0a2;49
L0a3;L0a;53
L0b;L0a'b;48
L0f;L0a'b'f;37
L0f1;L0f;61
L0f2;L0f;41
L0f2a;L0f2;46
L0f2a1;L0f2a;59
L0f2b;L0f2;63
L0k;L0a'b'f'k;39
L0k1;L0k;48
L0k2;L0k;54
L0d;L0;21
L0d1'2;L0d;25
L0d1;L0d1'2;30
L0d1 NL;L0d1;31
L0d1a;L0d1 NL;38
L0d1a1;L0d1a;41
L0d1c;L0d1 NL;39
L0d1c1;L0d1c;45
L0d1c1a;L0d1c1;46
L0d1c1b;L0d1c1;46
L0d1b;L0d1 NL;36
L0d1b1;L0d1b;40
L0d2;L0d1'2;31
L0d2a'b;L0d2;32
L0d2a;L0d2a'b;42
L0d2a1;L0d2a;43
L0d2b;L0d2a'b;46
L0d2c;L0d2;45
L0d3;L0d;39

Your program should output a left-to-right tree view including some numbers based on the input. Based on the example input, this is valid output:

  0│ ┐                                                               mtEVE               [  0][ 63]
 10│ └♦♦♦♦♦♦♦♦♦┬────────────────┬─────────────────────────────────── L0                  [ 10][ 63]
 21│           │                └♦♦♦♦♦♦♦♦♦♦┬──────┬───────────────── L0d                 [ 11][ 46]
 39│           │                           │      └♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0d3                [ 18][ 39]
 25│           │                           └♦♦♦┐                     L0d1'2              [  4][ 46]
 30│           │                               ├♦♦♦♦┬─────────────── L0d1                [  5][ 46]
 31│           │                               │    └┬────┬┐         L0d1 NL             [  1][ 46]
 36│           │                               │     │    │└♦♦♦♦┬─── L0d1b               [  5][ 40]
 40│           │                               │     │    │     └♦♦♦ L0d1b1              [  4][ 40]
 38│           │                               │     │    └♦♦♦♦♦♦┬── L0d1a               [  7][ 41]
 41│           │                               │     │           └♦♦ L0d1a1              [  3][ 41]
 39│           │                               │     └♦♦♦♦♦♦♦┬────── L0d1c               [  8][ 46]
 45│           │                               │             └♦♦♦♦♦┬ L0d1c1              [  6][ 46]
 46│           │                               │                   ├ L0d1c1a             [  1][ 46]
 46│           │                               │                   └ L0d1c1b             [  1][ 46]
 31│           │                               └♦♦♦♦♦┬┬───────────── L0d2                [  6][ 46]
 45│           │                                     │└♦♦♦♦♦♦♦♦♦♦♦♦♦ L0d2c               [ 14][ 45]
 32│           │                                     └┬──┐           L0d2a'b             [  1][ 46]
 42│           │                                      │  └♦♦♦♦♦♦♦♦♦┬ L0d2a               [ 10][ 43]
 43│           │                                      │            └ L0d2a1              [  1][ 43]
 46│           │                                      └♦♦♦♦♦♦♦♦♦♦♦♦♦ L0d2b               [ 14][ 46]
 14│           └♦♦♦┬────────┐                                        L0a'b'f'k           [  4][ 63]
 39│               │        └♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦┬─────┬──────── L0k                 [ 25][ 54]
 48│               │                                 │     └♦♦♦♦♦♦♦♦ L0k1                [  9][ 48]
 54│               │                                 └♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0k2                [ 15][ 54]
 23│               └♦♦♦♦♦♦♦♦┬──┐                                     L0a'b'f             [  9][ 63]
 30│                        │  └♦♦♦♦♦♦┬───────────┐                  L0a'b               [  7][ 60]
 48│                        │         │           └♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0b                 [ 18][ 48]
 38│                        │         └♦♦♦♦♦♦♦┬────┬─┬────────────── L0a                 [  8][ 60]
 53│                        │                 │    │ └♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0a3                [ 15][ 53]
 39│                        │                 │    └┬────┐           L0a1'4              [  1][ 55]
 40│                        │                 │     │    └┬────┬──── L0a1                [  1][ 50]
 42│                        │                 │     │     │    └♦┬── L0a1a               [  2][ 45]
 43│                        │                 │     │     │      └┬┐ L0a1a NL            [  1][ 45]
 44│                        │                 │     │     │       │├ L0a1a1              [  1][ 44]
 44│                        │                 │     │     │       │└ L0a1a3              [  1][ 44]
 45│                        │                 │     │     │       └♦ L0a1a2              [  2][ 45]
 41│                        │                 │     │     └┬────┬┐   L0a1 NL             [  1][ 50]
 44│                        │                 │     │      │    │└♦♦ L0a1d               [  3][ 44]
 45│                        │                 │     │      │    └♦♦♦ L0a1c               [  4][ 45]
 44│                        │                 │     │      └♦♦┬───── L0a1b               [  3][ 50]
 45│                        │                 │     │         └┬─┐   L0a1b NL            [  1][ 50]
 46│                        │                 │     │          │ └┬─ L0a1b1              [  1][ 48]
 47│                        │                 │     │          │  └┬ L0a1b1a             [  1][ 48]
 48│                        │                 │     │          │   └ L0a1b1a1            [  1][ 48]
 48│                        │                 │     │          └♦♦┬─ L0a1b2              [  3][ 50]
 50│                        │                 │     │             └♦ L0a1b2a             [  2][ 50]
 55│                        │                 │     └♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0a4                [ 16][ 55]
 47│                        │                 └♦♦♦♦♦♦♦♦┬─┬───┬────┬─ L0a2                [  9][ 60]
 49│                        │                          │ │   │    └♦ L0a2d               [  2][ 49]
 49│                        │                          │ │   └♦┬┬─── L0a2a               [  2][ 54]
 50│                        │                          │ │     │└┬── L0a2a1              [  1][ 53]
 51│                        │                          │ │     │ └┬─ L0a2a1a             [  1][ 53]
 53│                        │                          │ │     │  ├♦ L0a2a1a1            [  2][ 53]
 53│                        │                          │ │     │  └♦ L0a2a1a2            [  2][ 53]
 53│                        │                          │ │     └♦♦♦┬ L0a2a2              [  4][ 54]
 54│                        │                          │ │         └ L0a2a2a             [  1][ 54]
 57│                        │                          │ └♦♦♦♦♦♦♦♦♦┬ L0a2b               [ 10][ 58]
 58│                        │                          │           └ L0a2b1              [  1][ 58]
 60│                        │                          └♦♦♦♦♦♦♦♦♦♦♦♦ L0a2c               [ 13][ 60]
 37│                        └♦♦♦♦♦♦♦♦♦♦♦♦♦┬─┬─────────────────────── L0f                 [ 14][ 63]
 61│                                      │ └♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0f1                [ 24][ 61]
 41│                                      └♦♦♦┬───┬───────────────── L0f2                [  4][ 63]
 46│                                          │   └♦♦♦♦┬──────────── L0f2a               [  5][ 59]
 59│                                          │        └♦♦♦♦♦♦♦♦♦♦♦♦ L0f2a1              [ 13][ 59]
 63│                                          └♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦ L0f2b               [ 22][ 63]

Input: Details

The input table is not sorted in any particular order. If we randomly reorder the input lines, the output should remain the same.

Each line in the input represents a mtDNA tree branch or a hypothetical tree branch. The input table could be any number of lines in length.

Input: Details - Column A (branch name):

The first column is the actual branch name. The name divides the input lines into 2 groups of line types that should be handled different (explained later) from each other:

  • Type 1: Name consists of any ' or the suffix NL
  • Type 2: Name doesn't consist of any ' or the suffix NL.

The name can be up to 20 characters in length.

Input: Details - Column B (parent branch name):

The second column contains a pointer to the parent branch name. Several lines (branches) can share the same parent. There is always exactly 1 distinct parent branch name in the input table that points to a parent that isn't represented among the input lines, that parent branch name is the root for the tree. In the example input that is the third line pointing to the root: mtEVE. If the input has more than one root, or endless loops, it is an invalid input.

Input: Details - Column C (# of mutations):

The third column is the total number of mutations that particular branch has had counting from the root. Human mtDNA hasn't mutated more than 100 times in a single line from the hypothetical maternal root (Human/chimp ancestor EVE), but your program should be able to handle 3 digit # of mutations, up to 999.

From the input you can calculate a branch # of unique mutations by substracting its # of mutations from its parent # of mutations.

Output: Details

Your program should output 1 out of 3 different error messages if the input is invalid according to the input description.

  • Error message 1, if the input has more than one root: ERROR: Multiple roots
  • Error message 2, if the input parent pointers loops: ERROR: Endless loop
  • Error message 3, anything else invalid about the input: ERROR: Invalid input

If the input contains no errors, your program should output the tree according to the following constraints: Each line consists of 5 parts A, B, C, D, and E:

  • A: 5 characters, 3 char right-aligned # of mutations, a vertical line character: | and 1 blankspace
  • B: [max # of mutations] characters wide tree + 1 blankspace
  • C: 20 characters, left-aligned branch name
  • D: 5 characters, 3 character right-aligned # of unique mutations for the branch encapsulated between [ and ]. (Unique mutations will be explained below).
  • E: 5 characters, 3 character right-aligned max # of total mutations for this branch and all child-branches encapsulated between [ and ].

A branch # of unique mutations is the difference in # of mutations the current branch has from the # of mutations its parent branch has. The first line is the root and it should be represented with 0 for # of mutations and # of unique mutations.

Output: Details - line order/sortation

If two or more sub-branches are sharing the same parent, the branches are ordered by the sub-branches maximum # of total mutations in descending order. In our example L0a1'4, L0a3 and L0a2 shares the parent: L0a.

In the tree view the order from top to bottom is, sub-branches maximum # of total mutations within parenthesis: L0a3 (53), L0a1'4 (55), L0a2 (60).

If two or more sub-branches shares the same maximum # of mutations on child branches, they are vertically aligned and branched from their parent from the same spot, line ordering among those sub-branches is alphabetical.

Output: Details - tree (part B)

The tree should be drawn with the following ascii characters: , , , , , ,

The logic of the tree is that all mutations should be represented. A branch from a parent branch: or represents 1 mutation. Additional unique mutations on the same branch are represented by: and they must be left-aligned and placed before the first sub-branch.

Sub-branches are branched from their parent along the x-axis and the position is determined by the maximum # of mutations among all subsequent child branches.

As hinted before the input has 2 different type of input lines. Type 1 with any ' characters or the NL suffix in the branch name, should not fill the horizontal line to the far right on their line but end with a on the last sub-branch. In the example it is applied to the following branches:

L0a'b'f'k;L0;14
L0a'b'f;L0a'b'f'k;23
L0a'b;L0a'b'f;30
L0a1'4;L0a;39
L0a1a NL;L0a1a;43
L0a1 NL;L0a1;41
L0a1b NL;L0a1b;45
L0d1'2;L0d;25
L0d1 NL;L0d1;31
L0d2a'b;L0d2;32

Hopefully the example input and output answers any additional questions about how the tree should be drawn, consider it part of the challenge to figure out the logic.

Inspiration

You are welcome to try out my (un-golfed) javascript version for inspiration: http://artificial.se/DNA/mtDNAmutationTree3.html (it lacks error checking, and some statistics is added that is not part of this particular challenge).

A complete mtDNA-tree version [based on http://www.phylotree.org/ mtDNA tree Build 16 (19 Feb 2014)] can be found here:

http://artificial.se/DNA/mtDNAfull.html

The data-file used for the full tree:

http://artificial.se/DNA/mtDNA_full.txt

This is a code-golf challenge.

Plarsen

Posted 2015-06-21T20:45:35.250

Reputation: 1 740

L0d1 shouldn't be placed before L0d2, according to the sorting rule : "... descending order ..." – guy777 – 2015-06-25T15:41:55.630

L0a1'4 is not (55) but (39), L0a2 is not (60) but (47) ... Could you clarify this ? – guy777 – 2015-06-25T15:49:57.893

L0d1 and L0d2 are both 46, therefor alphabetical order is applied – Plarsen – 2015-06-25T17:22:58.153

L0a4 55 and a child of L0a1'4 so maximum mutations for L0a1'4 is 55 – Plarsen – 2015-06-25T17:25:42.907

I have a few questions: 1) Is this a real project? I have the impression that something like this could be worth actual money. 2) How did you get the example output? 3) Why does part A have 8 characters instead of 5? 4) Why does part D have 6 characters instead of 5? 5) Why does "L0a1 NL" have "4" in part D? – aditsu quit because SE is EVIL – 2015-06-27T18:14:45.803

@aditsu 1) Not sure what you mean by 'real project'. Well, I have my own working (un-golfed) 'project' here: http://artificial.se/DNA/mtDNAmutationTree3.html save the input as .txt-file and upload to create a tree. If you make money out of it, please send some to me ;). 2) From my own javascipt application. 3) Bug in my question, accidently added 4 extra blankspaces in front, will fix! 4) Same as 3.

– Plarsen – 2015-06-27T19:21:32.360

@aditsu 5) Another typo from me, it should read 1 in unique mutations. Will fix! – Plarsen – 2015-06-27T19:38:14.170

I have been working on a program to do this all day, but it is unfinished and not golfed at all. I was disheartened when I saw that you already have an ungolfed solution because it means that I probably wasted my day off. Would you give the bounty to an ungolfed answer, or should I stop now? Also, sorry for asking for clarification instead of answering. I am not permitted to comment yet and have no other way to reach Plarsen. – Mike Bufardeci – 2015-06-27T20:57:54.020

Okay, sorry about that Mike Bufardeci. Bounty goes to first answer, golfed or un-golfed (that is not a copy of my version). I cannot guarantee how other users vote on your answer though, because this is code-golf afterall. – Plarsen – 2015-06-27T21:29:47.183

Thanks; 1) I thought maybe some company or university doing research on this stuff needed to analyze data and represent it this way, I guess I was wrong :p @MikeBufardeci should probably be more disheartened that I had my solution >90% finished before I first commented (that's how I knew to ask questions 3-5, especially 5), but anyway, I have to go for a few hours; we'll see what happens :) – aditsu quit because SE is EVIL – 2015-06-27T23:43:51.803

I found one more problem: "L0a1 NL" has 45 in part A, should be 41. And another note: the character codes you mentioned for those symbols seem to refer to code page 437; they're totally different in unicode, and missing from ISO 8859-1/2 and Windows-1252 – aditsu quit because SE is EVIL – 2015-06-28T06:24:26.513

@aditsu You are correct, fixed! Great answer btw, the bounty is yours – Plarsen – 2015-06-28T10:41:25.997

Answers

6

Python 3, 925 bytes

Yay, under 1 KB! Probably still room for golfing...

import sys
class L:
 def __init__(x,**k):x.__dict__.update(k)
m={}
def e(x):print('ERROR: '+x);exit()
try:
 for x in sys.stdin:a,b,c=x.split(';');m[a]=L(s=a,p=b,m=int(c),l=0)
except:e('Invalid input')
a=set()
def k(x):
 if x.l<0:e('Endless loop')
 if x.l<1:y=m.get(x.p);x.l=-1;k(y)if y else a.add(x.p);x.l=1
for x in m:k(m[x])
r=L(s=a.pop(),p=0,m=0)
if a:e('Multiple roots')
m[r.s]=r
c={}
def u(x):
 c[x.s]=[m[y]for y in m if m[y].p==x.s];x.x=x.m
 for y in c[x.s]:u(y);x.x=max(x.x,y.x)
u(r)
o=[]
def f(p,x,o=o):
 d=x.m-p.m;j=p.m+r.x-x.x;s=x.s;x.i=len(o);l=sorted(c[s],key=lambda t:(t.x,t.s));q=' '*j+'└'+'♦'*(d-1);z='─'
 if"'"in s or s[-2:]=='NL'or x==r:q+=z*(x.x-l[0].x);z=' '
 o+=list("%3d│ "%x.m+q+z*(r.x-len(q))+' %-20s[%3d][%3d]'%(s,d,x.x)),;j+=5;o[p.i][j]='┐┬'[o[p.i][j]in'─┬']
 for i in range(p.i+1,x.i):o[i][j]='├│'[o[i][j]in' │']
 for y in l:f(x,y)
f(r,r)
print('\n'.join(''.join(x)for x in o))

aditsu quit because SE is EVIL

Posted 2015-06-21T20:45:35.250

Reputation: 22 326