Rosetta Stone Challenge: Gene Mapping

11

7

The goal of a Rosetta Stone Challenge is to write solutions in as many languages as possible. Show off your programming multilingualism!

The Challenge

Your challenge is to implement a program that will map some genes using cross-over frequencies, in as many programming languages as possible. You are allowed to use any sort of standard library function that your language has, since this is mostly a language showcase.

What is "gene mapping?"

Gene mapping is the process of locating the relative position of genes on chromosomes. This is done by measuring the crossing-over frequency of pairs of genes, equal to the percent of offspring in which that the pair is not inherited together. Distance is measured in map units with one map unit equal to one percent of crossing over. For example, if genes C & D have a crossing-over frequency of 11%, then gene C is a distance of 11 map units away from gene D.

Gene mapping is performed with multiple pairs of genes to determine their relative order. For example, the data (A,B,12) (D,B,7) (A,D,5) (D,H,2) (H,B,9) produces the following map:

A..H.D......B

You may have noticed that B......D.H..A is also a valid map. This is true, because it is not possible to distinguish between mirror opposites. Your program can pick which one to output. Although the input may not include every possible pair, there will always be enough information to reconstruct the entire map (so there will never be more than 2 valid outputs). In addition, the numbers will always work out (unlike actual biology), meaning that you won't have stuff like (A,B,3) (B,C,4) (A,C,13).

Input

Input will begin with a number n followed by a list of genes (uppercase letters). There will then be n triplets of data. Each set will consist of a pair of genes and their crossing over frequency (distance).

3,P,H,I
P,H,3
H,I,1
P,I,4

7,A,B,G,Q,U
B,Q,4
A,B,10
G,U,13
Q,U,10
A,G,9
G,Q,3
A,Q,6

Input is not rigidly defined, because different languages may have restrictions on what is feasible. For example, you may change the delimiters to something other than commas and newlines. Input formatting is largely up to you.

Output

Output will be a rendition of the gene map. It will consist of the genes (capital letters) spaced out by periods such that the distances are accurately portrayed. Here are the outputs for the above examples.

P..HI  *or*  IH..P

BG..Q.....A...U  *or*  U...A.....Q..GB

This also isn't a completely rigid requirement. For example you could use something other than periods, like commas or spaces.

The Objective Winning Criterion

As for an objective winning criterion, here it is: Each language is a separate competition as to who can write the shortest entry, but the overall winner would be the person who wins the most of these sub-competitions. This means that a person who answers in many uncommon languages can gain an advantage. Code-golf is mostly a tiebreaker for when there is more than one solution in a language: the person with the shortest program gets credit for that language.

Rules, Restrictions, and Notes

Your program can be written in any language that existed prior to December 20th, 2013. I will also have to rely on the community to validate some responses written in some of the more uncommon/esoteric languages, since I am unlikely to be able to test them.


Current Leaderboard

This section will be periodically updated to show the number of languages and who is leading in each.

  • AutoHotkey (632) - Avi
  • dj (579) - rubik

Current User Rankings

  1. Avi (1): AutoHotkey (632)
  2. rubik (1): dj (579)

PhiNotPi

Posted 2013-12-31T00:24:50.303

Reputation: 26 739

Should we include code to read the input? Or should we just assume input is passed as first argument of the function? – Shoe – 2013-12-31T10:32:55.363

@Jefffrey I suppose either one is fine. – PhiNotPi – 2013-12-31T15:05:30.627

.. the leaderboard ? :-) – Avi – 2014-01-02T07:49:12.237

1What are the input bounds? Not so much n, but primarily the bounds for the crossing over frequency (distance). Can we assume it will always be, say, less than 1000? – rubik – 2014-03-11T19:20:44.223

@PhiNotPi: can you provide a couple more test cases? I've almost finished mine and I'd like to test it more. – rubik – 2014-03-13T16:09:46.030

@PhiNotPi I added my solution, the leaderboard should be edited. – rubik – 2014-03-16T20:25:45.373

Sorry for the delay, I've been extremely busy lately. I've updated it. – PhiNotPi – 2014-03-16T20:31:44.293

Answers

2

AutoHotkey (632)

f(i){
o:={},f:={},n:=0
loop,parse,i,`n
{
a:=A_LoopField
if A_index!=1
{
@:=Substr(a,1,1),#:=Substr(a,3,1),n+=($:=Substr(a,5))
if !IsObject(o[@])
o[@]:={}
if !IsObject(o[#])
o[#]:={}
o[@][#]:=o[#][@]:=$
}
}
f[n+1]:=@,f[@]:=n+1,a:=""
while !a
{
a:=0
for k,v in o
{
if !f[k]
{
c1:=c2:=s:=0
for k1,v1 in v
{
if f[k1]
if s
{
if (r1==f[k1]-v1)or(r1==f[k1]+v1)
c1:=r1
else r1:=c1:=""
if (r2==f[k1]-v1)or(r2==f[k1]+v1)
c2:=r2
else r2:=c2:=""
}
else
c1:=r1:=f[k1]+v1,c2:=r2:=f[k1]-v1,s:=1
}
if c1
f[c1]:=k,f[k]:=c1,a:=1
else if c2
f[c2]:=k,f[k]:=c2,a:=1
}
} 
}
loop % 2*n+1
{
v:=f[A_index]
if v
z:=1
r.=z?(!v?".":v):v
}
return Rtrim(r,".")
}

The code can be more shortened by renaming all vars to 1 character .. It should then be about 610 characters.

Test Cases

v := "
(7,A,B,G,Q,U
B,Q,4
A,B,10
G,U,13
Q,U,10
A,G,9
G,Q,3
A,Q,6 
)"

msgbox % f(v)
msgbox % f("3,P,H,I`nP,H,3`nH,I,1`nP,I,4")

Avi

Posted 2013-12-31T00:24:50.303

Reputation: 261

1

Python 311

import sys,random
d=sys.stdin.readlines()
u=[]
r=g=0
m={}
l=d[0].split()[1:]
for a in l:m[a]=g;g+=1
for v in d[1:]:i=v.split();u+=[i];r+=int(i[2])
j=len(l)
y=range(j)
while any(abs(y[m[t]]-y[m[w]])!=int(p) for t,w,p in u):y=random.sample(range(r),j)
o=["."]*r
for a in m:o[y[m[a]]]=a
print "".join(o).strip(".")

My first code-golf :D

(Im not sure with the counting, i just postet it online in a character count)

The idea of the algorithm is pretty bad, but it is short. Try randomly all positions for the Symbols until they satisfy all constraints. The Input is with whitespace for example

3 P H I
P H 3
H I 1
P I 4

Hit after that CTRL+D in the console to end the reading.

Here is the original code which still uses ',' as delimiter.

import sys, random
#data = sys.stdin.readlines()
data = [
"3,P,H,I",
"P,H,3",
"H,I,1",
"P,I,4"
]
container = []
max_range = 0
map = {}
map_counter = 0

line_split = data[0].split(',')[1:]
count = len(line_split) # Number of genes
for symbol in line_split:
    map[symbol] = map_counter
    map_counter += 1

for line in data[1:]:
    line_split = line.split(',')
    container.append(line.split(','))
    max_range += int(line_split[2])

restart = True
while restart == True:
    positions = random.sample(range(max_range), count) # Since this loop will take like forever, but some day it will produce the correct positions
    restart = False
    for symbol1, symbol2, distance in container:
        if abs(positions[map[symbol1]] - positions[map[symbol2]]) != int(distance):
            restart = True
            break

output = ["."] * max_range
for symbol in map:
    output[positions[map[symbol]]] = symbol
print "".join(output).strip(".") # Strip . to make it more pretty

nvidia

Posted 2013-12-31T00:24:50.303

Reputation: 327

0

dg - 717 579 bytes

A Python one is incoming.

import '/sys'
w,o=list,tuple
p=g a b m->
 b in g=>a,b=b,a
 i,l,k=g.index a,w$g,w$g
 l!!(i+m),k!!(i-m)=b,b
 g!!(i+m)=='.'=>yield$o$l
 g!!(i-m)=='.'=>yield$o$k
g=t->
 d=sorted key:(i->snd i)$map((a,b,i)->((a,b),int i))$filter fst$map(i->i.split ',')$t.split '\n'
 (a,b),i=d.pop!
 g=w$('.',)*i*4
 g!!i,g!!(i+i)=a,b
 s=set'$o g
 while d=>
  d.sort key:((k,v)->set k&(set$fst$w s))
  n,(a,b),i=set! :+d.pop!
  for r in s=>
   if(a in r and b in r=>i==abs(r.index a-r.index b)=>n.add r)(1=>n.update$p r a b i)
   s = n
 '\n'.join$map(l->(''.join l).strip '.')s
print$g sys.stdin.read!

Examples:

$ echo """P,H,3
H,I,1
P,I,4""" | dg dna.dg
P..HI
$ echo """B,Q,4
A,B,10
G,U,13              
Q,U,10
A,G,9
G,Q,3
A,Q,6""" | dg dna.dg
BG..Q.....A...U

rubik

Posted 2013-12-31T00:24:50.303

Reputation: 222

0

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <malloc.h>

struct Gene
{
    char a1 , a2 ;
    int d ;
};
typedef struct Gene gene ;

struct Set
{
    int appr_id ;
    char CMN_char ;
};
typedef struct Set set ;

gene *stack;
int cp_id1 , cp_id2 , N=0 , cid , *used , n ;
char ucmn_char , *cmp1 , *cmp2 , *base , ep[15] ;                       
set ap_set ;


void randomize(void)
{   int i;
    Set temp;
    for(i=0;i<(n-1);i++)
    {
        temp=stack[i];
        stack[i]=stack[i+1];
        stack[i+1]=temp;
    }

    return;

}
void populate_ep ( char ucmn_char )
{
    int i;
    for ( i=0 ; ep[i] != '\0' ; i ++ );
        ep[ i ] = ucmn_char ;
}

set find_appr ( void )
{
    int i , j ;
    set s ;
    for ( i = 0 ; i < n ; i++ )
    {
        if ( used[ i ] == 1 )
            continue ;
        else
        {
            for ( j = 0 ; ep[ j ] != '\0' ; j++ )
            {
                if ( ep[ j ] == stack[ i ].a1 || ep[ j ] == stack[ i ].a2 )
                {
                    s.appr_id = i ;
                    s.CMN_char = ep[ j ] ;
                    return s ;
                }
            }
        }
    }
}

void destroy ( int id )
{
    used[ id ] = 1 ;
}

int get_center_id ( char a )
{
    int i ;
    for ( i = 0 ; i < N * 2 ; i++ )
        if ( base[ i ] == a )
            return i ;
}

int get_comparer ( void )
{
    int i , j , k ;
    for ( i = 0 ; i < n ; i ++ )
    {
        if ( used[ i ] == 0 )
        for ( j = 0 ; ep[ j ] != '\0' ; j ++ )
            if ( stack[ i ].a1 == ep[ j ])
                for ( k = 0 ; k < 15 ; k ++ )
                    if ( stack[ i ].a2 == ep[ k ] )
                        return i ;
    }
    printf ( "\nWrong set of genes....\n" ) ;
    exit ( 0 ) ;
}

void compare_and_merge ( int cid, int cp_id1, int cp_id2 )
{
    int base_cp_id , i ;
    char temp = ( ucmn_char == stack[ cid ].a1 ) ? stack[ cid ].a2 : stack[ cid ].a1 ;
    for ( i = 0 ; i < N * 2 ; i ++ )
        if ( base[ i ] == temp )
            base_cp_id = i ;
    if ( stack[ cid ].d == ( sqrt ( pow ( ( cp_id1 - base_cp_id ) , 2 ) ) ) )
    {   
        base[ cp_id1 ] = cmp1[ cp_id1 ] ;
        return ;
    }
    else
    {
        base[ cp_id2 ] = cmp2[ cp_id2 ] ;
        return ;
    }
}

void show_stack ( void )
{
    int i ;
    printf ( "The gene sets you entered are: \n" ) ;
    printf ( "____________\n" ) ;
    for ( i = 0 ; i < n ; i ++ )
        if ( used[ i ] == 0 )
            printf ( "%c %c %d\n" , stack[i].a1, stack[i].a2, stack[i].d ) ;
    printf ( "____________\n" ) ;
}

int main ( void )
{
    printf ( "Enter number of gene sets: " ) ;
    scanf ( "%d" , &n ) ;
    stack = ( gene* ) calloc ( n , sizeof ( gene ) ) ;
    used = ( int* ) calloc ( n , sizeof ( int ) ) ;
    int i ;
    N = 0 ;
    for ( i = 0 ; i < n ; i ++ )
    {
        char y[ 2 ] ;
        scanf ( "%s" , y ) ;
        stack[ i ].a1 = y[ 0 ] ;
        scanf ( "%s" , y ) ;
        stack[ i ].a2 = y[ 0 ] ;
        scanf ( "%d" , &stack[ i ].d ) ;
        N += stack[ i ].d ;
        used[ i ] = 0 ;
        fflush ( stdin ) ;
    }   
    randomize();
    show_stack ( ) ;
    int ff ;
    strcpy ( ep , " " ) ;
    cmp1 = ( char* ) calloc ( N * 2 , sizeof ( char ) ) ;
    cmp2 = ( char* ) calloc ( N * 2 , sizeof ( char ) ) ;
    base = ( char* ) calloc ( N * 2 , sizeof ( char ) ) ;
    for ( i = 0 ; i < N * 2 ; i ++ )
        base[ i ] = cmp1[ i ] = cmp2[ i ] = '=' ;
    base[ N ] = stack[ 0 ].a1 ;
    base[ N + stack[ 0 ].d ] = stack[ 0 ].a2 ;
    destroy ( 0 ) ;
    ep[ 0 ] = stack[ 0 ].a1 ;
    ep[ 1 ] = stack[ 0 ].a2 ;
    for ( ff = 0 ; ff < n / 2  ; ff ++ )
    {
        ap_set = find_appr ( ) ;
        cmp1[ get_center_id ( ap_set.CMN_char ) ] = ap_set.CMN_char ;
        cmp2[ get_center_id ( ap_set.CMN_char ) ] = ap_set.CMN_char ;
        ucmn_char = ( stack[ ap_set.appr_id ].a1 == ap_set.CMN_char ) ? stack[ ap_set.appr_id ].a2 : stack[ ap_set.appr_id ].a1;
        cmp1[ cp_id1 = get_center_id ( ap_set.CMN_char ) + stack[ ap_set.appr_id ].d ] = ucmn_char ;
        cmp2[ cp_id2 = get_center_id ( ap_set.CMN_char ) - stack[ ap_set.appr_id ].d ] = ucmn_char ;
        populate_ep ( ucmn_char ) ;
        destroy ( ap_set.appr_id ) ;
        cid = get_comparer ( ) ;
        compare_and_merge ( cid , cp_id1 , cp_id2 ) ;
        destroy ( cid ) ;
    }
    int start , end ;
    for ( i = 0 ; i < N * 2 ; i ++ )
        if ( base[ i ] != '=' )
        {
            start = i ;
            break ;
        }
    for ( i = N * 2 - 1 ; i >= 0 ; i -- )
        if ( base[ i ] != '=' )
        {
            end = i ;
            break ;
        }
        for ( i = start ; i <= end ; i ++ )
            printf( "%c" , base[ i ] ) ;
    printf( "\n\n" ) ;
}

Ankit Kumar

Posted 2013-12-31T00:24:50.303

Reputation: 1

3Welcome to PPCG! This is code golf, so please show some effort to solve the problem in the minimal amount of code. For a start you could remove all unnecessary whitespace and use single-letter variable, struct and function names. Please also include the language and the total byte-count at the top of your answer. – Martin Ender – 2015-02-06T13:53:04.207