Naming non-cyclic carbon chains

31

5

(I'm not a chemist! I might be wrong at some stuff, I'm writing what I've learned in high-school)

Carbon atoms have a special attribute: They can bind to 4 other atoms (which is not that special) and they stay stable even in long chains, which is very unique. Because they can be chained and combined in a lot of different ways, we need some kind of naming convention to name them.

This is the smallest molecule we can make:

CH4

It's called methane. It consists of only one carbon and 4 hydrogen atoms. The next one is:

CH3 - CH3

This is called ethane. It's made up of 2 carbon and 6 hydrogen atoms.

The next 2 are:

CH3 - CH2 - CH3
CH3 - CH2 - CH2 - CH3

They are propane and butane. The problems start with the chains with 4 carbon atoms, as it can be built in 2 different ways. One is shown above and the other is:

CH3 - CH - CH3
       |
      CH3

This is obviously not the same as the other. The number of atoms and the bindings are different. Of course just folding bindings and rotating the molecule won't make it a different one! So this:

CH3 - CH2 - CH2 - CH3

And this:

CH3 - CH2
       |
CH3 - CH2

Are the same (If you are into graph theory, you may say that if there is isomorphism between 2 molecules; they are the same). From now on I won't write out hydrogen atoms as they are not essential for this challenge.

As you hate organic chemistry and you have a lot of different carbon atoms to name, you decide to write a program that does this for you. You don't have too much space on your hard-drive tho, so the program must be as small as possible.

The challenge

Write a program that takes in a multi-line text as input (a carbon chain) and outputs the name of the carbon chain. The input will only contain spaces, uppercase 'c' characters and '|' and '-' which represents a binding. The input chain will never contain cycles! Example:

Input:

C-C-C-C-C-C
  |   |
  C   C-C

Output:

4-ethyl-2-methylhexane

Any output is acceptable as long as it's human-readable and essentially the same (so you can use different separators for example if you wish).

The naming convention:

(See: IUPAC rules)

  1. Identify the longest carbon chain. This chain is called the parent chain.

  2. Identify all of the substituents (groups appending from the parent chain).

  3. Number the carbons of the parent chain from the end that gives the substituents the lowest numbers. When comparing a series of numbers, the series that is the "lowest" is the one which contains the lowest number at the occasion of the first difference. If two or more side chains are in equivalent positions, assign the lowest number to the one which will come first in the name.

  4. If the same substituent occurs more than once, the location of each point on which the substituent occurs is given. In addition, the number of times the substituent group occurs is indicated by a prefix (di, tri, tetra, etc.).

  5. If there are two or more different substituents they are listed in alphabetical order using the base name (ignore the prefixes). The only prefix which is used when putting the substituents in alphabetical order is iso as in isopropyl or isobutyl. The prefixes sec- and tert- are not used in determining alphabetical order except when compared with each other.

  6. If chains of equal length are competing for selection as the parent chain, then the choice goes in series to:

    • the chain which has the greatest number of side chains.
    • the chain whose substituents have the lowest- numbers.
    • the chain having the greatest number of carbon atoms in the smallest side chain.
    • the chain having the least branched side chains (a graph having the least number of leaves).

For the parent chain, the naming is:

Number of carbons   Name
1                  methane
2                  ethane
3                  propane
4                  butane
5                  pentane
6                  hexane
7                  heptane
8                  octane
9                  nonane
10                 decane
11                 undecane
12                 dodecane

No chains will be longer than 12, so this will be enough. For the sub-chains it's the same but instead of 'ane' at the end we have 'yl'.

You can assume that the Cs are in the odd columns and the bindings (| and - characters) are 1 long between carbon atoms.

Test cases:

Input:

C-C-C-C

Output:

butane

Input:

C-C-C
  |
  C

Output:

2-methylpropane

Input:

C-C-C-C
  |
  C
  |
  C-C

Output:

3-methylhexane

Input:

C-C-C-C-C
  |
  C
  |
  C

Output:

3-methylhexane

Input:

    C
    |
    C
    |
C-C-C-C
  |
  C-C-C
  |
  C-C

Output:

3,4-dimethyl-5-ethylheptane

Edit: Sorry for the wrong examples. I wasn't a good student :( . They should be fixed now.

Peter Lenkefi

Posted 2015-10-15T16:23:10.563

Reputation: 1 577

Comments are not for extended discussion; this conversation has been moved to chat.

– Dennis – 2017-06-09T07:22:38.093

2According to this rule, If the same substituent occurs more than once, the location of each point on which the substituent occurs is given. In addition, the number of times the substituent group occurs is indicated by a prefix (di, tri, tetra, etc.)., shouldn't the last example be called 3,4-dimethyl-5-ethylheptane? (we're just starting organic chemistry, I might be wrong :P) – NieDzejkob – 2017-11-07T17:33:18.877

@NieDzejkob I would agree, as there are two methyl chains. – Jonathan Frech – 2017-11-07T19:17:15.693

@NieDzejkob Indeed, fixed. – Peter Lenkefi – 2017-11-07T19:24:35.417

Answers

18

Python 2, 1876 1871 1870 1859 1846 1830 1826 1900 1932 1913 1847 1833 1635 1613 1596 bytes

s=input().split('\n')
W=enumerate
J=len
Y=sorted
l=J(s[0])
s=''.join(s)
S=set
M=max
A=min
p=map
f=lambda k:[(x/l,x%l)for x,V in W(s)if V==k]
g=lambda x,i,h=lambda x,i,j:x[:i]+(x[i]+j,)+x[i+1:]:[(h(q,i,-1),h(q,i,1))for q in x]
v=f('C');e=g(f('-'),1)+g(f('|'),0)
E=[V for V in v if sum(e,()).count(V)==1]
o=lambda v:[E[~E.index(v)]for E in e if v in E]
T=lambda a:lambda b:z((a,b))
Z=lambda a:p(T(a[0]),a[1])
n=lambda R:'mepbphhondudetrueeeco nothotnxptn ddh p t t'[R-1::12].strip()+(R>9)*'ec'
G=lambda K:[H[i]for i,V in W(K)if V==A(K)]
q=lambda x:[`k[0]`for k in H if k[1]==x]
B='-'.join
def z(n,c=[]):k=[x for x in S(o(n[0]))-S(c)];p=[z((j,n[1]),c+k)for j in k];return 1-~-(n[0]==n[1])*(p and A(p)or J(v))
C=[(a,b)for a in E for b in E]
a=p(z,C)
s=[(k,[E for E in v if~-z((k[0],E))+z((k[1],E))==z((k[0],k[1]))])for k in[C[x]for x,V in W(a)if V==M(a)]]
H=[]
R=0
for k,_ in s:R=M(J(_),R);_.sort(key=T(k[0]));a=sum([list(S(o(k))-S(_))for k in _],[]);H+=zip(p(lambda a:Z((a,_)).index(2),a),p(Z,[(O,[x for x in S(v)-S(_)if z((x,O),_)<J(v)])for O in a])),
X=n(R)
U=any(H)
if U:H=G([[h[0]for h in Q]for Q in H if J(Q)==M(p(J,H))]);K=[[J(Q[1])for Q in j]for j in H];H=[H[i]for i,V in W(K)if A(V)==A(sum(K,[]))];K=[J([Q[1]for Q in j if J(S(Q[1]))-J(Q[1])])for j in H];H=[[p[0]+1,n(M(p[1]))+[['isopropyl','butyl-tert','butyl-sec','isobutyl'][J(p[1])+p[1].count(3)-3],'yl'][Y(p[1])==range(1,1+M(p[1]))]]for p in G(K)[0]]
print(U and B([','.join(q(x))+'-'+'dttphhondireeeecoe itnxptnc  rtataaa  aa a '[J(q(x))-2::9].strip()+B(x.split('-')[::-1])for x in Y(list(S(zip(*H)[1])))])+X or[X,'meth']['t'==X])+'ane'

Try it online!

Well there you go. Certainly not the golfiest but it works (I hope) :D

Took me about 10 hours, maybe? Probably my longest golf in both size and time, and that's saying something considering I used to use Java D:

Logic:

  1. Convert from the ASCII representation to graph representation with each carbon atom as a node and each bond as an edge represented in adjacency form.
  2. Find all leaves; that is, nodes with only one bond. The longest chain is guaranteed to be from one of these to another.
  3. Find the dyadic product of the leaves; that is, all pairs of edge nodes. Then, take the length of all of these chains.
  4. For each chain, find its subchains.
  5. Do stuff to pick the right chain. If there are ties, then it doesn't really matter. Fun fact: There will always be a tie because each chain is counted twice, once in reverse.
  6. Print it properly.

EDIT: Fixed bug where it used to cause errors if there were no side chains.

EDIT: Thanks to MD XF for noticing a few extra spaces (indentation for the for loop).

EDIT: I completely forgot about the prefix for having the same substituent.

NOTE: Each line needs to be the same width for this to work. That is, trailing spaces are required.

Fun fact: most cyclic hydrocarbons will be determined as "methane"

Fun fact: If you do C-C-...-C-C with 13 Cs, it will give ethane, then thane for 14, ropane for 15, etc.

-79 bytes thanks to Jonathan Frech
-119 bytes thanks to NieDzejkob
-17 bytes thanks to ovs

HyperNeutrino

Posted 2015-10-15T16:23:10.563

Reputation: 26 575