(I'm not a chemist! I might be wrong at some stuff, I'm writing what I've learned in high-school)

Carbon atoms have a special attribute: They can bind to 4 other atoms (which is not that special) and they stay stable even in long chains, which is very unique. Because they can be chained and combined in a lot of different ways, we need some kind of naming convention to name them.

This is the smallest molecule we can make:

CH4

It's called methane. It consists of only one carbon and 4 hydrogen atoms. The next one is:

CH3 - CH3

This is called ethane. It's made up of 2 carbon and 6 hydrogen atoms.

The next 2 are:

CH3 - CH2 - CH3
CH3 - CH2 - CH2 - CH3

They are propane and butane. The problems start with the chains with 4 carbon atoms, as it can be built in 2 different ways. One is shown above and the other is:

CH3 - CH - CH3
       |
      CH3

This is obviously not the same as the other. The number of atoms and the bindings are different. Of course just folding bindings and rotating the molecule won't make it a different one! So this:

CH3 - CH2 - CH2 - CH3

And this:

CH3 - CH2
       |
CH3 - CH2

Are the same (If you are into graph theory, you may say that if there is isomorphism between 2 molecules; they are the same). From now on I won't write out hydrogen atoms as they are not essential for this challenge.

As you hate organic chemistry and you have a lot of different carbon atoms to name, you decide to write a program that does this for you. You don't have too much space on your hard-drive tho, so the program must be as small as possible.

The challenge

Write a program that takes in a multi-line text as input (a carbon chain) and outputs the name of the carbon chain. The input will only contain spaces, uppercase 'c' characters and '|' and '-' which represents a binding. The input chain will never contain cycles! Example:

Input:

C-C-C-C-C-C
  |   |
  C   C-C

Output:

4-ethyl-2-methylhexane

Any output is acceptable as long as it's human-readable and essentially the same (so you can use different separators for example if you wish).

The naming convention:

(See: IUPAC rules)

Identify the longest carbon chain. This chain is called the parent chain.
Identify all of the substituents (groups appending from the parent chain).
Number the carbons of the parent chain from the end that gives the substituents the lowest numbers. When comparing a series of numbers, the series that is the "lowest" is the one which contains the lowest number at the occasion of the first difference. If two or more side chains are in equivalent positions, assign the lowest number to the one which will come first in the name.
If the same substituent occurs more than once, the location of each point on which the substituent occurs is given. In addition, the number of times the substituent group occurs is indicated by a prefix (di, tri, tetra, etc.).
If there are two or more different substituents they are listed in alphabetical order using the base name (ignore the prefixes). The only prefix which is used when putting the substituents in alphabetical order is iso as in isopropyl or isobutyl. The prefixes sec- and tert- are not used in determining alphabetical order except when compared with each other.
If chains of equal length are competing for selection as the parent chain, then the choice goes in series to:
- the chain which has the greatest number of side chains.
- the chain whose substituents have the lowest- numbers.
- the chain having the greatest number of carbon atoms in the smallest side chain.
- the chain having the least branched side chains (a graph having the least number of leaves).

For the parent chain, the naming is:

Number of carbons   Name
1                  methane
2                  ethane
3                  propane
4                  butane
5                  pentane
6                  hexane
7                  heptane
8                  octane
9                  nonane
10                 decane
11                 undecane
12                 dodecane

No chains will be longer than 12, so this will be enough. For the sub-chains it's the same but instead of 'ane' at the end we have 'yl'.

You can assume that the Cs are in the odd columns and the bindings (| and - characters) are 1 long between carbon atoms.

Test cases:

Input:

C-C-C-C

Output:

butane

Input:

C-C-C
  |
  C

Output:

2-methylpropane

Input:

C-C-C-C
  |
  C
  |
  C-C

Output:

3-methylhexane

Input:

C-C-C-C-C
  |
  C
  |
  C

Output:

3-methylhexane

Input:

    C
    |
    C
    |
C-C-C-C
  |
  C-C-C
  |
  C-C

Output:

3,4-dimethyl-5-ethylheptane

Edit: Sorry for the wrong examples. I wasn't a good student :( . They should be fixed now.

Peter Lenkefi

Posted 2015-10-15T16:23:10.563

Reputation: 1 577

Comments are not for extended discussion; this conversation has been moved to chat.

– Dennis – 2017-06-09T07:22:38.093

2According to this rule, If the same substituent occurs more than once, the location of each point on which the substituent occurs is given. In addition, the number of times the substituent group occurs is indicated by a prefix (di, tri, tetra, etc.)., shouldn't the last example be called 3,4-dimethyl-5-ethylheptane? (we're just starting organic chemistry, I might be wrong :P) – NieDzejkob – 2017-11-07T17:33:18.877

@NieDzejkob I would agree, as there are two methyl chains. – Jonathan Frech – 2017-11-07T19:17:15.693

@NieDzejkob Indeed, fixed. – Peter Lenkefi – 2017-11-07T19:24:35.417

Naming non-cyclic carbon chains

The challenge

Test cases:

Answers

Python 2, 1876 1871 1870 1859 1846 1830 1826 1900 1932 1913 1847 1833 1635 1613 1596 bytes