31
5
(I'm not a chemist! I might be wrong at some stuff, I'm writing what I've learned in high-school)
Carbon atoms have a special attribute: They can bind to 4 other atoms (which is not that special) and they stay stable even in long chains, which is very unique. Because they can be chained and combined in a lot of different ways, we need some kind of naming convention to name them.
This is the smallest molecule we can make:
CH4
It's called methane. It consists of only one carbon and 4 hydrogen atoms. The next one is:
CH3 - CH3
This is called ethane. It's made up of 2 carbon and 6 hydrogen atoms.
The next 2 are:
CH3 - CH2 - CH3
CH3 - CH2 - CH2 - CH3
They are propane and butane. The problems start with the chains with 4 carbon atoms, as it can be built in 2 different ways. One is shown above and the other is:
CH3 - CH - CH3
|
CH3
This is obviously not the same as the other. The number of atoms and the bindings are different. Of course just folding bindings and rotating the molecule won't make it a different one! So this:
CH3 - CH2 - CH2 - CH3
And this:
CH3 - CH2
|
CH3 - CH2
Are the same (If you are into graph theory, you may say that if there is isomorphism between 2 molecules; they are the same). From now on I won't write out hydrogen atoms as they are not essential for this challenge.
As you hate organic chemistry and you have a lot of different carbon atoms to name, you decide to write a program that does this for you. You don't have too much space on your hard-drive tho, so the program must be as small as possible.
The challenge
Write a program that takes in a multi-line text as input (a carbon chain) and outputs the name of the carbon chain. The input will only contain spaces, uppercase 'c' characters and '|' and '-' which represents a binding. The input chain will never contain cycles! Example:
Input:
C-C-C-C-C-C
| |
C C-C
Output:
4-ethyl-2-methylhexane
Any output is acceptable as long as it's human-readable and essentially the same (so you can use different separators for example if you wish).
The naming convention:
(See: IUPAC rules)
Identify the longest carbon chain. This chain is called the parent chain.
Identify all of the substituents (groups appending from the parent chain).
Number the carbons of the parent chain from the end that gives the substituents the lowest numbers. When comparing a series of numbers, the series that is the "lowest" is the one which contains the lowest number at the occasion of the first difference. If two or more side chains are in equivalent positions, assign the lowest number to the one which will come first in the name.
If the same substituent occurs more than once, the location of each point on which the substituent occurs is given. In addition, the number of times the substituent group occurs is indicated by a prefix (di, tri, tetra, etc.).
If there are two or more different substituents they are listed in alphabetical order using the base name (ignore the prefixes). The only prefix which is used when putting the substituents in alphabetical order is iso as in isopropyl or isobutyl. The prefixes sec- and tert- are not used in determining alphabetical order except when compared with each other.
If chains of equal length are competing for selection as the parent chain, then the choice goes in series to:
- the chain which has the greatest number of side chains.
- the chain whose substituents have the lowest- numbers.
- the chain having the greatest number of carbon atoms in the smallest side chain.
- the chain having the least branched side chains (a graph having the least number of leaves).
For the parent chain, the naming is:
Number of carbons Name
1 methane
2 ethane
3 propane
4 butane
5 pentane
6 hexane
7 heptane
8 octane
9 nonane
10 decane
11 undecane
12 dodecane
No chains will be longer than 12, so this will be enough. For the sub-chains it's the same but instead of 'ane' at the end we have 'yl'.
You can assume that the C
s are in the odd columns and the bindings (|
and -
characters) are 1 long between carbon atoms.
Test cases:
Input:
C-C-C-C
Output:
butane
Input:
C-C-C
|
C
Output:
2-methylpropane
Input:
C-C-C-C
|
C
|
C-C
Output:
3-methylhexane
Input:
C-C-C-C-C
|
C
|
C
Output:
3-methylhexane
Input:
C
|
C
|
C-C-C-C
|
C-C-C
|
C-C
Output:
3,4-dimethyl-5-ethylheptane
Edit: Sorry for the wrong examples. I wasn't a good student :( . They should be fixed now.
Comments are not for extended discussion; this conversation has been moved to chat.
– Dennis – 2017-06-09T07:22:38.0932According to this rule,
If the same substituent occurs more than once, the location of each point on which the substituent occurs is given. In addition, the number of times the substituent group occurs is indicated by a prefix (di, tri, tetra, etc.).
, shouldn't the last example be called 3,4-dimethyl-5-ethylheptane? (we're just starting organic chemistry, I might be wrong :P) – NieDzejkob – 2017-11-07T17:33:18.877@NieDzejkob I would agree, as there are two methyl chains. – Jonathan Frech – 2017-11-07T19:17:15.693
@NieDzejkob Indeed, fixed. – Peter Lenkefi – 2017-11-07T19:24:35.417