24
5
Background
Incident is a fairly unusual programming language, in that its list of tokens is not predetermined, but rather inferred from the input. As such, tokenising an Incident program can be fairly difficult, especially if you want to do so efficiently. This task is about doing that yourself.
The task
Your program will be given a string as input. Here's the algorithm that Incident uses to tokenise it:
- Identify all strings that occur as a substring of the input in exactly three ways (i.e. there are exactly three occurrences of that string within the input).
- Discard any of these strings that are a substring of another such string (e.g. for input
ababab
, the only remaining string would beab
, nota
orb
, becausea
andb
are both substrings ofab
). - Discard any strings that overlap within the input. (For example,
aaaa
contains exactly three copies ofaa
, but these copies overlap at the second and third characters, so would be discarded. Likewise, inabababa
, there are three copies ofab
and three copies ofba
, but the second to sixth characters are each at the overlap of anab
and aba
, so bothab
andba
would be discarded). - Any strings that remain at this point are the tokens used by the program. Tokenise the original input into a sequence of these tokens (due to the discarding in the previous step, there will only be one way to do it). Any characters in the input that aren't part of any token are treated as comments and discarded.
Your program has to take a string as input, and return the corresponding tokenisation of the string (a list of tokens, each of which are expressed as strings) as output. Additionally, this has to be done at least moderately efficiently; specifically, the program has to run in quadratic time ("O(n²)") or better. (Incidentally, it's almost certainly possible to go faster than quadratic, but this is not fastest-algorithm, so feel free to use the tersest algorithm you can find that fits within the complexity bounds.)
Clarifications
- Although Incident programs can in theory contain any of the 256 octets, it's acceptable for the purpose of this challenge for your program to handle only inputs formed out of printable ASCII (including space), plus newline and tab. (All known Incident programs restrict themselves to this subset). Note that space/newline/tab aren't special and can appear in the middle of tokens; Incident treats all 256 octets as opaque.
- The definition of "quadratic time" is "if the size of the input is doubled, the program will run slower by no more than a constant plus a factor of 4", i.e. if t(x) is the maximum time your program takes to process an input of size x, then there must be some constant k such that t(2 x) < 4 t(x) + k for all x. Bear in mind that comparing strings takes time proportional to the length of the strings.
- Your program should theoretically be able to handle input programs of any length if run in a (possibly hypothetical) variant of your language that has unlimited memory and uses unbounded integers (it's OK if the program fails to achieve this goal when run in practice due to the language's integers or memory actually being finitely large). You may assume (for the purpose of calculating the complexity) that integers that are no greater than the length of the input can be compared in constant time (although bear in mind that if you use larger values, e.g. due to converting the input into a single integer, they will take a length of time to compare proportional to the number of digits they have).
- You can use any algorithm that fits within the complexity bounds, even if it doesn't follow the same steps as the algorithm posted above, so long as it produces the same results.
- This puzzle is about tokenising the input, not really about formatting the output. If the most natural way to output a list in your language involves an ambiguous format (e.g. newline-separated when the strings contain literal newlines, or without delimiters between the strings), don't worry about the fact that the output ends up ambiguous (so long as the list is actually constructed). You may want to make a second version of your submission that produces unambiguous output, to aid in testing, but the original version is the version that counts for the scoring.
Test case
For the following input string:
aaabcbcbcdefdfefedghijghighjkllkklmmmmonono-nonppqpq-pqprsrsrstststuvuvu
your program should produce the following output list:
a a a bc bc bc d e f d f e f e d gh gh gh k l l k k l pq pq pq u u u
Victory condition
This is code-golf, so the shortest valid (i.e. correct input/output behaviour and sufficiently fast to execute) program, measured in bytes, wins.
For people who can see deleted posts: the Sandbox post was here.
– None – 2017-01-20T17:24:44.53016
Howe many languages have you created? ... Wait, 35?!
– Luis Mendo – 2017-01-20T17:31:26.947