Write an Incident tokeniser

24

5

Background

Incident is a fairly unusual programming language, in that its list of tokens is not predetermined, but rather inferred from the input. As such, tokenising an Incident program can be fairly difficult, especially if you want to do so efficiently. This task is about doing that yourself.

The task

Your program will be given a string as input. Here's the algorithm that Incident uses to tokenise it:

  1. Identify all strings that occur as a substring of the input in exactly three ways (i.e. there are exactly three occurrences of that string within the input).
  2. Discard any of these strings that are a substring of another such string (e.g. for input ababab, the only remaining string would be ab, not a or b, because a and b are both substrings of ab).
  3. Discard any strings that overlap within the input. (For example, aaaa contains exactly three copies of aa, but these copies overlap at the second and third characters, so would be discarded. Likewise, in abababa, there are three copies of ab and three copies of ba, but the second to sixth characters are each at the overlap of an ab and a ba, so both ab and ba would be discarded).
  4. Any strings that remain at this point are the tokens used by the program. Tokenise the original input into a sequence of these tokens (due to the discarding in the previous step, there will only be one way to do it). Any characters in the input that aren't part of any token are treated as comments and discarded.

Your program has to take a string as input, and return the corresponding tokenisation of the string (a list of tokens, each of which are expressed as strings) as output. Additionally, this has to be done at least moderately efficiently; specifically, the program has to run in quadratic time ("O(n²)") or better. (Incidentally, it's almost certainly possible to go faster than quadratic, but this is not , so feel free to use the tersest algorithm you can find that fits within the complexity bounds.)

Clarifications

  • Although Incident programs can in theory contain any of the 256 octets, it's acceptable for the purpose of this challenge for your program to handle only inputs formed out of printable ASCII (including space), plus newline and tab. (All known Incident programs restrict themselves to this subset). Note that space/newline/tab aren't special and can appear in the middle of tokens; Incident treats all 256 octets as opaque.
  • The definition of "quadratic time" is "if the size of the input is doubled, the program will run slower by no more than a constant plus a factor of 4", i.e. if t(x) is the maximum time your program takes to process an input of size x, then there must be some constant k such that t(2 x) < 4 t(x) + k for all x. Bear in mind that comparing strings takes time proportional to the length of the strings.
  • Your program should theoretically be able to handle input programs of any length if run in a (possibly hypothetical) variant of your language that has unlimited memory and uses unbounded integers (it's OK if the program fails to achieve this goal when run in practice due to the language's integers or memory actually being finitely large). You may assume (for the purpose of calculating the complexity) that integers that are no greater than the length of the input can be compared in constant time (although bear in mind that if you use larger values, e.g. due to converting the input into a single integer, they will take a length of time to compare proportional to the number of digits they have).
  • You can use any algorithm that fits within the complexity bounds, even if it doesn't follow the same steps as the algorithm posted above, so long as it produces the same results.
  • This puzzle is about tokenising the input, not really about formatting the output. If the most natural way to output a list in your language involves an ambiguous format (e.g. newline-separated when the strings contain literal newlines, or without delimiters between the strings), don't worry about the fact that the output ends up ambiguous (so long as the list is actually constructed). You may want to make a second version of your submission that produces unambiguous output, to aid in testing, but the original version is the version that counts for the scoring.

Test case

For the following input string:

aaabcbcbcdefdfefedghijghighjkllkklmmmmonono-nonppqpq-pqprsrsrstststuvuvu

your program should produce the following output list:

a a a bc bc bc d e f d f e f e d gh gh gh k l l k k l pq pq pq u u u

Victory condition

This is , so the shortest valid (i.e. correct input/output behaviour and sufficiently fast to execute) program, measured in bytes, wins.

user62131

Posted 2017-01-20T17:18:44.900

Reputation:

For people who can see deleted posts: the Sandbox post was here.

– None – 2017-01-20T17:24:44.530

16

Howe many languages have you created? ... Wait, 35?!

– Luis Mendo – 2017-01-20T17:31:26.947

Answers

14

C (gcc), 324 bytes

The function f takes a null-terminated string and prints the tokens to stdout. All newlines can be removed from the code below.

f(char*s){
int n=strlen(s),b=0,z[n-~n],F[n+1],u,a,x=0,l,m,*t=z+n;
int K(i){~m&&s[i]^s[a+m]?m=t[m],K(i):++m;}
for(;b<2*n;){
for(a=b++%n,m=l=-1;a+l<n;K(a+l))t[++l]=m;
for(l=0;l<n;++F[m])K(l++),F[l]=z[a]*=b>n?m^z[a]||~(m=t[z[l-m]]):0;
for(printf("%.*s",z[a],s+a);n/b*l&&a+l>x;l--)F[l]^3?F[t[l]]+=F[l]:(a<x?z[u]=0:(z[u=a]=l),x=a+l);
}
}

This older 376-byte version is a little easier to read; the explanation below applies to it.

*t,m;
char*p;
K(c){for(;~m&&c^p[m];)m=t[m];++m;}
k(i){for(*t=m=-1;p[i];t[++i]=m)K(p[i]);m=0;}
f(char*s){
int n=strlen(s),z[n-~n],F[n+1],u,*Z=z,a=0,x=0,l;
for(t=z+n;a<n;a++){
p=s+a;
for(k(l=z[a]=0);l<n;++F[m])K(s[l++]),F[l]=0;
for(;l&&a+l>x;l--)F[l]^3?F[t[l]]+=F[l]:(a<x?z[u]=0:(z[u=a]=l),x=a+l);
}
for(p=s;*p;printf("%.*s",*Z++,p++))
for(k(x=0);x<n;m==*Z?*Z*=!!z[x-m],m=t[m]:0)
K(s[x++]);
}

k(0) generates the table t for pattern p for the Knuth–Morris–Pratt algorithm. K(c) process the next character c of the search string and updates m, the length of the largest prefix of p which can be found ending at the most recently processed character.

In the first for loop, for each index a in the string, we count the number of times each possible value of m occurs when searching in the entire string for the substring beginning at a. Then we look for the largest l such that the length-l substring starting at a occurred exactly 3 times. If it's short enough to be entirely contained by a string found for a previous a, we ignore it. If it overlaps, we delete the previous string from z, the array recording which tokens will be kept. Otherwise, its length is stored in z.

Then, we use KMP again to search the string for the tokens recorded in z. If one of them is found at a location with a 0 entry in z, we know that this token was deleted due to overlap. If the token was not deleted, it is printed.

feersum

Posted 2017-01-20T17:18:44.900

Reputation: 29 566

1What is the time complexity of this? Has to be O(n^2) or faster. And why is there !! at !!z[x-m]? – Yytsi – 2017-01-22T13:28:28.137

2@TuukkaX It's exactly O(n^2). *Z is the length of the next token which needs to become 0 if any of the other occurrences of the token have the value 0 at their index in the array, or keep the same value otherwise (in that case !!z[x-m] should be 1. – feersum – 2017-01-22T18:13:44.330

Alright. But I still don't understand why the !! is there. !!x should still be x, or does it invoke a trick I don't know of? – Yytsi – 2017-01-23T09:20:07.980

@TuukkaX Well, !!x makes x a boolean representing its "truthiness". So, !!1 == true and !!0 == false. I don't know C specifically, but that's how it usually goes – Conor O'Brien – 2017-01-23T12:16:00.233

7

JavaScript, 878 867 842 825 775 752 717 712 704 673 664 650 641 bytes

Thanks to @Kritixi Lithos for helping golf the code
Thanks to @User2428118 for golfing off 14 bytes

(Will not work in IE7) (Newlines should be entered as "\n" and tab as "\t" in the input string, any unicode characters should be entered as \u####)

w=>{for(a=[],b=[],c=[],d=[],f=[],e=[],k=0;k<(g=w.length);a[k++]=h)for(b[R='push']([]),h=[d[k]=f[k]=j=i=0];i++<g-k;){while(j&&w[k+i]!=w[k+j])j=h[j-1];w[k+i]==w[k+j]&&j++,h[R](j)}for(k=0;k<g;k++)for(j=i=0;i<g;i++)if(w[i]!=w[k+j]){while(j&&w[i]!=w[k+j])j=a[k][j-1];w[i]==w[k+j]?i--:b[k][R](j)}else b[k][R](++j);for(k=0;k<g;c[k++]=l){for(h=f.map(Q=>i=l=0);i<g;)h[b[k][i++]]++;for(;i;)h[i]==3?(l=i,i=0):a[k][--i]?h[a[k][i]]+=h[i+1]:0}for(k=0;g>k++;)for(i=0;(S=c[k])&&i<g;)b[k][i++]==S?d[i-S]=S:0;for(k=0;k<g;k++)for(e[R](w.slice(k,(S=d[k])+k)),i=1;i<S;)f[k+i]=1,f[k]|=S<d[k+i]+i++;f.map((X,i)=>(P=e[i],X?e=e.map(Y=>P==Y?"":Y):0));return e.join``}

Try it Online

Explanation of how it works and ungolfed code

First, the program generates Knuth Morris Pratt arrays for every possible substring;

for(index=0;index<word.length;index++){
  kmpArray=[0];
  j=0;
  for(i=1;i<word.length-index;i++){
    while(j&&word.charAt(index+i)!=word.charAt(index+j)){
      j=kmpArray[j-1];
    }
    if(word.charAt(index+i)==word.charAt(index+j)){
      j++;
    }
    kmpArray.push(j);
  }
  kmpArrays.push(kmpArray);
}

Next, the program finds the maximum matching lengths at every single index in the word with each substring. (this is O(n^2) time)

for(index=0;index<word.length;index++){
  j=0;
  matchLength=[];
  for(i=0;i<word.length;i++){
    if(word.charAt(i)!=word.charAt(index+j)){
      while(j&&word.charAt(i)!=word.charAt(index+j)){
        j=kmpArrays[index][j-1];
      }
      if(word.charAt(i)==word.charAt(index+j)){
        i--;
      }else{
        matchLength.push(j);
      }
    }else{
      j++;
      matchLength.push(j);
      if(j==kmpArrays[index].length){
        j=kmpArrays[index][j-1];
      }
    }
  }
  matchLengths.push(matchLength);
}

The program uses this data to find the longest substrings that appear three times for each starting character in the string.

for(index=0;index<word.length;index++){
  counts=[]
  max=0;
  for(i=0;i<=word.length;i++){
    counts.push(0);
  }
  for(i=0;i<word.length;i++){
    counts[matchLengths[index][i]]++;
  }
  for(i=word.length-1;i>0;i--){
    if(counts[i]==3){
      max=i;
      break;
    }
    if(kmpArrays[index][i-1]){ //if this value has a smaller value it could be as well
      counts[kmpArrays[index][i]]+=counts[i-1];
    }
  }
  maxLengths.push(max);
}

The program uses this data to eliminate all substrings that do not appear exactly three times and all substrings of the longest valid substrings.

for(index=0;index<word.length;index++){
  if(!maxLengths[index])
    continue;
  for(i=0;i<word.length;i++){
    if(matchLengths[index][i]==maxLengths[index]){
      tokens[i-maxLengths[index]+1]=maxLengths[index];
    }
  }
}

Next, the program sets all overlapping or partial substrings as to be removed.

for(index=0;index<word.length;index++){
  sStrs.push(word.substring(index,tokens[index]+index));
  for(i=1;i<tokens[index];i++){
    toRemove[index+i]=1;
    if(tokens[index]<tokens[index+i]+i){
      toRemove[index]=1;
    }
  }
}

For each of the to be removed values, all equivalent substrings are removed as well.

for(index=0;index<word.length;index++){
  if(toRemove[index]){
    removal=sStrs[index];
    for(i=0;i<3;i++){
      indxOf=sStrs.indexOf(removal);
      sStrs[indxOf]="";
      toRemove[indxOf]=0;
    }
  }
}

Finally, the program joins the array of substrings together and outputs it.

fəˈnɛtɪk

Posted 2017-01-20T17:18:44.900

Reputation: 4 166

1You have some while and if blocks that have only 1 statement inside of them. You can remove the braces {} around those statement. For example, if(word.charAt(index+i)==word.charAt(index+j)){j++;} can become if(word.charAt(index+i)==word.charAt(index+j))j++; – user41805 – 2017-01-31T16:55:15.793

I used &&s to replace if statements, I moved statements around in loops so that they would end up with one statement under them so that I could remove braces. I used ternaries to replace some if-statements. I moved stuff around and ended up at 946 bytes. If you don't understand something I did, feel free to ask me :)

– user41805 – 2017-01-31T17:41:25.877

At this point my main problem with golfing it is trying to understand what I wrote there. I also don't know what optimisations I can do for golfing in javascript. – fəˈnɛtɪk – 2017-01-31T18:20:06.373

Do you want to discuss this in a separate chatroom? – user41805 – 2017-01-31T18:22:42.933

Saved 14 bytes. – user2428118 – 2017-02-21T12:17:55.800