Key length of Vigenère cipher

6

The Vigenère cipher is a substitution cipher where the encryption of each letter of the plaintext depends on a different character in a keyword. This stops you using simple methods of frequency analysis to guess the correct substitution. However, the keyword is repeated if it is shorter than the plaintext. This is a weakness. If the length of the keyword is known (n, say), however, and is much shorter than the message, then you can decompose the message into separate sections by taking every nth letter. You can then use frequency analysis on these.

The Kasiski examination is a method to work out the most likely key lengths. It does this by considering identical substrings which happen to be a multiple of the key length apart. They are therefore encrypted in the same way. Turning this round, identical substrings in the ciphertext are likely to be a multiple of the key length apart. If you identify these distances, then the key length is likely to be a common factor of all these distances.

Your job is to write a function or program that can guess the most likely length of the keyword, given some encrypted text. To be precise, the function or program (just called function below) will satisfy the following:

  • Inputs consist of a string representing the ciphertext and a positive integer (len, for the purposes of this spec) representing the shortest common substring to look at.
  • The ciphertext will consist of upper case A-Z characters only. Any length is permitted.
  • You can assume valid inputs without checking.
  • The inputs may be passed in, or read from stdin.
  • The function will identify all pairs of identical substrings in the ciphertext which have a length of at least len.
  • Substrings in each pair may not overlap but substrings in different pairs can overlap.
  • The function will return, or write to stdout, the highest common factor of the distances between the starts of these identical pairs.
  • If there are no identical substrings then the function will return 0.

For example:

"AAAAAAAAAAAAAAAAAA", n returns 1 for any n from 1 to 8, 9 for n = 9 (as the only possible non-overlapping substrings of length 9 happen to be the same) and 0 otherwise.

"ABCDEFGHIJKL", n returns 0 for any n

"ABCDEABCDE", n returns 5 for n<=5 and 0 otherwise

"ABCABABCAB", 2 returns 1 because "AB" pairs are separated by 2,3,5 & 8

"ABCABABCAB", 3 returns 5

"VHVSSPQUCEMRVBVBBBVHVSURQGIBDUGRNICJQUCERVUAXSSR", 4 returns 6 because the repeating "VHVS" are 18 characters apart, while "QUCE" is separated by 30 characters

Naturally, lowest byte count wins and avoid standard loopholes.

Alchymist

Posted 2015-02-13T11:26:04.587

Reputation: 544

I followed this up to the examples, but the first one appears to completely contradict the spec. Surely "AAAAAAAAAAAAAAAAAA", n should return 1 for any n from 1 to 8, because that's the highest common factor of n and n+1? – Peter Taylor – 2015-02-13T12:20:25.087

@PeterTaylor Quite correct. I've edited this in and improved the formatting. – Alchymist – 2015-02-13T14:06:45.797

Answers

2

CJam, 67 bytes

Definitely still room for improvement, but I'll post what I have so far.

Try it online

0r:T,,ri>{T,L(-,_m*{~:XT>L<T@LX+e>,{0t}/X>\#_W>\2$?{\1$_!+%}h;}/}fL

Runer112

Posted 2015-02-13T11:26:04.587

Reputation: 3 636

I don't know how CJam works but it gives the right answers. Can you explain it? – Alchymist – 2015-02-14T23:42:30.770

1

JavaScript (ES6) 117

F=(s,n,G=(a,b)=>b?G(b,a%b):a)=>(i=>{
  for(f=i;s[(j=i+n)-1];i++)
    for(;p=~s.indexOf(s.substr(i,n),j++);)
      f=G(~p-i,f)
})(0)|f

Ungolfed

F=(s,n)=>
{
  var G = (a,b) => b ? G(b,a%b) : a; // GCD function
  var p,w,i,j, f = 0; // f starting divisor
  for(i = 0; s[i + n - 1]; ++i) // scan s for each substring of lenght n
  {
     w = s.substr(i,n);  
     for (j=i+n; (p = s.indexOf(w,j)) != -1; ++j) // find all occurrencies of substring in the rest f the string
        f = G(p - i, f); // p-i is the distance between pairs
  }
  return f
}

Test In Firefox/FireBug console

;["AAAAAAAAAAAAAAAAAA","ABCDEFGHIJKL","ABCDEABCDE","ABCABABCAB",
  "VHVSSPQUCEMRVBVBBBVHVSURQGIBDUGRNICJQUCERVUAXSSR"]
.forEach(s=>{
  for(o=[],i=1;i<=10;i++) o.push(i+':'+F(s,i));
  console.log(s+' '+o)
})

Output

AAAAAAAAAAAAAAAAAA 1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:9,10:0
ABCDEFGHIJKL 1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0
ABCDEABCDE 1:5,2:5,3:5,4:5,5:5,6:0,7:0,8:0,9:0,10:0
ABCABABCAB 1:1,2:1,3:5,4:5,5:5,6:0,7:0,8:0,9:0,10:0
VHVSSPQUCEMRVBVBBBVHVSURQGIBDUGRNICJQUCERVUAXSSR 1:1,2:1,3:6,4:6,5:0,6:0,7:0,8:0,9:0,10:0

edc65

Posted 2015-02-13T11:26:04.587

Reputation: 31 086