Decryption by pattern analysis

You are given an encrypted string, encrypted using a very simple substitution cipher.

Problem

You do not know what the cipher is but you do know the ciphertext is English and that the most frequent letters in English are etaoinshrdlucmfwypvbgkqjxz in that order. The only allowed characters are uppercase letters and spaces. You can do basic analysis - starting from single letters, but you can migrate onto more complex multi-letter analysis - for example, U almost always follows Q, and only certain letters can come twice in a row.

Examples

clear : SUBMARINE TO ATTACK THE DOVER WAREHOUSE AND PORT ON TUESDAY SUNRISE
cipher: ZOQ DUPAEYSRYDSSDXVYSHEYNRBEUYLDUEHROZEYDANYKRUSYRAYSOEZNDMYZOAUPZE

clear : THE QUICK BROWN FOX BEING QUITE FAST JUMPED OVER THE LAZY DOG QUITE NICELY
cipher: TNAEPDHIGEMZQJLEVQBEMAHL EPDHTAEVXWTEODYUASEQKAZETNAERXFCESQ EPDHTAELHIARC

clear : BUFFALO BUFFALO BUFFALO BUFFALO BUFFALO BUFFALO BUFFALO
cipher: HV  WRPDHV  WRPDHV  WRPDHV  WRPDHV  WRPDHV  WRPDHV  WRP

Challenges

See if you can decrypt the text in each of these ciphers:

SVNXIFCXYCFSXKVVZXIHXHERDXEIYRAKXZCOFSWHCZXHERDXBNRHCXZR RONQHXORWECFHCUH
SOFPTGFIFBOKJPHLBFPKHZUGLSOJPLIPKBPKHZUGLSOJPMOLEOPWFSFGJLBFIPMOLEOPXULBSIPLBP KBPBPWLIJFBILUBKHPGKISFG
TMBWFYAQFAZYCUOYJOBOHATMCYNIAOQW Q JAXOYCOCYCHAACOCYCAHGOVYLAOEGOTMBWFYAOBFF ACOBHOKBZYKOYCHAUWBHAXOQW XITHJOV WOXWYLYCU
FTRMKRGVRFMHSZVRWHRSFMFLMBNGKMGTHGBRSMKROKLSHSZMHKMMMMMRVVLVMPRKKOZRMFVDSGOFRW

I have the substitution matrices and cleartext for each, but I will only reveal them if it becomes too difficult or someone doesn't figure it out.

The solution which can decrypt the most messages successfully is the winner. If two solutions are equally good, they will be decided by vote counts.

Thomas O

Posted 2011-01-30T23:58:29.030

Reputation: 3 044

3What defines »most elegant«? I think that's the same thing that Chris objected in 99 bottles already. It's a subjective criterion that's quite hard to judge. – Joey – 2011-01-31T00:07:49.703

@Joey Most upvotes? Let the community decide. – Thomas O – 2011-01-31T00:14:43.953

Re "most upvotes": I'm unhappy to see this become a popularity contest post, not least because the post is otherwise excellent; see http://meta.codegolf.stackexchange.com/questions/110/a-modest-proposal-the-popularity-contest-tag for my thoughts on the whole matter.

– Chris Jester-Young – 2011-01-31T01:12:34.530

2What does "elegant" mean here? Best big-O performance? – gnibbler – 2011-01-31T03:31:05.253

@Chris, what do you suggest? – Thomas O – 2011-01-31T08:11:23.727

@Thomas: I think mootinator's response to my post is a good middle ground. If nothing else, it will help you articulate clearly what you are expecting. Personally, I prefer to use a cut-and-dried winning criterion, simply because that has less potential for dispute, but, the fact that mootinator responded the way he had says that reasonable people can disagree on the level of black-and-whiteness required. Which I can live with---just be upfront with what you want. – Chris Jester-Young – 2011-01-31T08:36:16.277

@Thomas: In other words, if you want "elegant", at least define clearly what "elegant" means, so we're not dealing with blind men and an elephant.

– Chris Jester-Young – 2011-01-31T08:40:16.880

@chris, I've changed it to a challenge then. – Thomas O – 2011-01-31T09:04:06.720

Can we use a dictionary? – st0le – 2011-01-31T12:52:59.517

@st0le you can use anything you need. – Thomas O – 2011-01-31T13:15:32.707

@Thomas: Thanks for the revision; ability to decrypt most messages seems to be easier to measure. – Chris Jester-Young – 2011-01-31T14:23:38.417

The last cipher has 5 M's in a row, but there are no words that I could find with more than 4 repeating letters. Is this an error in the cipher? – Kevin Brown – 2011-02-01T22:25:16.757

1@Bass5098, nope. It's just a difficult ciphertext which has been tainted to make it more resilient to frequency analysis. – Thomas O – 2011-02-01T22:42:25.547

Answers

Python

I've figured out all the secret phrases, but I won't post them here. Run the code if you care.

The code works by picking a space character, enumerating all possible substitutions for each word, then searching for compatible substitutions. It also allows for some out-of-lexicon words to deal with misspellings in the cleartext :)

I used a large lexicon (~500K words) from http://wordlist.sourceforge.net/.

import sys,re

# get input
message = sys.argv[1]

# read in lexicon of words
# download scowl version 7.1
# mk-list english 95 > wordlist
lexicon = set()
roman_only = re.compile('^[A-Z]*$')
for word in open('wordlist').read().upper().split():
  word=word.replace("'",'')
  if roman_only.match(word): lexicon.add(word)

histogram={}
for c in message: histogram[c]=0
for c in message: histogram[c]+=1
frequency_order = map(lambda x:x[1], sorted([(f,c) for c,f in histogram.items()])[::-1])

# returns true if the two maps are compatible.
# they are compatible if the mappings agree wherever they are defined,
# and no two different args map to the same value.
def mergeable_maps(map1, map2):
  agreements = 0
  for c in map1:
    if c in map2:
      if map1[c] != map2[c]: return False
      agreements += 1
  return len(set(map1.values() + map2.values())) == len(map1) + len(map2) - agreements

def merge_maps(map1, map2):
  m = {}
  for (c,d) in map1.items(): m[c]=d
  for (c,d) in map2.items(): m[c]=d
  return m

def search(map, word_maps, outside_lexicon_allowance, words_outside_lexicon):
  cleartext = ''.join(map[x] if x in map else '?' for x in message)
  #print 'trying', cleartext

  # pick a word to try next
  best_word = None
  best_score = 1e9
  for (word,subs) in word_maps.items():
    if word in words_outside_lexicon: continue
    compatible_subs=0
    for sub in subs:
      if mergeable_maps(map, sub): compatible_subs += 1
    unassigned_chars = 0
    for c in word:
      if c not in map: unassigned_chars += 1  #TODO: duplicates?
    if compatible_subs == 0: score = 0
    elif unassigned_chars == 0: score = 1e9
    else: score = 1.0 * compatible_subs / unassigned_chars   # TODO: tweak?
    if score < best_score:
      best_score = score
      best_word = word
  if not best_word:  # no words with unset characters, except possibly the outside lexicon ones
    print cleartext,[''.join(map[x] if x in map else '?' for x in word) for word in words_outside_lexicon]
    return True

  # use all compatible maps for the chosen word
  r = False
  for sub in word_maps[best_word]:
    if not mergeable_maps(map, sub): continue
    r |= search(merge_maps(map, sub), word_maps, outside_lexicon_allowance, words_outside_lexicon)

  # maybe this word is outside our lexicon
  if outside_lexicon_allowance > 0:
    r |= search(map, word_maps, outside_lexicon_allowance - 1, words_outside_lexicon + [best_word])
  return r

for outside_lexicon_allowance in xrange(3):
  # assign the space character first
  for space in frequency_order:
    words = [w for w in message.split(space) if w != '']
    if reduce(lambda x,y:x|y, [len(w)>20 for w in words]): continue  # obviously bad spaces

    # find all valid substitution maps for each word
    word_maps={}
    for word in words:
      n = len(word)
      maps = []
      for c in lexicon:
        if len(c) != n: continue
        m = {}
        ok = 1
        for i in xrange(n):
          if word[i] in m:                      # repeat letter
            if m[word[i]] != c[i]: ok=0; break  # repeat letters map to same thing
          elif c[i] in m.values(): ok=0; break  # different letters map to different things
          else: m[word[i]]=c[i]
        if ok: maps.append(m);
      word_maps[word]=maps

    # look for a solution
    if search({space:' '}, word_maps, outside_lexicon_allowance, []): sys.exit(0)

print 'I give up.'

Keith Randall

Posted 2011-01-30T23:58:29.030

Reputation: 19 865

PHP (Incomplete)

This is an incomplete PHP solution which works using the letter frequency information in the question plus a dictionary of words matched with regular expressions based on the most reliable letters in the given word.

At present the dictionary is quite small but with the appropriate expansion I anticipate that results would improve. I have considered the possibility of partial matches but with the current dictionary this results in degradation rather than improvement to the results.

Even with the current, small dictionary I think I can quite safely say what the fourth message encodes.

#!/usr/bin/php
<?php

    if($argv[1]) {

        $cipher = $argv[1];

        // Dictionary
        $words = explode("/", "the/to/on/and/in/is/secret/message");
        $guess = explode("/", "..e/t./o./a../i./.s/.e..et/.ess..e");

        $az = str_split("_etaoinshrdlucmfwypvbgkqjxz");

        // Build table
        for($i=0; $i<strlen($cipher); $i++) {
            $table[$cipher{$i}]++;
        }
        arsort($table);

        // Do default guesses
        $result = str_replace("_", " ", str_replace(array_keys($table), $az, $cipher));

        // Apply dictionary
        $cw = count($words);
        for($i=0; $i<$cw*2; $i++) {
            $tokens = explode(" ", $result);
            foreach($tokens as $t) {
                if(preg_match("/^" . $guess[$i%$cw] . "$/", $t)) {
                    $result = deenc($words[$i%$cw], $t, $result);
                    echo $t . ' -> ' . $words[$i%$cw] . "\n";
                    break;
                }
            }
        }

        // Show best guess
        echo $result . "\n";

    } else {

        echo "Usage: " . $argv[0] . " [cipher text]\n";

    }

    // Quick (non-destructive) replace tool
    function deenc($word, $enc, $string) {
        $string = str_replace(str_split($enc), str_split(strtoupper($word)), $string);
        $string = str_replace(str_split($word), str_split($enc), $string);
        return strtolower($string);
    }

?>

jtjacques

Posted 2011-01-30T23:58:29.030

Reputation: 1 055

Try using /usr/share/dict/words if you're on a system that has it. – Keith Randall – 2011-01-31T03:59:35.963