Decompose words into other words (e.g., “afterglow” = “aft” + “erg” + “low”)

13

1

Here's one for all you wordsmiths out there! Write a program or function which takes a list of words and produces a list of all possible concatenative decompositions for each word. For example:

(Note: This is only a small sampling for illustrative purposes. Actual output is far more voluminous.)

afterglow = after + glow
afterglow = aft + erg + low
alienation = a + lie + nation
alienation = a + lien + at + i + on
alienation = a + lien + at + ion
alienation = alien + at + i + on
alienation = alien + at + ion
archer = arc + her
assassinate = ass + as + sin + ate
assassinate = ass + ass + in + ate
assassinate = assassin + ate
backpedalled = back + pedal + led
backpedalled = back + pedalled
backpedalled = backpedal + led
goatskin = go + at + skin
goatskin = goat + skin
goatskin = goats + kin
hospitable = ho + spit + able
temporally = tempo + rally
windowed = win + do + wed
windowed = wind + owed
weatherproof = we + at + her + pro + of
yeasty = ye + a + sty

Ok, you get the idea. :-)

Rules

  • Use any programming language of your choosing. Shortest code by character count for each language wins. This means there is one winner for each language used. The overall winner will be simply the shortest code of all submitted.
  • The input list can be a text file, standard input, or any list structure your language provides (list, array, dictionary, set, etc.). The words can be English or any other natural language. (If the list is English words, you'll want to ignore or pre-filter-out single-letter items except for "a" and "i". Similarly, for other languages, you'll want to ignore nonsensical items if they appear in the file.)
  • The output list can be a text file, standard output, or any list structure your language uses.
  • You can use any input dictionary you like, but you'll probably want to use one that provides sensible words rather than one that provides too many obscure, arcane, or obnubilated words. This the file I used: The Corncob list of more than 58000 English words

Questions

This challenge is primarily about writing the code to accomplish the task, but it's also fun to comb through the results...

  1. What subwords occur most commonly?
  2. What word can be decomposed into the greatest number of subwords?
  3. What word can be decomposed the most different ways?
  4. What words are composed of the largest subwords?
  5. What decompositions did you find that were the most amusing?

Todd Lehman

Posted 2014-09-01T23:51:55.380

Reputation: 1 723

@Geobits — Ah, thank you! I missed two decompositions of alienation when I cut & pasted that. Fixed now. In terms of the others, the list above is only a small sampling. My test program generated tens of thousands of answers when given the Corncob list. – Todd Lehman – 2014-09-02T00:56:06.043

1"What subwords occurmost commonly?"

Gonna throw a wild guess out there and say 'a' might be near the top. – Sellyme – 2014-09-02T04:20:27.850

@SebastianLamerichs — I dunno... Might be, might not be. :) – Todd Lehman – 2014-09-02T04:57:08.497

@ToddLehman that sentence contains exactly 0 subwords, so 'a' is still equal first :P – Sellyme – 2014-09-02T04:58:35.387

@SebastianLamerichs if you were referring to Todd's response to you, "dunno" can be split into "dun" + "no". ;) – i alarmed alien – 2014-09-02T12:14:29.007

@ialarmedalien Is "dunno" in the word list, though? It doesn't count if it's not a word itself. – Sellyme – 2014-09-02T12:26:28.377

@SebastianLamerichs I dunno. Depends on the word list used! – i alarmed alien – 2014-09-02T13:01:13.667

What if I use Chinese as input text… – Ray – 2014-09-02T15:02:27.437

@Ray — I suppose that might result in very few decompositions... unless you figure out a way to decompose the glyphs! – Todd Lehman – 2014-09-02T17:09:15.760

Answers

3

Python 186

a=open(raw_input()).read().split()
def W(r):
 if r:
    for i in range(1,len(r)+1):
     if r[:i]in a:
        for w in W(r[i:]):yield[r[:i]]+w
 else:yield[]
while 1:
 for f in W(raw_input()):print f

Not particularly efficient but actually not terrible slow. It just naively (I suppose it's possible, though I think unlikely that python does some clever optimizations) checks that sub-words are in the corncob dictionary and recursively finds as many words as it can. Of course this dictionary is pretty extensive and you could try one which doesn't include various abbreviations and acronyms (leading to things like bedridden: be dr id den). Also the linked dictionary did not seem to have 'A' or 'I' listed as words so I manually added them in.

Edit:

Now the first input is the filename of the dictionary to use, and every additional one is a word.

KSab

Posted 2014-09-01T23:51:55.380

Reputation: 5 984

It's worth stating that this is Python 2, because the code does not run in Python 3, because print f should be print(f) – None – 2014-09-02T10:21:47.477

Also, how do I run this? echo archer|python2 filename.py outputs an EOFError for the last line – None – 2014-09-02T10:25:01.853

Some things you could still change (I haven't tested these, but I'm pretty sure it would work): for f in W(raw_input()):print f => ''.join(W(raw_input()) ; a=open('c').read().split('\n') => a=open('c').readlines() – ɐɔıʇǝɥʇuʎs – 2014-09-02T14:04:11.560

@ɐɔıʇǝɥʇuʎs Your first one would work but readlines keeps the newline characters at the end of lines which is why I did it like I did. – KSab – 2014-09-02T18:11:19.907

@ɐɔıʇǝɥʇuʎs Oh actually it seems join requires all the elements to be strings and I can't get it in a form smaller than what I already have. – KSab – 2014-09-02T18:35:52.240

2

Cobra - 160

sig Z(x,y)
def f(b)
    c as Z=do(x,y)
        if x.length<1,print y
        for z in File.readLines('t'),if z==x[:e=z.length].toLower,c(x[e:],y+' '+z)
    for t in b,c(t,'[t]:')

This is a function (sort-of two functions) that takes a List<of String>* and prints the strings containing the possible sub-word arrangements for each string in the argument list.

* the type is actually List<of dynamic?>, but providing anything other than List<of String> will probably break it.

Οurous

Posted 2014-09-01T23:51:55.380

Reputation: 7 916

2

Scala, 132 129

Edit: slightly shorter as a loop reading from stdin than a function

while(true)print(readLine.:\(Seq(List(""))){(c,l)=>l.flatMap{m=>Seq(c+""::m,c+m.head::m.tail)}}filter(_.forall(args contains _)))

run as

scala decompose.scala aft after erg glow low

(or use a longer word list :) )

Original:

def f(s:Seq[String])=s.map{_.:\(Seq(List(""))){(c,l)=>l.flatMap{m=>Seq(c+""::m,c+m.head::m.tail)}}filter(_.forall(args contains _))}

Function from Seq[String] to Seq[Seq[List[String]]]. Takes dictionary as the command line arguments.

Ungolfed:

def decompose(wordList: Seq[String]) =
  wordList.map{ word =>                              // for each word
    word.foldRight(Seq(List(""))){ (char, accum) =>  // for each character
      accum.flatMap{list =>
        Seq(char+""::list,char+list.head::list.tail) // add it as both a new list and 
      }                                              // the to start of the first list
    }.filter(_.forall(args contains _))              // filter out lists w/ invalid words
  }

Approach is to generate all the possible lists of substrings and filter out those that contain a string not in the dictionary. Note that some of the substrings generated contain an extra empty string, I assume the empty string will not be in the dictionary (there's no way to pass it in on the command line anyways).

paradigmsort

Posted 2014-09-01T23:51:55.380

Reputation: 66