Compounding english

28

7

A compound word is a word that contains 2 or more words in it. We can do better than that, though. We need you to create 1 (nonsensical) word that contains every word.

However, we want this word to be as short as possible. We can use overlapping letters to achieve this.

For example, if your word list was ["cat", "atom", "a"], you would want to return "catom".

Input/Output

Your program will need to take a list of words as input, and return a compound word as output.

The word list you will be using is the top 10000 words in English, according to Google (If this list turns out to be too easy, I may change it to a longer one). For reference, simply appending each word gives you a score of 65888.

Your score is the number of letters in your final word, lower is better. Tie breaker goes to the first poster.

Nathan Merrill

Posted 2016-08-01T18:31:51.030

Reputation: 13 591

1Related. Related. – Martin Ender – 2016-08-01T18:35:02.013

Is there any limit as to how long the program can run? – Loovjo – 2016-08-01T18:40:07.857

1@Loovjo no, but if it ends up that bruteforcing is fast enough to run, then I'll change the word list to make it longer. – Nathan Merrill – 2016-08-01T18:41:39.667

@muddyfish correct, edited. – Nathan Merrill – 2016-08-01T18:42:05.560

So, for instance, the input list ["hello","ok","kid"] should return hellokid? – R. Kap – 2016-08-01T19:22:16.623

I hope we're not expected to put the compounded word directly in our answer... – Patrick Roberts – 2016-08-01T19:23:54.070

@R.Kap yes, that string is valid, but any string that contains the three words is valid (e.g. kidokhello ) – Nathan Merrill – 2016-08-01T19:27:14.280

1@PatrickRoberts If it fits in your answer, props to you :) A pastebin/gist link would be great, but isn't required. – Nathan Merrill – 2016-08-01T19:28:46.090

For consistency, you should include the actual length of all the words concatenated together if overlapping letters were not deleted. The length is 65888. – R. Kap – 2016-08-01T20:19:39.940

1Hmm, who knows a good asymmetric travelling salesman heuristic? – Dave – 2016-08-01T20:48:19.063

I assume that there is no wrapping? A word cannot start at the end of the compound word and finish at its beginning? – trichoplax – 2016-08-01T21:54:16.263

Does the code need to be deterministic (PRNG with consistent seed)? – trichoplax – 2016-08-01T22:12:45.563

2No wrapping, and yes to deterministic. – Nathan Merrill – 2016-08-01T22:14:37.380

Relevant Youtube video: Portmantout – justhalf – 2016-08-03T06:29:08.147

Given the size of the other programs compared to my PowerShell one, kinda wishing that code length was a factor in scoring, especially now that an optimal length word has been found. ;-) – AdmBorkBork – 2016-08-03T14:01:49.063

1I personally think credit should go to the first one who found the optimal solution, not just somebody who can golf their solution. :) – Nathan Merrill – 2016-08-03T14:04:35.390

@NathanMerrill Oh, no doubt. I'm just tooting my own horn for giggles.

– AdmBorkBork – 2016-08-03T14:09:22.027

Is the optimal solution 28272(Mentioned in bounty) or 38272(In title of accepted answer) chars? – TecBrat – 2016-08-04T16:06:19.593

Answers

26

C++, final word length: 38272

(optimised version took about 20 minutes)

#include <iostream>
#include <string>
#include <vector>

std::size_t calcOverlap(const std::string &a, const std::string &b, std::size_t limit, std::size_t minimal) {
    std::size_t la = a.size();
    for(std::size_t p = std::min(std::min(la, b.size()), limit + 1); -- p > minimal; ) {
        if(a.compare(la - p, p, b, 0, p) == 0) {
            return p;
        }
    }
    return 0;
}

int main() {
    std::vector<std::string> words;

    // Load all words from input
    while(true) {
        std::string word;
        std::getline(std::cin, word);
        if(word.empty()) {
            break;
        }
        words.push_back(word);
    }

    std::cerr
        << "Input word count: " << words.size() << std::endl;

    // Remove all fully subsumed words

    for(auto p = words.begin(); p != words.end(); ) {
        bool subsumed = false;
        for(auto i = words.begin(); i != words.end(); ++ i) {
            if(i == p) {
                continue;
            }
            if(i->find(*p) != std::string::npos) {
                subsumed = true;
                break;
            }
        }
        if(subsumed) {
            p = words.erase(p);
        } else {
            ++ p;
        }
    }

    std::cerr
        << "After subsuming checks: " << words.size()
        << std::endl;

    // Sort words longest-to-shortest (not necessary but doesn't hurt. Makes finding maxlen a tiny bit easier)
    std::sort(words.begin(), words.end(), [](const std::string &a, const std::string &b) {
        return a.size() > b.size();
    });

    std::size_t maxlen = words.front().size();

    // Repeatedly combine most-compatible words until there is only one left
    std::size_t bestPossible = maxlen - 1;
    while(words.size() > 1) {
        auto bestA = words.begin();
        auto bestB = -- words.end();
        std::size_t bestOverlap = 0;
        for(auto p = ++ words.begin(), e = words.end(); p != e; ++ p) {
            if(p->size() - 1 <= bestOverlap) {
                continue;
            }
            for(auto q = words.begin(); q != p; ++ q) {
                std::size_t overlap = calcOverlap(*p, *q, bestPossible, bestOverlap);
                if(overlap > bestOverlap) {
                    bestA = p;
                    bestB = q;
                    bestOverlap = overlap;
                }
                overlap = calcOverlap(*q, *p, bestPossible, bestOverlap);
                if(overlap > bestOverlap) {
                    bestA = q;
                    bestB = p;
                    bestOverlap = overlap;
                }
            }
            if(bestOverlap == bestPossible) {
                break;
            }
        }
        std::string newStr = std::move(*bestA);
        newStr.append(*bestB, bestOverlap, std::string::npos);

        if(bestA == -- words.end()) {
            words.pop_back();
            *bestB = std::move(words.back());
            words.pop_back();
        } else {
            *bestB = std::move(words.back());
            words.pop_back();
            *bestA = std::move(words.back());
            words.pop_back();
        }

        // Remove any words which are now in the result
        for(auto p = words.begin(); p != words.end(); ) {
            if(newStr.find(*p) != std::string::npos) {
                std::cerr << "Now subsumes: " << *p << std::endl;
                p = words.erase(p);
            } else {
                ++ p;
            }
        }

        std::cerr
            << "Words remaining: " << (words.size() + 1)
            << " Latest combination: (" << bestOverlap << ") " << newStr
            << std::endl;

        words.push_back(std::move(newStr));
        bestPossible = bestOverlap; // Merging existing words will never make longer merges possible
    }

    std::string result = words.front();

    std::cout
        << result
        << std::endl;
    std::cerr
        << "Word size: " << result.size()
        << std::endl;
    return 0;
}

Verification bash one-liner:

cat commonwords.txt | while read p; do grep "$p" merged.txt >/dev/null || echo "Not found: $p"; done

It also produced some pretty cool in-progress words. Here are some of my favourites:

  • polyesterday (synthetic nostalgia)
  • afghanistanbul (something [insert politician you dislike] would say)
  • togethernet (the friendly internet)
  • elephantom (a large ghost)
  • thundergroundwaterproof (as in "I don't know why they felt the need to make it thundergroundwaterproof, but it's making me nervous")

And:

  • codescribingo (maybe an upcoming challenge on this site?)

The final output is on pastebin here: http://pastebin.com/j3qYb65b

Dave

Posted 2016-08-01T18:31:51.030

Reputation: 7 519

2An observation which might be useful to others looking to get the optimal solution: the problem can be reduced to a non-euclidean, asymmetric travelling salesman problem: define a matrix where element i, j = max_word_length - overlap(word[i], word[j]) (where overlap checks the overlap from the right of the first argument to the left of the second). Solving this (good luck!) then cutting the resulting loop at the highest cost (lowest overlap) will give an ordered list of words which can be merged to give an optimal solution. – Dave – 2016-08-01T22:53:05.217

Impressive. Is this checking every word in the current list against every other, each time through? I was considering this but assumed I'd need to just check a random sample to make it run in a reasonable time. – trichoplax – 2016-08-02T01:08:07.020

I wonder if it'd be possible to "cache" the distances between pairs in a way that you don't have to re-check them with each other every time, and only have to check them against the new megaword? At any rate, this is incredible and I wonder if it's completely optimal or not. – Value Ink – 2016-08-02T02:10:26.053

1@ValueInk yes caching would be a big performance boost. An earlier version had that, but it adds a lot of complexity, so when I adapted some logic I had to re-write the lot. Instead I chose to drop caching. Also no, this isn't completely optimal. It's a greedy algorithm so it can't judge trade-offs, and it's unable to choose between "equally good" options. See my TSP comment for the (NP-Hard) optimal solution. – Dave – 2016-08-02T06:53:41.480

1@trichoplax yup, that's what it's doing. Running time is O(n^3), which isn't so bad for this sample size. With caching it could be reduced to O(n^2). The inner check is also very parallel-isable, so even for larger samples it could run in reasonable time with threading / distributed computation. Also this gets a big speed boost from knowing the range of possible overlap widths for each step, which cut the runtime by a factor of 10. – Dave – 2016-08-02T07:02:45.840

2This might not be as hard as general TSP, because we have the funny constraint that overlap(a, b) ≥ min{overlap(a, d), overlap(c, d), overlap(c, b)} for all a, b, c, d. – Anders Kaseorg – 2016-08-02T11:20:13.680

21

C++11, 38272 letters, proven optimal

This algorithm is guaranteed to provide a lower bound on the solution. In this case, it is able to achieve the lower bound and output an optimal 38272 letter solution. (This matches the solution found by Dave’s greedy algorithm. I was surprised and a little disappointed to discover that it’s optimal, but, there we are.)

It works by solving the minimum-cost flow problem on the network built as follows.

  • First, any words contained in other words are redundant; discard them.
  • For every word w, draw two nodes w_0 and w_1, where w_0 is a source with capacity 1 and w_1 is a sink with capacity 1.
  • For every (strict) prefix or suffix a of any word, draw a node a.
  • For every suffix a of w, draw an arc from w_0 to a with capacity 1 and cost 0.
  • For every prefix a of w, draw an arc from a to w_1 with capacity 1 and cost length(w) − length(a).

Any string of length n that contains every word can be converted into a flow on this network with cost at most n. Therefore, the minimum cost flow on this network is a lower bound on the length of the shortest such string.

If we’re lucky—and in this case we are—then after we redirect the flow coming into w_1 back out of w_0, we will find an optimal flow that has just one connected component and that passes through the node for the empty string. If so, it will contain an Eulerian circuit starting and ending there. Such an Eulerian circuit can be read back out as a string of optimal length.

If we weren’t lucky, add some extra arcs between the empty string and the shortest strings in the other connected components in order to ensure that an Eulerian circuit exists. The string would no longer necessarily be optimal in that case.

I use the LEMON library for its min-cost flow and Eulerian circuit algorithms. (This was my first time using this library, and I was impressed—I will definitely use it again for future graph algorithm needs.) LEMON comes with four different minimum-cost flow algorithms; you can try them here with --net, --cost, --cap, and --cycle (default).

The program runs in 0.5 seconds, producing this output string.

#include <iostream>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include <lemon/core.h>
#include <lemon/connectivity.h>
#include <lemon/euler.h>
#include <lemon/maps.h>
#include <lemon/list_graph.h>
#include <lemon/network_simplex.h>
#include <lemon/cost_scaling.h>
#include <lemon/capacity_scaling.h>
#include <lemon/cycle_canceling.h>

using namespace std;

typedef lemon::ListDigraph G;

struct Word {
    G::Node suffix, prefix;
    G::Node tour_node;
};

struct Edge {
    unordered_map<string, Word>::iterator w;
    G::Arc arc;
};

struct Affix {
    vector<Edge> suffix, prefix;
    G::Node node;
    G::Node tour_node;
};

template<class MCF>
bool solve(const G &net, const G::ArcMap<int> &lowerMap, const G::ArcMap<int> &upperMap, const G::ArcMap<int> &costMap, const G::NodeMap<int> &supplyMap, int &totalCost, G::ArcMap<int> &flowMap)
{
    MCF mcf(net);
    if (mcf.lowerMap(lowerMap).upperMap(upperMap).costMap(costMap).supplyMap(supplyMap).run() != mcf.OPTIMAL)
        return false;
    totalCost = mcf.totalCost();
    mcf.flowMap(flowMap);
    return true;
}

int main(int argc, char **argv)
{
    clog << "Reading dictionary from stdin" << endl;
    unordered_map<string, Affix> affixes;
    unordered_map<string, Word> words;
    unordered_set<string> subwords;
    G net, tour;
    G::ArcMap<int> lowerMap(net), upperMap(net), costMap(net);
    G::NodeMap<int> supplyMap(net);
    string new_word;
    while (getline(cin, new_word)) {
        if (subwords.find(new_word) != subwords.end())
            continue;
        for (auto i = new_word.begin(); i != new_word.end(); ++i) {
            for (auto j = new_word.end(); j != i; --j) {
                string s(i, j);
                words.erase(s);
                subwords.insert(s);
            }
        }
        words.emplace(new_word, Word());
    }
    for (auto w = words.begin(); w != words.end(); ++w) {
        w->second.suffix = net.addNode();
        supplyMap.set(w->second.suffix, 1);
        w->second.prefix = net.addNode();
        supplyMap.set(w->second.prefix, -1);
        for (auto i = w->first.begin(); ; ++i) {
            affixes.emplace(string(w->first.begin(), i), Affix()).first->second.prefix.push_back(Edge {w});
            affixes.emplace(string(i, w->first.end()), Affix()).first->second.suffix.push_back(Edge {w});
            if (i == w->first.end())
                break;
        }
        w->second.tour_node = tour.addNode();
    }
    for (auto a = affixes.begin(); a != affixes.end();) {
        if (a->second.suffix.empty() || a->second.prefix.empty() ||
            (a->second.suffix.size() == 1 && a->second.prefix.size() == 1 &&
             a->second.suffix.begin()->w == a->second.prefix.begin()->w)) {
            affixes.erase(a++);
        } else {
            a->second.node = net.addNode();
            supplyMap.set(a->second.node, 0);
            for (auto &e : a->second.suffix) {
                e.arc = net.addArc(e.w->second.suffix, a->second.node);
                lowerMap.set(e.arc, 0);
                upperMap.set(e.arc, 1);
                costMap.set(e.arc, 0);
            }
            for (auto &e : a->second.prefix) {
                e.arc = net.addArc(a->second.node, e.w->second.prefix);
                lowerMap.set(e.arc, 0);
                upperMap.set(e.arc, 1);
                costMap.set(e.arc, e.w->first.length() - a->first.length());
            }
            a->second.tour_node = lemon::INVALID;
            ++a;
        }
    }

    clog << "Read " << words.size() << " words and found " << affixes.size() << " affixes; ";
    clog << "created network with " << countNodes(net) << " nodes and " << countArcs(net) << " arcs" << endl;

    int totalCost;
    G::ArcMap<int> flowMap(net);
    bool solved;
    if (argc > 1 && string(argv[1]) == "--net") {
        clog << "Using network simplex algorithm" << endl;
        solved = solve<lemon::NetworkSimplex<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
    } else if (argc > 1 && string(argv[1]) == "--cost") {
        clog << "Using cost scaling algorithm" << endl;
        solved = solve<lemon::CostScaling<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
    } else if (argc > 1 && string(argv[1]) == "--cap") {
        clog << "Using capacity scaling algorithm" << endl;
        solved = solve<lemon::CapacityScaling<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
    } else if ((argc > 1 && string(argv[1]) == "--cycle") || true) {
        clog << "Using cycle canceling algorithm" << endl;
        solved = solve<lemon::CycleCanceling<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
    }

    if (!solved) {
        clog << "error: no solution found" << endl;
        return 1;
    }
    clog << "Lower bound: " << totalCost << endl;

    G::ArcMap<string> arcLabel(tour);
    G::Node empty = tour.addNode();
    affixes.find("")->second.tour_node = empty;
    for (auto &a : affixes) {
        for (auto &e : a.second.suffix) {
            if (flowMap[e.arc]) {
                if (a.second.tour_node == lemon::INVALID)
                    a.second.tour_node = tour.addNode();
                arcLabel.set(tour.addArc(e.w->second.tour_node, a.second.tour_node), "");
            }
        }
        for (auto &e : a.second.prefix) {
            if (flowMap[e.arc]) {
                if (a.second.tour_node == lemon::INVALID)
                    a.second.tour_node = tour.addNode();
                arcLabel.set(tour.addArc(a.second.tour_node, e.w->second.tour_node), e.w->first.substr(a.first.length()));
            }
        }
    }

    clog << "Created tour graph with " << countNodes(tour) << " nodes and " << countArcs(tour) << " arcs" << endl;

    G::NodeMap<int> compMap(tour);
    int components = lemon::stronglyConnectedComponents(tour, compMap);
    if (components != 1) {
        vector<unordered_map<string, Affix>::iterator> breaks(components, affixes.end());
        for (auto a = affixes.begin(); a != affixes.end(); ++a) {
            if (a->second.tour_node == lemon::INVALID)
                continue;
            int c = compMap[a->second.tour_node];
            if (c == compMap[empty])
                continue;
            auto &b = breaks[compMap[a->second.tour_node]];
            if (b == affixes.end() || b->first.length() > a->first.length())
                b = a;
        }
        int offset = 0;
        for (auto &b : breaks) {
            if (b != affixes.end()) {
                arcLabel.set(tour.addArc(empty, b->second.tour_node), b->first);
                arcLabel.set(tour.addArc(b->second.tour_node, empty), "");
                offset += b->first.length();
            }
        }
        clog << "warning: Found " << components << " components; solution may be suboptimal by up to " << offset << " letters" << endl;
    }

    if (!lemon::eulerian(tour)) {
        clog << "error: failed to make tour graph Eulerian" << endl;
        return 1;
    }

    for (lemon::DiEulerIt<G> e(tour, empty); e != lemon::INVALID; ++e)
        cout << arcLabel[e];
    cout << endl;

    return 0;
}

Anders Kaseorg

Posted 2016-08-01T18:31:51.030

Reputation: 29 242

While I can't claim to understand how min flow works, if this really is optimal, well done! – Nathan Merrill – 2016-08-03T12:38:03.247

1Sorry to disappoint :P I hadn't thought of a flow network; that's pretty elegant. Took me a while to understand how you were linking the words together in your network before I finally realised "For every (strict) prefix or suffix a of any word, draw a node a" meant that the nodes "a" can be shared between multiple words. – Dave – 2016-08-03T17:55:18.853

1Also your solution is about 2,000 times faster than mine! – Dave – 2016-08-03T17:56:51.103

1

Maybe help this guy (http://www.cs.cmu.edu/~tom7/portmantout/) out with his attempt at a similar thing?

– Oliver Daugherty-Long – 2016-08-06T00:02:58.980

2

@OliverDaugherty-Long Done! (For real this time.) The best previously known bounds were 520732 ≤ optimal length ≤ 537136, and I believe I have improved both bounds to 536186.

– Anders Kaseorg – 2016-08-07T10:30:59.177

13

Java 8, ~5 Minutes, Length of 39,279

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Optional;
import java.util.Set;
import java.util.stream.Collectors;

public class Words {

    public static void main(String[] args) throws Throwable {
        File file = new File("words.txt");
        List<String> wordsList = new ArrayList<>();
        BufferedReader reader = new BufferedReader(new FileReader(file));
        String line;
        while ((line = reader.readLine()) != null) {
            wordsList.add(line);
        }
        reader.close();

        Set<String> words = new HashSet<>();

        System.out.println("Finished I/O");

        for (int i = 0; i < wordsList.size(); i++) { //Step 1: remove any words that occur in other words
            boolean in = false;
            for (int j = 0; j < wordsList.size(); j++) {
                if (i != j && wordsList.get(j).contains(wordsList.get(i))) {
                    in = true;
                    break;
                }
            }
            if (!in) {
                words.add(wordsList.get(i));
            }
        }

        System.out.println("Removed direct containers");

        List<String> working = words.stream().sorted((c, b) -> Integer.compare(c.length(), b.length())).collect(Collectors.toList()); //Sort by length, shortest first
        StringBuilder result = new StringBuilder();
        result.append(working.get(0));
        while (!working.isEmpty()) {
            Optional<String> target = working.stream().sorted((c, b) -> Integer.compare(firstLastCommonality(result.toString(), b), firstLastCommonality(result.toString(), c))).findFirst(); //Find the string that has the greatest in common with the end of 'result'
            if(target.isPresent()) { //It really should be present, but just in case
                String s = target.get();
                working.remove(s);
                int commonality = firstLastCommonality(result.toString(), s);
                s = s.substring(commonality);
                result.append(s);
            }
        }

        System.out.println("Finished algorithm");

        String r = result.toString();
        System.out.println("The string: \n" + r);
        System.out.println("Length: \n" + r.length());
        System.out.println("Verified: \n" + !wordsList.stream().filter(s -> !r.contains(s)).findFirst().isPresent());
    }

    private static int firstLastCommonality(String a, String b) {
        int count = 0;
        int len = b.length();
        while (!a.endsWith(b) && !b.equals("")) {
            b = cutLastChar(b);
            count++;
        }
        return len - count;
    }

    private static String cutLastChar(String string) {
        if (string.length() - 1 < 0) {
            return string;
        } else {
            return string.substring(0, string.length() - 1);
        }
    }

}

Input:

  • a file called 'words.txt' in the working directory, in the exact same format as the file in the main post

Output:

Finished I/O
Removed direct containers
Finished algorithm
The string: 
[Moved to pastebin](http://pastebin.com/iygyR3zL)
Length: 
39279
Verified: 
true

Socratic Phoenix

Posted 2016-08-01T18:31:51.030

Reputation: 1 629

2FGITW, and in Java no less. You sir have my vote. – Patrick Roberts – 2016-08-01T19:30:05.413

2Nice! You got rid of 26,609 characters. – R. Kap – 2016-08-01T20:21:18.407

@R.Kap go figure! I didn't even think to calculate that... There must be a smarter algorithm though, this is just the first thing that came to mind... – Socratic Phoenix – 2016-08-01T20:26:13.543

7

Python 2, 39254 characters

Takes 1-2 minutes to run on my machine, works by taking the longest word and then always adding the word to the result string that has the most strings in common. (Before that all words that are substrings of other words are removed to prevent unnecessary adding to the string.)

Update: Tried to look in both directions, but that doesn't do any better. (maybe it's using words that can be used better later?)

Link to the word on pastebin.

first 100 chars:

telecommunicationsawayneillegallynnbabyenbcjrxltdxmlbsrcwvtxxxboxespnycdsriconsentencessexyrsslipodc

Code:

import re
import urllib

def suffix_dist(w1,w2):
    for i in range(min(map(len,[w1,w2])))[::-1]:
        if w1[-i:]==w2[:i]:
            return i
    return 0

url="https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt"
s=urllib.urlopen(url).read()
words=s.split()
words.sort(key=len,reverse=True)

## remove words that are substrings of other words anyway
for i in range(len(words))[::-1]:
    if any(words[i] in w for w in words[:i]):
        words.pop(i)

print len(words)

words.sort(key=len)
w1=words.pop(-1)
result=w1
while words:
    ## get the word with longest shared suffix/prefix
    w2=max(words,key=lambda x:suffix_dist(w1,x))
    print w2
    words.pop(words.index(w2))
    if w2 in result:
        break
    result+=w2[suffix_dist(w1,w2):]
    w1=w2


print result[:100]
print len(result)
print "Test:", all(w in result for w in s.split())

KarlKastor

Posted 2016-08-01T18:31:51.030

Reputation: 2 352

2Welp, I've been beaten by 25 characters... +1 for that – Socratic Phoenix – 2016-08-01T20:28:38.667

Nice work! I had a similar idea but you already had an answer up. My version starts with a small word instead of a big word, plus some other pruning that causes it to drastically lose out on the time factor, taking up to 3x the amount of time to run. – Value Ink – 2016-08-02T02:15:20.350

5

Ruby, 39222 characters

Uses a similar approach to @KarlKastor in his Python answer, but the starting string is one of the smallest words instead of the largest. Another optimization (I don't know how much it helps) is that in between each addition, it prunes any words that may have already been included in the string due to overlapping words.

Runs in a little over 4 minutes on my machine, not counting the web request to retrieve the list of words, but not quite 4:20.

The word on Pastebin.

require 'net/http'

puts "Obtaining word list..."
data = Net::HTTP.get(URI'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt')
puts "Word list obtained!"

puts "Starting calculation..."
time = Time.now

words = data.split.sort_by(&:size)
words.reject!{|w| words.find{|x| w!=x && x.include?(w)}}

string = words.shift

def merge_dist(prefix, suffix)
    [prefix.size,suffix.size].min.downto(0).find{|i| prefix.end_with?(suffix[0,i])}
end

while words.length > 0
    suffix = words.max_by{|w| merge_dist string, w}
    string += suffix[merge_dist(string, suffix)..-1]
    words.reject!{|w| string.include? w}
end

delta = Time.now - time

puts "Calculation completed in #{delta} seconds!"
puts "Word is #{string.size} chars long."

open("word.txt", 'w') << string

puts "Word saved to word.txt"

Value Ink

Posted 2016-08-01T18:31:51.030

Reputation: 10 608

3

PowerShell v2+, 46152 characters

param([System.Collections.ArrayList]$n)$n=$n|sort length -des;while($n){$x=$n.Count;$o+=$n[0];0..$x|%{if($o.IndexOf($n[$_])-ge0){$n.RemoveAt($_)}}}$o

Takes the input as a list, casts it into an ArrayList (so we can manipulate it). We sort it by length in -descending order. Then, while we still have words in our input array, do a loop. Each iteration, set helper $x to be equal to how many we have left, tack on the next item in the list to our output $o, and then trawl through everything still in our list. If the .IndexOf is not equal to -1 (i.e., the word was found somewhere in $o), we remove that word from our list of remaining words. Finally, at the end, output $o.

I don't have access to a Pastebin or similar, so here's the beginning and end of the word for temporary -- telecommunicationscharacterizationresponsibilitiessublimedirectory...fcmxvtwvfxwujmjsuhjjrxjdbkdxqc. Which I guess has shaved about 20,000 characters off of the input, so not that bad, I suppose.

I'm working on refinements.

AdmBorkBork

Posted 2016-08-01T18:31:51.030

Reputation: 41 581

0

PHP 46612 characters

This is just a start. I hope to improve it. All I've done so far is remove any word that is a sub-string of another word. I'm working on 3 copies of the array, but the memory doesn't seem to be an issue.

<?php
set_time_limit(3000);

$words=file('https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt');
$words = array_map('trim', $words);

usort($words, function ($a, $b)
{
    if (strlen($a) == strlen($b) ){
        return 0;
    }
    return ( strlen($a) < strlen($b) )? -1 : 1;
});

$final_array=$words;
$comparison_array=$words;


  foreach($words as $key => $word){
    echo $word."<br>\n";
      foreach($comparison_array as $nestedKey => $nestedWord){
          if (strlen($nestedWord) <= strlen($word)) {
            unset($comparison_array[$nestedKey]);
            continue;
          }
          if( strpos($nestedWord,$word) !== FALSE ){
              unset($final_array[$key]);
              $restart=true;
              break;
          } 
      }    
  }


sort($final_array);
$compound='';
$compound = implode($final_array);
echo $compound;
echo "  <br><br>\n\n". strlen($compound);

TecBrat

Posted 2016-08-01T18:31:51.030

Reputation: 222