C++11, 38272 letters, proven optimal
This algorithm is guaranteed to provide a lower bound on the solution. In this case, it is able to achieve the lower bound and output an optimal 38272 letter solution. (This matches the solution found by Dave’s greedy algorithm. I was surprised and a little disappointed to discover that it’s optimal, but, there we are.)
It works by solving the minimum-cost flow problem on the network built as follows.
- First, any words contained in other words are redundant; discard them.
- For every word w, draw two nodes w_0 and w_1, where w_0 is a source with capacity 1 and w_1 is a sink with capacity 1.
- For every (strict) prefix or suffix a of any word, draw a node a.
- For every suffix a of w, draw an arc from w_0 to a with capacity 1 and cost 0.
- For every prefix a of w, draw an arc from a to w_1 with capacity 1 and cost length(w) − length(a).
Any string of length n that contains every word can be converted into a flow on this network with cost at most n. Therefore, the minimum cost flow on this network is a lower bound on the length of the shortest such string.
If we’re lucky—and in this case we are—then after we redirect the flow coming into w_1 back out of w_0, we will find an optimal flow that has just one connected component and that passes through the node for the empty string. If so, it will contain an Eulerian circuit starting and ending there. Such an Eulerian circuit can be read back out as a string of optimal length.
If we weren’t lucky, add some extra arcs between the empty string and the shortest strings in the other connected components in order to ensure that an Eulerian circuit exists. The string would no longer necessarily be optimal in that case.
I use the LEMON library for its min-cost flow and Eulerian circuit algorithms. (This was my first time using this library, and I was impressed—I will definitely use it again for future graph algorithm needs.) LEMON comes with four different minimum-cost flow algorithms; you can try them here with --net
, --cost
, --cap
, and --cycle
(default).
The program runs in 0.5 seconds, producing this output string.
#include <iostream>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include <lemon/core.h>
#include <lemon/connectivity.h>
#include <lemon/euler.h>
#include <lemon/maps.h>
#include <lemon/list_graph.h>
#include <lemon/network_simplex.h>
#include <lemon/cost_scaling.h>
#include <lemon/capacity_scaling.h>
#include <lemon/cycle_canceling.h>
using namespace std;
typedef lemon::ListDigraph G;
struct Word {
G::Node suffix, prefix;
G::Node tour_node;
};
struct Edge {
unordered_map<string, Word>::iterator w;
G::Arc arc;
};
struct Affix {
vector<Edge> suffix, prefix;
G::Node node;
G::Node tour_node;
};
template<class MCF>
bool solve(const G &net, const G::ArcMap<int> &lowerMap, const G::ArcMap<int> &upperMap, const G::ArcMap<int> &costMap, const G::NodeMap<int> &supplyMap, int &totalCost, G::ArcMap<int> &flowMap)
{
MCF mcf(net);
if (mcf.lowerMap(lowerMap).upperMap(upperMap).costMap(costMap).supplyMap(supplyMap).run() != mcf.OPTIMAL)
return false;
totalCost = mcf.totalCost();
mcf.flowMap(flowMap);
return true;
}
int main(int argc, char **argv)
{
clog << "Reading dictionary from stdin" << endl;
unordered_map<string, Affix> affixes;
unordered_map<string, Word> words;
unordered_set<string> subwords;
G net, tour;
G::ArcMap<int> lowerMap(net), upperMap(net), costMap(net);
G::NodeMap<int> supplyMap(net);
string new_word;
while (getline(cin, new_word)) {
if (subwords.find(new_word) != subwords.end())
continue;
for (auto i = new_word.begin(); i != new_word.end(); ++i) {
for (auto j = new_word.end(); j != i; --j) {
string s(i, j);
words.erase(s);
subwords.insert(s);
}
}
words.emplace(new_word, Word());
}
for (auto w = words.begin(); w != words.end(); ++w) {
w->second.suffix = net.addNode();
supplyMap.set(w->second.suffix, 1);
w->second.prefix = net.addNode();
supplyMap.set(w->second.prefix, -1);
for (auto i = w->first.begin(); ; ++i) {
affixes.emplace(string(w->first.begin(), i), Affix()).first->second.prefix.push_back(Edge {w});
affixes.emplace(string(i, w->first.end()), Affix()).first->second.suffix.push_back(Edge {w});
if (i == w->first.end())
break;
}
w->second.tour_node = tour.addNode();
}
for (auto a = affixes.begin(); a != affixes.end();) {
if (a->second.suffix.empty() || a->second.prefix.empty() ||
(a->second.suffix.size() == 1 && a->second.prefix.size() == 1 &&
a->second.suffix.begin()->w == a->second.prefix.begin()->w)) {
affixes.erase(a++);
} else {
a->second.node = net.addNode();
supplyMap.set(a->second.node, 0);
for (auto &e : a->second.suffix) {
e.arc = net.addArc(e.w->second.suffix, a->second.node);
lowerMap.set(e.arc, 0);
upperMap.set(e.arc, 1);
costMap.set(e.arc, 0);
}
for (auto &e : a->second.prefix) {
e.arc = net.addArc(a->second.node, e.w->second.prefix);
lowerMap.set(e.arc, 0);
upperMap.set(e.arc, 1);
costMap.set(e.arc, e.w->first.length() - a->first.length());
}
a->second.tour_node = lemon::INVALID;
++a;
}
}
clog << "Read " << words.size() << " words and found " << affixes.size() << " affixes; ";
clog << "created network with " << countNodes(net) << " nodes and " << countArcs(net) << " arcs" << endl;
int totalCost;
G::ArcMap<int> flowMap(net);
bool solved;
if (argc > 1 && string(argv[1]) == "--net") {
clog << "Using network simplex algorithm" << endl;
solved = solve<lemon::NetworkSimplex<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
} else if (argc > 1 && string(argv[1]) == "--cost") {
clog << "Using cost scaling algorithm" << endl;
solved = solve<lemon::CostScaling<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
} else if (argc > 1 && string(argv[1]) == "--cap") {
clog << "Using capacity scaling algorithm" << endl;
solved = solve<lemon::CapacityScaling<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
} else if ((argc > 1 && string(argv[1]) == "--cycle") || true) {
clog << "Using cycle canceling algorithm" << endl;
solved = solve<lemon::CycleCanceling<G>>(net, lowerMap, upperMap, costMap, supplyMap, totalCost, flowMap);
}
if (!solved) {
clog << "error: no solution found" << endl;
return 1;
}
clog << "Lower bound: " << totalCost << endl;
G::ArcMap<string> arcLabel(tour);
G::Node empty = tour.addNode();
affixes.find("")->second.tour_node = empty;
for (auto &a : affixes) {
for (auto &e : a.second.suffix) {
if (flowMap[e.arc]) {
if (a.second.tour_node == lemon::INVALID)
a.second.tour_node = tour.addNode();
arcLabel.set(tour.addArc(e.w->second.tour_node, a.second.tour_node), "");
}
}
for (auto &e : a.second.prefix) {
if (flowMap[e.arc]) {
if (a.second.tour_node == lemon::INVALID)
a.second.tour_node = tour.addNode();
arcLabel.set(tour.addArc(a.second.tour_node, e.w->second.tour_node), e.w->first.substr(a.first.length()));
}
}
}
clog << "Created tour graph with " << countNodes(tour) << " nodes and " << countArcs(tour) << " arcs" << endl;
G::NodeMap<int> compMap(tour);
int components = lemon::stronglyConnectedComponents(tour, compMap);
if (components != 1) {
vector<unordered_map<string, Affix>::iterator> breaks(components, affixes.end());
for (auto a = affixes.begin(); a != affixes.end(); ++a) {
if (a->second.tour_node == lemon::INVALID)
continue;
int c = compMap[a->second.tour_node];
if (c == compMap[empty])
continue;
auto &b = breaks[compMap[a->second.tour_node]];
if (b == affixes.end() || b->first.length() > a->first.length())
b = a;
}
int offset = 0;
for (auto &b : breaks) {
if (b != affixes.end()) {
arcLabel.set(tour.addArc(empty, b->second.tour_node), b->first);
arcLabel.set(tour.addArc(b->second.tour_node, empty), "");
offset += b->first.length();
}
}
clog << "warning: Found " << components << " components; solution may be suboptimal by up to " << offset << " letters" << endl;
}
if (!lemon::eulerian(tour)) {
clog << "error: failed to make tour graph Eulerian" << endl;
return 1;
}
for (lemon::DiEulerIt<G> e(tour, empty); e != lemon::INVALID; ++e)
cout << arcLabel[e];
cout << endl;
return 0;
}
1Related. Related. – Martin Ender – 2016-08-01T18:35:02.013
Is there any limit as to how long the program can run? – Loovjo – 2016-08-01T18:40:07.857
1@Loovjo no, but if it ends up that bruteforcing is fast enough to run, then I'll change the word list to make it longer. – Nathan Merrill – 2016-08-01T18:41:39.667
@muddyfish correct, edited. – Nathan Merrill – 2016-08-01T18:42:05.560
So, for instance, the input list
["hello","ok","kid"]
should returnhellokid
? – R. Kap – 2016-08-01T19:22:16.623I hope we're not expected to put the compounded word directly in our answer... – Patrick Roberts – 2016-08-01T19:23:54.070
@R.Kap yes, that string is valid, but any string that contains the three words is valid (e.g.
kidokhello
) – Nathan Merrill – 2016-08-01T19:27:14.2801@PatrickRoberts If it fits in your answer, props to you :) A pastebin/gist link would be great, but isn't required. – Nathan Merrill – 2016-08-01T19:28:46.090
For consistency, you should include the actual length of all the words concatenated together if overlapping letters were not deleted. The length is
65888
. – R. Kap – 2016-08-01T20:19:39.9401Hmm, who knows a good asymmetric travelling salesman heuristic? – Dave – 2016-08-01T20:48:19.063
I assume that there is no wrapping? A word cannot start at the end of the compound word and finish at its beginning? – trichoplax – 2016-08-01T21:54:16.263
Does the code need to be deterministic (PRNG with consistent seed)? – trichoplax – 2016-08-01T22:12:45.563
2No wrapping, and yes to deterministic. – Nathan Merrill – 2016-08-01T22:14:37.380
Relevant Youtube video: Portmantout – justhalf – 2016-08-03T06:29:08.147
Given the size of the other programs compared to my PowerShell one, kinda wishing that code length was a factor in scoring, especially now that an optimal length word has been found. ;-) – AdmBorkBork – 2016-08-03T14:01:49.063
1I personally think credit should go to the first one who found the optimal solution, not just somebody who can golf their solution. :) – Nathan Merrill – 2016-08-03T14:04:35.390
@NathanMerrill Oh, no doubt. I'm just tooting my own horn for giggles.
– AdmBorkBork – 2016-08-03T14:09:22.027Is the optimal solution 28272(Mentioned in bounty) or 38272(In title of accepted answer) chars? – TecBrat – 2016-08-04T16:06:19.593