Maximal Substring Construction

18

1

In this challenge, you are passed two things:

  1. A string length, N
  2. A list of strings, L, each with an assigned point value. Any string that is not passed in has a point value of 0

You need to construct a string of length N such that the sum of all substring points is as large as possible.

For example:

5 [("ABC", 3), ("DEF", 4), ("CDG", 2)]

should output

ABCDG

Because the two substrings with points (ABC and CDG) have a total of 5 points, and no other possible construction can give 5 or more points.

Substrings can be used multiple times in a string, and can overlap. You can assume that the points will always be positive, the substring lengths will be between 1 to N characters longs, and that N > 0. If multiple constructions are maximal, print any one of them.

Your program must run in a reasonable amount of time (no more than a minute for each of the examples):

1 [("A", 7), ("B", 4), ("C", 100)]     => C
2 [("A", 2), ("B", 3), ("AB", 2)]      => AB
2 [("A", 1), ("B", 2), ("CD", 3)]      => BB
2 [("AD", 1), ("B", 2), ("ZB", 3)]     => ZB
3 [("AB", 2), ("BC", 1), ("CA", 3)]    => CAB
3 [("ABC", 4), ("DEF", 4), ("E", 1)]   => DEF
4 [("A", 1), ("BAB", 2), ("ABCD", 3)]  => AAAA or ABAB or BABA or ABCD
5 [("A", 1), ("BAB", 2), ("ABCD", 3)]  => BABCD or BABAB
5 [("ABC", 3), ("DEF", 4), ("CDG", 2)] => ABCDG
5 [("AB", 10), ("BC", 2), ("CA", 2)]   => ABCAB
6 [("AB", 10), ("BC", 2), ("CA", 2)]   => ABABAB
8 [("AA", 3), ("BA", 5)]               => BAAAAAAA
10 [("ABCDE", 19), ("ACEBD",  18), ("ABEDC", 17), ("BCEDA", 16), ("EBDAC", 15), ("BACD", 14), ("CADB", 13), ("ABDC", 12), ("CABD", 11), ("EBDC", 10), ("ACE", 9), ("CBA", 8), ("AEC", 7), ("BE", 6), ("AE", 5), ("DC", 4), ("BA", 3), ("A", 2), ("D", 1)]
=> ACEBDACEBD

This is a , so prepare your shortest answer in your favorite language!

Nathan Merrill

Posted 2015-12-24T00:41:31.653

Reputation: 13 591

Do we have to use your exact input format or can we use any convenient input format for our language? – flawr – 2015-12-24T12:25:22.373

@flawr any convenient method of input is fine (dictionary, stdin, function params) – Nathan Merrill – 2015-12-24T14:00:13.277

The DEF tuple missing a comma – isaacg – 2015-12-25T00:54:37.897

I have a perfectly nice solution, but it's too slow for the last test case. – isaacg – 2015-12-25T02:50:46.353

@isaacg post it, I'll see if my time limit needs to change. – Nathan Merrill – 2015-12-25T17:45:56.470

@NathanMerrill No, it's a pure brute-force solution. The time-limit is fine. – isaacg – 2015-12-25T18:23:12.600

@isaacg yeah that was the purpose of my last test case. I'm trying to get a smart solution, which I'm hoping exists. – Nathan Merrill – 2015-12-25T18:23:57.413

1@isaacg I have an elegant solution, sadly this comment is too small to contain it. (I don't, just wanted to cite Fermat.) – orlp – 2015-12-26T07:31:59.133

Answers

1

Python 2.7, 503 Bytes

This isn't particularly golfed, nor is it particularly efficient; it's just a relatively condensed enumeration of feasible strings that are brute-forced. I don't think it would be too hard to create an admissible heuristic to use with A*, which would probably be a bit quicker. I didn't bother doing that, though, because this method solves all of the examples in about 47 seconds on my laptop.

import re
v=lambda l,s:sum([len([m.start() for m in re.finditer('(?=%s)'%u,s)])*v for u,v in l])
def m(n,l,e):
 if len(e)==n:
  yield e
 else:
  u=1
  for s,v in sorted(l,key=lambda j:j[1],reverse=True):
   for i in range(len(s)-1,0,-1):
    if len(e)+len(s)-i <= n and e.endswith(s[:i]):
     for r in m(n,l,e+s[i:]):
      u=0;yield r
   if len(e)+len(s)<=n:
    for r in m(n,l,e+s):
     u=0;yield r
  if u:
   yield e
def p(n,l):
 b,r=-1,0
 for s in m(n,l,''):
  a=v(l,s)
  if a>b:b,r=a,s
 return r

To test it:

if __name__ == "__main__":
    for n, l in [
            (1, [("A", 7), ("B", 4), ("C", 100)]),     # => C
            (2, [("A", 2), ("B", 3), ("AB", 2)]),      # => AB
            (2, [("A", 1), ("B", 2), ("CD", 3)]),      # => BB
            (2, [("AD", 1), ("B", 2), ("ZB", 3)]),     # => ZB
            (3, [("AB", 2), ("BC", 1), ("CA", 3)]),    # => CAB
            (3, [("ABC", 4), ("DEF", 4), ("E", 1)]),   # => DEF
            (4, [("A", 1), ("BAB", 2), ("ABCD", 3)]),  # => AAAA or ABAB or BABA or ABCD
            (5, [("A", 1), ("BAB", 2), ("ABCD", 3)]),  # => BABCD or BABAB
            (5, [("ABC", 3), ("DEF", 4), ("CDG", 2)]), # => ABCDG
            (5, [("AB", 10), ("BC", 2), ("CA", 2)]),   # => ABCAB
            (6, [("AB", 10), ("BC", 2), ("CA", 2)]),   # => ABABAB
            (8, [("AA", 3), ("BA", 5)]),               # => BAAAAAAA
            (10, [("ABCDE", 19), ("ACEBD",  18), ("ABEDC", 17), ("BCEDA", 16), ("EBDAC", 15), ("BACD", 14), ("CADB", 13), ("ABDC", 12), ("CABD", 11), ("EBDC", 10), ("ACE", 9), ("CBA", 8), ("AEC", 7), ("BE", 6), ("AE", 5), ("DC", 4), ("BA", 3), ("A", 2), ("D", 1)]) # => ACEBDACEBD
    ]:
        print p(n, l)

Explanation

The v function simply calculates the value of a given string by searching for all occurrences of the substrings in L. The m function enumerates all strings of length n with prefix e that can be generated from the substrings in l. m recursively calls itself; the first call should pass in '' for e. For example:

>>> for s in m(4, [("A", 1), ("BAB", 2), ("ABCD", 3)], ''):print s
ABCD
BABA
ABCD
ABAB
AAAA

The p function simply loops through all possible strings (as enumerated by m) and returns the one with highest value (as determined by v).

Note that the m function actually prioritizes the enumeration by sorting based on substrings' values. This turns out to be unnecessary to ensure optimality, and it actually hinders efficiency a bit; I could save ~20 or so bytes by removing it.

ESultanik

Posted 2015-12-24T00:41:31.653

Reputation: 1 078