Shortest Unique Substring

14

Given (on STDIN, as command line arguments, or as function arguments) two distinct non-empty strings, find and return the shortest substring of the first string which is not a substring of the second. If no such substring exists, you may return the empty string, return any string which isn't a substring of the original string, or throw an exception. If you are returning from a function, you may also return null (or undefined, None, etc.) in this case. If multiple such substrings are tied for the shortest, you may return any one of them.

Strings can consist of any printable ascii characters.

Input given on STDIN will be given with one string on each line. At your request, a single empty line may be added at the end of the input.

This is code golf, so the shortest valid program wins.

SOME TEST CASES

INPUT:

STRING ONE
STRING TWO

OUTPUT:

E

INPUT:

A&&C
A&$C

VALID OUTPUTS:

&&
&C

INPUT:

(Two randomly-generated 80-letter strings)

QIJYXPYWIWESWBRFWUHEERVQFJROYIXNKPKVDDFFZBUNBRZVUEYKLURBJCZJYMINCZNQEYKRADRYSWMH
HAXUDFLYFSLABUCXUWNHPSGQUXMQUIQYRWVIXGNKJGYUTWMLLPRIZDRLFXWKXOBOOEFESKNCUIFHNLFE

ALL VALID OUTPUTS:

AD
BJ
BR
CZ
DD
EE
ER
EY
EY
FF
FJ
FW
FZ
HE
IJ
IN
IW
JC
JR
JY
KL
KP
KR
KV
LU
MH
MI
NB
NQ
OY
PK
PY
QE
QF
QI
RA
RB
RF
RO
RV
RY
RZ
SW
UE
UH
UN
UR
VD
VQ
VU
WB
WE
WI
WU
XN
XP
YI
YK
YK
YM
YS
YW
YX
ZB
ZJ
ZN
ZV

SuperJedi224

Posted 2016-04-29T15:07:58.027

Reputation: 11 342

1shortest or longest? – Leaky Nun – 2016-04-29T15:12:49.077

@FryAmTheEggman Then should I still post my solution... – Leaky Nun – 2016-04-29T15:15:04.327

"One string on each line" with or without quotes? – Leaky Nun – 2016-04-29T15:21:50.610

1Can we take an array of strings? – Dennis – 2016-04-29T17:16:57.963

is "B" a substring of "aBc" ? – downrep_nation – 2016-04-29T17:58:21.793

Can we return something else than a string (0, None, undefined, etc.) if there is no unique substring? – Dennis – 2016-04-29T18:50:33.050

@downrep_nation ...yes. – SuperJedi224 – 2016-04-29T19:12:53.480

so any non empty character string2 doesnt contain is fine... i see

please add test cases tho – downrep_nation – 2016-04-29T19:14:36.007

-1 for lack of test cases. – cat – 2016-04-30T15:32:51.817

Answers

4

Brachylog, 23 bytes

:1foh.,{,.[A:B]hs?'~sB}

Works on the old Java transpiler. Expects the two strings in a list as input, unifies the output with the substring. If no substring is found, returns false.

Unfortunately I have not yet coded the subset built-in in the new Prolog transpiler.

Explanation

:1f               Find all bindings which satisfy predicate 1 with that binding as input and
                  with the Input of the main predicate as output.
   oh.,           Order that list of bindings, and unify the output with the first one.

{
 ,.[A:B]          Unify the output with the list [A,B]
        hs?       Unify the input with a subset of A
           '~sB   Check that no subset of B can be unified with the input
               }

Fatalize

Posted 2016-04-29T15:07:58.027

Reputation: 32 976

4

Python, 119 115 91

lambda a,b:[a[m:m+n]for n in range(1,len(a)+1)for m in range(len(a))if a[m:m+n]not in b][0]

Test cases:

| Input 1  | Input 2     | Output        |
|----------+-------------+---------------|
| 'abcd'   | 'abc'       |  'd'          |
| 'abcd'   | 'dabc'      |  'cd'         |
| 'abcd'   | 'dcbabbccd' |  'abc'        |
| 'abcdf'  | 'abcdebcdf' |  'abcdf'      |
| 'abc'    | 'abc'       |  (IndexError) |

Working on making it shorter, but this is my brain instinct. Not really much of a golfer yet.

Thanks to @user81655 and @NonlinearFruit for the extra bytes.

Edit:

Dang. Tried this code:

def z(a,b):
 for s in [a[m:m+n]for n in range(1,len(a)+1)for m in range(len(a)-n+1)]:
  if s not in b:return s
 return''

Thought it was a few bytes shorter. Turns out it was 1 byte longer than what I had before the edit.

Taylor Lopez

Posted 2016-04-29T15:07:58.027

Reputation: 411

I don't know much python, but maybe you can do (r=range)(1,len(a)+1) then use r? – Conor O'Brien – 2016-04-29T16:03:02.660

@CᴏɴᴏʀO'Bʀɪᴇɴ Can't do it quite that way. If I assign range to r in the line above, it actually adds a byte. Good idea, though. There's probably a shorter way to iterate through the substrings. – Taylor Lopez – 2016-04-29T16:07:35.047

range(1,len(a)) and range(len(a)-1) should work shouldn't it? Also I think using a tab character for the two space indent would save a byte. – user81655 – 2016-04-29T16:17:58.127

No, with range(1,len(a)), the 4th test cast fails because it won't try the full string; it'll only go to the length of the string - 1. And with range(len(a)-1), the 1st test case fails returning 'cd' instead of just 'd'. There may be something there, though. – Taylor Lopez – 2016-04-29T16:22:27.033

Sorry, I'm not familiar with Python and I assumed the ranges were inclusive. In that case try range(1,len(a)+1) and range(len(a)). – user81655 – 2016-04-29T16:34:36.923

That works. I hate double-checking substrings I've already checked for the sake of a few bytes, but the challenge is for shortest, so... Thanks! – Taylor Lopez – 2016-04-29T16:38:38.010

You can save a couple bytes (maybe make it a one-liner) if you compact your for loops and if statement. For example, [ n+m for m in range(...) for n in range(...) if n%2==0] – NonlinearFruit – 2016-04-29T17:38:46.650

@NonlinearFruit Maybe I'm not understanding your suggestion. In the code in the Edit section of my post, I tried doing that, but it ended up being a byte longer. Is your example the suggested change? Because I'm not understanding the even-check at the end. Or adding n and m together. – Taylor Lopez – 2016-04-29T17:59:51.487

I was just making stuff up for a short example, for your case it would be [a[m:m+n] for m in range(...) for n in range(...) if a[m:m+n]not in b]. This would give an array of possible answers, so return the 0th one unless it is empty, then return '' – NonlinearFruit – 2016-04-29T18:15:07.093

Aha! Much better. Man, code golfing takes a certain kind of problem solving for sure. It's like a completely different thought process. – Taylor Lopez – 2016-04-29T18:20:26.300

You can save 2 more bytes by doing a[m:m+n+1]for n in range(len(a))... – Blue – 2016-04-29T20:44:42.960

Then a couple more by doing lambda a,b,r=range,l=len:..., then call r() and l() instead of range() and len(). This might actually beat Dennis's answer – Blue – 2016-04-29T20:48:43.267

@Blue regarding the first suggestion: but then I would have to increment the second n later on too, which would result in no improvement. Regarding the second suggestion: the range aliasing results in no improvement, and the len aliasing actually adds two bytes. If I called the functions more often, then it would have helped. That's probably why @Dennis used the aliasing for enumerate since it's a longer word lol. It actually had a positive effect. – Taylor Lopez – 2016-04-29T20:53:51.413

@iAmMortos the first one works because it is equivalent to what you are doing now. As for the other 2, yeah, I tested and realized. Sorry – Blue – 2016-04-29T21:22:14.750

@Blue nah, it happens. It always surprises me how those bytes add up, especially when I find some way to optimize something, and it turns out to be longer. lol – Taylor Lopez – 2016-04-29T21:30:47.687

3

Python, 87 86 bytes

lambda s,t,e=enumerate:[s[i:i-~j]for j,_ in e(s)for i,_ in e(s)if(s[i:i-~j]in t)<1][0]

If it exists, this will return the leftmost of all shortest unique substrings.

If there is no unique substring, an IndexError is raised.

Test it on Ideone.

Dennis

Posted 2016-04-29T15:07:58.027

Reputation: 196 637

There it is. I was waiting for somebody to kill my non-lambda implementation. nice lol – Taylor Lopez – 2016-04-29T18:03:03.333

I think you can make this shorter by supplying the optional second argument to enumerate to start j at i+1. – user2357112 supports Monica – 2016-04-29T23:25:42.827

@user2357112 That throws a NameError, unfortunately. The code defines j first, then i. – Dennis – 2016-04-29T23:39:41.557

@Dennis: Yeah, but it doesn't need to. You could switch the loop order. – user2357112 supports Monica – 2016-04-29T23:41:04.810

1@user2357112 If I switch the loop order, the first unique substring it finds may not be the shortest. Simply swapping the order returns 'ab' for input 'abc','aaa'. – Dennis – 2016-04-29T23:44:35.083

Ah, you're right. It didn't register to me that the list comprehension includes substrings of non-minimal length. – user2357112 supports Monica – 2016-04-29T23:47:01.353

2

JavaScript (ES6), 79 bytes

f=
(a,b)=>[...a].some((_,i,c)=>c.some((_,j)=>b.indexOf(s=a.substr(j,i+1))<0))?s:''
<div oninput=o.textContent=f(a.value,b.value)><input id="a"/><input id="b"/><pre id=o>

If returning false is acceptable, save 2 bytes by using &&s instead of ?s:''.

Neil

Posted 2016-04-29T15:07:58.027

Reputation: 95 035

2

Python, 82 bytes

g=lambda u:{u}|g(u[1:])|g(u[:-1])if u else{''}
f=lambda s,t:min(g(s)-g(t),key=len)

Usage: f('A&&C', 'A&$C') -> returns '&&'

Raises ValueError if there is no suitable substring.

Explanation:

g=lambda u:{u}|g(u[1:])|g(u[:-1])if u else{''} recursively creates a set of the substrings of u f=lambda s,t:min(g(s)-g(t),key=len) takes the shortest substring from the set difference

RootTwo

Posted 2016-04-29T15:07:58.027

Reputation: 21

1

Pyth, 11 bytes

Jwhf!}TJ.:z

Try it online!

Leaky Nun

Posted 2016-04-29T15:07:58.027

Reputation: 45 011

1

JavaScript (Firefox), 80 bytes

solution=

a=>b=>[for(_ of(i=0,a))for(_ of(j=!++i,a))if(b.includes(s=a.substr(j++,i)))s][0]

document.write("<pre>"+
[ [ "test", "best" ], [ "wes", "west" ], [ "red", "dress" ] ]
.map(c=>c+": "+solution(c[0])(c[1])).join`\n`)

Test works only in Firefox. Returns undefined if there is no substring.

user81655

Posted 2016-04-29T15:07:58.027

Reputation: 10 181

Strings could contain printable ASCII characters such as \ or other RegExp metacharacters, but if you're limiting yourself to Firefox, why not use b.includes instead? – Neil – 2016-04-29T18:31:17.127

@Neil The question didn't say that the strings could be any character before but thanks for letting me know! Updated to use includes. – user81655 – 2016-04-29T18:34:04.443

1The test snippet throws a SyntaxError: unexpected token 'for' – NoOneIsHere – 2016-04-29T19:35:52.630

@NoOneIsHere That's the error you'll get if you're not using Firefox... – user81655 – 2016-04-29T20:30:03.093

1

Retina, 37 bytes

M!&`\G(.+?)(?!.*¶.*\1)
O$#`.+
$.&
G1`

Output is empty if no valid substring is found in A.

Try it online! (Slightly modified to run several test cases at once. The input format is actually linefeed separated, but test suites are easiest to write with one test case per line. The test framework turns the space into a linefeed before the actual code starts.)

Explanation

M!&`\G(.+?)(?!.*¶.*\1)

For each possible starting position in A, match the shortest substring which does not appear in B. The & is for overlapping matches, such that we actually try every starting position, even if a match is longer than one character. The \G ensures that we don't skip any positions - in particular, this way we have to stop at the linefeed, such that we don't get additional matches from B itself. The reason this doesn't mess things up is actually quite subtle: because if there's a starting position in A where we can't find any valid substring, then that's also a failure which will cause \G to stop checking any further positions. However, if (from the current starting position) all substrings appear in B, so will all substrings that start further right of the current position, so discarding those is not an issue (and actually improves performance).

Due to the M! configuration, all of these matches will be returned from the stage, joined with linefeeds.

O$#`.+
$.&

This sorts the lines of the previous result by length. This is done by matching the line with .+. Then $ activates a form of "sort-by", such that the match is substituted with $.& for determining sort order. The $.& itself replaces the match with its length. Finally, the # option tells Retina to sort numerically (otherwise, it would treat the resulting numbers as strings and sort them lexicographically).

G1`

Finally, we simply keep only first line, by using a grep stage with an empty regex (which always matches) and a limit of 1.

Martin Ender

Posted 2016-04-29T15:07:58.027

Reputation: 184 808

1

Perl, 87 85

sub{(grep{$_[1]!~/\Q$_/}map{$}=$_;map{substr($_[0],$_,$})}@}}(@}=0..length$_[0]))[0]}

This is an anonymous function which returns the first (by position) of the shortest substrings of $_[0] that do not occur in $_[1], or undef if no such substring exists.

Test program with strings taken from @iAmMortos's answer, tested with Perl 5.22.1:

#!/usr/bin/perl -l
use strict;
use warnings;

my $f = <see above>;
print $f->('abcd', 'abc');
print $f->('abcd', 'dabc');
print $f->('abcd', 'dcbabbccd');
print $f->('abcdf', 'abcdebcdf');
print $f->('abc', 'abc');

hvd

Posted 2016-04-29T15:07:58.027

Reputation: 3 664

1

Haskell, 72 bytes

import Data.Lists
a#b=argmin length[x|x<-powerslice a,not$isInfixOf x b]

Usage example: "abcd" # "dabc"-> "cd".

A straightforward implementation: build all substrings of a and keep those which do not appear in b. argmin returns an element of a list which minimizes the function given a the 2nd argument, here: length.

nimi

Posted 2016-04-29T15:07:58.027

Reputation: 34 639

I didn't know about argmin! It seems extremely useful. – Zgarb – 2016-04-30T16:58:16.113

0

Burlesque - 26 bytes

Right now the shortest way I can come up with is:

lnp^sujbcjz[{^p~[n!}f[-][~

mroman

Posted 2016-04-29T15:07:58.027

Reputation: 1 382

0

Japt, 14 bytes

Êõ!ãU c k!èV g

Try it online!

Returns undefined if there is no valid substring. This is distinct from returning the string "undefined", though the difference is only visible because of the -Q flag.

Explanation:

Ê                 :Length of the first input
 õ                :For each number in the range [1...length]:
  !ãU             : Get the substrings of the first input with that length
      c           :Flatten to a single array with shorter substrings first
        k         :Remove ones which return non-zero to:
         !èV      : Number of times that substring appears in second input
             g    :Return the shortest remaining substring

Kamil Drakari

Posted 2016-04-29T15:07:58.027

Reputation: 3 461

0

Japt -h, 11 bytes

à f@øX «VøX

Try it

                :Implicit input of strings U & V
à               :All combinations of U
  f@            :Filter each as X
    øX          :  Does U contain X?
       «        :  Logical AND with the negation of
        VøX     :  Does V contain X?
                :Implicit output of last element

Shaggy

Posted 2016-04-29T15:07:58.027

Reputation: 24 623

0

Pyth - 9 6 bytes

h-Fm.:

Try it online here.

Maltysen

Posted 2016-04-29T15:07:58.027

Reputation: 25 023

Crossed out 9 is still 9 – cat – 2016-04-30T15:33:58.223

I'd love to know how this works. – mroman – 2018-11-25T14:02:59.173

@mroman the .: with one arg is all substrs. So I map that over both strings, then fold setwise diff, so I have all substrs of first that arnt of the second, then I pick the first one, which is smallest cuz .: is sorted. – Maltysen – 2018-11-25T19:03:28.537

0

C#, 152 bytes

string f(string a,string b){int x=a.Length;for(int i=1;i<=x;i++)for(int j=0;j<=x-i;j++){var y=a.Substring(j,i);if(!b.Contains(y))return y;}return null;}

downrep_nation

Posted 2016-04-29T15:07:58.027

Reputation: 1 152

0

Ruby, 70 bytes

Collects all substrings of a certain length from the first string, and if there is one that isn't in the second string, return it.

->a,b{r=p;(1..l=a.size).map{|i|(0...l).map{|j|b[s=a[j,i]]?0:r||=s}};r}

Value Ink

Posted 2016-04-29T15:07:58.027

Reputation: 10 608