Algorithm for finding a regex for a list of "words"

-2

I just discovered this website and was amazed. Somehow this challenge popped into my mind: Provided a very long list of words; Write or describe an algorithm which finds a regex to match them all.

Ex.: A very long list of Dates, model-numbers, CD-Codes, ISBNs et cetera.

I hope it's not too big or does not belong here. Please refer to downvote in that case but write a comment and I will delete this.

Edit: Sorry at my clumsy attempt. I fear there is not a very sharp goal. Of course .* is not what I look for :) Here are some specifications:

  • It has to match all words
  • It has to group characters and find out what kind of characters are in that group, e.g. identify that an american date always consits of one to four numbers, then a slash etc.
  • It does not need to be able to identify transverses etc
  • There is a huge list of not-to-match-examples available as well

Winner is the one who can write a code which does what I described for one of my examples (or a similar one). In case that no code is committed, the winner is the one who can convince me, that the described algorithm could work.

3244611user

Posted 2014-11-21T20:15:24.830

Reputation: 101

Question was closed 2014-11-21T23:35:26.650

4something like .* ? – tigrou – 2014-11-21T20:22:12.867

2What's the winning criterion? Shortest code? Shortest regex? Something else? Does the regex need to match only the words? – marinus – 2014-11-21T20:22:54.423

2I'm afraid your spec still isn't very solid. I don't know what you mean by your third bullet point. How long is a "huge list" (which can btw easily satisfied for any finite definition of "huge", with the expression ^.{n}$, where n is the longest input string). And how should the grouping work? If my input is foo, bar, abc I can always form a pattern [fba][oab][orc], but that doesn't make a lot of sense. Or [abcfor]{3}. It's very unclear what a valid solution has to do. – Martin Ender – 2014-11-21T21:01:27.797

Your problem is "regex golf". Search this site for it, or google it. – Sparr – 2014-11-21T21:16:37.920

What about removing not-to-match list and asking for the shortest regex on some hidden set of tests? – Qwertiy – 2014-11-21T21:28:10.563

1You might be interested in this. – Martin Ender – 2014-11-21T22:35:50.690

Answers

2

In Java 8:

public class Reg{

    public static void main(String... args){
        System.out.println(args[0].replaceAll(",", "|"));
    }

}

Matches only the strings inputed. Expects input to be of form:

goodWord1,goodWord2,goodWord3...(space)badWord1,badWord2,badWord3...

Edit: Edited for new requirements. Now accepts a list of bad words also. It also now works in Java 7.

TheNumberOne

Posted 2014-11-21T20:15:24.830

Reputation: 10 855

Good quick answer. But not what I looked for. Sorry. I gave some more details. – 3244611user – 2014-11-21T20:43:43.387

@3244611user It matches only the words in the input sequence. It matches no other words. You don't even need a list of not-to-match examples. See next program for one that finds the shortest answer. – TheNumberOne – 2014-11-21T21:06:46.140