Regex Golf: Regions of Italy vs. States of USA

23

2

We've already got a meta-regex-golf problem as inspired by the xkcd comic

copyright 2013 Randall Munroe

But, this regex golf looks fun, too! I want to distinguish between the states of the US and the regions of Italy. Why? I'm a citizen of both countries, and I always have trouble with this*.

The regions of Italy are

Abruzzo, Valle d'Aosta, Puglia, Basilicata, Calabria, Campania, Emilia-Romagna, Friuli-Venezia Giulia, Lazio, Liguria, Lombardia, Marche, Molise, Piemonte, Sardegna, Sicilia, Trentino-Alto Adige/Südtirol, Toscana, Umbria, Veneto

and the states of the USA are

Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming

Your job is to write a program which distinguishes these lists with a regular expression. This is a new game, so here's the

Rules

  • Distinguishing between lists must be done with a single matching regular expression.
  • Your score is the length of that regular expression, smaller is better.

To be clear: all work must be done by the regular expression -- no filtering, no replacements, no nothing... even if those are also done with regular expressions. That is, the input should be passed directly into a regular expression, and only the binary answer (match / no match) can be used by later parts of the code. The input should never be inspected or changed by anything but the matching expression. Exception: eating a newline with something akin to Ruby's chomp is fine.

Your program should take a single entry (optionally followed by \n or EOF if it makes things easier) from either list from stdin, and print to stdout the name of that list. In this case, our lists are named Italy and USA.

To test your code, simply run both lists through it. Behavior may be undefined for strings which do not occur in the list.

Scoring Issues

This might have to be done on a language-by-language basis. In Perl,

m/foobarbaz/

is a matching regular expression. However, in Python,

import re
re.compile('foobarbaz')

does the same thing. We wouldn't count the quotes for Python, so I say we don't count the m/ and final / in Perl. In both languages, the above should receive a score of 9.

To clarify a point raised by Abhijit, the actual length of the matching expression is the score, even if you generate it dynamically. For example, if you found a magical expression m,

n="foo(bar|baz)"
m=n+n

then you should not report a score of 12: m has length 24. And just to be extra clear, the generated regular expression can't depend on the input. That would be reading the input before passing it into the regular expression.

Example Session

input> Calabria
Italy
input> New Hampshire
USA
input> Washington
USA
input> Puglia
Italy

* Actually, that's a lie. I have never had any trouble with this at all.

boothby

Posted 2014-01-07T04:03:51.550

Reputation: 9 038

Can you please explain, what you mean by "no filtering, no replacements, no nothing... even if those are also done with regular expressions.". Just to clarify, does it mean filtering, replacements of the list of states/regions or the focus is wider? – Abhijit – 2014-01-07T04:29:59.003

@Abhijit edited. Is that clearer? – boothby – 2014-01-07T04:34:32.780

Hopefully, let me post an answer and see if you feel it violates the rule in any way :-) – Abhijit – 2014-01-07T04:35:17.400

How are flags counted? Do we get case-insensitive for free? – John Dvorak – 2014-01-07T05:58:06.290

@JanDvorak Good thinking! No: flags cost extra, just the m// is free. – boothby – 2014-01-07T06:00:21.333

@Boothby, aren't you forgetting District of Columbia? – WallyWest – 2014-01-07T10:07:52.380

3

@Eliseod'Annunzio: DC is not a state

– Kyle Kanos – 2014-01-07T14:21:51.610

1"Behavior may be undefined for strings which do not occur in the list." this rule is broken: it allows one to return USA in case of such a string, hence you would just have to check Italian regions, and return USA otherwise. – o0'. – 2014-01-14T18:59:28.487

@Lohoris "This rule is broken" is an opinion. Codegolf tends to encourage cutting corners like this. – boothby – 2014-01-14T19:23:52.120

1@boothby well, no, it's simple logic: it is basically asking only a regexp to match italian regions, but needlessly worded in a much complicated way. The whole point about american states is totally not relevant to the actual question asked, thanks to this bug. This also makes the question much less interesting. – o0'. – 2014-01-14T20:52:04.277

@Lohoris A cursory perusal of the answers below indicates that your "simple logic" is still just an opinion, and it may be better to match states of the US instead. – boothby – 2014-01-14T22:12:06.940

@Eliseod'Annunzio: If we include DC, we should also include Guam, CNMI, American Samoa, the US Virgin Islands, and nine uninhabited atolls. – Mechanical snail – 2014-01-15T19:37:29.660

Fair enough... just checking... – WallyWest – 2014-01-15T23:59:06.040

@Eliseod'Annunzio The internet is made for dogpiling. – boothby – 2014-01-16T00:11:36.063

@boothby Please tell me there is a definition for "dogpiling" in this context...? Very 404 here... – WallyWest – 2014-01-16T02:51:45.620

@Eliseod'Annunzio https://www.wordnik.com/words/dogpile

– boothby – 2014-01-16T03:32:27.137

Answers

10

Perl - 51 36 bytes (for regex)

print<>=~/.A|ise|net|te|z.o|[cp]a|[lr]ia|r[cd]/?"Italy
":"USA
"

Nothing special, but may as well post it, because it's different to other 51 bytes solution.

Or alternatively, shorten my already short solution by 15 bytes. This wins for now, I think.

Konrad Borowski

Posted 2014-01-07T04:03:51.550

Reputation: 11 185

7

Perl, 40 chars

Approaching this from the other direction, i.e. matching the U.S. states:

[DNIOWy]|ss|M.n|^A.*a|or|[aguh]i|[sth]\b

The only Perl/PCRE-specific feature in the regexp is the \b word boundary anchor, which I used instead of the $ end-of-string anchor to let it match "South Carolina".

Here's the regexp in a Perl one-liner for testing:

perl -nE 'say /[DNIOWy]|ss|M.n|^A.*a|or|[aguh]i|[sth]\b/ ? "USA" : "Italy"'

Ilmari Karonen

Posted 2014-01-07T04:03:51.550

Reputation: 19 513

This is a more golfy test harness:

perl -pe '$_=/re/?"USA\n":"Italy\n"' – Pseudonym – 2014-01-07T23:09:11.150

3@Pseudonym: meh. As long as it doesn't count in the score, might as well keep it readable. – Ilmari Karonen – 2014-01-07T23:29:46.797

5

Ruby (plain regex), 44

$_ = gets.chomp
puts /'|-|(([^gn]i|gn|at)a|[hst]e|to|zo)$|To|La|pa/ ? "Italy" : "USA"

You know what? Case sensitivity is the best start-of-word anchor.

I'm not sure, but I think I owe the pa to Hax0r778's answer.

John Dvorak

Posted 2014-01-07T04:03:51.550

Reputation: 9 048

3

Perl - 51

(<STDIN> =~ m/'|-|ru|pu|at|pa|az|gu|mb|rc|ie|rd|ci|os|abr|mol|ven/)?printf("Italy\n"):printf("USA\n");

Hax0r778

Posted 2014-01-07T04:03:51.550

Reputation: 39

3

JavaScript 42

alert(/at|gn|mp|sc|-|'|((zi?|t)o|[hts]e|[lrd]ia)$/g.test(prompt())?"Italy":"USA")

I was initially going to work this out from the USA side, as eliminating KWXY from the USA list strips a lot of the States away... But Italy had it bested by a good 17 characters...

If we go with fat arrow notation we can reduce this to a simple function with a return variable.

r=s=>/at|gn|mp|sc|-|'|((zi?|t)o|[hts]e|[lrd]ia)$/g.test(s)?"Italy":"USA"

> r("South Dakota") // USA
> r("Puglia") // Italy

WallyWest

Posted 2014-01-07T04:03:51.550

Reputation: 6 949