Remove the Salutations

11

Challenge

Hi, given a string as input, remove any salutations found at the start of the string.

The program which performs the most correct substitutions in under 50 bytes wins.

Salutations

Hey, a salutation is defined as one of the following words:

  • hi
  • hey
  • hello
  • dear
  • greetings
  • hai
  • guys
  • hii
  • howdy
  • hiya
  • hay
  • heya
  • hola
  • hihi
  • salutations

The first letter may be capitalised.

There will always be a comma and/or a single space following the salutation which must also be removed. The comma and the space may be in any order (,<space> or <space>,) and both should be removed.

The greeting and the following word will only ever be separated by a comma and/or single space.

You must then capitalise the first letter of the word which would have followed the salutation. Even if no replacement has taken place, you should still capitalise the first word of the output.

Capitalisation only applies to lowercase alphabetical characters (abcdefghijklmnopqrstuvwxyz). You should leave any other character as it was.

The salutation will always be at the start of the string. You should not replace a salutation which is not at the start.

There may not always be a salutation.

Your code must be under 50 bytes.

Examples

Input > Output

Salutations, what's going on? > What's going on?
hello i have quetions how does juice an avocado > I have quetions how does juice an avocado
How d'you do > How d'you do
Hey,You! > You!
hola cows eat hay > Cows eat hay
hey Hi there! > Hi there!
hihi ,guys > Guys

Test battery

Hola, there are 1000 different inputs in total:

A Bash command to retrieve both the above is

wget https://raw.githubusercontent.com/beta-decay/Remove-Substitutions-Battery/master/{inputs,replaced}.txt

Winning

Howdy, the program with the most correct substitutions from the 1000 inputs above wins.

You must put the percentage of the inputs your program handles correctly in your header like so:

# Language Name, percentage%

I'm not completely sure why Jeff made this a thing, but it makes a nice challenge nevertheless.

Beta Decay

Posted 2017-05-30T11:06:50.340

Reputation: 21 478

3s=>System.Text.RegularExpressions.Regex.Replace(); 50 bytes before even a pattern is specified, that's C# out then. (With a regex approach of course) – TheLethalCoder – 2017-05-30T12:01:18.500

1Python is also out (with regex) :( – Gábor Fekete – 2017-05-30T12:50:15.643

You can always return the given input for a score of 31.3%. – Ian Miller – 2017-05-30T15:57:18.443

Urge to edit out the salutation at the beginning of the challenge rising. ;) – Draco18s no longer trusts SE – 2017-05-30T17:02:38.547

2

Fun anecdote: I originally started my first post on PPCG with "Hello, world! :)" but noticed as soon as I posted it that SE removed the entirety of that line except the ":)". I was of course mortified that I had done something wrong and immediately removed the smiley as well. Not a trace was left in the revision history, and to this day you and I are the only ones who know about it...

– ETHproductions – 2017-05-30T17:03:12.907

@ETHproductions That was your first post? Wow, I congratulate you :D – Beta Decay – 2017-05-30T17:04:57.217

Answers

8

GNU sed, 78% 100%

/^\w*[wd]\b/!s/^[dghs][eruaio]\w*\W\+//i
s/./\U&/

(49 bytes)

The test battery is quite limited: we can count which words appear first on each line:

$ sed -e 's/[ ,].*//' inputs.txt | sort | uniq -ic
 40 aight
 33 alright
 33 dear
 33 g'd
 41 good
 36 greetings
 35 guys
 31 hai
 33 hay
 27 hello
 33 hey
 37 heya
 43 hi
 34 hihi
 29 hii
 35 hiya
 45 hola
 79 how
 37 howdy
 33 kowabunga
 39 salutations
 32 speak
 34 sweet
 40 talk
 36 wassup
 34 what's
 38 yo

The salutations to be removed begin with d, g, h or s (or uppercase versions thereof); the non-salutations beginning with those letters are

 33 g'd
 41 good
 79 how
 32 speak
 34 sweet

Ignoring lines where they appear alone, that's 220 false-positives. So let's just remove initial words beginning with any of those four letters.

When we see an initial word beginning with any of those (/ ^[dghs]\w*), case-insensitively (/i), and followed by at least one non-word character (\W\+), then replace with an empty string. Then, replace the first character with its uppercase equivalent (s/./\U&/).

That gives us

s/^[dghs]\w*\W\+//i
s/./\U&/

We can now refine this a bit:

  • The largest set of false-positives is how, so we make the substitution conditional by prefixing with a negative test:

     /^[Hh]ow\b/!
    
  • We can also filter on the second letter, to eliminate g'd, speak and sweet:

    s/^[dghs][eruaio]\w*\W\+//i
    
  • That leaves only good as a false positive. We can adjust the prefix test to eliminate words ending in either w or d:

    /^\w*[wd]\b/!
    

Demonstration

$ diff -u <(./123478.sed inputs.txt) replaced.txt | grep ^- | wc -l
0

Toby Speight

Posted 2017-05-30T11:06:50.340

Reputation: 5 058

9

Retina, 68% 72.8% (old) 74.8% 77.5% (new test battery)

i`^h(a[iy]|eya?|i(h?i|ya|)|ello)[ ,]+

T`l`L`^.

Try it online! Edit: Gained 4.8% (old) 2.7% (new) coverage with help from @MartinEnder's tips.

Neil

Posted 2017-05-30T11:06:50.340

Reputation: 95 035

1I think you can do [ ,]+ to squeeze out a few more bytes. You can also extract the h from the alternation. – Martin Ender – 2017-05-30T11:36:48.593

not sure but i\^h(a[iy]|eya?|i(h?i?|ya))[ ,]+` might work meaning you have 8 bytes to spare – ASCII-only – 2017-05-30T11:40:02.977

@ASCII-only h?i? saves nothing over h?i| and it would match hih (although I don't know whether that's even in the test cases). – Martin Ender – 2017-05-30T11:41:09.200

Actually, it does save a byte if you do ih?i?|iya. – Martin Ender – 2017-05-30T11:42:03.430

Maybe i\^h(a[iy]|eya?|ih?i|iya|ola|ello)[ ,]+` then – ASCII-only – 2017-05-30T11:44:19.043

Would i\^([dghs]\w+)[ ,]+` be cheating? – manatwork – 2017-05-30T11:46:37.540

@manatwork How many false positives does that yield? – Martin Ender – 2017-05-30T11:47:55.867

@MartinEnder, didn't counted, but the result matches replaced.txt by MD5. – manatwork – 2017-05-30T11:49:31.517

@manatwork That's a bad set of test cases then. You don't even need the parentheses. – Martin Ender – 2017-05-30T11:50:21.803

Never mind, but maybe i\^h(i|a[iy]|eya?|ih?i|iya|ello)[ ,]+` will be slightly better – ASCII-only – 2017-05-30T11:54:49.383

Hey, I've changed to a new test battery, can you update your answer with your new score? Thanks – Beta Decay – 2017-05-30T18:13:59.373

@BetaDecay Updated. – Neil – 2017-05-30T18:59:49.010

6

PHP, 60.6%

50 Bytes

<?=ucfirst(preg_replace("#^[dh]\w+.#i","",$argn));

Try it online!

PHP, 59.4%

49 Bytes

<?=ucfirst(preg_replace("#^h\w+,? #i","",$argn));

Try it online!

PHP, 58.4%

50 Bytes

<?=ucfirst(preg_replace("#^[gh]\w+.#i","",$argn));

Try it online!

Jörg Hülsermann

Posted 2017-05-30T11:06:50.340

Reputation: 13 026

160.1%: #^[gh]\w+.# – manatwork – 2017-05-30T12:56:47.700

Hey, I've changed to a new test battery, can you update your answer with your new score? Thanks – Beta Decay – 2017-05-30T18:13:36.260

@BetaDecay is updated – Jörg Hülsermann – 2017-05-30T18:29:56.533

4

Vim, 55.4% 44.4%

df,<<vgU

Explanation:

df,    Delete until and including the first comma
<<     Remove leading spaces
vgU    Uppercase first letter

BlackCap

Posted 2017-05-30T11:06:50.340

Reputation: 3 576

Hey, I've changed to a new test battery, can you update your answer with your new score? Thanks – Beta Decay – 2017-05-30T18:13:30.590