What is the most frequent word?

26

11

What is the most frequent word?

Given a sentence, your program must make its way through it, counting the frequencies of each word, then output the most used word. Because a sentence has no fixed length, and so can get very long, your code must be as short as possible.

Rules/Requirements

  • Each submission should be either a full program or function. If it is a function, it must be runnable by only needing to add the function call to the bottom of the program. Anything else (e.g. headers in C), must be included.
  • There must be a free interpreter/compiler available for your language.
  • If it is possible, provide a link to a site where your program can be tested.
  • Your program must not write anything to STDERR.
  • Your program should take input from STDIN (or the closest alternative in your language).
  • Standard loopholes are forbidden.
  • Your program must be case-insensitive (tHe, The and the all contribute to the count of the).
  • If there is no most frequent word (see test case #3), your program should output nothing.

Definition of a 'word':

You get the list of words by splitting the input text on spaces. The input will never contain any other type of whitespace than plain spaces (in particular no newlines). However, the final words should only contain alphanumerics (a-z, A-Z, 0-9), hyphens (-) and apostrophes ('). You can make that so by removing all other characters or by replacing them by space before doing the word splitting. To remain compatible with previous versions of the rules, apostrophes are not required to be included.

Test Cases

The man walked down the road.
==> the

-----

Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy.
==> he

-----

This sentence has no most frequent word.
==> 

-----

"That's... that's... that is just terrible!" he said.
==> that's / thats

-----

The old-fashioned man ate an old-fashioned cake.
==> old-fashioned

-----

IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses.
==> IPv6

-----

This sentence with words has at most two equal most frequent words.
==>

Note: The third and seventh test cases have no output, you may choose either on the fourth.

Scoring

Programs are scored according to bytes. The usual character set is UTF-8, if you are using another please specify.

When the challenge finishes, the program with the least bytes (it's called ), will win.

Submissions

To make sure that your answer shows up, please start your answer with a headline, using the following Markdown template:

# Language Name, N bytes

where N is the size of your submission. If you improve your score, you can keep old scores in the headline, by striking them through. For instance:

# Ruby, <s>104</s> <s>101</s> 96 bytes

If there you want to include multiple numbers in your header (e.g. because your score is the sum of two files or you want to list interpreter flag penalties separately), make sure that the actual score is the last number in the header:

# Perl, 43 + 2 (-p flag) = 45 bytes

You can also make the language name a link which will then show up in the leaderboard snippet:

# [><>](http://esolangs.org/wiki/Fish), 121 bytes

Leaderboard

Here is a Stack Snippet to generate both a regular leaderboard and an overview of winners by language.

/* Configuration */

var QUESTION_ID = 79576; // Obtain this from the url
// It will be like https://XYZ.stackexchange.com/questions/QUESTION_ID/... on any question page
var ANSWER_FILTER = "!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe";
var COMMENT_FILTER = "!)Q2B_A2kjfAiU78X(md6BoYk";
var OVERRIDE_USER = 53406; // This should be the user ID of the challenge author.

/* App */

var answers = [], answers_hash, answer_ids, answer_page = 1, more_answers = true, comment_page;

function answersUrl(index) {
  return "https://api.stackexchange.com/2.2/questions/" +  QUESTION_ID + "/answers?page=" + index + "&pagesize=100&order=desc&sort=creation&site=codegolf&filter=" + ANSWER_FILTER;
}

function commentUrl(index, answers) {
  return "https://api.stackexchange.com/2.2/answers/" + answers.join(';') + "/comments?page=" + index + "&pagesize=100&order=desc&sort=creation&site=codegolf&filter=" + COMMENT_FILTER;
}

function getAnswers() {
  jQuery.ajax({
    url: answersUrl(answer_page++),
    method: "get",
    dataType: "jsonp",
    crossDomain: true,
    success: function (data) {
      answers.push.apply(answers, data.items);
      answers_hash = [];
      answer_ids = [];
      data.items.forEach(function(a) {
        a.comments = [];
        var id = +a.share_link.match(/\d+/);
        answer_ids.push(id);
        answers_hash[id] = a;
      });
      if (!data.has_more) more_answers = false;
      comment_page = 1;
      getComments();
    }
  });
}

function getComments() {
  jQuery.ajax({
    url: commentUrl(comment_page++, answer_ids),
    method: "get",
    dataType: "jsonp",
    crossDomain: true,
    success: function (data) {
      data.items.forEach(function(c) {
        if (c.owner.user_id === OVERRIDE_USER)
          answers_hash[c.post_id].comments.push(c);
      });
      if (data.has_more) getComments();
      else if (more_answers) getAnswers();
      else process();
    }
  });  
}

getAnswers();

var SCORE_REG = /<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/;

var OVERRIDE_REG = /^Override\s*header:\s*/i;

function getAuthorName(a) {
  return a.owner.display_name;
}

function process() {
  var valid = [];
  
  answers.forEach(function(a) {
    var body = a.body;
    a.comments.forEach(function(c) {
      if(OVERRIDE_REG.test(c.body))
        body = '<h1>' + c.body.replace(OVERRIDE_REG, '') + '</h1>';
    });
    
    var match = body.match(SCORE_REG);
    if (match)
      valid.push({
        user: getAuthorName(a),
        size: +match[2],
        language: match[1],
        link: a.share_link,
      });
    
  });
  
  valid.sort(function (a, b) {
    var aB = a.size,
        bB = b.size;
    return aB - bB
  });

  var languages = {};
  var place = 1;
  var lastSize = null;
  var lastPlace = 1;
  valid.forEach(function (a) {
    if (a.size != lastSize)
      lastPlace = place;
    lastSize = a.size;
    ++place;
    
    var answer = jQuery("#answer-template").html();
    answer = answer.replace("{{PLACE}}", lastPlace + ".")
                   .replace("{{NAME}}", a.user)
                   .replace("{{LANGUAGE}}", a.language)
                   .replace("{{SIZE}}", a.size)
                   .replace("{{LINK}}", a.link);
    answer = jQuery(answer);
    jQuery("#answers").append(answer);

    var lang = a.language;
    if (/<a/.test(lang)) lang = jQuery(lang).text();
    
    languages[lang] = languages[lang] || {lang: a.language, user: a.user, size: a.size, link: a.link};
  });

  var langs = [];
  for (var lang in languages)
    if (languages.hasOwnProperty(lang))
      langs.push(languages[lang]);

  langs.sort(function (a, b) {
    if (a.lang > b.lang) return 1;
    if (a.lang < b.lang) return -1;
    return 0;
  });

  for (var i = 0; i < langs.length; ++i)
  {
    var language = jQuery("#language-template").html();
    var lang = langs[i];
    language = language.replace("{{LANGUAGE}}", lang.lang)
                       .replace("{{NAME}}", lang.user)
                       .replace("{{SIZE}}", lang.size)
                       .replace("{{LINK}}", lang.link);
    language = jQuery(language);
    jQuery("#languages").append(language);
  }

}
body { text-align: left !important}

#answer-list {
  padding: 10px;
  width: 290px;
  float: left;
}

#language-list {
  padding: 10px;
  width: 290px;
  float: left;
}

table thead {
  font-weight: bold;
}

table td {
  padding: 5px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b">
<div id="answer-list">
  <h2>Leaderboard</h2>
  <table class="answer-list">
    <thead>
      <tr><td></td><td>Author</td><td>Language</td><td>Size</td></tr>
    </thead>
    <tbody id="answers">

    </tbody>
  </table>
</div>
<div id="language-list">
  <h2>Winners by Language</h2>
  <table class="language-list">
    <thead>
      <tr><td>Language</td><td>User</td><td>Score</td></tr>
    </thead>
    <tbody id="languages">

    </tbody>
  </table>
</div>
<table style="display: none">
  <tbody id="answer-template">
    <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr>
  </tbody>
</table>
<table style="display: none">
  <tbody id="language-template">
    <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr>
  </tbody>
</table>

George Gibson

Posted 2016-05-08T07:17:45.953

Reputation: 2 369

2

Comments are not for extended discussion; this conversation has been moved to chat.

– Doorknob – 2016-05-11T11:28:25.267

1So given your new definition of 'word', what is the most common word here don't d'ont dont a a? Would it be dont? – James – 2016-05-11T15:34:27.463

@DrGreenEggsandHamDJ If you have a submission that does remove apostrophes, dont. If not, a. but most submissions do, and so dont is a correct answer. – George Gibson – 2016-05-11T15:36:38.683

1Is the output case-sensitive? So is ipv6 valid output for the last test case? – kirbyfan64sos – 2016-05-13T22:16:19.913

@kirbyfan64sos Case in the output is irrelevant. – George Gibson – 2016-05-14T06:46:26.860

1An extra test case may be of use: "This sentence with words has at most two equal most frequent words." --> <nothing> – philcolbourn – 2016-05-14T07:12:27.487

Just to clarify: if I'm writing a function, I can't take the input as a parameter? It must be from the stdin? – Carcigenicate – 2017-01-31T16:51:47.063

@Carcigenicate Nah, I think that's just a mistake in the wording. – mbomb007 – 2017-01-31T17:34:51.433

Answers

6

Pyke, 26 25 bytes

l1dcD}jm/D3Sei/1qIi@j@
(;

Try it here!

Or 23 22 bytes (noncompeting, add node where kills stack if false)

l1cD}jm/D3Sei/1q.Ii@j@

Try it here!

Or with punctuation, 23 bytes (I think this competes? Commit was before the edit)

l1.cD}jm/D3Sei/1q.Ii@j@

Try it here!

Or 12 bytes (definitely noncompeting)

l1.cj.#jR/)e

Try it here!

l1           -     input.lower()
  .c         -    punc_split(^)
    j        -   j = ^
     .#   )  -  sort(V(i) for i in ^)
       jR/   -   j.count(i)
           e - ^[-1]

Blue

Posted 2016-05-08T07:17:45.953

Reputation: 26 661

Your 23 byte answer would compete if the only punctuation preserved was - and ' (hyphen and apostrophe). – George Gibson – 2016-05-18T19:42:52.580

It only preserves punctuation that isn't at the end of a word – Blue – 2016-05-18T19:54:09.577

Oh, OK (I don't understand Pyke). I guess it competes then... – George Gibson – 2016-05-19T05:43:23.073

1@GeorgeGibson I'm pretty sure the 23 byte version doesn't compete - it could come under standard loopholes. Also I don't expect (m)any people to understand Pyke, I'm making it as my own language – Blue – 2016-05-19T07:03:16.443

Alright then. I think you still win anyway, so it doesn't really matter. – George Gibson – 2016-05-19T07:04:32.633

Oh yeah, I posted earlier than the Jelly answer – Blue – 2016-05-19T07:05:48.693

I'm getting errors for all of the ones except for the 12 byte version, I presume this was an update to the interpreter? – Okx – 2017-01-31T16:02:57.433

@Okx almost certainly. There's been about 60 commits since I last edited this so I wouldn't be surprised if at least one of them was breaking – Blue – 2017-01-31T16:50:06.077

14

Jelly, 25 bytes

ṣ⁶f€ØB;”-¤Œl©Qµ®ċЀĠṪịµẋE

Try it online! or verify all test cases.

Dennis

Posted 2016-05-08T07:17:45.953

Reputation: 196 637

11

Pyth - 23 30 bytes

There has to be a better way to include digits and hyphens, but I just want to fix this right now.

Kc@s+++GUTd\-rzZ)I!tJ.M/KZ{KhJ

Test Suite.

Maltysen

Posted 2016-05-08T07:17:45.953

Reputation: 25 023

1The revised rules require preserving digits and hyphens. – Dennis – 2016-05-09T00:52:18.803

@GeorgeGibson fixed. – Maltysen – 2016-05-16T19:01:30.570

6

Octave, 115 94 bytes

[a,b,c]=unique(regexp(lower(input('')),'[A-z]*','match'));[~,~,d]=mode(c); try disp(a{d{:}})

Accounts for the case with no most frequent word by using try. In this case it outputs nothing, and "takes a break" until you catch the exception.

Saved 21(!) bytes thanks to Luis Mendo's suggestion (using the third output from mode to get the most common word).


The rules have changed quite a bit since I posted my original answer. I'll look into the regex later.

Stewie Griffin

Posted 2016-05-08T07:17:45.953

Reputation: 43 471

1you beat me to it, gonna think for something else now. – Abr001am – 2016-05-08T09:05:09.957

Apply mode on c maybe? Its third output gives all tied values, if I recall correctly – Luis Mendo – 2016-05-08T20:47:41.123

I count 115 bytes. – Conor O'Brien – 2016-05-08T23:34:55.507

I believe your regex should be ['\w\d] because you have to preserve apostrophes and digits. Unless those are between upper and lower case in ASCII, in which case ignore me because I don't have a table handy. – Fund Monica's Lawsuit – 2016-05-09T13:25:01.643

1@StewieGriffin [~, ~, out] = mode([1 1 2 2 1 2 3 4 5 5]) gives out = {1 2} – Luis Mendo – 2016-05-09T23:26:14.880

Ah, misunderstood you the last time. Thanks Luis! Saved a lot of bytes! =) – Stewie Griffin – 2016-05-10T05:13:39.180

doesnt try have to be enclosed by an end ? and about mode function doesnt it return the smallest most common number ? – Abr001am – 2016-05-11T16:04:09.050

5

Perl 6, 80 bytes

{$_>1&&.[0].value==.[1].value??""!!.[0].key given .lc.words.Bag.sort:{-.value}}

Let's split the answer into two parts...

given .lc.words.Bag.sort:{-.value}

given is a control statement (like if or for). In Perl 6, they're allowed as postfixes. (a if 1, or like here, foo given 3). given puts its topic (right-hand side) into the special variable $_ for its left-hand side.

The "topic" itself lowercases (lc), splits by word (words), puts the values into a Bag (set with number of occurences), then sorts by value (DESC). Since sort only knows how to operate on lists, the Bag is transformed into a List of Pairs here.

$_>1&&.[0].value==.[1].value??""!!.[0].key

a simple conditional (?? !! are used in Perl 6, instead of ? :).

$_ > 1

Just checks that the list has more than one element.

.[0].value==.[1].value

Accesses to $_ can be shortened... By not specifying the variable. .a is exactly like $_.a. So this is effectively "do both top elements have the same number of occurences" – If so, then we print '' (the empty string).

Otherwise, we print the top element's key (the count): .[0].key.

Ven

Posted 2016-05-08T07:17:45.953

Reputation: 3 382

7It's like half English, half line-noise. Amazing. – cat – 2016-05-08T15:58:47.537

1it's funny how it's the OO-style features that look english-y :P – Ven – 2016-05-08T16:02:41.867

2Also manages to be less readable than Perl 5 while containing more English than Perl 5. D: – cat – 2016-05-08T16:08:55.423

1@cat fixed it -- should be totally unreadable now – Ven – 2016-05-08T20:19:00.623

5value??!! (i know that's a ternary operator, it's just entertaining) – cat – 2016-05-08T20:26:53.553

4

Ruby, 94 92 102 bytes

Gotta go fast (FGITW answer). Returns the word in all uppercase, or nil if there is no most frequent word.

Now updated to new specs, I think. However, I did manage to golf down a little so the byte count is the same!

->s{w=s.upcase.tr("_'",'').scan /[-\w]+/;q=->x{w.count x};(w-[d=w.max_by(&q)]).all?{|e|q[e]<q[d]}?d:p}

Value Ink

Posted 2016-05-08T07:17:45.953

Reputation: 10 608

5Gotta go fast? – cat – 2016-05-08T12:37:50.480

@cat yeah, 'cuz I was FGITW this time – Value Ink – 2016-05-09T03:02:13.923

4

05AB1E, 30 bytes

Code:

lžj¨„ -«Ãð¡©Ùv®yQOˆ}®¯MQÏDg1Q×

Uses CP-1252 encoding. Try it online!.

Adnan

Posted 2016-05-08T07:17:45.953

Reputation: 41 965

hmm? – TessellatingHeckler – 2016-05-10T02:26:43.000

3@TessellatingHeckler It only takes one line of input. Unless you repeatedly use the I command, 05AB1E will only take as much as it needs. – George Gibson – 2016-05-10T06:05:41.980

4

JavaScript (ES6), 155 bytes

s=>(m=new Map,s.toLowerCase().replace(/[^- 0-9A-Z]/gi,'').split(/\ +/).map(w=>m.set(w,-~m.get(w))),[[a,b],[c,d]]=[...m].sort(([a,b],[c,d])=>d-b),b==d?'':a)

Based on @Blue's Python answer.

Neil

Posted 2016-05-08T07:17:45.953

Reputation: 95 035

Your regex replace looks like it drops numbers, and will break the IPv6 test case, is that right? – TessellatingHeckler – 2016-05-10T02:16:54.403

@TessellatingHeckler The definition of word changed since I originally read the question, but I've updated my answer now. – Neil – 2016-05-10T08:00:21.903

4

Python 3.5, 142 137 134 112 117 110 127 bytes:

(+17 bytes, because apparently even if there are words more frequent than the rest, but they have the same frequency, nothing should still be returned.)

def g(u):import re;q=re.findall(r"\b['\-\w]+\b",u.lower());Q=q.count;D=[*map(Q,{*q})];return['',max(q,key=Q)][1in map(D.count,D)]

Should now satisfy all conditions. This submission assumes that at least 1 word is input.

Try It Online! (Ideone)

Also, if you want one, here is another version of my function devoid of any regular expressions at the cost of about 43 bytes, though this one is non-competitive anyways, so it does not really matter. I just put it here for the heck of it:

def g(u):import re;q=''.join([i for i in u.lower()if i in[*map(chr,range(97,123)),*"'- "]]).split();Q=q.count;D=[*map(Q,{*q})];return['',max(q,key=Q)][1in map(D.count,D)]

Try this New Version Online! (Ideone)

R. Kap

Posted 2016-05-08T07:17:45.953

Reputation: 4 730

From the challenge comments "if there are two words that are more frequent than the rest, but with the same frequency", the output is 'nothing'. – RootTwo – 2016-05-09T04:11:42.160

@RootTwo Fixed! :) – R. Kap – 2016-05-09T06:46:25.093

@TessellatingHeckler Those are different words though. That's is a contraction for that is whereas thats is not really a word. – R. Kap – 2016-05-10T03:33:52.770

@TessellatingHeckler Can you give me some proof of this comment? Because I am going through all the comments on the post and see no such comment. – R. Kap – 2016-05-10T03:41:22.607

3

Pyth, 32 bytes

p?tlJeM.MhZrS@Ls++\-GUTcrz0d8ksJ

Test suite.

Leaky Nun

Posted 2016-05-08T07:17:45.953

Reputation: 45 011

3

Sqlserver 2008, 250 bytes

DECLARE @ varchar(max) = 'That''s... that''s... that is just terrible!" he said.';

WITH c as(SELECT
@ p,@ x
UNION ALL
SELECT LEFT(x,k-1),STUFF(x,1,k,'')FROM
c CROSS APPLY(SELECT patindex('%[^a-z''-]%',x+'!')k)k
WHERE''<x)SELECT max(p)FROM(SELECT top 1with ties p
FROM c WHERE p>''GROUP BY p
ORDER BY count(*)DESC
)j HAVING count(*)=1

Try it online!

Sqlserver 2016, 174 bytes

Unable to handle data like this example(counting the equals as 3 words):

DECLARE @ varchar(max) = 'That''s... that''s... that is just terrible!" he said. = = ='

SELECT max(v)FROM(SELECT TOP 1WITH TIES value v
FROM STRING_SPLIT(REPLACE(REPLACE(REPLACE(@,'"',''),',',''),'.',''),' ')GROUP
BY value ORDER BY count(*)DESC)x HAVING count(*)=1

t-clausen.dk

Posted 2016-05-08T07:17:45.953

Reputation: 2 874

I don't like variable approach because it is kind of cheating :) One input -> nothing or something, with set-based approach it has to be longer, because you need to add additional GROUP BY, LEFT JOIN, or PARTITION BY Anyway SQL Server has built in SPLIT function. Ungolfed demo feel free to make it as short as possible.

– lad2025 – 2016-05-09T11:59:05.693

@lad2025 thanks alot, didn't know any features from 2016. SPLIT_STRING surely is a long overdue feature. I tried to golf the script down using split, got it down to 174, however it will not be able to filter out text like "= = =" – t-clausen.dk – 2016-05-11T08:33:14.870

3

PostgreSQL, 246, 245 bytes

WITH z AS(SELECT DISTINCT*,COUNT(*)OVER(PARTITION BY t,m)c FROM i,regexp_split_to_table(translate(lower(t),'.,"''',''),E'\\s+')m)
SELECT t,CASE WHEN COUNT(*)>1 THEN '' ELSE MAX(m)END
FROM z WHERE(t,c)IN(SELECT t,MAX(c)FROM z GROUP BY t)
GROUP BY t  

Output:

enter image description here

Input if anyone is interested:

CREATE TABLE i(t TEXT);

INSERT INTO i(t)
VALUES ('The man walked down the road.'), ('Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy.'),
       ('This sentence has no most frequent word.'), ('"That''s... that''s... that is just terrible!" he said. '), ('The old-fashioned man ate an old-fashioned cake.'), 
       ('IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses.'), ('a   a            a b b b c');


Normally I would use MODE() WITHIN GROUP(...) and it will be much shorter, but it will violate:

If there is no most frequent word (see test case #3), your program should output nothing.


EDIT:

Handling ':

WITH z AS(SELECT DISTINCT*,COUNT(*)OVER(PARTITION BY t,m)c FROM i,regexp_split_to_table(translate(lower(t),'.,"!',''),E'\\s+')m)
SELECT t,CASE WHEN COUNT(*)>1 THEN '' ELSE MAX(m)END
FROM z WHERE(t,c)IN(SELECT t,MAX(c)FROM z GROUP BY t)
GROUP BY t  

SqlFiddleDemo

Output:

╔═══════════════════════════════════════════════════════════════════════════════════════════════╦═══════════════╗
║                                              t                                                ║      max      ║
╠═══════════════════════════════════════════════════════════════════════════════════════════════╬═══════════════╣
║ a a a b b b c                                                                                 ║               ║
║ The old-fashioned man ate an old-fashioned cake.                                              ║ old-fashioned ║
║ IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses.  ║ ipv6          ║
║ This sentence has no most frequent word.                                                      ║               ║
║ "That's... that's... that is just terrible!" he said.                                         ║ that's        ║
║ The man walked down the road.                                                                 ║ the           ║
║ Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy.        ║ he            ║
╚═══════════════════════════════════════════════════════════════════════════════════════════════╩═══════════════╝

lad2025

Posted 2016-05-08T07:17:45.953

Reputation: 379

could not get as low as you, sqlserver doesn't have build in split yet. However the select part is shorter. – t-clausen.dk – 2016-05-09T11:17:37.140

@GeorgeGibson Sure, fixed + added live demo. – lad2025 – 2016-05-11T15:30:03.997

@lad2025 By common agreement in chat, what you did is no longer necessary, feel free to revert back. – George Gibson – 2016-05-11T15:32:47.587

@GeorgeGibson Yup, edit will be much clear. Live demo is working now, when I wrote answer sqlfiddle was not responding. – lad2025 – 2016-05-11T15:37:06.627

3

JavaScript (ES6), 99 bytes

F=s=>(f={},w=c='',s.toLowerCase().replace(/[\w-']+/g,m=>(f[m]=o=++f[m]||1)-c?o>c?(w=m,c=o):0:w=''),w)
#input { width: 100%; }
<textarea id="input" oninput="output.innerHTML=F(this.value)"></textarea>
<div id="output"></div>

George Reith

Posted 2016-05-08T07:17:45.953

Reputation: 2 424

2

Retina, 97 bytes

The rules keep changing...

T`L`l
[^-\w ]

O`[-\w]+
([-\w]+)( \1\b)*
$#2;$1
O#`[-\w;]+
.*\b(\d+);[-\w]+ \1;[-\w]+$

!`[-\w]+$

Try it online!

Test suite.

Leaky Nun

Posted 2016-05-08T07:17:45.953

Reputation: 45 011

2Fails for this input. – Conor O'Brien – 2016-05-08T23:24:59.210

@CᴏɴᴏʀO'Bʀɪᴇɴ Thanks, fixed. – Leaky Nun – 2016-05-08T23:39:06.827

1And you golfed it 11 bytes ._. impressive – Conor O'Brien – 2016-05-08T23:41:06.507

Also fails for "The old-fashioned man ate an old-fashioned cake." – t-clausen.dk – 2016-05-09T10:27:35.103

This doesn't look right either (expecting a to be the most common word there) – TessellatingHeckler – 2016-05-10T02:24:30.547

@TessellatingHeckler I don't accept newline as separator

– Leaky Nun – 2016-05-10T02:51:49.967

2

Python, 132 bytes

import collections as C,re
def g(s):(a,i),(b,j)=C.Counter(re.sub('[^\w\s-]','',s.lower()).split()).most_common(2);return[a,''][i==j]

Above code assumes that input has at least two words.

RootTwo

Posted 2016-05-08T07:17:45.953

Reputation: 1 749

Got to love that regex, tho. – Blue – 2016-05-09T01:52:15.517

This is incorrect. The character class \w includes underscores. – mbomb007 – 2017-01-31T17:39:46.443

2

R, 115 bytes

function(s)if(sum(z<-(y=table(tolower((x=strsplit(s,"[^\\w']",,T)[[1]])[x>""])))==max(y))<2)names(which(z))else NULL

This is a function that accepts a string and returns a string if a single word appears more often than others and NULL otherwise. To call it, assign it to a variable.

Ungolfed:

f <- function(s) {
    # Create a vector of words by splitting the input on characters other
    # than word characters and apostrophes
    v <- (x <- strsplit(s, "[^\\w']", perl = TRUE))[x > ""]

    # Count the occurrences of each lowercased word
    y <- table(tolower(v))

    # Create a logical vector such that elements of `y` which occur most
    # often are `TRUE` and the rest are fase
    z <- y == max(y)

    # If a single word occurs most often, return it, otherwise `NULL`
    if (sum(z) < 2) {
        names(which(z))
    } else {
        NULL
    }
}

Alex A.

Posted 2016-05-08T07:17:45.953

Reputation: 23 761

1

05AB1E, 22 21 20 bytes

žK„- JÃl#{D.MDgiJëõ?

Explanation:

žK                     # Push [a-zA-Z0-9]
  „-                   # Push 2-char string containing a hyphen and a space
     J                 # Join the stack into a single element
      Ã                # Removes all characters from implicit input except those specified above
       l               # Converts to lowercase
        #              # Split string by spaces
         {             # Sorts array
          D            # Duplicates
           .M          # Finds most common element
             Dg        # Gets length of string without popping
                 iJ    # If length == 1, then convert the array to a string (otherwise the output would be ['example'] instead of example
                   ëõ? # Else push an empty string.

Note: If you're fine with trailing newlines in the output for when you're not supposed to output anything, remove the ? at the end to save a byte.

Note #2: The program will not work with a single word, but I doubt this would be a problem. If you want to fix this, replace # with ð¡ for an extra byte.

05AB1E uses CP-1252 as the charset, not UTF-8.

Try it online!

Okx

Posted 2016-05-08T07:17:45.953

Reputation: 15 025

1

Python 2, 218 bytes

Assumes more than 2 words. Getting rid of punctuation destroyed me...

import string as z
def m(s):a=[w.lower()for w in s.translate(z.maketrans('',''),z.punctuation).split()];a=sorted({w:a.count(w)for w in set(a)}.items(),key=lambda b:b[1],reverse=1);return a[0][0]if a[0][1]>a[1][1]else''

Blue

Posted 2016-05-08T07:17:45.953

Reputation: 1 986

Does this strip ',- etc? – Tim – 2016-05-08T16:50:13.170

@Tim No, I did this challenge before the rules were fully fleshed. Will change. – Blue – 2016-05-08T16:58:46.117

Can you assign the result of sorted to a tuple rather than having to index into the array manually? – Neil – 2016-05-08T19:18:08.723

@Neil you mean just get the first and second items for comparison instead of the entire array? I don't know how to do that – Blue – 2016-05-08T19:19:47.887

1

Matlab (225)

  • Rules chaneged :/

.

      function c=f(a),t=@(x)feval(@(y)y(y>32),num2str(lower(x)-0));f=@(x)num2str(nnz(x)+1);e=str2num(regexprep(a,'([\w''-]+)',' ${t($1)} ${f($`)} ${f([$`,$1])}'));[u,r,d]=mode(e);try c=find(e==d{:});c=a((e(c(1)+1)):(e(c(1)+2)));end
  • Toolbox is necessary to run this.

  • How does this work, one of the nicest privileges of regex replace in matlab this it field-executes tokens by calling external-environmental functions parameterized by the tokens caught in the inner environment, so any sequence of "Word_A Word_B .." is replaced by integers "A0 A1 A2 B0 B1 B2 ..." where the first integer is the numerica ascii signature of the word, the second is the starting index, the third is the ending index, these last two integers dont reduplicate in the whole sequence so i took this advantage to transpose it to an array, then mode it then search the result in that array, so the starting/ending indices will consequently follow.

  • Edit: after changing some details, the program is called function by a string parameter.


20 bytes saved thanks to @StewieGriffin, 30 bytes added reproaches to common-agreed loopholes.

Abr001am

Posted 2016-05-08T07:17:45.953

Reputation: 862

You'll have my upvote when you (or someone else) show that this actually works, both for inputs that have a most common word, and for inputs that don't. =) (I can't test it, unfortunately) – Stewie Griffin – 2016-05-09T11:46:27.080

@StewieGriffin i think the programe misbehaves with sentences with equi-frequence words i will fix that – Abr001am – 2016-05-09T13:29:44.500

1

PHP, 223 bytes

$a=array_count_values(array_map(function($s){return preg_replace('/[^A-Za-z0-9]/','',$s);},explode(' ',strtolower($argv[1]))));arsort($a);$c=count($a);$k=array_keys($a);echo($c>0?($c==1?$k[0]:($a[$k[0]]!=$a[$k[1]]?$k[0]:'')):'');

MonkeyZeus

Posted 2016-05-08T07:17:45.953

Reputation: 461

1

Perl, 60 56 55 54 bytes

Includes +3 for -p

#!/usr/bin/perl -p
s/[\pL\d'-]+/$;[$a{lc$&}++]++or$\=$&/eg}{$\x=2>pop@

If a word cannot be just a number you can also drop the a for a score of 53.

Ton Hospel

Posted 2016-05-08T07:17:45.953

Reputation: 14 114

Does the hyphen in the -anE not count? It does on the other answer (+2 bytes for -p flag)... – George Gibson – 2016-05-10T17:48:03.417

@GeorgeGibson No, see http://meta.codegolf.stackexchange.com/questions/273/on-interactive-answers-and-other-special-conditions. The hyphen, the space and the E do not count. The other answer would normally only have to do +1 bytes for -p, but his solution has ' so it cannot be seen as an extension of -e or -E. So he should in fact count +3 (not +2) since he should count the space and the hyphen (but every extra option would only be +1).

– Ton Hospel – 2016-05-10T20:03:06.823

@TomHospel Oh, right. – George Gibson – 2016-05-11T06:06:53.483

Is this considered valid given the apostrophe rule? [\pL\d-] looks like it could be shrunken down to [\w-] (unless we care about underscores) but either version will report that instead of that's or thats for test 4. Otherwise, you need to add 4 bytes to insert \x27 in that character class (unless you have a better way of adding an apostrophe). – Adam Katz – 2018-01-22T17:08:47.813

@AdamKatz The definition of 'word' changed quite a bit while this was running and I never fully adopted the last version. But to keep you happy I created a fixed (and shorter) version :-). And yes, I do care about underscores – Ton Hospel – 2018-01-28T10:17:36.363

0

Python 3, 76 98 100 bytes

import re,statistics as S
try:print(S.mode(re.split("([a-z0-9-]+)",input().lower())[1::2]))
except:1

Try it online

Outputs the most common word as lowercase. Does not include apostrophes because "apostrophes are not required to be included."

statistics.mode requires Python 3.4

Unfortunately, no output to stderr is allowed, or it'd be much shorter.

mbomb007

Posted 2016-05-08T07:17:45.953

Reputation: 21 944

You're not allowed to print to STDERR, unless this program doesn't produce any error output? – Okx – 2017-01-31T17:17:01.217

Your new program doesn't support hyphens! I tried the input i- test i- – Okx – 2017-01-31T17:38:10.407

Fixed it all. Still short. – mbomb007 – 2017-01-31T17:54:18.523

0

R, 96 bytes

19 bytes shorter than the existing R answer, with a somewhat different approach.

t=table(gsub("[^a-z0-9'-]","",tolower(scan(,''))))
`if`(sum(t==max(t))-1,'',names(which.max(t)))

Reads from stdin, so the input is automatically separated by spaces. We convert to lowercase and use gsub to remove all non-alphanumerics (plus - and '). We count the instances of each word with table and save the result to t. Next, we check if there is more than 1 maximum in t (by seeing if there is more than one element which is equal to max(t). If so, we return the empty string ''. If not, we return the word corresponding to the maximum in t.

rturnbull

Posted 2016-05-08T07:17:45.953

Reputation: 3 689

0

Java 7, 291 bytes

import java.util.*;Object c(String s){List w=Arrays.asList(s.toLowerCase().split("[^\\w'-]+"));Object r=w;int p=0,x=0;for(Object a:w){p=Collections.frequency(w,r);if(Collections.frequency(w,a)>p)r=a;if(p>x)x=p;}for(Object b:w)if(!b.equals(r)&Collections.frequency(w,b)==p)return"";return r;}

The rule where it should output nothing when there are multiple words with the same occurrence took quite a bit of extra code..

Ungolfed:

import java.util.*;
Object c(String s){
  List w = Arrays.asList(s.toLowerCase().split("[^\\w'-]+"));
  Object r = w;
  int p = 0,
      x = 0;
  for(Object a : w){
    p = Collections.frequency(w, r);
    if(Collections.frequency(w, a) > p){
      r = a;
    }
    if(p > x){
      x = p;
    }
  }
  for(Object b : w){
    if(!b.equals(r) & Collections.frequency(w, b) == p){
      return "";
    }
  }
  return r;
}

Test code:

Try it here.

import java.util.*;
class M{
  static Object c(String s){List w=Arrays.asList(s.toLowerCase().split("[^\\w'-]+"));Object r=w;int p=0,x=0;for(Object a:w){p=Collections.frequency(w,r);if(Collections.frequency(w,a)>p)r=a;if(p>x)x=p;}for(Object b:w)if(!b.equals(r)&Collections.frequency(w,b)==p)return"";return r;}

  public static void main(String[] a){
    System.out.println(c("The man walked down the road."));
    System.out.println(c("Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy."));
    System.out.println(c("This sentence has no most frequent word."));
    System.out.println(c("\"That's... that's... that is just terrible!\" he said."));
    System.out.println(c("The old-fashioned man ate an old-fashioned cake."));
    System.out
        .println(c("IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses."));
    System.out.println(c("This sentence with words has at most two equal most frequent words."));
  }
}

Output:

the
he
     (nothing)
that's
old-fashioned
ipv6
     (nothing)

Kevin Cruijssen

Posted 2016-05-08T07:17:45.953

Reputation: 67 575

0

Python, 158 bytes

def g(s):import collections as c,re;l=c.Counter(re.sub('[^\w\s-]',"",s.lower()).split());w,f=l.most_common(1)[0];return[w,""][all(f==i[1]for i in l.items())]

Takes its input like this:

g("Bird is the word")

Should match all the requirements, although it does fail on empty strings, is it necessary to check for those? Sorry for the delay.

Advice / feedback / black magic tips for saving bytes are always welcome

Wouldn't You Like To Know

Posted 2016-05-08T07:17:45.953

Reputation: 9

Hi, and welcome to PPCG! We score [tag:code-golf] challenges by the number of bytes in the answer. I went ahead and edited it for you with the correct information. – Rɪᴋᴇʀ – 2016-05-09T13:56:31.767

2Welcome to PPCG! Unfortunately, your submission does not satisfy all the requirements of this challenge as, first of all, it's NOT case insensitive. For instance, it will NOT count occurrences of the word That as occurrences of the word that since the former begins with an uppercase T and the latter begins with a lowercase t. Also, this does NOT remove all other forms of punctuation except hyphens (-) and, optionally, apostrophes (') and as a result, this would NOT work for the fourth test case given in the question. – R. Kap – 2016-05-09T17:08:55.697

1Also, this does NOT output nothing if there is no most frequent word. For instance, using the third test case (This sentence has no most frequent word.) as an example, your function outputs [('This', 1)], when it should instead be outputting nothing. I could go on and on about more issues, so I would recommend fixing them as soon as you can. – R. Kap – 2016-05-09T18:02:27.920

Will do soon, when I have time – Wouldn't You Like To Know – 2016-05-11T08:20:09.283

This is incorrect. The character class \w includes underscores. – mbomb007 – 2017-01-31T17:40:49.937

0

Lua, 232 199 175 bytes

w,m,o={},0;io.read():lower():gsub("[^-%w%s]",""):gsub("[%w-]+",function(x)w[x]=(w[x]or 0)+1 end)for k,v in pairs(w)do if m==v then o=''end if(v>m)then m,o=v,k end end print(o)

Blab

Posted 2016-05-08T07:17:45.953

Reputation: 451

1if not w[x]then w[x]=0 end w[x]=w[x]+1 end -> w[x]=(w[x]or0)+1 – Leaky Nun – 2016-05-10T10:29:17.753

if m==v then o=''end -> o=m==v and '' or o – Leaky Nun – 2016-05-10T13:47:17.300

0

Rexx, 109 128 122 bytes

pull s;g.=0;m=0;do i=1 to words(s);w=word(s,i);g.w=g.w+1;if g.w>=m then do;m=g.w;g.m=g.m+1;r=w;end;end;if g.m=1 then say r

Pretty printed...

pull s
g.=0
m=0
do i=1 to words(s)
  w=word(s,i)
  g.w=g.w+1
  if g.w>=m
  then do
    m=g.w
    g.m=g.m+1
    r=w
  end
end
if g.m=1 then say r

aja

Posted 2016-05-08T07:17:45.953

Reputation: 141

I don't think this handles all cases of tied most frequent words - see (new) last test case - I made similar mistake. – philcolbourn – 2016-05-15T21:43:27.987

Hopefully, that's fixed it now – aja – 2016-05-16T11:58:12.317

0

PowerShell (v4), 117 bytes

$y,$z=@($input-replace'[^a-z0-9 \n-]'-split'\s'|group|sort Count)[-2,-1]
($y,($z,'')[$y.Count-eq$z.Count])[!!$z].Name

The first part is easy enough:

  • $input is ~= stdin
  • Regex replace irrelevant characters with nothing, keep newlines so we don't mash two words from the end of a line and the beginning of the next line into one by mistake. (Nobody else has discussed multiple lines, could golf -2 if the input is always a single line).
  • Regex split, Group by frequency (~= Python's collections.Counter), Sort to put most frequent words at the end.
  • PowerShell is case insensitive by default for everything.

Handling if there isn't a most frequent word:

  • Take the last two items [-2,-1] into $y and $z;
  • an N-item list, where N>=2, makes $y and $z the last two items
  • a 1-item list makes $y the last item and $z null
  • an Empty list makes them both null

Use the bool-as-array-index fake-ternary-operator golf (0,1)[truthyvalue], nested, to choose "", $z or $y as output, then take .Name.

PS D:\> "The man walked down the road."|.\test.ps1
The

PS D:\> "Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy."|.\test.ps1
he

PS D:\> "`"That's... that's... that is just terrible!`" he said."|.\test.ps1
Thats

PS D:\> "The old-fashioned man ate an old-fashioned cake."|.\test.ps1
old-fashioned

PS D:\> "IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses."|.\test.ps1
IPv6

TessellatingHeckler

Posted 2016-05-08T07:17:45.953

Reputation: 2 412

0

Perl 5, 96 92 84 + 2 (-p flag) = 86 bytes

++$h{+lc}for/\w(?:\S*\w)?/g}{$m>$e[1]||$e[1]>$m&&(($_,$m)=@e)||($_="")while@e=each%h

Using:

> echo "The man walked down the road." | perl -p script.pl

Denis Ibaev

Posted 2016-05-08T07:17:45.953

Reputation: 876

Your -p flag should invoke a penalty of 3 bytes. The rules are roughly: Each commandline flag is +1 byte since that is how many extra bytes you need to extend your free -e'code' style commandline. So normally -p is only +1 byte. But here your code has ' so it cannot be run simply from the commandline without escaping. So no combining with -e and the - and the space before the p are extra and must be counted too – Ton Hospel – 2016-05-10T12:00:22.310

@TonHospel Fixed. – Denis Ibaev – 2016-05-10T13:34:38.687

This is actually 84 + 1 (-p flag) if you invoke it on the command line as perl -pe'…' (made available by removing the ' as noted in the first comments) – Adam Katz – 2018-01-22T16:39:23.997

0

bash, 153 146 131 154 149 137 bytes

declare -iA F
f(){ (((T=++F[$1])==M))&&I=;((T>M))&&M=$T&&I=$1;}
read L
L=${L,,}
L=${L//[^- a-z0-9]}
printf -vA "f %s;" $L
eval $A;echo $I

Operation:

declare an associative array F of integers (declare -iA F)

f is a function that, given a word parameter $1, increments frequency count for this word (T=++F[$1]) and compares to max count so far (M).

If equal, the we have a tie so we will not consider this word to be most frequent (I=)

If greater than max count so far (M), then set max count so far to frequency count of this word so far (M=$T) and remember this word (I=$1)

End function f

Read a line (read L) Make lowercase (L=${L,,}) Remove any character except a-z, 0-9, dash(-) and space (L=${L//[^- a-z0-9]}) Make a sequence of bash statements that calls f for each word (printf -vA "f %s;" $L). This is saved to variable A. eval A and print result (eval $a;echo$I)

Output:

This quick brown fox jumps over this lazy dog.
-->this
This sentence with the words has at most two equal most frequent the words.
-->
The man walked down the road.
-->the
This sentence has no most frequent word.
-->
Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy.
-->he
"That's... that's... that is just terrible!" he said.
-->thats
The old-fashioned man ate an old-fashioned cake.
-->old-fashioned
IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses.
-->ipv6

Bug: FIXED I have a bug that is not revealed in these test cases. If input is

This sentence with words has at most two equal most frequent words.

then my code should output nothing.

I have a fix but I seem to have hit a bash bug... I get very odd behaviour is M is not declared an integer: ++F[$1]==M (after a few repeated words) increments both F[$1] and M!! - my mistake.

philcolbourn

Posted 2016-05-08T07:17:45.953

Reputation: 501

0

Tcl 8.6, 196 bytes

lmap s [join [read stdin] \ ] {dict incr d [regsub -all {[^\w-]} [string tol $s] {}]}
set y [dict fi $d v [lindex [lsort [dict v $d]] end]]
if {[llength $y]!=2} {set y {}}
puts "==> [lindex $y 0]"

(Alas, I can't figure out how to get it any smaller than that...)

Explanation

It uses several obscure Tcl idioms to do stuff.

  • [join [read stdin] " "] — input string→list of whitespace-separated words
  • lmap ... — iterate over every element of that list. (Shorter than foreach and effectually identical since the result is discarded.)
  • [regsub ... [string tolower ...]] — Convert the string to lowercase and strip all characters except for word characters and the hyphen.
  • [dict incr d ...] — Create/modify a dictionary/word→count histogram.
  • set y ... — Sort the dictionary values, take the largest one, and return all (key,value) pairs corresponding to it.
  • if... — There must be exactly two elements: a single (key,value) pair, else there is nothing to print.
  • puts... — Print the key in the key value pair, if any. (No word has spaces.)

You can play with it using CodeChef.

Dúthomhas

Posted 2016-05-08T07:17:45.953

Reputation: 541

182 – sergiol – 2018-06-09T23:46:37.080

0

Python 3, 106 bytes

def f(s):s=s.split();z=sorted([s.count(i)for i in set(s)]);return("",max(set(s),key=s.count))[z[-2]<z[-1]]

Hunter VL

Posted 2016-05-08T07:17:45.953

Reputation: 321

You can use s.split() – mbomb007 – 2017-01-31T17:41:36.313

split method by default uses spaces, so you can save 3 bytes by changing s.split(" ") to s.split() – sagiksp – 2017-02-01T12:00:29.787

0

Shell, 89 86 82 bytes

grep -Po "[\w'-]+"|sort -f|uniq -ci|sort -nr|awk 'c>$1{print w}c{exit}{c=$1;w=$2}'

This lists all words in the input, then sorts them with counts from most common to least common. The awk call merely ensures that the #2 word doesn't have the same count as the #1 word.

Unwrapped:

grep -Po "[\w'-]+"      # get a list of the words, one per line
  |sort -f              # sort (case insensitive, "folded")
  |uniq -ci             # count unique entries while still ignoring case
  |sort -nr             # sort counted data in descending order
  |awk '
    count > $1 {        # if count of most common word exceeds that of this line
      print word        # print the word saved from it
    }
    count {             # if we have already saved a count (-> we are on line 2)
      exit              # we always exit on line 2 since we have enough info
    }
    {                   # if true (run on line 1 only due to the above exit)
      count = $1        # save the count of the word on this first line
      word = $2         # save the word itself
    }'

grep -o is the magic tokenizer here. It takes each word (as defined by a regex accepting word characters (letters, numbers, underscore), apostrophe, or hyphen using PCRE given -P) and puts it on its own line. This accepts underscores, as to many other answers here. To disallow underscores, add four characters to turn this portion into grep -oi "[a-z0-9'-]*"

alias cnt='sort -f |uniq -ci |sort -nr' is an old standby of mine. Without regards to case, it alphabetizes (erm, asciibetizes) the lines of the input counts occurrences of each entry, then reverse-sorts by the numeric occurrences so the most popular is first.

awk only looks at the first two lines of that descending ranked list. On line one, count is not yet defined, so it is evaluated as zero and therefore the first two stanzas are skipped (zero == false). The third stanza sets count and word. On the second line, awk has a defined and nonzero value for count, so it compares that count to the second best count. If it's not tied, the saved word is printed. Regardless, the next stanza exits for us.

Test implemented as:

for s in "The man walked down the road." "Slowly, he ate the pie, savoring each delicious bite. He felt like he was truly happy." "This sentence has no most frequent word." "\"That's... that's... that is just terrible\!\" he said." "The old-fashioned man ate an old-fashioned cake." "IPv6 looks great, much better than IPv4, except for the fact that IPv6 has longer addresses." "This sentence with words has at most two equal most frequent words."; do printf "%s\n==> " "$s"; echo "$s" |grep -io "[a-z0-9'-]*"|sort -f|uniq -ci|sort -nr|awk 'c>$1{print w}c{exit}{c=$1;w=$2}'; echo; done

Adam Katz

Posted 2016-05-08T07:17:45.953

Reputation: 306