Find the frequency of triplets in a phrase

6

1

For context, this problem is based on a old chat-bot project I did.

Problem:

Given a string of words containing any of the characters:

" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"

Find the frequency of each triplet of words. All non-alphanumeric characters should be ignored, and input/output will be case-insensitive.

For this challenge, the "triplets" of a phrase are each consecutive chunk of 3 words along the string.

For example, in the string

"Oh hi there guy. What's up? Oh hi there."

The "triplets" of the string are

[["oh", "hi", "there"], ["hi", "there", "guy"], ["there", "guy", "whats"], ["guy", "whats", "up"],
 ["whats", "up", "oh"], ["up", "oh", "hi"], ["oh", "hi", "there"]]

The frequency of each triplet is 1, except for ["oh", "hi", "there"], which appears twice.

Input

Input will be a string of space-delimited "words" that may contain any of the characters mentioned above. Although punctuation is to be ignored, it must be handled.

You can assume the input will always contain at least 3 words, and that there won't be consecutive whitespace.

Output

Output can be anything that shows the frequency of each triplet.

For the string "Oh hi there guy.", possible outputs could be:

{"oh hi there":1, "hi there guy":1}

["oh hi there", 1, "hi there guy", 1]

"oh hi there|1 hi there guy|1"
            ^ Or any other delimiter

Test Cases (Output order doesn't matter):

"Oh hi there guy. What's up? Oh hi there."
{["oh" "hi" "there"] 2,
 ["hi" "there" "guy"] 1,
 ["there" "guy" "whats"] 1,
 ["guy" "whats" "up"] 1,
 ["whats" "up" "oh"] 1,
 ["up" "oh" "hi"] 1}

"aa aa aa aa"
{["aa" "aa" "aa"] 2}

"aa bb a bb a bb a cc a bb a"
{["aa" "bb" "a"] 1,
 ["bb" "a" "bb"] 2,
 ["a" "bb" "a"] 3,
 ["bb" "a" "cc"] 1,
 ["a" "cc" "a"] 1,
 ["cc" "a" "bb"] 1}

"99 bottles of beer"
{["99" "bottles" "of"] 1,
 ["bottles" "of" "beer"] 1}

"There are two main types of chatbots, one functions based on a set of rules, and the other more advanced version uses artificial intelligence. The chatbots based on rules, tend to be limited in functionality, and are as smart as they are programmed to be. On the other end, a chatbot that uses artificial intelligence, understands language, not just commands, and continuously gets smarter as it learns from conversations it has with people."
{["main" "types" "of"] 1,
 ["rules" "and" "the"] 1,
 ["of" "chatbots" "one"] 1,
 ["to" "be" "limited"] 1,
 ["artificial" "intelligence" "understands"] 1,
 ["it" "has" "with"] 1,
 ["chatbots" "based" "on"] 1,
 ["smarter" "as" "it"] 1,
 ["the" "chatbots" "based"] 1,
 ["other" "more" "advanced"] 1,
 ["commands" "and" "continuously"] 1,
 ["chatbots" "one" "functions"] 1,
 ["tend" "to" "be"] 1,
 ["a" "chatbot" "that"] 1,
 ["continuously" "gets" "smarter"] 1,
 ["advanced" "version" "uses"] 1,
 ["functionality" "and" "are"] 1,
 ["are" "two" "main"] 1,
 ["based" "on" "rules"] 1,
 ["on" "a" "set"] 1,
 ["there" "are" "two"] 1,
 ["the" "other" "more"] 1,
 ["just" "commands" "and"] 1,
 ["the" "other" "end"] 1,
 ["that" "uses" "artificial"] 1,
 ["based" "on" "a"] 1,
 ["limited" "in" "functionality"] 1,
 ["smart" "as" "they"] 1,
 ["are" "as" "smart"] 1,
 ["from" "conversations" "it"] 1,
 ["other" "end" "a"] 1,
 ["intelligence" "the" "chatbots"] 1,
 ["functions" "based" "on"] 1,
 ["in" "functionality" "and"] 1,
 ["intelligence" "understands" "language"] 1,
 ["chatbot" "that" "uses"] 1,
 ["more" "advanced" "version"] 1,
 ["gets" "smarter" "as"] 1,
 ["rules" "tend" "to"] 1,
 ["on" "rules" "tend"] 1,
 ["as" "it" "learns"] 1,
 ["are" "programmed" "to"] 1,
 ["and" "the" "other"] 1,
 ["understands" "language" "not"] 1,
 ["and" "are" "as"] 1,
 ["of" "rules" "and"] 1,
 ["has" "with" "people"] 1,
 ["end" "a" "chatbot"] 1,
 ["set" "of" "rules"] 1,
 ["and" "continuously" "gets"] 1,
 ["as" "they" "are"] 1,
 ["they" "are" "programmed"] 1,
 ["as" "smart" "as"] 1,
 ["two" "main" "types"] 1,
 ["a" "set" "of"] 1,
 ["uses" "artificial" "intelligence"] 2, # <----- 2 Here
 ["it" "learns" "from"] 1,
 ["be" "limited" "in"] 1,
 ["programmed" "to" "be"] 1,
 ["types" "of" "chatbots"] 1,
 ["conversations" "it" "has"] 1,
 ["one" "functions" "based"] 1,
 ["be" "on" "the"] 1,
 ["not" "just" "commands"] 1,
 ["version" "uses" "artificial"] 1,
 ["learns" "from" "conversations"] 1,
 ["artificial" "intelligence" "the"] 1,
 ["to" "be" "on"] 1,
 ["on" "the" "other"] 1,
 ["language" "not" "just"] 1}

Your submission can be a function or full program, and can take input via stdin, or as an argument. It may output by returning, or printing to the stdout.

This is code golf, so the shortest number of bytes wins.

Carcigenicate

Posted 2017-03-04T23:16:06.187

Reputation: 3 295

Should digits be removed or retained in the output? – Martin Ender – 2017-03-04T23:22:38.613

Retained. Note the 4th test case. Do I have contradictory information somewhere? I changed that from when it was in the sandbox. Might have missed updating something. – Carcigenicate – 2017-03-04T23:24:38.653

Nevermind, overlooked that test case. – Martin Ender – 2017-03-04T23:26:20.233

Are repeated spaces ever going to occur in the input, and if so how should they be treated? (At least one current solution would parse "spam eggs ham" (with double spaces, which markdown removes) as ["spam", "", "eggs", "", "ham"] and at least one as ["spam", "eggs", "ham"]) – Jonathan Allan – 2017-03-05T07:38:16.530

You can assume consecutive spaces won't exist. Updated input specification. – Carcigenicate – 2017-03-06T00:37:43.143

Given that you are not lost in another planet: do you think that 10 out of 13 answers do not even deserve an upvote? – edc65 – 2017-03-06T15:46:30.830

@edc65 I've worked the past 2 nights, while dealing with an infected wisdom tooth, while packing for a trip. I glanced down the answers, which is why I commented on yours. Once I have a sec I'll go over and check and upvote the answers I can. – Carcigenicate – 2017-03-06T15:56:02.253

@edc65 Unless everyone needs exactly 10 rep now, I didn't think it would be an issue. – Carcigenicate – 2017-03-06T15:57:47.193

Of course it won't. Seeing so many answers with no feedback just gives a bad feeling. Best wishes for a speedy recovery – edc65 – 2017-03-06T16:12:26.373

@edc65 Thanks. I should have internet on my computer tonight when I get to the hotel, so I'll go over them then. I'm low on data so I don't want to push my luck on the way there. – Carcigenicate – 2017-03-06T16:15:11.180

Answers

2

Jelly, 18 13 bytes

ŒlḲf€ØBṡ3ṢŒr'

A monadic link (function) that returns a list, each entry contains a list of 3 lists of characters (the words) and a count.

Try it online! - the footer formats the results (as a full program everything just gets smushed together by the implicit print).

How?

ŒlḲf€ØBṡ3ṢŒr' - Main link: s
Œl            - convert s to lowercase
  Ḳ           - split on spaces
     ØB       - base digit yield "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
   f€         - filter keep for €ach
       ṡ3     - all overlapping slices of length 3
         Ṣ    - sort
            ' - call the previous monadic link without vectorising
          Œr  -     run length encode

Jonathan Allan

Posted 2017-03-04T23:16:06.187

Reputation: 67 804

4

Perl 6,  67  57 bytes

*.words.map({[~] .lc.comb(/\w/)}).rotor(3=>-2)».join(' ').Bag.perl

Try it

*.words.rotor(3=>-2).map({lc S:g/<-:L-:N-:Z>//}).Bag.perl

Try it

Expanded:

*\                # WhateverCode lambda (this is the parameter)
.words            # get a list of words
.rotor( 3 => -2 ) # grab 3, back up 2, repeat

.map({            # take those list of 3 elements

  lc              # lower case the following

  S               # remove
  :global         # globally
  /
    <- :L         # not a letter
     - :N         # not a number
     - :Z         # not a space  (inserted when the current list is stringified)
    >
  //

})
.Bag              # turn it into a Bag
.perl             # return the structure as an `EVAL`able Str

returns something like

("bb a cc"=>1,"aa bb a"=>1,"a cc a"=>1,"a bb a"=>3,"bb a bb"=>2,"cc a bb"=>1).Bag

Brad Gilbert b2gills

Posted 2017-03-04T23:16:06.187

Reputation: 12 713

3

Retina, 50 bytes

T`lLd p`lld _
M!&`(\b\w+ ??){3}
O`
$
¶
D`
¶+
:$.&¶

Try it online!

Explanation

T`lLd p`lld _

Replace all uppercase letters with lowercase ones, keep other letters, digits and spaces unchanged and delete other characters.

M!&`(\b\w+ ??){3}

Find all possible matches of three words in a row.

O`

Sort the matches, so equal triplets end next to each other.

$
¶

Add a newline at the end.

D`

Remove duplicate lines (this will keep the newlines at the end of them).

¶+
:$.&¶

Sequences of newlines are now the counts we need, replace them with a delimiter (:) followed by the count and a newline.

Leo

Posted 2017-03-04T23:16:06.187

Reputation: 8 482

T\w p`_dll _` saves a byte. – Martin Ender – 2017-03-07T14:09:52.450

1

Mathematica, 75 bytes

Tally@Partition[ToLowerCase@#~StringSplit~RegularExpression@"[_\\W]+",3,1]&

Pure function taking a string as input and returning a list of ordered triples of strings together with their number of appearances. For example, on the input "aa bb a bb a bb a cc a bb a" the output is {{{"aa", "bb", "a"}, 1}, {{"bb", "a", "bb"}, 2}, {{"a", "bb", "a"}, 3}, {{"bb", "a", "cc"}, 1}, {{"a", "cc", "a"}, 1}, {{"cc", "a", "bb"}, 1}}. The expression RegularExpression@"[_\\W]+" recognizes all runs of one or more non-word characters (that's what \\W does, except it calls _ a word character whereas we don't want to). ToLowerCase@#~StringSplit~ then splits the input string at all of those runs, after converting all letters to lowercase. Partition[...,3,1] partitions the resulting list into all triples of consecutive words, and Tally does exactly what we want to that list of triples.

Greg Martin

Posted 2017-03-04T23:16:06.187

Reputation: 13 940

1The string shouldn't be split around non-word characters but only around spaces. Other non-word characters should simply be deleted. See e.g. the test case containing What's. – Martin Ender – 2017-03-05T11:06:39.787

StringSplit splits around spaces by default. – LegionMammal978 – 2017-03-05T16:16:53.503

1

JavaScript (ES6), 200 198 187 bytes

(-2 bytes thanks to Luke)

(-11 bytes thanks to edc65)

H=(h,k=[],i=0)=>i+2<(z=h.toLowerCase().replace(/[^\w ]|_/g,'').split` `).length?H(h,k.concat(z.slice(i,i+3)+''),-~i):new Set(k.map(a=>a+':'+k.filter(b=>b==a).length)).forEach(m=>alert(m))

A recursive solution. Takes input as a single string and outputs dialog boxes containing each triplet's count in the format <Triplet>:<Count>. As always, golfing tips are greatly appreciated.

Test Snippet

H=(h,k=[],i=0)=>i+2<(z=h.toLowerCase().replace(/[^\w ]|_/g,'').split` `).length?H(h,k.concat(z.slice(i,i+3)+''),-~i):new Set(k.map(a=>a+':'+k.filter(b=>b==a).length)).forEach(m=>alert(m))
<input type="text" id="i" value="Oh hi there guy._ What's up? Oh hi there."></input>
<input type="button" value="Submit" onclick="H(document.getElementById('i').value)"></input>

R. Kap

Posted 2017-03-04T23:16:06.187

Reputation: 4 730

split(" ") can become split\``. I don't really understand your Regex. Don't you want to replace /\W|_/ (anything that is (not a word character) or (an underscore))? The .forEach() at the end can probably be replaced by map(). – Luke – 2017-03-05T06:51:56.543

@Luke About the Regex, I also don't want to replace spaces (which are apparently referred to as non-word characters) so I can use .split(" ") on the result. As for your map() advice, the Set() object does not seem to have Set.prototype.map() defined according to Set.prototype, so that won't work. Still, thanks for your first tip. – R. Kap – 2017-03-05T08:29:33.257

[z[i],z[i+1],z[i+2]].join' ' -> z[i]+' '+z[i+1]+' '+z[i+2], but I don't see a reason to use a space to separate words, so why not z.slice(i,i+3)+'' – edc65 – 2017-03-05T22:23:46.137

@edc65 Because I am just starting out with JavaScript and did not really know/remember that built-in existed. Anyways, thanks for the tip! :) – R. Kap – 2017-03-05T23:51:14.040

1

Python 2, 164 159 bytes

s=filter(lambda k:k.isalnum()or k==' ',input()).lower().split()
k=[tuple(s[i:i+3])for i in range(len(s)-3)]
l={}
for i in k:l[i]=l[i]+1if i in l else 1
print l

Not a really great solution, I know. Pretty straightforward.

Thanks to @ovs for saving 5 bytes overall.

HyperNeutrino

Posted 2017-03-04T23:16:06.187

Reputation: 26 575

Some improvements: ' ' is the default parameter of split, you don't need the newline and tab after for and you can replace your conditional expression with l[i]=l.get(i,0)+1 Tio

– ovs – 2017-03-05T07:35:00.330

@ovs Oh, nice. I didn't notice. Thanks! – HyperNeutrino – 2017-03-08T14:54:20.533

1

Java 8, 376 372 bytes

import java.util.*;public interface T{static void main(String[]k)throws Exception{String s="";int c;while((c=System.in.read())>10)if(c==32|(""+(char)c).matches("[A-Za-z]"))s+=(char)c;k=s.toLowerCase().split(" ");Map<String,Integer>o=new HashMap<>();for(int i=0;i<k.length-2;i++){s=k[i]+" "+k[i+1]+" "+k[i+2];o.put(s,(o.containsKey(s)?o.get(s):0)+1);}System.out.print(o);}}

Ungolfed:

import java.util.HashMap;
import java.util.Map;

public interface T {
    static void main(String[] k) throws Exception {
        String s = "";
        int c;
        while ((c = System.in.read()) > 10)
            if (c == 32 | ("" + (char) c).matches("[A-Za-z]"))
                s += (char) c;
        k = s.toLowerCase().split(" ");
        Map<String, Integer> o = new HashMap<>();
        for (int i = 0; i < k.length - 2; i++) {
            s = k[i] + " " + k[i + 1] + " " + k[i + 2];
            o.put(s, (o.containsKey(s) ? o.get(s) : 0) + 1);
        }
        System.out.print(o);
    }
}

Never use Java for actual golfing.

EDIT: Saved 4 bytes by abusing the fact that the main method has a String[] as its parameter so I don't have to initialize one >:-)

HyperNeutrino

Posted 2017-03-04T23:16:06.187

Reputation: 26 575

I'm getting 376 bytes still. I think you forgot to update your code. (String[]a->String[]k and delete ,k[]) – CAD97 – 2017-03-05T03:52:26.653

@CAD97 oh whoops, you're right. That was stupid. Thanks! – HyperNeutrino – 2017-03-05T03:53:16.953

Submitting this as a lambda Function<String, String> that takes s as input and returns o.toString() should be shorter by a lot of bytes. So import java.util.*; & s->{String[]k=s.toLowerCase().split(" "); etc. – CAD97 – 2017-03-05T04:03:39.000

1

Python 3, 151 149 bytes

s="".join(c for c in input()if c.isalnum()or c==" ").lower().split()
k=[tuple(s[i:i+3])for i in range(len(s)-2)]
print(*((x,k.count(x))for x in{*k}))

Try it online!

Trelzevir

Posted 2017-03-04T23:16:06.187

Reputation: 987

1

05AB1E, 23 20 bytes

žLðìÃ#Œ3ù©Ùvyðý®y¢‚ˆ

Try it online!

Explanation

žL                    # push a string of alphanumeric characters
  ðì                  # prepend a space
    Ã                 # keep only those characters from input
     #                # split on spaces
      Œ3ù             # get all sublists of length 3
         ©            # save a copy in register
          Ù           # remove duplicates
           v          # for each unique triplet
            yðý       # join the triplet by spaces
               ®y¢    # count the triplets occurrences in the list saved to register
                  ‚   # pair the string with the count
                   ˆ  # push it to global list
                      # output global list at the end of the program

Emigna

Posted 2017-03-04T23:16:06.187

Reputation: 50 798

1

Python, 143 bytes

lambda s:Counter(tuple(''.join(filter(str.isalnum,w))for w in s.lower().split())[i:i+3]for i in range(s.count(' ')-1))
from collections import*

Try it online!

Unnamed function which returns a dictionary (specifically a Counter object) which has keys that are tuples of the triples (in the correct order) and values that are their counts.

Jonathan Allan

Posted 2017-03-04T23:16:06.187

Reputation: 67 804

1You can assume multiple consecutive spaces won't exist. – Carcigenicate – 2017-03-05T13:35:46.937

Thanks; best to update the question to specify this I think. – Jonathan Allan – 2017-03-05T13:38:14.253

1will do. I literally just woke up though. – Carcigenicate – 2017-03-05T13:38:44.847

1

Bash + Unix utilities, 117 bytes

s=[^\ ]*
p=s/^$s\ //
tr A-Z a-z|tr -cd \ a-z0-9|sed "p;$p;p;$p"|sed "s/\($s $s $s\) /\1\n/g"|grep ' .* '|sort|uniq -c

Try it online!

Input on stdin, output on stdout.

Sample run:

Input:

Oh hi there guy. What's up? Oh hi there.

Output:

      1 guy whats up
      1 hi there guy
      2 oh hi there
      1 there guy whats
      1 up oh hi
      1 whats up oh

Mitchell Spector

Posted 2017-03-04T23:16:06.187

Reputation: 3 392

1

JavaScript (ES6), 124

s=>s.toLowerCase().replace(/\S+/g,x=>t.push(x.replace(/[\W_]/g,''))>2&&(k[t]=-~k[t],t.shift()),t=[],k={})&&JSON.stringify(k)

Note that the output shows the frequency of each triplet, as requested. If the output could just contain the frequency data in a machine usable format, I could save 16 bytes of JSON.stringify()

Less golfed

s => {
   s = s.toLowerCase() // force to lowercase
   t = [] // init current group (or triplet)
   k = {} // init frequency list    
   s.replace(/\S+/g, x => ( // execute for each sequence of non space chars
     x = x.replace(/[\W_]/g,''), //remove non alphanumeric chars
     t.push(x) > 2 // add to end of current group
     && (          // if there are 3 elements in group (it's a triplet)
       k[t] = -~k[t], // add to frequency list incrementing count
       t.shift() // remove first element of current group
     ) 
   )
   return JSON.stringify(k) // return in human readable format
}

Test

let F=
s=>s.toLowerCase().replace(/\S+/g,x=>t.push(x.replace(/[\W_]/g,''))>2&&(k[t]=-~k[t],t.shift()),t=[],k={})&&JSON.stringify(k)

;["Oh hi there guy. What's up? Oh hi there."
,"aa aa aa aa"
,"aa bb a bb a bb a cc a bb a"
,"99 bottles of beer"
,"There are two main types of chatbots, one functions based on a set of rules, and the other more advanced version uses artificial intelligence. The chatbots based on rules, tend to be limited in functionality, and are as smart as they are programmed to be. On the other end, a chatbot that uses artificial intelligence, understands language, not just commands, and continuously gets smarter as it learns from conversations it has with people."]
.forEach(t=>{
  O.textContent += t + '\n'+ F(t).replace(/,"/g,'\n"')+'\n\n'
})
<pre id=O></pre>

edc65

Posted 2017-03-04T23:16:06.187

Reputation: 31 086

A JSON should be an acceptable output unless it contains a bunch of other irrelevant data. – Carcigenicate – 2017-03-06T14:56:59.800

@Carcigenicate JSON is a way to encode data in a stream of characters, and it' human readable. And JSON is exactly the output right now, and as you can see, it only contains relevant data. – edc65 – 2017-03-06T15:42:38.717

1

CJam, 25 bytes

qel"␡{a:0!":,:^-S/3ew$e`p

␡ represents a Delete character (ASCII 127).

Try it online!

Explanation

q                          Read all input
 el                        Convert it to all lowercase
   "␡{a:0!"                Push this string
           :,              Map each character in the string to the range of chars up to,
                            but excluding it
             :^            Reduce the array of ranges with symmetric set differences
                           The result will be an array containing all disallowed characters
               -           Remove all instances of these characters from the input
                S/         Split the input on spaces
                  3ew      Take overlapping slices of length 3
                     $     Sort them
                      e`   Take the run-length encoding
                        p  Print it

Business Cat

Posted 2017-03-04T23:16:06.187

Reputation: 8 927

1

R, 114 bytes

x=readline();y=strsplit(tolower(gsub('[[:punct:]]','',x)),' ')[[1]];s=3:length(y);table(paste(y[s-2],y[s-1],y[s]))

Reads input, removes punctuation, splits it into words and counts triplets.

Robert Hacken

Posted 2017-03-04T23:16:06.187

Reputation: 371