Generate a Texting Dictionary

6

3

this challenge was based on a sandboxed post by @dmckee

Old fashioned cell phones were and are used heavily as texting platforms, unfortunately the keyboard has only 12 + a few buttons, and is therefor ill-suited to naive text entry.

Common solutions make use of the traditional telephone coding (generally with a modest extension for 'q' and 'z'):

2: a,b,c
3: d,e,f
4: g,h,i
5: j,k,l
6: m,n,o
7: p(,q),r,s
8: t,u,v
9: w,x,y(,z)

In this case we are interested in the one-press-per-letter schemes that offer up words matching words on a frequency basis. That is if the user type "4663" the phone will offer up words like "good", "home", "gone", "hood", "hoof", etc that code to 4663 in order of their frequency.

Your task is to read one or more input files and generate a frequency sorted dictionary that could be used to implement this scheme.

Inputs

  • All input will be plain ASCII text.
  • You may accept input from named files, the standard input, or other conventional source supported by your implementation language, but there must be some mechanism to process several files in a single run.
  • Non-letter, non-whitespace characters should be stripped from the input, this means that contractions will be coded without their apostrophe: "don't" --> "dont" --> 3668
  • Words are a sequence of letters separated surrounded by whitespace
  • All whitespace (space, tab, newline, beginning-of-file, end-of-file) is equivalent.
  • Input will be all lowercase or caps your preference.

Output

  • The output consists of a ASCII text stream
  • You may direct this to a file, the standard output, or any other conventional destination supported by your implementation language
  • The output should consist of one line for each string of digits corresponding to at least one word in the input.
  • Each line should consist of the digit string and a list of all corresponding words sorted by frequency with the most frequent coming first. A sample line of output would be

    4663 good home gone hood hoof
    
  • Tokens (digit strings and corresponding words) should be separated by a single space.

  • Equally frequent words may be sorted in any order.

The winner is the shortest code.

Test Cases

potato, potato, damd, fjefje, djfaejri

Top 1000 words used.

The potato was a down goat goat down game gill pont pint point.

More actual test cases will be given in about a week to prevent any sort of hard-coding action.

Christopher

Posted 2017-05-31T17:29:16.827

Reputation: 3 428

What should functionality be when there is nothing of the correct length in the input file(s)? – Stephen – 2017-05-31T17:33:05.583

@StephenS you can bug out. – Christopher – 2017-05-31T17:33:51.777

in the bigO notation, O(n) = O(2n) – Felipe Nardi Batista – 2017-05-31T17:34:19.770

@FelipeNardiBatista oh yeah :P I am really bad at bigO stuff – Christopher – 2017-05-31T17:34:58.893

@Christopher you should add O(1), O(logN) and O(N*logN). the order would be O(1), O(logN), O(N), O(N*logN) and O(N^2) – Felipe Nardi Batista – 2017-05-31T17:36:49.190

2Can you provide a better explanation of "Big O"? I'm left wondering how you accurately measure efficiency across different languages and machines. – Shaggy – 2017-05-31T17:39:29.067

@Shaggy Well the Big O is the same everywhere, It is the rate of time as the input (n) gets bigger – Christopher – 2017-05-31T17:42:10.157

2"Top 1000 words will be used with each word repeated x times based on the position on the list (word one is put 1000 times two is 999... ect [sic])." is this regarding the test case or the input file, and what does it mean exactly? – Jonathan Allan – 2017-05-31T17:43:34.277

by All whitespace is equivalent do you mean that we have to support all whitespace as separators or only what we choose? – dzaima – 2017-05-31T17:50:30.317

@JonathanAllan that is a test case. You would get word one 1000 times word two 999 times, word three 998 times... – Christopher – 2017-05-31T17:55:00.463

@dzaima all whitespace – Christopher – 2017-05-31T17:55:14.043

There seems to be 2 parts to this challenge. (1) read in a dictionary and organize the words by frequency (2) apply this dictionary to an input of digits. It's not clear to me how you have divided these tasks in the challenge. "Your task is to ... generate a frequency sorted dictionary" but the test case involves potato twice and 3 non-words. Wut??! – Octopus – 2017-05-31T18:05:57.383

"You would get word one 1000 times word two 999 times, word three 998 times" what's the point of that? Not at all clear to me what you want us to do here. – Octopus – 2017-05-31T18:08:00.463

@Octopus why are they divided? They are one task – Christopher – 2017-05-31T18:31:11.227

Competing by big O doesn't make sense here. The output is exponentially long in the worst case. – xnor – 2017-05-31T18:32:27.360

@xnor done with it – Christopher – 2017-06-01T20:24:28.070

I would suggest that you focus on writing your own challenges... This taking old challenges and posting isn't working out: there's a reason why they were left derelict for ages – Beta Decay – 2017-06-04T15:10:04.227

@Christopher I wouldn't say its about rep, but more about reducing the number of low quality questions on PPCG – Beta Decay – 2017-06-04T15:24:52.543

@beta is this low quality? – Christopher – 2017-06-04T15:29:39.720

@Christopher I'm not sure – Beta Decay – 2017-06-04T15:33:05.197

Your test cases include capitalized words, like The. What to do with it ? – Setop – 2017-12-13T12:27:45.780

@Setop umm they are different iirc – Christopher – 2017-12-13T19:35:57.913

Your test cases should show the expected output – mbomb007 – 2017-12-13T20:20:30.083

@mbomb007 good point, I uhh didn't know how to do this problem so i checked and the pearl problem gets them right – Christopher – 2017-12-13T20:50:54.020

Answers

2

Jelly, 42 41 bytes

>”q+91N+O:3«9µ€Ḍ’0x
ċ@Þ³ṚQµÇ€⁸ṭ"Zḟ€0Ėṛ/Ðf

Try it online!

Monadic link take a list of strings as argument.

The footer is for formatting the output to a form suitable for outputting. Without it, the monadic link returns:

[[3, ["d", "f", "e"]], [4628, ["goav", "goat"]], [6346, ["mego", "mdim"]]]

(that is, a list, each has the format [number, list of words]).

The "Zḟ€0 idea was taken from this answer.


Explanation:

The first link, which given a string and output its telephone number, is mostly a port from my Python answer.

>”q+91N+O:3«9µ€Ḍ’0x     Link.
                        Implicit input, call it (st).
             µ€         Monadic link map for each character (ch):
>”q                     Larger than character 'q'? If yes, get 1, else get 0.
   +91N                 add 91, and then negate. Current value = -((ch>'q')+91)
       +O               Add the ord (ASCII value) of ch. Get ch-(ch>'q')-91.
         :3             Integer division by 3.
           «9           If the result is larger than 9, set it to 9. (« : min)
                        With the list of digits obtained by 
                        mapping the previous chain to characters:
               Ḍ        Convert from decimal digits to number.
                ’       Decrement.
                 0x     Repeat that many digits of zero.

Then, the second link:

  Þ                         Sort by...
ċ@                            count. (frequency)
   ³                        I have no idea what is this used for.
    Ṛ                       Reverse. (so that words with max count appear first)
     Q                      Unique.
      µ                     With that value,
       Ç                    Map the previous link over this word list.
         ṭ"                 Vectorize tack with...
        ⁸                     the word list.
           Z                Zip (transpose).
            ḟ€0             Filter out 0 from each element.
               Ė            Enumerate.
                  Ðf        Filter, keep elements with truthy...
                ṛ/            last element.

user202729

Posted 2017-05-31T17:29:16.827

Reputation: 14 620

5

Mathematica, 202 bytes

{g[#],##}~r~" "&@@@GatherBy[#&@@@SortBy[Tally@StringReplace[StringSplit@#,x_/;!LetterQ@x->""],-Last@#&],g=""<>ToString/@(\[LeftFloor]If[#<113,(#-91)/3,(#-88.2)/3.4]\[RightFloor]&/@ToCharacterCode@#)&]~(r=Riffle)~"
"<>""&

Note that \[LeftFloor] and \[RightFloor] are 3 bytes each.

user202729

Posted 2017-05-31T17:29:16.827

Reputation: 14 620

Well since nobody answered besides you :/ – Christopher – 2017-07-19T01:25:13.890

4

C++, 566 533 bytes

-33 bytes thanks to Zacharý ( about 20 bytes ) and user202729 ( about 12 bytes )

#include<map>
#include<string>
#include<fstream>
#include<regex>
#include<set>
using namespace std;using s=string;struct z{s d;int f;int operator<(z a)const{return f<=a.f&&d<a.d;}};regex r("[^a-z]");void g(){fstream e("a"),o("d",ios::out);map<s,int>m;while(!e.eof()){s t;e>>t;t=regex_replace(t,r,"");++m[t];}map<s,set<z>>t;for(auto a:m){s k;for(auto c:a.first)k+="22233344455566677778889999"[c-97];t[k].insert({a.first,a.second});}for(auto&a:t){o<<'\n'<<a.first<<' ';for(auto m=a.second.rbegin();m!=a.second.rend();++m)o<<m->d<<' ';}}

Uses a set to ensure that the words in the z struct are ordered, so a reverse iterator loop can be done.

HatsuPointerKun

Posted 2017-05-31T17:29:16.827

Reputation: 1 891

If you are using g++ you can use import and bits/stdc++.h. – user202729 – 2017-12-14T11:46:16.147

Also I think s("...")[k-97] can be replaced by "..."[k-97] (indexing a char[] also works); and there is a redundant space. / Why is const z& necessary? Just z is enough (although the performance would be bad) – user202729 – 2017-12-14T11:49:25.277

You don't need a a space here: return (f<=a.f&&d<a.d);=>return(f<=a.f&&d<a.d); – Zacharý – 2017-12-14T13:19:38.327

Also, using namespace std should save a few bytes if you get rid of all the std::s – Zacharý – 2017-12-14T13:22:42.867

@user202729 I use MSVC++. For the const required, see this answer. I can change it to a non constant value parameter though.

– HatsuPointerKun – 2017-12-14T15:19:05.497

Looks like you can use int instead of bool in operator<. Also probably this will work...

– user202729 – 2017-12-14T15:22:31.270

Using #define B a.second should save some bytes. – Zacharý – 2018-12-04T18:46:47.133

3

Perl 5, 123 bytes + 1 (-a) = 124 bytes

y/A-Z//dc&$k{y/A-Z/22233344455566677778889999/r}{$_}++for@F}{$,=$";say$_,sort{$k{$_}{$b}-$k{$_}{$a}}keys%{$k{$_}}for keys%k

Try it online!

Takes input as uppercase.

Xcali

Posted 2017-05-31T17:29:16.827

Reputation: 7 671

3

Clean, 245 ... 214 210 bytes

import StdEnv
r=removeDup
$s=toString[48+min((toInt c-91-toInt c/113)/3)9\\c<-:s]
?f=foldr(\s-> \l=[(a+if(s==b)1 0,b)\\(a,b)<-l])(zip([0,0..],r f))f
@f=r[(s,reverse(map snd(sort(?[l\\l<-f| $l==s]))))\\s<-map$f]

Try it online!

Οurous

Posted 2017-05-31T17:29:16.827

Reputation: 7 916

1

Röda, 218 bytes

{g={|n|chars n|[("2223334445556667777888999"/"")[ord(_)-97]]|concat}m=new map
replace`[^\w\s]|_`,""|(_/`\s+`)|sort|count|m[g(_)]+=[-_,_1]if m[g(_1)]=[]unless[m[g(_1)]?]
keys m|sort|{|k|[k];m[k]|sort|_|[` $_2`];[`
`]}_}

Try it online!

fergusq

Posted 2017-05-31T17:29:16.827

Reputation: 4 867

1

Java 8, 480 + 19 = 499 bytes

A Runnable lambda. Byte count includes lambda and required import. Input and output are as specified through standard in/out. Input is in all lowercase. Dictionary items are not printed in any particular order.

import java.util.*;

()->{Map<String,Integer>f=new HashMap();new Scanner(System.in).useDelimiter("\\s+").forEachRemaining(n->f.put(n=n.replaceAll("[^a-z]",""),f.getOrDefault(n,0)+1));Map<String,Set>n=new HashMap();f.forEach((k,v)->{String c=k;for(int i=0;i<8;)c=c.replaceAll("["+"abc def ghi jkl mno pqrs tuv wxyz".split(" ")[i]+"]",++i+1+"");n.putIfAbsent(c,new HashSet());n.get(c).add(k);});n.forEach((k,v)->System.out.println(v.stream().sorted((y,z)->f.get(z)-f.get(y)).reduce(k,(y,z)->y+" "+z)));}

Try It Online

Ungolfed lambda

() -> {
    Map<String, Integer> f = new HashMap();
    new Scanner(System.in)
        .useDelimiter("\\s+")
        .forEachRemaining(
            n -> f.put(
                n = n.replaceAll("[^a-z]", ""),
                f.getOrDefault(n, 0) + 1
            )
        )
    ;

    Map<String, Set> n = new HashMap();
    f.forEach((k, v) -> {
        String c = k;
        for (int i = 0; i < 8;)
            c = c.replaceAll(
                "[" + "abc def ghi jkl mno pqrs tuv wxyz".split(" ")[i] + "]",
                ++i + 1 + ""
            );
        n.putIfAbsent(c, new HashSet());
        n.get(c).add(k);
    });

    n.forEach((k, v) ->
        System.out.println(
            v.stream()
                .sorted((y, z) -> f.get(z) - f.get(y))
                .reduce(k, (y, z) -> y + " " + z)
        )
    );
}

f is my frequency map, and n collects words based on their numeric encoding. As usual with programs this long, this is probably far from optimal. Improvements are welcome.

I could have replaced ++i+1+"" in the number encoding step with ""+-~++i for the same number of bytes. But I respect you all too much to do that.

Jakob

Posted 2017-05-31T17:29:16.827

Reputation: 2 428

this does not handle uppercase or capitalized words – Setop – 2017-12-13T12:22:00.457

1@Setop Taking just uppercase or just lowercase was allowed in the challenge specification. – Xcali – 2017-12-13T14:09:50.187

1

PHP, 273 bytes

<?$f=str_split;$w=explode(" ",preg_replace(['/\s+/','/[^a-z ]/is'],[' ',''],$argv[1]));$l=$f('22233344455566677778889999');foreach($w as$x){$z='';foreach($f($x)as$y){$z.=$l[ord($y)-97];}$a[$z][$x]++;}foreach($a as$n=>$b){arsort($b);echo$n." ".join(" ",array_keys($b))."
";}

Try it online!

Jo.

Posted 2017-05-31T17:29:16.827

Reputation: 974

1

Python 3, 179 156 155 bytes

Thanks Rod for -23 bytes!
Thanks anonymous user suggestion for -1 byte!

def f(l):
	d={}
	for w in l:k=(*[min(27,ord(x)-91-(x>'q'))//3for x in w],);d[k]=k in d and{w}|d[k]or{w}
	for k in d:print(k,sorted(d[k],key=l.count)[::-1])

Try it online!

user202729

Posted 2017-05-31T17:29:16.827

Reputation: 14 620

what is the purpose of min(27,...) ? – Setop – 2017-12-13T12:34:00.430

@Setop Otherwise z will become 10, not 9. – user202729 – 2017-12-13T12:54:54.640

Based on your solution, I had a try with defaultdict(Counter). I expected to save some bytes by avoiding count, but ended up spending a lot of bytes printing the result. – Setop – 2017-12-13T14:14:04.230