N-gram histogram

6

Given a text and a number N, output, in a human readable format, the set of the character N-grams with their respective percentages.

The smallprint

  • The text can be read from a file, stdin, passed as string parameter to a function or any other common way i forgot to mention
  • The number N can be hardcoded, but only if it needs to be changed in one single place to change the functionality, the number counts as one byte regardless of the number of digits (T=input();N=100;f(T,N) would count as 20 bytes, not 22)
  • No case sensitivity and all non-letter characters are treated the same. Inputs Abra Cadabra! and aBRa#caDAbrA5 should produce the same output. The character for non-letter can be any printable non-letter, not necessarily contained in the input
  • You may assume N < text.length
  • Output can be written to a file or displayed on screen. If you use a function you can return the result instead of printing as long as it is human readable (a JSON object or array like in the examples below is fine, a custom class or a sparse matrix with 27^N entries - not so much)

Examples with text = 'Abra Kadabra!'
N = 1

a 0.385
b 0.154
r 0.154
_ 0.154
k 0.076923
d 0.076923

N = 12

abra_kababra 0.5
bra_kadabra_ 0.5

N = 2 (dont worry about number format or rounding as long as it sums up to 0.99 < x < 1.01)

ab 0.1666666
br 0.16666
ra 0.1666
a. 0.166667
.k 0.0833333
ka 0.0833333
ad 8.33333e-2
da 0.0833333

a tree structure is fine too

{'a':{'b':0.166, '.':0.166, 'd': 0.0833},
 'b':{'r':0.166},
 'r':{'a':0.166}, 
 '.':{'k':0.0833}, 
 'k':{'a':0.0833}, 
 'd':{'a':0.0833}}

or a list of lists

[['A', 'B', 0.166], ['B', 'R', 0.166], ...]

DenDenDo

Posted 2015-03-27T11:43:30.037

Reputation: 2 811

4I'd suggest explaining exactly what N-grams are instead of just linking. – Calvin's Hobbies – 2015-03-27T11:54:32.563

If we hardcode N how do we count the bytes? Without the value of N? Or for a single-digit N? – Martin Ender – 2015-03-27T12:53:21.273

Another question: can the character representing non-letters be anything? Including a space? – Martin Ender – 2015-03-27T12:57:02.777

And more interestingly, can it be a letter? – FUZxxl – 2015-03-27T14:34:32.433

@FUZxxl If your language can only print letters, then you can use an uppercase letter while all "real" letters are lowercase or the other way round – DenDenDo – 2015-03-27T15:37:36.223

Answers

4

Pyth, 40 35 27 character

KCm>m?k}kGNrz0dQ{m,bc/KblKK

Try it online: Pyth Compiler/Executor

For the input Abra Kadabra!, 2 it prints something like (the order is each time different, because of the behavior of set):

{(('d', 'a'), 0.08333333333333333), (('r', 'a'), 0.16666666666666666), (('b', 'r'), 0.16666666666666666), (('"', 'k'), 0.08333333333333333), (('a', 'b'), 0.16666666666666666), (('k', 'a'), 0.08333333333333333), (('a', '"'), 0.16666666666666666), (('a', 'd'), 0.08333333333333333)}

I use " as special character.

Explanation

m?k}kGNrz0         implicit: z = raw_input() # input_string
m      rz0         map each character k of z.lower() to:
  k   N                k                     '"'
 ? }kG                    if k in "a-z" else 
We call the result X. 

KCm>XdQ             implicit: Q = input() # number N
  m   Q             map each d in [0, 1, 2, ..., Q - 1] to:
   >Xd                  X[d:] # all characters of X starting at position d
 C                  zip - creates all N-grams
K                   store the results in variable K

{m,bc/KblKK
 m        K         map each b in K to:
  ,b                    the pair (b,                    )
     /Kb                             K.count(b)
    c   lK                                      / len(K)
{             set - remove duplicates

Printing in a 'human readable format': 35 character

KCm>m?k}kG\_rz0dQjbo_eN{m,sbc/KblKK

Try it online: Pyth Compiler/Executor

For the input: Abra Kadabra!, 2 it prints:

('a_', 0.16666666666666666)
('ab', 0.16666666666666666)
('ra', 0.16666666666666666)
('br', 0.16666666666666666)
('ka', 0.08333333333333333)
('da', 0.08333333333333333)
('_k', 0.08333333333333333)
('ad', 0.08333333333333333)

Explanation (differences)

\_ instead of N in m?k}kGNrz0
Uses '_' instead of '"'

s in {m,sbc/KblKK
This combines the individual characters

jbo_eN
  o                 order the elements N by:
   _eN                  the value -N[-1] (by the negative percentage)
jb                  join by newlines - prints each pair in row

Jakube

Posted 2015-03-27T11:43:30.037

Reputation: 21 462

all three versions are valid, and it does not need to be sorted – DenDenDo – 2015-03-27T15:42:34.907

@DenDenDo O.k. thanks. I shortened it a lot. – Jakube – 2015-03-27T16:15:01.663

2

J, 57 55 48 characters.

As a dyadic verb, N goes left, text goes right.

[:((a.{~97+~.);"#:#/.~%#)((u:97+i.26)i.tolower)\

Example

2 ([:((a.{~97+~.);"#:#/.~%#)((u:97+i.26)i.tolower)\) 'Abra Kadabra!'

yields:

┌──┬─────────┐
│ab│0.166667 │
├──┼─────────┤
│br│0.166667 │
├──┼─────────┤
│ra│0.166667 │
├──┼─────────┤
│a{│0.166667 │
├──┼─────────┤
│{k│0.0833333│
├──┼─────────┤
│ka│0.0833333│
├──┼─────────┤
│ad│0.0833333│
├──┼─────────┤
│da│0.0833333│
└──┴─────────┘

Thanks to randomra for a nine character improvement.

Explanation

This is the solution with spaces inserted in the appropriate places:

[: ((a. {~ 97 + ~.) ;"#: #/.~ % #) ((u: 97 + i. 26) i. tolower)\

First, observe that "#: is shorthand for "1 0 and that u: 97 + i. 26 is just 'abcdefghijklmnopqrstuvwxyz', so without these two tricks we get

[: ((a. {~ 97 + ~.) ;"1 0 #/.~ % #) ('abcdefghijklmnopqrstuvwxyz' i. tolower)\

And this is how this works. In the following explanation, x and y are the left and right arguments, u and v are verbs, and m and n are nouns.

  1. [: u v—a capped fork, x ([: u v) y is equal to u (x v y). This is used instead of u@v because v is an adverbial phrase and parentheses were needed. In this code, u is (a. {~ 97 + ~.) ;"#: #/.~ % # and v is ((u: 97 + i. 26) i. tolower)\.

  2. x u\ y—apply u to each infix of y of length x. For example, 3 ]\ 'abcdefg' where ] is the identity function yields

    abc
    bcd
    cde
    def
    efg
    
  3. n u v—a fork with a noun as the left argument. (n u v) y is equal to n u (v y). This is used multiple times in the code.

  4. tolower y–convert all upper case letters in y to lower case.

  5. x i. y–find the index of y, which is in this case for each item in y the lowest index into x at which that item can be found or the number of items in x if such an item does not exist. In our use case, x is the lower case alphabet. ('abcdefghijklmnopqrstuvwxyz' i. tolower) y translated each (upper case or lower case) letter in y into a number from 0 to 25. Non-letters are translated into 26. Notice that the character immediately following z in ASCII is {.

  6. x u/. y–applies u to y keyed by x. That is, u is applied to groups of items of y with the same key in x. Noticeably, if x and y are equal, y u/. y applies u to groups of equal items of y and can be used to generate frequency statistics.

  7. # y–the tally of y, that is, the number of items in y.

  8. u~–the reflexive of u, that is, u~ y is equal to y u y.

  9. u v w–a fork, (u v w) y is equal to (u y) v (w y). This, too, is used multiple times in the code.

  10. x % yx divided by y.

  11. #/.~ % #–the frequency count we want sans what we count.

  12. ~. y–the nub of y, that is, each distinct item of y. For instance, ~. 1 2 3 3 4 6 6 6 yields 1 2 3 4 6.

  13. (a. {~ 97 + ~.) y–the nub of y converted from indices into the alphabet back to letters; 26 is converted into { implicitly.

  14. x ; yx linked to y, that is, both put into boxes and the boxes concatenated. For instance, 'abc' ; 1 2 3 4 yields

    ┌───┬───────┐
    │abc│1 2 3 4│
    └───┴───────┘
    

    This is needed because in J, arrays are homogeneous; all items in an array need to have same type. A box hides what is in it and turns heterogeneous arrays into arrays of boxes.

  15. u"nu ranked n. Rank is a slightly complex concept so I won't explain it here, but in this case, x ;"1 0 y (for which x ;"#: y is just a shorthand) applies ; to each string in x and the corresponding number in y as opposed to the entire of x and the entire of y. This is the difference between

    ┌──┬───────────────────────────────────────────────────────────────────────────┐
    │ab│0.166667 0.166667 0.166667 0.166667 0.0833333 0.0833333 0.0833333 0.0833333│
    │br│                                                                           │
    │ra│                                                                           │
    │a{│                                                                           │
    │{k│                                                                           │
    │ka│                                                                           │
    │ad│                                                                           │
    │da│                                                                           │
    └──┴───────────────────────────────────────────────────────────────────────────┘
    

    and the correct output shown above.

FUZxxl

Posted 2015-03-27T11:43:30.037

Reputation: 9 656

[]\(u:97+i.26)i.tolower@] => ((u:97+i.26)i.tolower)\ for 2 bytes. – randomra – 2015-03-27T13:14:37.523

Getting rid of the adverse (::) and boxing/unboxing (&.>) saves 7 more bytes! [:((a.{~97+~.);"#:#/.~%#)((u:97+i.26)i.tolower)\ – randomra – 2015-03-27T14:12:26.307

This is really cool. – FUZxxl – 2015-03-27T14:26:30.747

2

CJam, 53 bytes

q~el_'{,97>-'_er_,,\f>1$(W*<f<:L__&_@f{\a/,(}L,df/]z`

Without having a look at the other CJam answer, here is a shorter one.

Input goes like:

2 "Abra Kadabra!"

Output is like:

[["ab" 0.16666666666666666] ["br" 0.16666666666666666] ["ra" 0.16666666666666666] ["a_" 0.16666666666666666] ["_k" 0.08333333333333333] ["ka" 0.08333333333333333] ["ad" 0.08333333333333333] ["da" 0.08333333333333333]]

Explanation soon.

Try it online here

Optimizer

Posted 2015-03-27T11:43:30.037

Reputation: 25 836

1

CJam, 61 59 bytes

I think the input transformation is quite nifty, but the actual tallying needs work.

l~qeu'~),_el&'_er_,,2$f{2$@><\}\W*<:L__&\f{1$a/,(L,d/][};]`

Reads N on the first line and then treats the remainder of STDIN as the text:

2
Abra Kadabra!

The result is printed as a nested CJam-style array with upper case letters and _ for non-letters:

[["AB" 0.16666666666666666] ["BR" 0.16666666666666666] ["RA" 0.16666666666666666] ["A_" 0.16666666666666666] ["_K" 0.08333333333333333] ["KA" 0.08333333333333333] ["AD" 0.08333333333333333] ["DA" 0.08333333333333333]]

Test it here.

Martin Ender

Posted 2015-03-27T11:43:30.037

Reputation: 184 808

1

Clip 10, 53

\[tm[b{(b/)b)mlxt`}q[bm[juj+jtb}R)mlbt]a[!cA}m.U`xS]n

Example:

C:\Users\~~~~~\Documents>java -jar Clip10.jar ngram.clip
Abra Kadabra!
2
{{"AB", 0.166666666666666}, {"A ", 0.166666666666666}, {"RA", 0.166666666666666}, {"BR", 0.166666666666666}, {"DA", 0.083333333333333}, {" K", 0.083333333333333}, {"KA", 0.083333333333333}, {"AD", 0.083333333333333}}

Explanation

 [t                                               ]n .- Assign the second input to t -.
                                            m.U`x    .- Convert the input to upper case -.
                                      a[!cA}     S   .- Replace all non-alphabetic characters with spaces -.
                   [b                ]               .- Assign the modified string to b -.
                     m[juj+jtb}R)mlbt                .- All substrings of length t -.
                  q                                  .- Frequencies (in the form {"A", 1} )-.
   m[b{(b/)bmlxt`}                                   .- Use proportional frequencies
\                                                    .- Make the array pretty -.

Ypnypn

Posted 2015-03-27T11:43:30.037

Reputation: 10 485

0

APL (Dyalog Classic) with MiServer's Strings.lc, 30 bytes

Prompts for text, then for N. Prints N-grams and frequencies.

(∪,⊢∘≢⎕U2338÷≢)⎕,/'\W'⎕R'_'lc⍞

Try it online!

Only 25 characters (but 44 bytes) in Dyalog Unicode:

(∪,⊢∘≢⌸÷≢)⎕,/'\W'⎕R'_'lc⍞

… because ⎕U2338, but is not included in the Classic character set.

Explanation

 get text input

lclowercase

'\W'⎕R'_' PCRE Replace non-word characters with underscores

,/ concatenate the characters in each possible position of a sliding window of size

 number-input

() apply the following anonymous tacit function on that (the N-grams):

 the unique (N-grams)

, followed by

⊢∘≢⌸ the number occurrences for each unique element (of the N-grams)

÷ divided by

 the tally (of N-grams)

Adám

Posted 2015-03-27T11:43:30.037

Reputation: 37 779

0

Haskell, 164 bytes

import Data.Char
import Data.List
l=genericLength
c c|isAlpha c=toLower c|1<2='_'
n%i=[(s!!0,l s/l p)|s<-group p]where p=sort[take n s|s<-tails$map c i,length s>=n]

Usage: 12 % "Abra Kadabra!" -> [("abra_kadabra",0.5),("bra_kadabra_",0.5)]

This is longer than I thought. Does anyone know a shorter version of a "split list into sublists of length n" function (my attempt: [take n s|s<-tails$ i,length s>=n])?

How it works:

  • map all characters in the input list to lowercase or '_' if not a letter
  • split into a list of substrings of length n
  • sort, name the list p
  • group equal elements (e.g. for n=2 -> [["_k"],["a_","a_"],["ab","ab"],["ad"],["br","br"],["da"],["ka"],["ra","ra"]]
  • for each sublist take one of it's elements and pair it with the length divided by the length of p

nimi

Posted 2015-03-27T11:43:30.037

Reputation: 34 639

0

Mathematica, 124 119 bytes

Could probably golf further.

#[[1]]->#[[2]]&/@#/Length@#&@Tally@Partition[Characters@StringReplace[ToLowerCase@#,Except@LetterCharacter->"_"],#2,1]&

LegionMammal978

Posted 2015-03-27T11:43:30.037

Reputation: 15 731

1The doesn't ask for percent so you don't need the 100. It also doesn't even specify that the results should be floating point ("dont worry about number format or rounding as long as it sums up to 0.99 < x < 1.01") so I think you can actually report the results as rationals and skip the N@. Also, I think what you're trying to do with GroupBy and Length is what Tally is for. – Martin Ender – 2015-03-29T16:33:00.850

@MartinBüttner Thanks! The spec wasn't exactly clear... EDIT: Dangit my Mathematica trial expired... guess I'll have to try to read my own mess D: – LegionMammal978 – 2015-03-31T10:42:56.147