Compute the histogram entropy estimation of a string

19

1

Write a program or function that estimates the Shannon entropy of a given string.

If a string has n characters, d distinct characters, xi is the i th distinct character, and P(xi) is the probability of that character occuring in the string, then our Shannon entropy estimate for that string is given by:

H = -n\sum\limits_{i=1}^d P(x_i) \log_2 P(x_i)

For the estimation in this challenge we assume that the probability of a character occurring in a string is simply the number of times it occurs divided by the total number of characters.

Your answer must be accurate to at least 3 digits after the period.


Test cases:

"This is a test.", 45.094
"00001111", 8.000
"cwmfjordbankglyphsvextquiz", 122.211
"             ", 0.0

orlp

Posted 2016-04-25T17:19:08.600

Reputation: 37 067

Opposed to my usual challenges, this one looks complicated, but is actually quite simple :) – orlp – 2016-04-25T17:28:45.297

Related: http://codegolf.stackexchange.com/q/24316

– msh210 – 2016-04-25T17:56:39.067

Is it safe to assume printable ASCII for the input string? – AdmBorkBork – 2016-04-25T18:00:07.240

@TimmyD No. Any string that your language's string type supports. – orlp – 2016-04-25T18:02:34.223

Unfortunately, Mathematica's Entropy counts bits per character, not total for the string; oh well... – 2012rcampion – 2016-04-26T02:47:48.203

Answers

2

Jelly, 11 8 bytes

ċЀ÷Ll.S

Try it online!

Dennis

Posted 2016-04-25T17:19:08.600

Reputation: 196 637

Can I ask, how you enter those characters? With copy and paste? – Bálint – 2016-04-26T09:49:27.840

At least on Linux, they can all be typed on the US international keyboard. – Dennis – 2016-04-26T14:44:40.653

11

Python 3.3+, 64 bytes

import math
lambda s:sum(math.log2(len(s)/s.count(c))for c in s)

Got math.log2 from mbomb007's solution.

xnor

Posted 2016-04-25T17:19:08.600

Reputation: 115 687

So @orlp didn't give us a fully simplified formula, eh...? – mbomb007 – 2016-04-25T20:37:55.290

@mbomb007 Depends for what purpose you're simplifying. Writing it in terms of probabilities and distinct characters is natural as a definition, but for golfing it's shorter to work with counts and iterate over all characters. – xnor – 2016-04-25T20:41:02.010

2

APL, 18 14 bytes

+/2⍟≢÷(+/∘.=⍨)

This is an unnamed, monadic function train that accepts a string on the right and returns a real.

Like all good things in life, this uses xnor's formula. We get a matrix of booleans corresponding to the occurrences of each character in the string using ∘.=⍨, sum this along the first axis (+/) to get the number of occurrences of each character, divide the length of the string by each, then take log base 2 (2⍟) and sum.

Try it here

Saved 4 bytes thanks to Dennis!

Alex A.

Posted 2016-04-25T17:19:08.600

Reputation: 23 761

1

Perl 5, 58 bytes

A subroutine:

{for$a(@a=split'',pop){$t+=(log@a/grep/\Q$a/,@a)/log 2}$t}

A tip of my hat to xnor for the formula.

msh210

Posted 2016-04-25T17:19:08.600

Reputation: 3 094

-F doesn't work (in Strawberry, anyway) because it includes the $/. – msh210 – 2016-04-25T17:59:10.397

1

MATL, 17 bytes

S4#Y'ts/tZl*sGn_*

Try it online!

beaker

Posted 2016-04-25T17:19:08.600

Reputation: 2 349

You may be able to save some bytes with Ym – Luis Mendo – 2016-04-25T21:55:11.613

1

J - 18 16 14 bytes

1#.2^.#%1#.=/~

Shortened using the idea in Dennis' method.

Usage

   f =: 1#.2^.#%1#.=/~
   f 'This is a test.'
45.0936
   f '00001111'
8
   f 'cwmfjordbankglyphsvextquiz'
122.211
   f '             '
0

Explanation

1#.2^.#%1#.=/~  Input: string S
           =/~  Create a table testing for equality
        1#.     Convert each row from a list of base 1 digits to decimal
                This is equivalent to taking the sum and forms a list of tallies
      #         Get the length of S
       %        Divide the length by each tally
   2^.          Log base 2 of each
1#.             "Sum" those values and return

miles

Posted 2016-04-25T17:19:08.600

Reputation: 15 654

1I don't think this counts as a function. If you assign the code to a variable, it does something entirely different. – Dennis – 2016-04-26T06:00:58.047

@Dennis From what I gather, it appears J interprets it as a chain of compositions, using 3 : '... y' with the same syntax would be a valid way to define it as a function. J states that it evaluates from right-to-left, so I've refactored my code as a train. I don't like caps [: but I can't find any other way to make a train. – miles – 2016-04-26T09:19:21.710

1

JavaScript (ES6), 67 bytes

s=>[...s].map(c=>t+=Math.log2(s.length/~-s.split(c).length),t=0)&&t

I need to use ~-s.split because that accepts strings rather than regexps. As usual, map beats reduce by a byte.

s=>[...s].reduce((t,c)=>t+Math.log2(s.length/~-s.split(c).length),0)

Neil

Posted 2016-04-25T17:19:08.600

Reputation: 95 035

1

MATL, 14 bytes

!Gu=stGn/Zl*s|

Try it online!

!      % transpose implicit input into column vector
Gu     % row vector with unique elements of input
=      % test for equality, element-wise with broadcast
s      % sum of each column
tGn/   % duplicate. Divide by number of input characters
Zl     % binary logarithm
*      % element-wise multiplication
s      % sum of array
|      % absolute value. Display implicitly

Luis Mendo

Posted 2016-04-25T17:19:08.600

Reputation: 87 464

1

Julia, 37 bytes

x->sum(log2(endof(x)./sum(x.==x',1)))

Takes a character array as input. Try it online!

Dennis

Posted 2016-04-25T17:19:08.600

Reputation: 196 637

0

Pyth - 17 bytes

*_lQsm*FlBc/QdlQ{

Try it online here.

Maltysen

Posted 2016-04-25T17:19:08.600

Reputation: 25 023

0

Jolf, 26 bytes

_*liuΜGμiEd*γ/l miLeHlimzγ

Try it here! (Note that the test suite function is borked.)

Explanation

_*liuΜGμiEd*γ/l miLeHlimzγ
       μi                   unique members of i
      G  E                  split on ""
     Μ    d                 map over function
               _miLeH       match i with regex escaped member
             /l      li     divide length of (^) by length of i
            γ               γ = (^)
           *           mzγ  (^) * log_2(γ)
 *li                        (^) * length of i
_                           negate

Conor O'Brien

Posted 2016-04-25T17:19:08.600

Reputation: 36 228

0

Python 3.3+, 95 91 89 85 bytes

Simple solution. Version 3.3 is required to use math.log2.

import math
def f(s):C=s.count;return-sum(C(x)*math.log2(C(x)/len(s))for x in set(s))

Try it online

mbomb007

Posted 2016-04-25T17:19:08.600

Reputation: 21 944

Do you think there's anything unnecessary here? n*sum(s.count(c)/n – orlp – 2016-04-25T20:11:43.723

@orlp Thanks. I originally had a separate function for finding the probability, but had pasted it inside twice and deleted it to save chars. – mbomb007 – 2016-04-25T20:16:40.360

You don't have to store n in a variable now that you only use it once. – Maltysen – 2016-04-25T20:32:10.520

0

Java 7, 207 bytes

double C(String x,Map<Character,Integer>f){double H=0,g;for(char c:x.toCharArray())f.put(c,f.containsKey(c)?f.get(c)+1:1);for(char c:f.keySet()){g=f.get(c);H+=g*Math.log(g/x.length())/Math.log(2);}return-H;}

Detailed try online

double log2(double d) { return Math.log(d) / Math.log(2); }

double C(String x, Map<Character,Integer>f)
{
    double H=0,g;

    // frequency
    for(char c : x.toCharArray())
    {
        f.put(c, f.containsKey(c) ? f.get(c)+1 : 1);
    }

    // calculate entropy
    for(char c : f.keySet())
    {
        g = f.get(c);
        H += g * log2(g / x.length());
    }

    return -H;
}

Khaled.K

Posted 2016-04-25T17:19:08.600

Reputation: 1 435

0

Factor, 98 bytes

[ [ length ] [ dup [ [ = ] curry dupd count ] { } map-as nip ] bi [ / log 2 log / ] with map sum ]

This is a direct translation of this Python answer. I'll add an explanation over dinner.

cat

Posted 2016-04-25T17:19:08.600

Reputation: 4 989

0

Racket, 130 bytes

:c

#lang racket
(require math)(λ(S)(let([s(string->list S)])(sum(map(λ(c)(/(log(/(length s)(count(λ(x)(char=? c x))s)))(log 2)))s))))

Translation of my Factor answer, so it's an indirect translation of Kenny Lau's Python answer.

cat

Posted 2016-04-25T17:19:08.600

Reputation: 4 989

0

k (32 bytes)

{-+/c*(log c%n:+/c:#:'=x)%log 2}

Or in q, the translation is not all that short but clearer:

{neg sum c*2 xlog c%n:sum c:count each group x}

skeevey

Posted 2016-04-25T17:19:08.600

Reputation: 4 139

0

Mathematica, 45 bytes

Tr[Log[2,Tr@#/#]#]&@Values@CharacterCounts@#&

Usage

This returns exact results so we approximate them with N.

  f = Tr[Log[2,Tr@#/#]#]&@Values@CharacterCounts@#&
  f["This is a test."]//N
45.0936
  f["00001111"]//N
8.
  f["cwmfjordbankglyphsvextquiz"]//N
122.211
  f["             "]//N
0.

miles

Posted 2016-04-25T17:19:08.600

Reputation: 15 654

0

R, 67 bytes

l=length(i<-strsplit(readline(),"")[[1]]);-sum(log2(l/table(i)[i]))

Explanation

Take input from stdin and split it into a list of characters. (This clunky syntax is why string golf challenges are so tough in R...)

         i<-strsplit(readline(),"")[[1]])

This assignment is hidden inside of a length command, so we get two assignments for the price of one. We have i, the list of characters, and l, its length.

l=length(i<-strsplit(readline(),"")[[1]]);

Now we calculate the entropy. R has a nice function table which returns the counts of all unique values. For input This is a test, table(i) returns

> table(i)
i
  . a e h i s t T 
3 1 1 1 1 2 3 2 1

This is indexed by characters, which is nice, as we can then use i as an index to get the count of each character, like so:

> table(i)[i]
i
T h i s   i s   a   t e s t . 
1 1 2 3 3 2 3 3 1 3 2 1 3 2 1 

The rest of the code is then a simple implementation of the entropy formula, flipped around a little.

                                           -sum(log2(l/table(i)[i]))

rturnbull

Posted 2016-04-25T17:19:08.600

Reputation: 3 689

Save two bytes (also your submission does not work on TIO) – JayCe – 2018-06-05T16:05:17.793

61 chars using utf8ToInt – JayCe – 2018-06-05T16:12:47.060

0

C#, 159 bytes

Golfed:

string f(string s){var l=s.Length;double sum=0;foreach(var item in s.GroupBy(o=>o)){double p=(double)item.Count()/l;sum+=p*Math.Log(p,2);}return(sum*=-l)+"";}}

Ungolfed:

string f(string s)
{
  var l = s.Length;
  double sum = 0;
  foreach (var item in s.GroupBy(o => o))
  {
    double p = (double)item.Count() / l;
    sum += p * Math.Log(p, 2);
  }
  return (sum *= -l) + "";
}

Test:

var codeGolf = new StringHistogramEntropyEstimation();
    Console.WriteLine(codeGolf.f("This is a test.")); //45.0935839298008
    Console.WriteLine(codeGolf.f("00001111")); //8
    Console.WriteLine(codeGolf.f("cwmfjordbankglyphsvextquiz")); //122.211432671668
    Console.WriteLine(codeGolf.f("             ")); //0

Pete Arden

Posted 2016-04-25T17:19:08.600

Reputation: 1 151

0

Groovy, 100 Bytes

{a->n=a.size();a.toList().unique().collect{p=a.count(it)/n;p*(Math.log(p)/Math.log(2.0f))}.sum()*-n}

Tests:

This is a test. = 45.09358393449714
00001111 = 8.0
cwmfjordbankglyphsvextquiz = 122.21143275636976
aaaaaaaa = -0.0

Magic Octopus Urn

Posted 2016-04-25T17:19:08.600

Reputation: 19 422