How many characters per character?

15

1

At http://shakespeare.mit.edu/ you can find the full text of each of Shakespeare's plays on one page (e.g. Hamlet).

Write a script that takes in the url of a play from stdin, such as http://shakespeare.mit.edu/hamlet/full.html, and outputs the number of text characters each play character spoke to stdout, sorted according to who spoke the most.

The play/scene/act titles obviously do not count as dialogue, nor do the character names. Italicized text and [square bracketed text] are not actual dialogue, they should not be counted. Spaces and other punctuation within dialogue should be counted.

(The format for the plays looks very consistent though I have not looked at them all. Tell me if I've overlooked anything. Your script does not have to work for the poems.)

Example

Here is a simulated section from Much Ado About Nothing to show what I expect for output:

More Ado About Nothing

Scene 0.

Messenger

I will.

BEATRICE

Do.

LEONATO

You will never.

BEATRICE

No.

Expected output:

LEONATO 15
Messenger 7
BEATRICE 6

Scoring

This is code golf. The smallest program in bytes will win.

Calvin's Hobbies

Posted 2014-07-25T20:29:54.950

Reputation: 84 000

8What if someone did this Shakespeare challenge in Shakespeare? It would be amazing if that was even possible... – fuandon – 2014-07-25T20:37:25.063

Can we assume we have a list of the characters in the play? Or must we infer the characters from the text? The latter is very difficult given that some characters (e.g. Messenger) have a mix of upper and lower case letters. Others have names with only upper case letters (e.g. LEONATO); and some of those are compound names. – DavidC – 2014-07-25T21:54:32.760

Yes you should infer the names. They are formatted very differently than the dialogue so given the html differentiating them shouldn't be too tricky. – Calvin's Hobbies – 2014-07-25T22:40:02.877

Yes, perhaps if one works directly with the HTML... – DavidC – 2014-07-25T22:54:14.387

1Should 'All' be considered as a separate character? – es1024 – 2014-07-25T23:09:59.710

@DavidCarraher Well I'm not sure why you wouldn't use the html unless you couldn't... I'll ignore the names rule iff you are using the Shakespeare language. – Calvin's Hobbies – 2014-07-25T23:58:29.277

1@es1024 Yes. Any play character with a unique title is considered separate, even if the result does not exactly make sense. – Calvin's Hobbies – 2014-07-26T00:01:44.387

Doesn't really qualify, but playing with javascript just for fun -> FIDDLE

– adeneo – 2014-07-26T00:53:29.503

Answers

4

PHP (240 characters)

Divides the html into strings (using as a delimeter), then runs a couple of regular expressions to extract name and words spoken. Saves length of words spoken into array. Golfed:

<?@$p=preg_match_all;foreach(explode('/bl',implode(file(trim(fgets(STDIN)))))as$c)if($p('/=s.*?b>(.*?):?</',$c,$m)){$p('/=\d.*?>(.*?)</',$c,$o);foreach($m[1]as$n)@$q[$n]+=strlen(implode($o[1]));}arsort($q);foreach($q as$n=>$c)echo"$n $c\n";

Ungolfed:

<?php
$html = implode(file(trim(fgets(STDIN))));
$arr = explode('/bl',$html);
foreach($arr as $chunk){
    if(preg_match_all('/=s.*?b>(.*?):?</',$chunk,$matches)){
        $name = $matches[1];
        preg_match_all('/=\d.*?>(.*?)</',$chunk,$matches);
        foreach($name as $n)
            @$names[$n] += strlen(implode($matches[1]));
    }
}
arsort($names);
foreach($names as $name=>$count)
    echo "$name $count\n";

Note: This considers 'All' to be a separate character.

Example:

$php shakespeare.php <<< "http://shakespeare.mit.edu/hamlet/full.html"
HAMLET 60063
KING CLAUDIUS 21461
LORD POLONIUS 13877
HORATIO 10605
LAERTES 7519
OPHELIA 5916
QUEEN GERTRUDE 5554
First Clown 3701
ROSENCRANTZ 3635
Ghost 3619
MARCELLUS 2350
First Player 1980
OSRIC 1943
Player King 1849
GUILDENSTERN 1747
Player Queen 1220
BERNARDO 1153
Gentleman 978
PRINCE FORTINBRAS 971
VOLTIMAND 896
Second Clown 511
First Priest 499
Captain 400
Lord 338
REYNALDO 330
FRANCISCO 287
LUCIANUS 272
First Ambassador 230
First Sailor 187
Messenger 185
Prologue 94
All 94
Danes 75
Servant 49
CORNELIUS 45

es1024

Posted 2014-07-25T20:29:54.950

Reputation: 8 953

1Kindly show some example(s) of output. – DavidC – 2014-07-25T23:51:17.497

@DavidCarraher An example has been added. – es1024 – 2014-07-26T00:09:57.533

3

Rebol - 556 527

t: complement charset"<"d: charset"0123456789."m: map[]parse to-string read to-url input[any[(s: 0 a: copy[])some["<A NAME=speech"some d"><b>"copy n some t</b></a>(append a trim/with n":")some newline]<blockquote>newline any["<A NAME="some d">"copy q some t</a><br>newline(while[f: find q"["][q: remove/part f next find f"]"]s: s + length? trim head q)|<p><i>some t</i></p>newline][</blockquote>|</body>](foreach n a[m/:n: either none? m/:n[s][s + m/:n]])| skip]]foreach[x y]sort/reverse/skip/compare to-block m 2 2[print[x y]]

This could probably be golfed further however its unlikely to get below the answer(s) already provided :(

Ungolfed:

t: complement charset "<"
d: charset "0123456789."
m: map []

parse to-string read to-url input [
    any [
        (s: 0 a: copy [])

        some [
            "<A NAME=speech" some d "><b>" copy n some t </b></a>
            (append a trim/with n ":")
            some newline
        ]

        <blockquote> newline
        any [
            "<A NAME=" some d ">" copy q some t </a><br> newline (
                while [f: find q "["] [
                    q: remove/part f next find f "]"
                ]
                s: s + length? trim head q
            )
            | <p><i> some t </i></p> newline
        ]
        [</blockquote> | </body>]
        (foreach n a [m/:n: either none? m/:n [s] [s + m/:n]])

        | skip
    ]
]

foreach [x y] sort/reverse/skip/compare to-block m 2 2 [print [x y]]

This program removes [square bracketed text] and also trims surrounding whitespace from the dialogue. Without this the output is identical to es1024 answer.

Example:

$ rebol -q shakespeare.reb <<< "http://shakespeare.mit.edu/hamlet/full.html"
HAMLET 59796
KING CLAUDIUS 21343
LORD POLONIUS 13685
HORATIO 10495
LAERTES 7402
OPHELIA 5856
QUEEN GERTRUDE 5464
First Clown 3687
ROSENCRANTZ 3585
Ghost 3556
MARCELLUS 2259
First Player 1980
OSRIC 1925
Player King 1843
GUILDENSTERN 1719
Player Queen 1211
BERNARDO 1135
Gentleman 978
PRINCE FORTINBRAS 953
VOLTIMAND 896
Second Clown 511
First Priest 499
Captain 400
Lord 338
REYNALDO 312
FRANCISCO 287
LUCIANUS 269
First Ambassador 230
First Sailor 187
Messenger 185
Prologue 89
All 76
Danes 51
Servant 49
CORNELIUS 45

draegtun

Posted 2014-07-25T20:29:54.950

Reputation: 1 592

0

Common Lisp - 528

(use-package :plump)(lambda c(u &aux(h (make-hash-table))n r p)(traverse(parse(drakma:http-request u))(lambda(x &aux y)(case p(0(when(and n(not(ppcre:scan"speech"(attribute x"NAME"))))(setf r t y(#1=ppcre:regex-replace-all"aside: "(#1#"^(\\[[^]]*\\] |\\s*)"(text x)"")""))(dolist(w n)(incf(gethash w h 0)(length y)))))(1(if r(setf n()r()))(push(intern(text(aref(children x)0)))n)))):test(lambda(x)(and(element-p x)(setf p(position(tag-name x)'("A""b"):test #'string=)))))(format t"~{~a ~a~^~%~}"(alexandria:hash-table-plist h)))

Explanation

This is a slightly modifed version which adds printing informations (see paste).

(defun c (u &aux
                 (h (make-hash-table)) ;; hash-table
                 n ;; last seen character name
                 r p
                 )
      (traverse                 ;; traverse the DOM generated by ...
       (parse                   ;; ... parsing the text string
        (drakma:http-request u) ;; ... resulting from http-request to link U
        )

       ;; call the function held in variable f for each traversed element
       (lambda (x &aux y)
         (case p
           (0 ;a
            (when(and n(not(alexandria:starts-with-subseq"speech"(attribute x "NAME"))))
              (setf r t)
              (setf y(#1=ppcre:regex-replace-all"aside: "(#1#"^(\\[[^]]*\\] |\\s*)"(text x)"")""))
              (format t "~A ~S~%" n y) ;; debugging
              (dolist(w n)
                (incf
                    (gethash w h 0) ;; get values in hash, with default value 0
                    (length y)))) ;; length of text
            )
           (1 ;b
            (if r(setf n()r()))
            (push (intern (text (aref (children x)0)))n))))

       ;; but only for elements that satisfy the test predicate
       :test
       (lambda(x)
         (and (element-p x) ;; must be an element node
              (setf p(position(tag-name x)'("A""b"):test #'string=)) ;; either <a> or <b>; save result of "position" in p
              )))

        ;; finally, iterate over the elements of the hash table, as a
        ;; plist, i.e. a list of alternating key values (k1 v1 k2 v2 ...),
        ;; and print them as requested. ~{ ~} is an iteration control format.
  (format t "~&~%~%TOTAL:~%~%~{~a ~a~^~%~}" (alexandria:hash-table-plist h)))

Notes

  • I remove bracketed text as well as the "aside: " occurence that is not present in brackets (I also trim whitespace characters). Here is a trace of execution with the text being matched and the total for each character, for Hamlet.

  • As other answers, All is assumed to be a character. It could be tempting to add the value of all to all other characters, but this would be incorrect since "All" refers to the characters actually present on stage, which requires to keep a context of who is present (tracking "exit" "exeunt" and "enter" indications). This is not done.

coredump

Posted 2014-07-25T20:29:54.950

Reputation: 6 292