Finding the most 'unique' word

12

4

Using you language of choice, write the shortest function/script/program you can that will identify the word with the highest number of unique letters in a text.

  • Unique letters should include any distinct character using UTF-8 encoding.
    • Upper and lower case versions of the same character are different and distinct; 'a' != 'A'
  • Words are bound by any whitespace character.
  • 'Letters' are any symbol which can be represented by a single unicode character.
  • The text document must be read in by your code -- no preloading/hard-coding of the text allowed.
  • The output should be the word, followed by the count of unique letters.
    • llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 18
    • Any delimiter/spacing between the two values is up to you, so long as there is at least one character to separate them.
  • In the event more than one word exists with the highest count, print all words for that count, with one new line delimiting.
    superacknowledgement - 16
    pseudolamellibranchiate - 16
  • This is code golf, so shortest code wins.

This answer on English.SE inspired me to create this challenge. The example uses just a word list, but any text should be able to be processed.

Gaffi

Posted 2013-02-05T14:03:13.340

Reputation: 3 411

1How are words separated? You say unique letters are any UTF-8 character, but that would imply that the entire file is just one word. – cardboard_box – 2013-02-05T14:54:46.307

@cardboard_box, my interpretation is that it's left flexible, so that you can decide to use code point 10 as a word separator, or code point 32, or treat any non-empty sequence of characters drawn from the 26 Unicode whitespace characters as a word separator. – Peter Taylor – 2013-02-05T15:15:23.100

@PeterTaylor Correct. I'll update the question to make note of the whitespace. – Gaffi – 2013-02-05T15:32:02.803

1How are you defining letters here? As I've just been and pointed out on one of the English.SE answers LlanfairPG is a Welsh word and contains letters from the Welsh alphabet - ll and ch are both single letters in the Welsh language. – Gareth – 2013-02-05T16:12:32.447

If the function is required to read the input, why require it be a (parameterless) function? It seems that a script would be more than adequate. – primo – 2013-02-05T16:19:50.657

1@Gareth I was not aware of that distinction, my mistake. Are there unicode representations of those two 'letters'? For the purposes of this challenge, each individual unicode character is a letter. – Gaffi – 2013-02-05T16:20:40.510

@Gaffi No, because they're easily represented as two characters from the english alphabet there's no point in having a separate Unicode character for them. I wasn't having a go - just wanted to be sure of what you meant by 'letter' if you mean Unicode character that's perfectly clear. – Gareth – 2013-02-05T16:22:36.950

@primo Fair enough. Any code implementation will work - script, function, complete program, what have you - so long as the code reads in the text. – Gaffi – 2013-02-05T16:23:33.410

@Gareth That's all it is, Uni char. :-) – Gaffi – 2013-02-05T16:24:42.383

Can you clarify "The text document must be read in by your code"? Can the input/document be a function parameter? Do you expect us to load a filestream? What about a prompt box in javascript? – Shmiddty – 2013-02-05T20:40:02.087

@Shmiddty All of the above are ok. Essentially, I meant for that rule to mean 'no cheating'. I intentionally left it vague, since I know some implementations will be smaller with reading STDIN vs. passing an argument vs. opening/scanning a text file. – Gaffi – 2013-02-05T21:08:44.133

One last question, do upper and lower-case characters count separately? – Shmiddty – 2013-02-05T21:27:05.790

Yes, they are different and distinct. Updating question... – Gaffi – 2013-02-05T21:33:48.533

1So abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!@#$%^&*()_+~\<>/\?'";:{}[],.` is a valid "word"? – Shmiddty – 2013-02-05T21:44:15.387

2Off-topic, but apparently there used to be single letters for LL and ll in Welsh. At least Unicode has U+1EFA and U+1EFB for those; "Middle-Welsh" it calls them. There is no titlecase Ll though. – Mr Lister – 2013-02-06T09:15:06.410

@Shmiddty Yep, that's valid. – Gaffi – 2013-02-06T12:36:19.477

Answers

7

APL (56)

{⎕ML←3⋄⊃{⍵,⍴∪⍵}¨W[⍙]⍴⍨↑+/∆∘.=∆←∆[⍙←⍒∆←↑∘⍴∘∪¨W←⍵⊂⍨⍵≠' ']}

This is a function (question says that's allowed) that takes a string and returns a matrix of words and unique lengths.

Usage:

      {⎕ML←3⋄⊃{⍵,⍴∪⍵}¨W[⍙]⍴⍨↑+/∆∘.=∆←∆[⍙←⍒∆←↑∘⍴∘∪¨W←⍵⊂⍨⍵≠' ']}'The quick brown fox jumps over the lazy dog.'
quick 5
brown 5
jumps 5

Explanation:

  • ⎕ML←3: set migration level to 3 (so that is partition instead of enclose)
  • W←⍵⊂⍨⍵≠' ': store in W the given string, where each partition consists of non-whitespace characters.
  • ⍙←⍒∆←↑∘⍴∘∪¨W: get the amount () of unique () elements in each part (¨) of W, and store these in , then get the sort order when sorted downwards on this () and store that in .
  • ∆[⍙...]: sort by , so now we have the unique lengths in order.
  • ∆∘.=∆←∆: store the sorted back in , and see which elements of are equal.
  • ↑+/: sum the rows (now we know how many elements are equal to each element) and then take the first item (now we know how many elements are equal to the first element, i.e. how many of the words are tied for first place.)
  • W[⍙]⍴⍨: sort W by , and take the first N, where N is the number we just calculated.
  • {⍵,⍴∪⍵}¨: for each of these, get the word itself and the amount of unique characters in the word
  • : format as matrix

marinus

Posted 2013-02-05T14:03:13.340

Reputation: 30 224

4

Mathematica 96 115

Edit: code now finds all words of the maximum number of characters. I refuse to treat commas as word characters.

f@t := With[{r = {#, Length@Union@Characters@#} & /@ 
StringSplit[t,RegularExpression@"\\W+"]},  Cases[r, {_, Max[r[[All, 2]]]}]]

Examples

f@"It was the best of times,...of comparison only."

or

f@Import["t1.txt"]

{{"incredulity", 10}, {"superlative", 10}}


f@"Lorem ipsum... vitae augue."

or

f@Import["t2.txt"]

{"Vestibulum", 9}


Longer Examples

f@Import["ShakespearesSonnets.txt"]
f@Import["OriginOfSpecies.txt"]
f@Import["DeclarationOfIndependence.txt"]
f@Import["DonQuixoteISpanish.txt"]
f@Import["AliceInWonderland.txt"]
f@Import["UNHumanRightsGerman.txt"]
f@Import["GenesisKJV.txt"]

Surprise: The most "unique" word in the Declaration of Independence is also the most unique word in Alice in Wonderland!

{"prognosticate", 11}
{"undiscoverable", 13}
{"uncomfortable", 12}
{"regocijadamente", 12}
{"uncomfortable", 12}
{"Verpflichtung", 13}
{"buryingplace", 12}

DavidC

Posted 2013-02-05T14:03:13.340

Reputation: 24 524

does this only return a single most-unique word? It should return them all. e.g. "superlative, incredulity, 10" – Shmiddty – 2013-02-06T00:23:01.793

@Shmiddty I addressed your criticism. (It cost 19 bytes.) – DavidC – 2013-02-06T02:18:54.043

4

Python 2 (110 (98 using file input))

import sys
f=lambda x:len(set(x))
a=sys.stdin.read().split()
c=max(map(f,a))
for i in a:
 if f(i)==c:print i,c

.

f=lambda x:len(set(x))
a=file('a').read().split()
c=max(map(f,a))
for i in a:
 if f(i)==c:print i,c

Things to improve: printing (33 characters)

Punctuation is considered letters.

beary605

Posted 2013-02-05T14:03:13.340

Reputation: 3 904

Python 2.7.3: NameError: global name 'r' is not defined. After adding single quotes around the r: AttributeError: 'file' object has no attribute 'split'. Python 3.3.0: SyntaxError: invalid syntax 'print i,c'. – primo – 2013-02-06T08:49:46.460

Oops, I didn't test it. Thanks for saying that, I would have never seen that. As for Python 3: doesn't work. – beary605 – 2013-02-06T18:07:12.073

4

This is my first codegolf, I'm so excited :) Also that means it is probably not any good.

Groovy 127 117 112 105

Edit: Since functions seem to be allowed here is one in 105. I also renamed the variables to make the first column read ACDC, because that is important in any kind of source code:

A={e={it.toSet().size()}
C=it.text.tokenize()
D=e(C.max{e(it)})
C.grep{e(it)==D}.each{println"$it $D"}}

You would call it like that:

A(new File("words.txt"))

Without function using standard input in 112:

a={it.toSet().size()}
b=System.in.getText().tokenize()
c=a(b.max{a(it)})
b.grep{a(it)==c}.each{println "$it $c"}

a={it.toSet().size()}
b=System.in.getText().tokenize().sort{-a(it)}
c=a(b[0])
b.grep{a(it)==c}.each{println "$it $c"}

a={it.toSet().size()}
System.in.getText().tokenize().sort({-a(it)}).groupBy{a(it)}.take(1).each{k,v->v.each{println "$it $k"}}

Input: Lorem Ipsum Text from primo

All scripts output:

consequat 9
ullamcorper 9
Vestibulum 9

Anyone got an idea how to make them more groovy?

Fels

Posted 2013-02-05T14:03:13.340

Reputation: 488

3

Perl 78 bytes

map{push$_[keys{map{$_,1}/./g}]||=[],$_}split for<>;print"$_ $#_
"for@{$_[-1]}

Interpretting the restriction "The text document must be read in by your code" to mean that command line options that read and parse the input are not allowed. As with the PHP solution below, only characters 10 and 32 are considered to be word delimiters. Input and output are also taken in the same manner.


PHP 128 bytes

<?foreach(split(~߃õ,fread(STDIN,1e6))as$s){$w[count(count_chars($s,1))][]=$s;}krsort($w)?><?=join($f=~ß.key($w).~õ,pos($w)),$f;

The only characters considered to be word delimiters are characer 10, and character 32. The rest, including puncuation, are considered to be part of the word.

This contains a few binary characters, which saves quotation marks, but as a result needs to be saved with an ANSI encoding in order to function properly. Alternatively, this version can be used, which is 3 bytes heavier:

<?foreach(split(' |
',fread(STDIN,1e6))as$s){$w[count(count_chars($s,1))][]=$s;}krsort($w)?><?=join($f=' '.key($w).'
',pos($w)),$f;

Sample I/O:

input 1:

It was the best of times, it was the worst of times, it was the age of wisdom,
it was the age of foolishness, it was the epoch of belief, it was the epoch of
incredulity, it was the season of Light, it was the season of Darkness, it was
the spring of hope, it was the winter of despair, we had everything before us,
we had nothing before us, we were all going direct to Heaven, we were all going
direct the other way - in short, the period was so far like the present period,
that some of its noisiest authorities insisted on its being received, for good
or for evil, in the superlative degree of comparison only.

output 1:

$ php most-unique.php < input1.dat
incredulity, 11

input 2:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec mollis, nisl sit
amet consequat fringilla, justo risus iaculis justo, vel ullamcorper dui tellus
ut enim. Suspendisse lectus risus, molestie sed volutpat nec, eleifend vitae
ligula. Nulla porttitor elit vel augue pretium cursus. Donec in turpis lectus.
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia
Curae; Quisque a lorem eu turpis viverra sodales. Pellentesque justo arcu,
venenatis nec hendrerit a, molestie vitae augue.

output 2:

$ php most-unique.php < input2.dat
consequat 9
ullamcorper 9
Vestibulum 9

primo

Posted 2013-02-05T14:03:13.340

Reputation: 30 891

incredulity has 10 unique letters, not 11. – DavidC – 2013-02-05T23:01:52.077

@DavidCarraher his code is including the comma, which is theoretically allowed via the rules. – Shmiddty – 2013-02-06T00:37:37.853

The explanation is absolutely incredulous. – DavidC – 2013-02-06T02:04:25.083

2Not only is it 'theoretically allowed', but given the wording of the question (specifically points 2 and 3), it seems to be a requirement. – primo – 2013-02-06T04:02:12.177

@DavidCarraher Yes, punctuation are valid characters. Anything other than whitespace is valid. – Gaffi – 2013-02-06T12:38:58.290

3

GoRuby 2.0.0 – 66 chars

The solutions below didn't actually find all matches but only one. Here's my final version:

a=$<.r.sp.m{|x|[x,x.ch.u.sz]};a.m{|x|s x*' - 'if x.l==a.m_(&:l).l}

Examples:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec mollis, nisl sit amet consequat fringilla, justo risus iaculis justo, vel ullamcorper dui tellus ut enim. Suspendisse lectus risus, molestie sed volutpat nec, eleifend vitae ligula. Nulla porttitor elit vel augue pretium cursus. Donec in turpis lectus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Quisque a lorem eu turpis viverra sodales. Pellentesque justo arcu, venenatis nec hendrerit a, molestie vitae augue.

produces:

$ ruby golf.rb < input.txt
consequat - 9
ullamcorper - 9
Vestibulum - 9

GoRuby 2.0.0 – 29 chars (not exact output format)

s$<.sp.m{|x|[x.ch.u.sz,x]}.mx

Expects input from stdin. The output format is a little bit different, though. For example:

$ ruby golf.rb < british.1
14
manoeuvrability

GoRuby 2.0.0 – 42 40 chars

s$<.r.sp.m{|x|[x.ch.u.sz,x]}.mx.rv*' - '

expects input from stdin

Ruby 1.9.3 - 69 65 chars

puts$<.read.split.map{|x|[x.chars.uniq.size,x]}.max.reverse*' - '

expects input from stdin (same as above, but without GoRuby abbreviations)

Patrick Oscity

Posted 2013-02-05T14:03:13.340

Reputation: 765

2

Javascript 163 155 152 162 bytes

This is about as short as I can get it:

prompt(x=[]).split(/\s/).forEach(function(a){b={};c=0;a.split('').forEach(function(d){b[d]?1:b[d]=++c});x[c]?x[c].push(a):x[c]=[a]});alert((l=x.length-1)+':'+x[l])
prompt(x=[]).split(/\b/).map(function(a){b={};c=0;a.split('').map(function(d){b[d]?1:b[d]=++c});x[c]?x[c].push(a):x[c]=[a]});alert((l=x.length-1)+':'+x[l])
prompt(x=[]).split(/\s/).map(function(a){b=[c=0];a.split('').map(function(d){b[d]?1:b[d]=++c});x[c]=(x[c]||[]).concat(a)});alert((l=x.length-1)+':'+x[l])

prompt(x=[]).split(/\s/).map(function(a){b=[c=0];a.split('').map(function(d){b[d]?1:b[d]=++c});x[c]=(x[c]||[]).concat(a)});alert((l=x.length-1)+':'+x[l].join('\n'))

In this version /\s/ separates words based on whitespace, so it includes punctuation, commas, periods, etc as part of words. This is easily changed to /\b/ to not included them.

I'll see what I can do with for-loops instead of forEaches in a bit.

I/O:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way - in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.

11:incredulity,

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec mollis, nisl sit amet consequat fringilla, justo risus iaculis justo, vel ullamcorper dui tellus ut enim. Suspendisse lectus risus, molestie sed volutpat nec, eleifend vitae ligula. Nulla porttitor elit vel augue pretium cursus. Donec in turpis lectus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Quisque a lorem eu turpis viverra sodales. Pellentesque justo arcu, venenatis nec hendrerit a, molestie vitae augue.

9:consequat
ullamcorper
Vestibulum

A little tired, perhaps. But I feel peaceful. Your success in the ring this morning was, to a small degree, my success. Your future is assured. You will live, secure and safe, Wilbur. Nothing can harm you now. These autumn days will shorten and grow cold. The leaves will shake loose from the trees and fall. Christmas will come, and the snows of winter. You will live to enjoy the beauty of the frozen world, for you mean a great deal to Zuckerman and he will not harm you, ever. Winter will pass, the days will lengthen, the ice will melt in the pasture pond. The song sparrow will return and sing, the frogs will awake, the warm wind will blow again. All these sights and sounds and smells will be yours to enjoy, Wilbur—this lovely world, these precious days…

10:Wilbur—this

Nearly all children nowadays were horrible. What was worst of all was that by means of such organizations as the Spies they were systematically turned into ungovernable little savages, and yet this produced in them no tendency whatever to rebel against the discipline of the Party. On the contrary, they adored the Party and everything connected with it... All their ferocity was turned outwards, against the enemies of the State, against foreigners, traitors, saboteurs, thought-criminals. It was almost normal for people over thirty to be frightened of their own children.

15:thought-criminals.

Shmiddty

Posted 2013-02-05T14:03:13.340

Reputation: 1 209

There is a potential awkwardness with the output: if there are multiple words in the output and one of the words ends with a comma, it could show two commas in a row which would be confusing. – Shmiddty – 2013-02-06T16:43:25.417

From the spec, In the event more than one word exists with the highest count, print all words for that count, **with one new line delimiting**. – Gaffi – 2013-02-06T19:38:30.863

@Gaffi should be fixed now. 10 bytes >.< – Shmiddty – 2013-02-06T20:03:03.640

2

Scala 129 chars:

def f{
val l=readLine.split(" ").map(s=>(s,s.distinct.length)).sortBy(_._2)
println(l.filter(x=>x._2==l.last._2).mkString)}

user unknown

Posted 2013-02-05T14:03:13.340

Reputation: 4 210

2

R - 106 characters
As a function with the input text as parameter:

f=function(t){
s=strsplit
a=sapply
t=s(t," ")[[1]]
w=a(a(s(t,""),unique),length)
n=(w==max(w))
cbind(t[n],w[n])
}

And a few examples:

f("It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way - in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.")
     [,1]           [,2]
[1,] "incredulity," "11"

f("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec mollis, nisl sit amet consequat fringilla, justo risus iaculis justo, vel ullamcorper dui tellus ut enim. Suspendisse lectus risus, molestie sed volutpat nec, eleifend vitae ligula. Nulla porttitor elit vel augue pretium cursus. Donec in turpis lectus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Quisque a lorem eu turpis viverra sodales. Pellentesque justo arcu, venenatis nec hendrerit a, molestie vitae augue.")
     [,1]          [,2]
[1,] "consequat"   "9" 
[2,] "ullamcorper" "9" 
[3,] "Vestibulum"  "9"

Or R - 100 characters
As a function with the path to the text file as parameter:

f=function(t){
t=scan(t,"")
a=sapply
w=a(a(strsplit(t,""),unique),length)
n=(w==max(w))
cbind(t[n],w[n])
}

Usage:

f("t1.txt")
Read 120 items
     [,1]           [,2]
[1,] "incredulity," "11"

plannapus

Posted 2013-02-05T14:03:13.340

Reputation: 8 610

I think this is missing "The text document must be read in by your code". – Steven Rumbalski – 2013-02-08T15:47:25.677

@StevenRumbalski this is corrected. – plannapus – 2013-02-08T16:12:04.027

1

Python 176 168

w = "".join((open('c')).readlines()).replace("\n", " ").split(" ")
l = sorted(zip([len(set(w[i])) for i in range(len(w))],w,))
print([x for x in l if l[-1][0] == x[0]])

Raufio

Posted 2013-02-05T14:03:13.340

Reputation: 216

1

Python3 119

Reads from a file called a.

r={w:len(set(w))for w in open("a").read().split()};print("\n".join(str((k,v))for k,v in r.items()if v==max(r.values())))

Tested with the input texts from @primo:

Input 1:
    ('incredulity,', 11)

Input 2:
    ('Vestibulum', 9)
    ('consequat', 9)
    ('ullamcorper', 9)

gcq

Posted 2013-02-05T14:03:13.340

Reputation: 251

0

VBScript - 430 / VBA - 420

VBScript:

Function r(t)
d="Scripting.Dictionary"
Set w=CreateObject(d)
c=1
Do Until c>Len(t)
p=InStr(c,t," ")
i=InStr(c,t,vbCr)
If p<i Then s=i Else s=p
If s=0 Then s=Len(t)+1
f=Mid(t,c,s-c)  
If Not w.Exists(f) Then 
Set x=CreateObject(d)
For l=1 To Len(f)
n=Mid(f,l,1)
If Not x.Exists(n) Then x.Add n,n
Next
w.Add f,f
y=x.Count
If m=y Then z=f &vbCr &z
If m<y Then m=y:z=f
End If
c=s+1
Loop
r=z &" " &m
End Function

VBA:

Function r(t)
d="Scripting.Dictionary"
Set w=CreateObject(d)
c=1
Do Until c>Len(t)
p=InStr(c,t," ")
i=InStr(c,t,vbCr)
s=IIf(p<i,i,p)
If s=0 Then s=Len(t)+1
f=Mid(t,c,s-c)  
If Not w.Exists(f) Then 
Set x=CreateObject(d)
For l=1 To Len(f)
n=Mid(f,l,1)
If Not x.Exists(n) Then x.Add n,n
Next
w.Add f,f
y=x.Count
If m=y Then z=f &vbCr &z
If m<y Then m=y:z=f
End If
c=s+1
Loop
r=z &" " &m
End Function

Gaffi

Posted 2013-02-05T14:03:13.340

Reputation: 3 411