Fun with strings and string lengths

7

1

Introduction

Inspired by Dyalog Ltd.'s 2016 student competition. The challenge there is to write good APL code, but this is a and it is open to all languages. Derived from this closed challenge.

The real challenge here is to combine the various (quite trivial) parts of the problem in a byte-efficient way by reusing values and/or algorithms.

If you find this challenge interesting and you use Dyalog APL (free download here), you may also want to participate in the original competition, but note that the objective there is to submit quality code – not necessarily short code. If you are a student, you can win USD 100 for Phase I. Phase II (not related to this code golf challenge) can win you up to USD 2000 and a free trip to Scotland!

Task

Take two strings (of any 7-bit ASCII characters) by any means and order (but please include input format in your answer). You may take them as separate arguments/inputs or as a single list of strings. One string contains the delimiters in the other. E.g. given ',;/' and 'hello,hello;there;my/yes/my/dear,blue,world', the sub-strings are ['hello', 'hello', 'there', 'my', 'yes', 'my', 'dear', 'blue', and 'world'].

You will do some statistics and ordering based on the lengths of those sub-strings that are not duplicated in the input. (You may chose to consider casing or not.) In the example given, those are ['there', 'yes', 'dear', 'blue', 'world'].

Give (by any means) a list of four elements in the overall structure [Number_1, Number_2, [List_3], [[List_4_1], [List_4_2]]]:

  1. The mean of the lengths of the non-duplicated strings. In our example, that is 4.2
  2. The median of the lengths of the non-duplicated strings. In our example, that is 4.
  3. The mode(s) of the lengths of the non-duplicated strings. In our example, that is [4, 5].
  4. The original sub-strings (including duplicates), divided into two lists; the first are those that are not one of the above mode lengths, and the second are those that are. In our example: [['my', 'my', 'yes'], ['dear', 'blue', 'hello', 'hello', 'there', 'world']]

The combined final answer of our example case is: [4.2, 4, [4, 5], [['my', 'my', 'yes'], ['dear', 'blue', 'hello', 'hello', 'there', 'world']]]

Considerations

  1. Don't forget to reuse code and values. E.g. the median is the average of the 1 or 2 middle elements, and you'll need the mode(s) to reorder the original list.

  2. Watch out for division by zero; an empty input means one string with length zero, so the mean length is zero. If there are no non-unique sub-strings, return zero for both mean and median, an empty list for modes, and put all original sub-strings onto the non-mode list.

  3. The list of modes, and the sub-lists of non-mode-length strings, and mode-length strings, do not need to be sorted in any way.

  4. If all non-duplicated sub-strings have different lengths, then all the lengths are considered modes.

  5. Wherever the term list is used, you may use any collection of multiple data pieces that your language supports, e.g. tuples, arrays, dictionaries, etc.

  6. The numeric values only need to be within 1% of the actual value.

Example cases

Example A

',;/' 'hello,hello;there;my/yes/my/dear,blue,world'
gives
[4.2, 4, [4, 5], [['my', 'my', 'yes'], ['dear', 'blue', 'hello', 'hello', 'there', 'world']]

Example B

'+|' ''
gives
[0, 0, [0], [[], ['']]]

Note: while there are two delimiter characters, none of them occur in the data string, so this is just one empty string. Mean, median, and mode are therefore zero There are no strings in the non-mode list and just the single empty string in the mode list.

Example C

'' 'ppcg.se.com'
gives
[11, 11, [11], [[], ['ppcg.se.com']]]

Note: no delimiters, so the entire input is one string.

Example D

',;' ',math,,math;'
gives
[0, 0, [], [['', 'math', '', 'math', ''], []]]

Note: there are three empty sub-strings, and since all sub-strings have at least one duplicate, the mean and median default to zero, as per Considerations, above.

Example E

',;' 'new,math,math;!;'
gives
[1.333, 1, [0,1,3], [['math', 'math'], ['new', '!', '']]]

Note: there is only one of each length among the non-duplicated sub-strings, so all three lengths are modes.


You can verify my explicit permission to post this challenge here by contacting the author of the original competition.

Adám

Posted 2016-08-11T10:30:53.357

Reputation: 37 779

I presume the delimiters are always single characters? And that when you say "list" you also include tuples, in particular for the sake of strongly typed languages which require all elements of a list to have the same type? – Peter Taylor – 2016-08-11T10:41:04.153

@PeterTaylor The inputs are two plain strings, so the delimiters can only be single characters. I'm not sure I understand what the problem about tuples vs lists. But if you want to use tuples instead of lists, go ahead. – Adám – 2016-08-11T10:48:04.343

Ugh. The additional handling required for supporting empty substrings (like Example D) is brutal. :-/ – AdmBorkBork – 2016-08-11T16:30:10.740

The first number of Example A should be 4 – Leaky Nun – 2016-08-11T17:55:22.427

@TimmyD It is 2 not 3 – Leaky Nun – 2016-08-11T18:12:27.670

@TimmyD It is 'oh' not 'yes' – Leaky Nun – 2016-08-11T18:45:22.947

@LeakyNun Thanks, fixed. It was supposed to be the same as in the walk-through, but I changed the walk-through and forgot to update the example. – Adám – 2016-08-11T19:55:13.797

Could ^, ], - or \ be delimiters? – Neil – 2016-08-11T23:32:58.827

@Neil any 7-bit ASCII. I've edited that in. Thanks. – Adám – 2016-08-12T05:04:43.447

Answers

4

PowerShell v4+, 453 455 435 427 417 bytes

param([char[]]$a,$b)if(!$b){0;0;0;'','';exit}$a|%{$b=$b-split$_};$c=@{};if($z=($b|group|?{$_.count-eq1}).Name){$z|%{$m+=$c[$_]=$_.length}}if($l=$c.count){$c=$c.GetEnumerator()|sort v}else{$l=2}$m/$l;if($l%2){$c[[math]::Floor($l/2)].Value}else{($c[$l/2].value+$c[$l/2-1].value)/2}if($x=($c|group value|sort c -des)|%{if($_.count-ge$i){$i=$_.count;$_.name}}){$m=$b|?{$_.length-in$x}}else{$m=@()}$x;($b|?{$_-notin$m}),$m

(Saved 7 bytes by eliminating $d and reusing $c instead. Saved 13 21 31 bytes thanks to @Joey)

What a mess. Below is a slightly more readable version, broken down into sections. I also included -join on the two potentially-array outputs to make the output more clear visually, as well as surrounding the last one in distinct delimiters #-> and <-# to indicate the sub-arrays. I did this since PowerShell's Write-Output for multiple array return values can get a little confusing when just looking at them on the console.

param([char[]]$a,$b)if(!$b){0;0;0;'','';exit}$a|%{$b=$b-split$_}
$c=@{}
if($z=($b|group|?{$_.count-eq1}).Name){$z|%{$m+=$c[$_]=$_.length}}
if($l=$c.count){$c=$c.GetEnumerator()|sort v}else{$l=2}
$m/$l
if($l%2){$c[[math]::Floor($l/2)].Value}else{($c[$l/2].value+$c[$l/2-1].value)/2}
if($x=($c|group value|sort c -des)|%{if($_.count-ge$i){$i=$_.count;$_.name}}){$m=$b|?{$_.length-in$x}}else{$m=@()}
$x-join','
"#->" + ((($b|?{$_-notin$m})-join','),($m-join',')-join'<-# #->') + "<-#"

Takes input $a and $b (separators and string), and explicitly casts $a right away as a char-array. The if(!$b) clause handles the special case where $b is an empty string - it dumps zeros and empty strings onto the pipeline and exits. Otherwise we loop through $a with $a|%{...}, each loop -splitting $b by the current character and re-saving into $b.

The next line sets $c as an empty hashtable. This is used as the temporary storage location for the non-duplicated values in $b and their corresponding lengths.

That's the next line. We take $b, pump it through Group-Object to put like words together, pipe that to Where-Object to select only those where the .count is -equal to 1. We then take the .Name of those (here's where the v4+ requirement comes into play), and store those .Names (i.e., the words) into $z. We then take the .count of that, and that number forms our if clause -- so long as it's non-zero (i.e., we have at least one unique word), we pump those words through a loop. Each iteration we accumulate $m by, and also store into $c, the corresponding word's .length.

Next line sorts $c on lengths (that's the |sort value in the middle) and stores that back to $c, but we need to have some logic to account for cases where $z was empty (i.e., every word is a duplicate). If that's the situation, $c is already an empty hashtable, but we set $l equal to 2 to handle the mean calculation later.

Now, we're to our outputs. The first $m/$l just calculates the average length. The next line calculates the median based on either taking the exact middle value of $c, or taking the two above and below the middle and splitting the difference.

The next line is very complicated, and was where the longest troubleshooting was. It's calculating the mode. We start by Grouping $c based on value (i.e., the word lengths), then Sort-Object them by their count in -descending order (i.e., the most-common lengths will be first). That list is fed into a loop. Each iteration, if the current item's $_.count is greater-than-or-equal-to $i (which initializes to $null or 0, here), meaning that we've got a length that's the most-common, set $i equal to how many of that length there are, and add the length ($_.name) to the pipeline. That resulting array of most-common lengths is stored into $x.

Provided that we have at least one most-common length, the if triggers and we pipe $b through another Where-Object. This time, we're selecting all of the words where the .length is -in $x (i.e., all of the words that have a length that's in the mode), and store them into $m (yes, I reused a variable name). Else, we don't have a most-common length, so the strings of the mode is empty.

The next line $x-join',' outputs the lengths of the mode. Then, the words in $b that aren't in $m and the words in $m, as two sub-arrays.


Test suite

PS C:\Tools\Scripts\golfing> @(@(',;/','hello,hello;there;my/yes/my/dear,blue,world'),@('+|',''),@('','ppcg.se.com'),@(',;',',math,,math;'),@(',;','new,math,math;!;'))|%{"Testing: '"+$_[0]+"' '"+$_[1]+"'";.\fun-with-strings.ps1 $_[0] $_[1];"---------------------------------------------------------------------------------------"}
Testing: ',;/' 'hello,hello;there;my/yes/my/dear,blue,world'
4.2
5
4,5
#->my,yes,my<-# #->hello,hello,there,dear,blue,world<-#
---------------------------------------------------------------------------------------
Testing: '+|' ''
0
0
0


---------------------------------------------------------------------------------------
Testing: '' 'ppcg.se.com'
11
11
11
#-><-# #->ppcg.se.com<-#
---------------------------------------------------------------------------------------
Testing: ',;' ',math,,math;'
0
0

#->,math,,math,<-# #-><-#
---------------------------------------------------------------------------------------
Testing: ',;' 'new,math,math;!;'
1.33333333333333
1
3,1,0
#->math,math<-# #->new,!,<-#
---------------------------------------------------------------------------------------

AdmBorkBork

Posted 2016-08-11T10:30:53.357

Reputation: 41 581

I can't really tell the structure from your output. Does the output have four or five parts? – Adám – 2016-08-11T20:44:27.363

Give (by any means) a list of four elements: The fifth point on that list was supposed to be under Considerations (fixed now), but your 4th and 5th need to be two elements of one list in order to be fully compliant with the specs. – Adám – 2016-08-11T20:55:02.850

@Adám Easy enough to fix, at the cost of two parens and changing a ; to a ,. Updated. – AdmBorkBork – 2016-08-12T13:30:25.687

@Joey Oh yeah. But don't need the * since it's unambiguous anyway. – AdmBorkBork – 2016-08-12T18:42:26.500

You do, since neither c nor v match a property; the wildcards do. – Joey – 2016-08-12T18:50:42.557

@Joey Works for me without the *. PowerShell v4 ISE on Win 8.1 – AdmBorkBork – 2016-08-12T18:51:51.860

Your tests give different results for me. PowerShell v5, Win 10. In any case, that behaviour for Sort-Object has been that way for ages. It takes a string[] of property names which can also contain wildcards. – Joey – 2016-08-12T18:53:07.507

3

Pyth, 63 bytes

Ju.ncRHGQ]E.OKlMfq1/JTJ.O<.<S*2KtlK2=N
S{.x.M/KZKY=Y
f}lTNJ.-JY

Test suite.

Leaky Nun

Posted 2016-08-11T10:30:53.357

Reputation: 45 011

I can't really tell the structure from your output. Does the output have four or five parts? – Adám – 2016-08-11T20:44:54.160

Denote your format to be [A, B, C, [D, E]]. My output is A↵B↵C↵D↵E – Leaky Nun – 2016-08-12T02:03:27.543

Done, but I'd accept A↵B↵C↵DE if that's a natural list format for Pyth. However, A↵B↵C↵D↵E doesn't quite follow the specs. – Adám – 2016-08-12T05:06:53.190

@Adám Oh, so you moved the goalpost – Leaky Nun – 2016-08-12T05:14:12.947

How so? According to the revision history, the very first version said Give (by any means) a list of four elements:. – Adám – 2016-08-12T05:48:59.177

1

Python 3, 341 chars, 343 bytes

from statistics import*
def f(d,s,l=len):
 w=sorted(''.join(x if not x in d else'ÿ'for x in s).split('ÿ'),key=l)
 L=[l(x)for x in w if w.count(x)<2]
 if L==[]:return 0,0,L,[s,L]
 a,q=mean(L),median(L)
 D={x:L.count(x)for x in L}
 m=[x for x in D if D[x]==max(D.values())]
 i=[l(x)for x in w].index(m[0])
 return a,q,m,[w[:i],w[i:]]

As built-ins are not forbidden, using the statistics module is fine for mean and median, but its mode does not allow multiples.

  • Edit1: check for non-uniqueness
  • Edit2: shorter list comprehension on m
  • Edit3: l=len and L==[]
  • Edit4: for the empty L case, use it instead of []
  • Edit5: figured empty strings were not handled well, introduced non-7-bit-ASCII sentinel ÿ

Karl Napf

Posted 2016-08-11T10:30:53.357

Reputation: 4 131

1

JavaScript (ES2017), 346 bytes

f=(p,q,a=p?q.split(p[0]):[q])=>p[1]?f(p.slice(1),a.join(p[1])):(s=a.filter(e=>a.indexOf(e)==a.lastIndexOf(e)).map(e=>e.length).sort((a,b)=>a-b),t=l=0,r=[],s.map(e=>r[t+=e,l++,e]=-~r[e]),[l&&t/l,l&&(s[l>>1]+s[l-1>>1])/2,[...m=new Set(s.filter(e=>r[e]==Math.max(...Object.values(r))))],[a.filter(e=>!m.has(e.length)),a.filter(e=>m.has(e.length))]])

For ES6 replace Object.values(r) with Object.keys(r).map(k=>r[k]) at a cost of 11 bytes. Explanation:

f=(p,q,a=p?q.split(p[0]):[q])=>         split on first seprator
 p[1]?f(p.slice(1),a.join(p[1])):       recurse for each separator
  (s=a.filter(e=>                       search through list for
     a.indexOf(e)==a.lastIndexOf(e))    unique strings only
    .map(e=>e.length)                   get lengths of each string
    .sort((a,b)=>a-b),                  sort lengths into order
   t=l=0,                               total and length
   r=[],                                frequency table
   s.map(e=>r[t+=e,l++,e]=-~r[e]),      build frequency table and total
   [l&&t/l                              mean = total / length
   ,l&&(s[l>>1]+s[l-1>>1])/2            median = mean of middle elements
   ,[...m=new Set(s.filter(e=>r[e]==    unique elements matching
    Math.max(...Object.values(r))))]    maximum length frequency
   ,[a.filter(e=>!m.has(e.length))      filter on non-modal length
    ,a.filter(e=>m.has(e.length))]])    filter on modal length

Neil

Posted 2016-08-11T10:30:53.357

Reputation: 95 035

0

q/kdb+, 134 bytes

Solution:

{(avg;med;{:M::(&)m=max m:(#:)@/:(=)x};{(y(&)(~)C;y(&)C:((#:)@/:y)in M)}[;a])@\:(#:)@/:(&)1=(#:)@/:(=)a:"\000"vs?[any y=/:x;"\000";y]}

Examples:

q)/assign function to 'f' to make this tidier
q)f:{(avg;med;{:M::(&)m=max m:(#:)@/:(=)x};{(y(&)(~)C;y(&)C:((#:)@/:y)in M)}[;a])@\:(#:)@/:(&)1=(#:)@/:(=)a:"\000"vs?[any y=/:x;"\000";y]}
q)f[",;/";"hello,hello;there;my/yes/my/dear,blue,world"]
4.2
4f
5 4
(("my";"yes";"my");("hello";"hello";"there";"dear";"blue";"world"))
q)f["+|";""]
0f
0f
,0
(();,"")
q)f["";"ppcg.se.com"]
11f
11f
,11
(();,"ppcg.se.com")
q)f[",;";",math,,math;"]
()
()
()
(("";"math";"";"math";"");())
q)f[",;";"new,math,math;!;"]
1.333333
1f
3 1 0
(("math";"math");("new";,"!";""))

Explanation:

Ungolfed version. Can likely be golfed down a fair bit more.

f:{(avg;                                           / built-in average
    med;                                           / built-in median
    {:M::where m=max m:count each group x};        / get max counts for group, save to global M
    {
     (y where not C;                               / y is original list,
      y where C:(count each y) in M)               / so these are ones that match the modes (M)
    }[;a]                                          / pass parameter a to the function
    )
    @\:                                            / apply each-left with items on the right
    count each                                     / get the lengths of each string
    where 1=count each group                       / remove the duplicates
    a:"\000" vs                                    / split on NUL and save in a
    ?[any y=/:x;"\000";y]}                         / replace any occurance of delimiter with NUL

streetster

Posted 2016-08-11T10:30:53.357

Reputation: 3 635