Sorting files by “line content” frequency - print duplicates

-1

Imagine there is a file -

a
b
b
b
b
c
c
d
d
d

I want the output to be sorted based on frequency (I want the duplicate lines to be printed as well) as -

b
b
b
b
d
d
d
c
c
a

GeekyGeek

Posted 2018-08-25T19:23:45.410

Reputation: 3

Answers

3

With GNU Awk:

gawk '
   { arr[$0]++ }
   END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (ln in arr) for (i = 1; i <= arr[ln]; i++) print ln
       }
   '

The trick is to use an array and @val_num_desc. Every encountered line becomes an index, the associated value is increased each time the line appears. At the end we scan the entire array in a specific order:

"@val_num_desc"
[…] the element values, treated as numbers, are ordered from high to low.

source

So the outer (first) for is responsible for retrieving lines and their frequencies in the desired order; the inner (second) for is just to print the currently picked line the right number of times.

Note:

  • Every character matters. A line and the same line with an extra trailing space are different.

Kamil Maciorowski

Posted 2018-08-25T19:23:45.410

Reputation: 38 429

PROCINFO["sorted_in"] - awesome, just what I was looking for to make an awk example too, thanks! – Attie – 2018-08-25T22:43:32.100

3@Attie I think it may not work in plain awk, unless your awk is gawk in disguise. In my Debian awk used to be too limited, I had to install gawk. Now both commands understand this because awk is (non-directly) symlinked to gawk. – Kamil Maciorowski – 2018-08-25T22:49:49.547

3

The following will do what you're after... though there are many other ways to achieve this... for example with gawk, as per Kamil's answer.

  • The first sort will order the data by line data
  • uniq -c will count the number of matching occurrences (they must be neighbours)
  • sort -nr will sort by the number of occurrences, in reverse order
  • The while loop iterates over each line
    • read n l will ingest the count into n, and the line data into l
  • The for loop will iterate n times
  • echo "${l}" outputs the line data
(
    sort \
        | uniq -c \
        | sort -nr \
        | while read n l; do \
            for i in $(seq ${n}); do \
                echo "${l}"; \
            done; \
        done
) <<"EOF"
a
b
b
b
b
c
c
d
d
d
EOF

Attie

Posted 2018-08-25T19:23:45.410

Reputation: 14 841