Calculating total file size by extension in shell

Question

We have a set of directories containing lucene indexes. Each index is a mix of different file types (differentiated by extension) eg:

0/index/_2z6.frq
0/index/_2z6.fnm
..
1/index/_1sq.frq
1/index/_1sq.fnm
..

(it's about 10 different extensions)

We'd like to get a total by file extension, eg:

.frq     21234
.fnm     34757
..

I've tried various combinations of du/awk/xargs but finding it tricky to do exactly this.

You have the answer for that problem in this post : http://serverfault.com/questions/183431/get-all-extensions-and-their-respective-file-count-in-a-directory — Blueicefield, Mar 07 '12 at 13:16
Do you want to know the total size of each type of file or the total number of each type of file ? — user9517, Mar 07 '12 at 13:23

user9517 · Accepted Answer · 2014-11-17T22:52:35.547

22

For any given extension you an use

find /path -name '*.frq' -exec ls -l {} \; | awk '{ Total += $5} END { print Total }'

to get the total file size for that type.

And after some thinking

#!/bin/bash

ftypes=$(find . -type f | grep -E ".*\.[a-zA-Z0-9]*$" | sed -e 's/.*\(\.[a-zA-Z0-9]*\)$/\1/' | sort | uniq)

for ft in $ftypes
do
    echo -n "$ft "
    find . -name "*${ft}" -exec ls -l {} \; | awk '{total += $5} END {print total}'
done

Which will output the size in bytes of each file type found.

edited Nov 17 '14 at 22:52

answered Mar 07 '12 at 13:21

user9517

114,104
20
206
289

Thanks, was looking for something that summarized by any extension (as it'd be handy to then sort for example) – barnybug Mar 07 '12 at 14:10
Check my update. – user9517 Mar 07 '12 at 14:11
great thanks. awk produces scientific output for some of the numbers, can this be disabled: .fdt 3.15152e+10 – barnybug Mar 07 '12 at 14:15
1

slightly tweaked to just give plain integer numbers: find . -name "*${ft}" -print0 | xargs -0 du -c | grep total | awk '{print $1}' – barnybug Mar 07 '12 at 14:17
1

Might want to use `-iname` to make the file extenstion search case insensitive. – Aaron Copley Mar 07 '12 at 15:16
@barnybug: I don't have a version of awk that prints the total in scientific notation. What version/OS etc are you using ? – user9517 Mar 07 '12 at 15:17
Ubuntu Lucid - it's just on the larger numbers (Gigabytes in size), perhaps over an int32. – barnybug Mar 07 '12 at 15:59
Note that this script has horrible performance issues (still, I upvoted it). It scans input directory X times (where X is number of extensions found in the directory). If you need something faster check @glenn jackman answer – matt May 25 '15 at 15:34

glenn jackman · Answer 2 · 2012-03-07T16:04:50.033

6

With bash version4, you just need to call find, ls and awk not necessary:

declare -A ary

while IFS=$'\t' read name size; do 
  ext=${name##*.}
  ((ary[$ext] += size))
done < <(find . -type f  -printf "%f\t%s\n")

for key in "${!ary[@]}"; do 
  printf "%s\t%s\n" "$key" "${ary[$key]}"
done

edited Mar 07 '12 at 16:04

answered Mar 07 '12 at 15:55

glenn jackman

4,320
16
19

This script doesn't not work well with file names with tab character. Changing `read name size` to `read size name` and `-printf "%f\t%s\n"` to `-printf "%s\t%f\n"`should fix it. – matt May 25 '15 at 15:36
1

Note also that this script doesn't work well with files without extension. It will treat whole file name as extension. Add `if [ "$name" == "$ext" ]; then ext="*no_extension*"; fi` after `ext=${name##*.}` if you need to prevent it. This will put all files without extension into `*no_extension*` group (I'm using `*no_extension*` because `*` is not a valid character in file name) – matt May 25 '15 at 15:39

Selman Ulug · Answer 3 · 2012-03-07T16:15:18.147

4

Every second column splited by . and last part (extension) saved in array.

#!/bin/bash

find . -type f -printf "%s\t%f\n" | awk '
{
 split($2, ext, ".")
 e = ext[length(ext)]
 size[e] += $1
}

END{
 for(i in size)
   print size[i], i
}' | sort -n

then you got every extensions total size in bytes.

60055 gemspec
321991 txt
2075312 html
2745143 rb
13387264 gem
47196526 jar

edited Mar 07 '12 at 16:15

answered Mar 07 '12 at 14:57

Selman Ulug

161
4

score 1 · Answer 4 · answered Nov 25 '14 at 02:32

Extending on Iain's script with a faster version for working with a large number of files.

#!/bin/bash

ftypes=$(find . -type f | grep -E ".*\.[a-zA-Z0-9]*$" | sed -e 's/.*\(\.[a-zA-Z0-9]*\)$/\1/' | sort | uniq)

for ft in $ftypes
do
    echo -ne "$ft\t"
    find . -name "*${ft}" -exec du -bcsh '{}' + | tail -1 | sed 's/\stotal//'
done

Nakilon · Answer 5 · 2020-04-12T03:09:58.767

For macOS:

#!/bin/bash

for ft in $(find "$1/" -type f | { export GREP_OPTIONS="--color=never" && grep -E ".*\.[a-zA-Z0-9]*$"; } | sed -E 's/.*(\.[^\.]*)$/\1/' | sort | uniq)
do
    find "$1/" -name "*$ft" -exec stat -f%z {} \; | awk '{total += $1} END {printf "%s\t",total}'
    echo " $ft"
done | sort -hr

$ bash temp.sh assets
1622995  .monstertype
1279175  .frames
756855   .npctype
706087   .projectile
573611   .head

score 0 · Answer 6 · edited Apr 13 '17 at 12:14

0

This is the solution :

find . -type f | grep -E ".*\.[a-zA-Z0-9]*$" | sed -e 's/.*\(\.[a-zA-Z0-9]*\)$/\1/' | sort | uniq -c | sort -n

Solution posted originally on this post : Get all extensions and their respective file count in a directory

edited Apr 13 '17 at 12:14

Community

1

answered Mar 07 '12 at 13:19

Blueicefield

232
1
3

3

This is a count by number of files, not what I was asking - I'd like a total by size. – barnybug Mar 07 '12 at 14:10

score 0 · Answer 7 · answered Nov 17 '14 at 23:27

0

I solved using this two commands:

FILES=$(find . -name '*.c')
stat -c %s ${FILES[@]} | awk '{ sum += $1 } END { print ".c" " " sum }'

answered Nov 17 '14 at 23:27

c4f4t0r

5,149
3
28
41

score 0 · Answer 8 · answered Sep 27 '16 at 10:33

my version of answer to the question:

#!/bin/bash

date >  get_size.log
# Lists all files
find . -type f -printf "%s\t%f\n" | grep -E ".*\.[a-zA-Z0-9]*$" | sort -h | awk  '
{
        split($2, ext, ".")
        e = ext[length(ext)]
        # Checks that one extension could be found
        if(length(e) < length($2)) {
                # Check that file size are bigger than 0
                if($i > 0) {
                        # Check that extension not are integer
                        if(!(e ~/^[0-9]+$/)) {
                                size[e] += $1
                        }
                }
        }
        if(length(e) == length($2)) {
                size["blandat"] += $1
        }
}

END{
 for(i in size)
   print size[i], i
}' | sort -n >> get_size.log
echo
echo
echo The result are in file get_size.log

score 0 · Answer 9 · answered Sep 21 '17 at 18:57

0

Try Crab (http://etia.co.uk/) - it's a command-line utility that allows you to query the filesystem using SQL.

answered Sep 21 '17 at 18:57

Jacek Lampart

101
1

Calculating total file size by extension in shell

9 Answers9