bash: loop over 20000 files slow - why?

Question

A simple loop over a lot of files is half as fast on one system vs. the other.

using bash, I did something like

for * in ./
do
   something here
done

Using "time" I was able to confirm, that on system2 the "something here"-part runs faster than on system1. Nevertheless, the whole loop on system 2 takes double as long as on system1. Why? ...and how can I speed this up?

There are about 20000 (text-)files in the directory. Reducing the number of files to about 6000 significantly speeds things up. These findings stay the same regardless of the looping-method (replacing "for * in" with a find command or even putting filenames in an array first).

System1: Debian (in an openvz-vm, using reiserfs)
System2: Ubuntu (native, faster Processor than System1, faster Raid5 too, using ext3 and ext4 - results stay the same)

So far I should have ruled out: hardware (System2 should be way faster), userland-software (bash, grep, awk, find are the same versions) and .bashrc (no spiffy config there).

So is it the filesystem? Can I tweak ext3/4 so that it gets as fast as reiserfs?

Thanks for your recommendations!

Edit: Ok, you're right, I should have provided more info. Now I have to reveal my beginner's bash mumble but here we go:

 declare -a UIDS NAMES TEMPS ANGLEAS ANGLEBS
 ELEM=0
 for i in *html
    do
            #get UID
            UID=${i%-*html}
            UIDS[$ELEM]=$UID

            # get Name
            NAME=`awk -F, '/"name":"/ { lines[last] = $0 } END { print lines[last] }' ${i} | awk '{ print $2 }'`
            NAME=${NAME##\[*\"}
            NAMES[$ELEM]=$NAME

            echo "getting values for ["$UID"]" "("$ELEM "of" $ELEMS")"

            TEMPS[$ELEM]=`awk -F, '/Temperature/ { lines[last] = $0 } END { print lines[last] }' ${i} | sed 's/<[^>]*>//g' | tr -d [:punct:] | awk '{ print $3 }'`
            ANGLEAS[$ELEM]=`awk -F, '/Angle A/ { lines[last] = $0 } END { print lines[last] }' ${i} | sed 's/<[^>]*>//g' | tr -d [:punct:] | awk '{ print $3 }'`
            ANGLEBS[$ELEM]=`awk -F, '/Angle B/ { lines[last] = $0 } END { print lines[last] }' ${i} | sed 's/<[^>]*>//g' | tr -d [:punct:] | awk '{ print $3 }'`
            ### about 20 more lines like these ^^^ 
             ((ELEM++))
 done

Yes, the problem is, that I have to read the file 20 times but putting the content of the file in a variable (FILE=(cat $i)) removes the linebreaks and I can't use awk anymore...? Maybe I tried that wrong so if you have a suggestion for me, I'd be grateful.

Still: the problem remains, that reading a file in that directory just takes too long...

To the hardware-question: well, system1 runs on over 5 year-old hardware, system2 is 2 months old. Yes, the specs are quite different (other mainboards, processors etc.) but system2 is way faster in every other aspect and raw write/read rates to the filesystem are faster too.

There may be techniques to minimize those differences. If you describe in detail what you're trying to accomplish and show your actual script, we might be able to help further. — Dennis Williamson, Sep 23 '10 at 11:37
You said you ruled out hardware, but how? As in, what hard disks do you have? Are they identical drive technologies and caches? Did you run drive benchmarks to get some idea of drive performance? I'd do that next, personally, to see if there's something weird with drive access. — Bart Silverstrim, Sep 23 '10 at 11:44
So your problem is now very visible. You are executing awk, sed and plethora of other tools again and again and again. Even if awk etc do startup in no time at all, it starts to pile up during this kind of operation. I would use Perl for this kind of stuff, it's made for this kind of text processing. Now I'm about to go elsewhere for a while so no time for typing up a sample script, but will do that later if needed. — Janne Pikkarainen, Sep 23 '10 at 12:35
again: I know that using sed/awk/tr is not the wisest of choices but theses tools are running *faster* on system2! I timed that. The problem lies in the for-loop, looping over 20k files... Thanks for your offer to help with perl (I'll gladly look into it) but I was looking for a solution to get the looping quicker... — brengo, Sep 23 '10 at 12:41
Some trickery with awk etc might make your current script faster, but the truth is that right now your script needs to execute those tools hundreds of thousands of times, or more. Even if starting up every one of those would take only a millisecond, it would still pile up to be hundreds of thousands milliseconds just for starting up those tools. That kind of overhead is huge and can/should be reduced MUCH. And reducing the need for constantly re-executing those tools is the only way to sanely make your loop faster. — Janne Pikkarainen, Sep 23 '10 at 12:54

score 1 · Answer 1 · answered Sep 23 '10 at 10:59

1

Depends what you're doing exactly, but yes, ext file systems get slow when you've got a lot of files in one directory. Splitting the files into e.g. numbered subdirectories is one common way round this.

answered Sep 23 '10 at 10:59

pjc50

1,720
10
12

hnnf. I feared that answer :) ...well, I'm grepping values from these files, putting them in variables and sort them afterwards. I can't merge these files and I can't access them online well enough (or do you recommend a tool for that? instead of wget to download the files?) – brengo Sep 23 '10 at 11:20
Looking at the code you've now added, I suggest rewriting it as a Perl program that reads each line of each file only once, filling in the variables as it finds them. – pjc50 Sep 27 '10 at 11:38

score 1 · Answer 2 · answered Sep 23 '10 at 14:57

It's not necessary to use arrays in awk for what you're doing. You don't seem to be making use of the comma as a field separator since you're printing $0.

AWK can do what you have sed and tr doing.

It would be helpful to see what your data looks like.

One approach might be something like this (although it's pretty ugly to look at):

for f in *.html
do
    read -r array1[i] array2[i] array3[i] array4[i] . . . <<< $(
        awk '
            /selector1/ {var1 = $2}
            /selector2/ {split($0,temparray,"<[^>]*>"); split(temparray[2],temparray); var2 = gensub("[[:punct:]]","","g",a[3])}
            /selector3/ {split($0,temparray,"<[^>]*>"); split(temparray[2],temparray); var3 = gensub("[[:punct:]]","","g",a[3])}
            . . .
            END { print var1, var2, var3, var4 . . . }' "$f"
((i++))
done

With choices of array subscripts in the awk script dictated by the actual layout of your data. There may be better approaches, but this one eliminates about 1,600,000 processes (20,000 files * 20 vars * 4 processes/var) from being spawned so that only about 20,000 (one per file) are.

You didn't say what kinds of execution times you were getting, but with this optimization it may be fast enough that you can take your time investigating the problem in your newer system.

score 0 · Answer 3 · answered Sep 23 '10 at 11:38

Your description is so vague it's difficult to give you advices. Anyway, 20k files in a single directory is much, but not THAT much.

Many times it's possible to speed up things with rethinking what you do. What currently happens during your loop? Does your script need to read through 20 000 files 20 000 times? If so, would it be possible to modify your script to perform only read through 20 000 files and do the comparison 20 000 times? I mean: 1) read a file, 2) perform all the possible comparisons against that file, 3) proceed to next file.

You mentioned filenames in array but what does that mean in this case? Does the script still need to perform 20 000 * 20 000 read operations instead of 20 000 read operations?

bash: loop over 20000 files slow - why?

3 Answers3