Earlier this week I had a 'perfect storm' moment on my servers: Two backup jobs (one for each RAID10 array on the system) had been humming along for 18 hours, and then we had a sustained spike in traffic on my I/O intensive application. The result was unacceptably slow performance, and I had to force our administrator to cancel the backup. (He was not happy about this...not at all. "I'm not responsible if...")
The end result was lots of stress, unhappy customers, and a very grouchy Stu.
The bottleneck was disk utilization. Once the jobs were canceled, everything was working just fine. What can I suggest to my administrators to lessen the impact on my servers?
Here are some of the gory details:
The backup command itself (I got this out of ps
, but really don't know what it means.)
bpbkar -r 1209600 -ru root -dt 0 -to 0 -clnt xtx-le00 -class F_Full_on_Thursday
-sched Incr_Fri_to_Wed -st INCR -bpstart_to 300 -bpend_to 300 -read_to 300
-blks_per_buffer 127 -stream_count 8 -stream_number 8 -jobgrpid 223932 -tir -tir_plus
-use_otm -use_ofb -b svr_1259183136 -kl 28 -fso
The system
- RHEL4 64-bit
- 4GB RAM (~half used by applications)
- DL380G5 with two attached SAS RAID10 partitions, ~550GB and ~825GB
The data
1TB
- ~10 million files
The application
- busy from 0900 to 2300 on weekdays
- I/O intensive (99% read) mostly focused on a few hundred MB of files