0

Problem: Backup throughput suddenly went down from 1TB+ per hour to 350GB per hour in HPUX server for DB2 database. Backup using Commvault backup software to the media agent via 10G network.

Troubleshoot done:

  1. Database. I have tried to do native backup using same parallelism, num of buffer and buffer size like via commvault. Im getting about 1TB+ per hour throughput. Hence i dont think DB / DB settings is the issue.

  2. Network. Network team checked that the port only used up very low utilization which is less than 0.5% out of 10G. No error reported on switch. Checked from HPE Intelligence management center the network throughput is tally with what shown from commvault.

  3. OS. During the backup time i notice CPU was constantly around 8% and memory around 83%. Hence im not sure whether got any resource bottleneck or not.

  4. Backup software (commvault). Other backup client that are using the same backup disklibrary, same storage policy, same media agent getting higher throughput. Hence, i dont think backup software is the issue.

Im not sure where should i check nor what should i do anymore. I really need somebody to advice me what to check next. I have a feeling that the bottleneck coming from either network or OS side. I have revert to OS and network team but both revert back saying everyting was ok from their side. So i have no choice but to troubleshoot myself.

Thank you so much for your help!

Tommy
  • 1
  • 2

2 Answers2

0

Tommy , juste find this thread and wonder if you finally find the culprint / solution to this issue.
We experiment the same issue at our center ( DB2 ESE Multi Node on Linux/RHEL-7 ) of only 300-400 Mb of throughput for DB2... where as we experiment between 1-2 Tb for Oracle PDB !! So if can provide with your finding it will help us a lot to orientate our research. Thanks in advance.

Denis
  • 1
0

First, determine if anything has changed. The description in your post indicated multiple teams involved in managing this infrastructure, and they probably don't share information well between each other. Figure out exactly when the throughput drop happened and ask around (if you haven't already.)

Next lets start at the bottom of the OSI layer here and work our way up. Figure out how things are connected together first so you know what to check. Is this connection via some physical switch or a virtual switch on some server? If one port is not utilized highly, what about overall utilization? Is some other backup/sync running at the same time?

After that look for packet loss along the path and other problems with the protocol transporting this data. I assume the connection is TCP so watch for the big 3 items that affect throughput like the TCP window size, Round trip time, and available bandwidth. Things like packet loss cause TCP to scale back and send less data per-window. Higher latency means slower potential download speeds (each ms waiting for an ACK means time not sending more data) TCPDUMP is your friend, capture a slice of traffic and examine it.

Next check the two endpoints in this connection and re-check that they aren't bottle-necking this somehow with RAM or CPU load.

Lastly, some sanity check items.

1) When your backups are not running, can other protocols download at faster speeds between the same endpoints? SMB? FTP?

2) Is there some history here in this environment of poor backup performance?

3) Open a ticket with the vendor, if you have support.

It seems likely that the network could be involved in this assuming no other changes in between.

manbearpig
  • 56
  • 3