Load is a very often misunderstood value on Linux.
On Linux it is the measurement of all tasks in the running or uninterruptible sleep state.
Note this is tasks, not processes. Threads are included in this value.
Load is calculated by the kernel every five seconds and is a weighted average. That is the minute load is the average of 5/60, the five minute 5/300 and the fifteen 5/900.
Generally speaking, load as a pure number has little value without a point of reference and I consider the value often misrepresented.
Misconception 1: Load as a Ratio
In other words, how can I know what maximum load average a machine can support before performance starts to degrade?
This is most common falsehood people make of load in Linux. That it can be used to measure CPU performance against some fixed ratio. This is not what load gives you.
To elaborate - people have an easy time understanding CPU utilization. This is utility over time. You take work done, then divide it by work possible.
Work possible in this regards is a fixed known value normally represented as a percentage out of 100 - thats your fixed ratio.
Load however has no constraint. There is no fixed maximum which is why you are having this difficulty understanding what to measure against.
To clarify what load is sampling does have a unfixed maximum, which is the total number of tasks currently present in the system when the sample is taken (this has no real bearing on what CPU work is being done).
Load as its calculated has no fixed maximum given its thrown into a weighted average and no recording of the number of tasks is given when weighting is measured.
Because I like food, an analogy you could give is that utilization is how fast you can eat your plate and load is - on average - how many plates you have left to devour.
So, the difference between CPU utility and load is subtle but important. CPU utility is a measure of work being done and load is a measure of work that needs to be done.
Misconception 2: Load is an Instant Measurement
The second fallacy is that Load is a granular measurement. You can read a number and get a understanding of the systems state.
Load is not granular but represents the general long term condition of the system. Not only is it sampled every five seconds (so misses running tasks that occur within the 5 second window) but is measured as averages over 1, 5 and 15 minutes respectively.
You cant use it as an instant measure of capacity, but a general sense of a systems burden over a longer period.
The load can be 100 and then be 10 only 30 seconds later. Its a value you have to keep watching to work with.
What can Load tell you?
It can give you an idea of the systems working trend. It is being given more than it can cope or less?
- If the load is less than the number of CPUs you have this (normally) indicates you have more CPU capacity than work.
- If the load is greater or equal to the number of CPUs and is trending upwards, its an indication that the system has more work than what it can handle.
- If the load is greater or equal to the number of CPUs and is trending downwards, its an indication that the system is getting through the work faster than you are giving it stuff to do.
Because of the uninterruptible sleep state, this does muddy the load value as a pure scheduling score of work - but gives you some indication of how much demand there is on the disk (its still work that needs to be done technically).
Load also offer clues to anomalies on a system. If you see the load at 50+ it suggests something is amiss.
Load additionally can cause people to be concerned without reason.
- Commonly known, disk activity can inflate load.
- The load can be artificially inflated if lots of processes are bound to one CPU which is being waited on.
- Tasks with a very low priority (niceness) will often wait a long time inflating load by 1 for that particular process.
In Summary
I find load a very woolly value, precisely there are no absolutes with it. Its measurement you get on one system is often meaningless in reference against another.
Its probably one of the first things I'd see in top purely to check for an obvious anomaly. Basically I'm using it almost like a thermometer - like a general condition of a system only.
I find its sampling period way too long for most workloads I throw at my systems (which run in the order of seconds generally, not minutes). I suppose it makes sense for systems that execute long running intensive tasks, but I dont really do much of that.
The other thing I use it for is long term capacity management. Its a nice thing to graph over long periods of time (months) as you can use it to understand how much more work you are handling compared to a few months ago.
Finally, to answer your question about what to do in your scenario.
Quite honestly, the best suggestion I would offer is rather than consider using load as a factor as to when to run - use nice to execute your process giving other processes priority over it. This is good for a few reasons.
- You only give a small amount of CPU time to this process when other processes are busy.
- If there is nothing on the CPU or a CPU is idle your task spends 100% of the time on it.
- All the processes in the process group inherit the same niceness.
With a niceness of 0 (the default) each process gets a weight of 1024. The lower the weight, the less time on the CPU is offered to the process. Here is a table of this behaviour.
Nice Weight
0 1024
1 820
2 655
3 526
4 423
5 335
6 272
7 215
8 172
9 137
10 110
11 87
12 70
13 56
14 45
15 36
16 29
17 23
18 18
19 15
So to compare, in a scenario where you have 2 processes waiting to run - if you renice a process +10 it gets approximately 1/10th of the CPU time a priority 0 process has. If you renice it +19 it would get 1/100th of the CPU time a priority 0 process has.
It should be noted you'll probably see your load at 1 at least during the duration of your pipeline.
I imagine this would be a more elegant solution to your problem.