1

Edit 09/20/2012

I made this way too complicated before. I believe that these commands are actually the simpler way, while still formatting everything nicely.

    RHEL 5
    du -x / | sort -n |cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -n|tail -10|cut -f2|xargs du -sxh

    Solaris 10
    du -d / | sort -n |cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -n|tail -10|cut -f2|xargs du -sdh

Edit: The command has been updated to properly make use of du -x or du -d on RHEL5 or Solaris 10, respectively.

RHEL5

du -x /|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-3|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -sxh|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"|sed '$d'

Solaris

du -d /|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-3|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -sdh|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"|sed '$d'

This will return directories over 50mb within "/" file system in ascending, reursive, human readable format, and in a reasonably fast amount of time.

Request: Can you help make this one-liner more effective, faster, or efficient? How about more elegant? If you understand what I did there then please read on.

The problem is that it can be difficult to quickly discern what directories contained under the "/" directory are contributing to "/" filesystem capaciy because all other filesystems fall under root.

This will exclude all non / filesystems when running du on Solaris 10 or Red Hat el5 by basically munging an egrepped df from a second pipe-delimited egrep regex subshell exclusion that is naturally further excluded upon by a third egrep in what I would like to refer to as "the whale." The munge-fest frantically escalates into some xargs du recycling where du -x/-d is actually useful (see bottom comments), and a final, gratuitous egrep spits out a list of relevant, high-capacity directories that are exclusively contained within the "/" filesystem, but not within other mounted filesystems. It is very sloppy.

Linux platform example: xargs du -shx

pwd = /

du *|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -shx|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"

This is running against these filesystems:

            Linux builtsowell 2.6.18-274.7.1.el5 #1 SMP Mon Oct 17 11:57:14 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

            df -kh

            Filesystem            Size  Used Avail Use% Mounted on
            /dev/mapper/mpath0p2  8.8G  8.7G  90M   99% /
            /dev/mapper/mpath0p6  2.0G   37M  1.9G   2% /tmp
            /dev/mapper/mpath0p3  5.9G  670M  4.9G  12% /var
            /dev/mapper/mpath0p1  494M   86M  384M  19% /boot
            /dev/mapper/mpath0p7  7.3G  187M  6.7G   3% /home
            tmpfs                  48G  6.2G   42G  14% /dev/shm
            /dev/mapper/o10g.bin   25G  7.4G   17G  32% /app/SIP/logs
            /dev/mapper/o11g.bin   25G   11G   14G  43% /o11g
            tmpfs                 4.0K     0  4.0K   0% /dev/vx
            lunmonster1q:/vol/oradb_backup/epmxs1q1
                                  686G  507G  180G  74% /rpmqa/backup
            lunmonster1q:/vol/oradb_redo/bisxs1q1
                                  4.0G  1.6G  2.5G  38% /bisxs1q/rdoctl1
            lunmonster1q:/vol/oradb_backup/bisxs1q1
                                  686G  507G  180G  74% /bisxs1q/backup
            lunmonster1q:/vol/oradb_exp/bisxs1q1
                                  2.0T  1.1T  984G  52% /bisxs1q/exp
            lunmonster2q:/vol/oradb_home/bisxs1q1
                                   10G  174M  9.9G   2% /bisxs1q/home
            lunmonster2q:/vol/oradb_data/bisxs1q1
                                   52G  5.2G   47G  10% /bisxs1q/oradata
            lunmonster1q:/vol/oradb_redo/bisxs1q2
                                  4.0G  1.6G  2.5G  38% /bisxs1q/rdoctl2
            ip-address1:/vol/oradb_home/cspxs1q1
                                   10G  184M  9.9G   2% /cspxs1q/home
            ip-address2:/vol/oradb_backup/cspxs1q1
                                  674G  314G  360G  47% /cspxs1q/backup
            ip-address2:/vol/oradb_redo/cspxs1q1
                                  4.0G  1.5G  2.6G  37% /cspxs1q/rdoctl1
            ip-address2:/vol/oradb_exp/cspxs1q1
                                  4.1T  1.5T  2.6T  37% /cspxs1q/exp
            ip-address2:/vol/oradb_redo/cspxs1q2
                                  4.0G  1.5G  2.6G  37% /cspxs1q/rdoctl2
            ip-address1:/vol/oradb_data/cspxs1q1
                                  160G   23G  138G  15% /cspxs1q/oradata
            lunmonster1q:/vol/oradb_exp/epmxs1q1
                                  2.0T  1.1T  984G  52% /epmxs1q/exp
            lunmonster2q:/vol/oradb_home/epmxs1q1
                                   10G   80M   10G   1% /epmxs1q/home
            lunmonster2q:/vol/oradb_data/epmxs1q1
                                  330G  249G   82G  76% /epmxs1q/oradata
            lunmonster1q:/vol/oradb_redo/epmxs1q2
                                  5.0G  609M  4.5G  12% /epmxs1q/rdoctl2
            lunmonster1q:/vol/oradb_redo/epmxs1q1
                                  5.0G  609M  4.5G  12% /epmxs1q/rdoctl1
            /dev/vx/dsk/slaxs1q/slaxs1q-vol1
                                  183G   17G  157G  10% /slaxs1q/backup
            /dev/vx/dsk/slaxs1q/slaxs1q-vol4
                                  173G   58G  106G  36% /slaxs1q/oradata
            /dev/vx/dsk/slaxs1q/slaxs1q-vol5
                                   75G  952M   71G   2% /slaxs1q/exp
            /dev/vx/dsk/slaxs1q/slaxs1q-vol2
                                  9.8G  381M  8.9G   5% /slaxs1q/home
            /dev/vx/dsk/slaxs1q/slaxs1q-vol6
                                  4.0G  1.6G  2.2G  42% /slaxs1q/rdoctl1
            /dev/vx/dsk/slaxs1q/slaxs1q-vol3
                                  4.0G  1.6G  2.2G  42% /slaxs1q/rdoctl2
            /dev/mapper/appoem     30G  1.3G   27G   5% /app/em

This is the result:

Linux:

            54M     etc/gconf
            61M     opt/quest
            77M     opt
            118M    usr/  ##===\
            149M    etc
            154M    root
            303M    lib/modules
            313M    usr/java  ##====\
            331M    lib
            357M    usr/lib64  ##=====\
            433M    usr/lib  ##========\
            1.1G    usr/share  ##=======\
            3.2G    usr/local  ##========\
            5.4G    usr   ##<=============Ascending order to parent
            94M     app/SIP   ##<==\
            94M     app   ##<=======Were reported as 7gb and then corrected by second du with -x.

=============================================

Solaris Platform example: xargs du -shd

pwd = /

du *|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -sh|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"

This is running against these filesystems:

            SunOS solarious 5.10 Generic_147440-19 sun4u sparc SUNW,SPARC-Enterprise

            Filesystem             size   used  avail capacity  Mounted on
             kiddie001Q_rpool/ROOT/s10s_u8wos_08a  8G   7.7G    1.3G    96%    / 
            /devices                 0K     0K     0K     0%    /devices
            ctfs                     0K     0K     0K     0%    /system/contract
            proc                     0K     0K     0K     0%    /proc
            mnttab                   0K     0K     0K     0%    /etc/mnttab
            swap                    15G   1.8M    15G     1%    /etc/svc/volatile
            objfs                    0K     0K     0K     0%    /system/object
            sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
            fd                       0K     0K     0K     0%    /dev/fd
            kiddie001Q_rpool/ROOT/s10s_u8wos_08a/var    31G   8.3G   6.6G    56%    /var
            swap                   512M   4.6M   507M     1%    /tmp
            swap                    15G    88K    15G     1%    /var/run
            swap                    15G     0K    15G     0%    /dev/vx/dmp
            swap                    15G     0K    15G     0%    /dev/vx/rdmp
            /dev/dsk/c3t4d4s0   3   20G   279G    41G    88%    /fs_storage
            /dev/vx/dsk/oracle/ora10g-vol1  292G   214G    73G    75%     /o10g
            /dev/vx/dsk/oec/oec-vol1    64G    33G    31G    52%    /oec/runway
            /dev/vx/dsk/oracle/ora9i-vol1   64G    33G    31G   59%    /o9i
            /dev/vx/dsk/home     23G    18G   4.7G    80%    /export/home
            /dev/vx/dsk/dbwork/dbwork-vol1 292G   214G    73G    92%    /db03/wk01
            /dev/vx/dsk/oradg/ebusredovol   2.0G   475M   1.5G    24%    /u21
            /dev/vx/dsk/oradg/ebusbckupvol   200G    32G   166G    17%    /u31
            /dev/vx/dsk/oradg/ebuscrtlvol   2.0G   475M   1.5G    24%    /u20
            kiddie001Q_rpool       31G    97K   6.6G     1%    /kiddie001Q_rpool
            monsterfiler002q:/vol/ebiz_patches_nfs/NSA0304   203G   173G    29G    86%    /oracle/patches
            /dev/odm                 0K     0K     0K     0%    /dev/odm

This is the result:

Solaris:

            63M     etc
            490M    bb
            570M    root/cores.ric.20100415
            1.7G    oec/archive
            1.1G    root/packages
            2.2G    root
            1.7G    oec

==============

How could one more effectively deal with "/" aka "root" filesystem full issues across multiple platforms that have a devastating number of mounts?

On Red Hat el5, du -x apparently avoids traversal into other filesystems. While this may be so, it does not appear to do anything if run from the / directory.

On Solaris 10, the equivalent flag is du -d, which apparently packs no surprises.

(I'm hoping I've just been doing it wrong.)

Guess what? It's really slow.

nice_line
  • 149
  • 2
  • 7
  • FWIW: The `du` packaged with RHEL is from GNU coreutils. Any other Linux that uses that edition of `du` will have the same `-x` flag. – Charles Aug 27 '12 at 19:42
  • Have you considered `fdisk /`? *"Oops, it all got formatted. Oh well, guess we'll have have to rebuild it **right** this time."* That's definitely what I'd [get one of my more gullible coworkers to] do. – HopelessN00b Aug 27 '12 at 20:02
  • @Charles: regarding human-readable, the newer versions of coreutils allow for this: du -h * | sort -h. Another option, that I don't have, is this: du -BM | sort -nr – nice_line Aug 27 '12 at 20:13
  • 1
    You are unwilling to make any changes to your environment, but you want us to make it better for you? you can't have it both ways -- Fixing this disaster requires changing *something* (your monitoring software or your whole disaster of an environment) -- Tell us which you want t do, but If you're unwilling or unable to do either this is just a rant cleverly masquerading as a question. – voretaq7 Aug 27 '12 at 20:20
  • Totally a rant :) – ewwhite Aug 27 '12 at 20:27
  • @voretaq7: I simply encourage anyone brave enough to help refine what I can work with. I requested no fix, and I hardly expect that the community would magically make my environment better. What I want to do is turn my bloated one liner into something less embarrassing, and I should hope this would aid those in the community and at large who find themselves in a similar mess, which I reckon is not so unusual. – nice_line Aug 27 '12 at 20:35

4 Answers4

4

Your problem, as I understand it, is that du is descending into other filesystems (some of which are network or SAN mounts, and take a long time to count up utilization on).

I respectfully submit that if you're trying to monitor filesystem utilization du is the wrong tool for the job. You want df (which you apparently know about since you included its output).

Parsing the output from df can help you target specific filesystems in which you should be running du to determine which directories are chewing up all your space (or if you're lucky the full filesystem has a specific responsible party who you can tell to figure it out for themselves). In either case at least you will know a filesystem is filling up before it's full (and the output is easier to parse).

In short: Run df first, then if you have to run du on any filesystem df identified as having utilization over (say) 85% to get more specific details.


Moving on into your script, the reason du isn't respecting your -d (or -x) flag is because of the question you're asking:

 # pwd   
 /
 # du * (. . .etc. . .)

You are asking du to run on everything under / -- du -x /bin /home /sbin /usr /tmp /var etc. -- du is then doing exactly what you asked (giving you the usage of each of those things. If one of the arguments happens to be a filesystem root du assumes you know what you're doing and give the usage of that filesystem up to the first sub-mount it finds.

This is critically different from du -x / ("Tell me about / and ignore any sub-mounts").

To fix your script *don't cd into the directory you are analyzing -- instead just run
du /path/to/full/disk | [whatever you want to feed the output through]


This (or any other suggestion you may get) doesn't solve your two core problems:

  1. Your monitoring system is ad-hoc
    If you want to catch problems before they bite you in the genitals you really need to deploy a decent monitoring platform. If you're having trouble getting your management team to buy into this remind them that proper monitoring lets you avoid downtime.

  2. Your environment (as you've rightly surmised) is a mess
    Not much to be done here except rebuild the thing - It's your job as the SA to stand up and make a very clear, very LOUD business case for why the systems need to be taken down one at a time and rebuilt with a structure that can be managed.

You seem to have a pretty decent handle on what needs to be done, but if you have questions by all means ask them and we'll try to help as much as we can (we can't do your architecture for you, but we can answer conceptual questions or practical "How do I do X with monitoring tool Y?" type stuff...

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • I appreciate your thoughtful response. There is a centralized monitoring platform in place that makes it quite clear when various issues are occurring depending on whatever thresholds are set. Detection is not the problem. The same challenge remains when tasked with cleaning a root aka "/" filesystem surrounded by hordes of insular filesystems, regardless of when a determination or detection was made. When one runs `df` and one sees >=85% on "/" and `du` has to be run to get more specific details on the contents of a skewed "/" filesystem one too many times...he is me and I am here. – nice_line Aug 27 '12 at 21:06
  • Also @voretaq7, I'm not sure you completely understand. The initial problem is not the speed, it is that EVERYTHING other than "/" falls under "/". The challenge is quickly discerning which filesystems, represented by folder names, should be ignored when trying to determine the disk usage of "/", which all other filesystems fall under. – nice_line Aug 27 '12 at 21:13
  • @nice_line Ah - I think I see your problem -- see if my edit clears things up. – voretaq7 Aug 27 '12 at 21:22
  • Ah, yes. I was hoping it was something simple like that. I'll update my code in the post and give you credit. – nice_line Aug 28 '12 at 00:15
3

Simple answer: install an infrastructure monitoring tool (such as ZenOSS, Zabixx, etc.).

If you're looking for something custom, perhaps you need some sort of abstraction layer to handle weird per-machine differences rather than managing that by hand every time?

MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • I suppose I failed to appreciate that someone might actually think a digital janitor such as myself could simply choose to submit/approve/spearhead/purchase/design/implement/utilize a costly high-level enterprise solution to resolve my abject, powerless position. – nice_line Aug 27 '12 at 20:25
  • 1
    @nice_line Not all monitoring systems are costly. Many are free (only your time to set them up). If your position is really so powerless as to be unable to mane *any* substantive changes to the environment to alleviate your problems you need to send your boss to this site. One of our core assumptions is that you at least have the power to recommend solutions (and have those recommendations listened to by management)... – voretaq7 Aug 27 '12 at 20:37
  • @voretaq7 As I mentioned elsewhere, monitoring is in place. The issue is not detection, it regards the swift handling of identifying which directories under the "/" filesystem are consuming space, while ignoring other, irrelevant filesystem paths dynamically. I am not looking for an "enterprise-wide solution" that would span 4 continents and multiple data centers regarding my "please help me with my how can I clean disk space better" request. – nice_line Aug 27 '12 at 21:19
  • @nice_line Something as simple as a filesystem check using [Monit](http://mmonit.com/monit/) would allow you to set a generic threshold on any filesystem mount (say, alert me at 80%). That's free and available within the distribution. If you have SNMP capabilities, [OpenNMS](http://www.opennms.org/) is smart enough to detect sudden increases/decreases in disk utilization. – ewwhite Aug 27 '12 at 21:47
  • Your question was vague and ranty enough that it was not nearly clear exactly for what you were asking. Installing monitoring is *not* quite the drastic job that you seem to think it is and there are good, free solutions available (two of which I mentioned). – MikeyB Aug 28 '12 at 16:56
1

I make this recommendation often. The tool I advocate for ad-hoc disk usage calculations is the ncdu utility. There is an --exclude flag that can be specified multiple times.

There are packaged versions for Solaris (CSWncdu), or you can compile it from source. It simplifies much of what you're doing.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • "Tell me I should download forbidden 3rd party software." Check! – nice_line Aug 27 '12 at 19:45
  • Sounds like a political issue, then. What's your goal? Fixing the root problem in the environment? Or is it reacting to an outage or near-emergency? Are the "growth" directories on these systems not predictable, e.g. `/opt/app/logs`? – ewwhite Aug 27 '12 at 19:57
  • Fixing the root problem is far beyond my control. It is purely reactionary, and yes, the "growth" directories are frequently unpredictable due to a lack of design, build, and maintenance consistency spanning several decades. Even applications implemented across blades that share a chassis, that were deployed at the same time for different clients, are frequently installed in different directories. – nice_line Aug 27 '12 at 20:09
1

I think what you are looking for is something like ncdu. That will let you stop from traversing into directories, while still being able to find where the disk is being consumed.

I will echo the other answers by saying that this is the tool you use after your monitoring systems have detected a problem - it's not the sort of tool you would want to use non-interactively. In fact, because it's ncurses based, doing so would be a cludge. Any systems administrator worth their salt will let you download a vetted and simple tool to prevent resource-hungry, hacked together bash monstrosities like the one you've described. It will use far more memory, far more I/O, and be far more dangerous than that "forbidden" software.

Scrivener
  • 3,106
  • 1
  • 20
  • 23