Setup
- Ubuntu 20.04
- Dell PowerEdge R820
- [PERC H710] 2x Virtual Drives (RAID-1 Boot, RAID-0 Work Drive)
- Everything been fine for 6 months
- No preceeding even, just suddenly, drive full.
Details...
This machine is used for plotting Chia (cryptocurrency) - it's been working away for months without issue.
I noticed the plotting process crashed (bladebit) - which is pretty uncommon, happens maybe once every 2 months - so I went to fire it back up and immediately started getting device full
types of errors.
I fired off a quick df -h
to see what was going on, and got this:
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 2.9M 51G 1% /run
/dev/sda2 549G 512G 8.7G 99% /
tmpfs 252G 4.0K 252G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/sda1 511M 5.3M 506M 2% /boot/efi
tmpfs 51G 0 51G 0% /run/user/1000
<... SNIP ...>
/dev/sda2
is the boot drive - it's actually a RAID-1 (2-disk) Virtual Disk handled by the H710 RAID card in the server, but I don't think that's terribly relevant.
NORMALLY this drive is 3% full, it only has bootable Ubuntu Server 20.04 installed on it and nothing else.
I had to erase the tmp file in root and a few other garbage files to free up space enough to get things to function again, but it's sitting at dang near full.
I followed countless "find the biggest file on your server" tips from here and around the web, for example this one, with the command sudo du -a / 2>/dev/null | sort -n -r | head -n 20
returning:
$ sudo du -a / 2>/dev/null | sort -n -r | head -n 20
[sudo] password for user:
1010830919685 /
1010823681740 /mnt
<...SNIP...>
Ok so something huge sitting in /
apparently? A simple ls
shows nothing of interest in there:
$ ls -lFa /
total 84
drwxr-xr-x 20 root root 4096 Jan 12 17:45 ./
drwxr-xr-x 20 root root 4096 Jan 12 17:45 ../
lrwxrwxrwx 1 root root 7 Aug 24 08:41 bin -> usr/bin/
drwxr-xr-x 4 root root 4096 Jan 6 06:22 boot/
drwxr-xr-x 2 root root 4096 Sep 28 14:04 cdrom/
drwxr-xr-x 21 root root 6920 Jan 5 16:05 dev/
drwxr-xr-x 105 root root 4096 Jan 5 01:54 etc/
drwxr-xr-x 3 root root 4096 Sep 28 14:18 home/
lrwxrwxrwx 1 root root 7 Aug 24 08:41 lib -> usr/lib/
lrwxrwxrwx 1 root root 9 Aug 24 08:41 lib32 -> usr/lib32/
lrwxrwxrwx 1 root root 9 Aug 24 08:41 lib64 -> usr/lib64/
lrwxrwxrwx 1 root root 10 Aug 24 08:41 libx32 -> usr/libx32/
drwx------ 2 root root 16384 Sep 28 14:03 lost+found/
drwxr-xr-x 2 root root 4096 Aug 24 08:42 media/
-rw-r--r-- 1 root root 6678 Jan 9 00:59 MegaSAS.log
drwxr-xr-x 64 root root 4096 Jan 5 01:48 mnt/
drwxr-xr-x 3 root root 4096 Nov 30 18:14 opt/
dr-xr-xr-x 1356 root root 0 Jan 3 04:40 proc/
drwx------ 7 root root 4096 Nov 30 18:07 root/
drwxr-xr-x 34 root root 1100 Jan 12 08:04 run/
lrwxrwxrwx 1 root root 8 Aug 24 08:41 sbin -> usr/sbin/
drwxr-xr-x 9 root root 4096 Sep 28 22:06 snap/
drwxr-xr-x 2 root root 4096 Aug 24 08:42 srv/
dr-xr-xr-x 13 root root 0 Jan 3 04:40 sys/
drwxrwxrwt 13 root root 4096 Jan 12 17:15 tmp/
drwxr-xr-x 15 root root 4096 Aug 24 08:46 usr/
drwxr-xr-x 13 root root 4096 Aug 24 08:47 var/
Using sudo ncdu -x /
(link) shows nothing interesting oddly enough:
2.4 GiB [##########] /usr
1.5 GiB [###### ] /var
732.5 MiB [## ] /home
202.8 MiB [ ] /boot
5.5 MiB [ ] /opt
5.4 MiB [ ] /etc
1.9 MiB [ ] /root
168.0 KiB [ ] /tmp
<...SNIP...>
Where is this ~510GB of used space sitting?
Firing off a sudo lsof | grep deleted
to see if there is some giant file being held onto, gave me this:
systemd-j 1134 root 36u REG 8,2 134217728 5246838 /var/log/journal/771d7f1addf64a7b930191976176149e/system@ae2f8b2397c441f7a286d25144be755f-0000000000315312-0005d4e51ab8f8e9.journal (deleted)
unattende 3932 root 3w REG 8,2 113 5246631 /var/log/unattended-upgrades/unattended-upgrades-shutdown.log.1 (deleted)
unattende 3932 3943 gmain root 3w REG 8,2 113 5246631 /var/log/unattended-upgrades/unattended-upgrades-shutdown.log.1 (deleted)
Ok so it's holding onto a 134mb journal file, but that still doesn't explain why suddenly there is 510GB of the drive being taken up.
I also tried some additional searches, like this one, and resulted in nothing helpful either.
I eventually used megacli
to check the SMART data off the 2 drives in the RAID-0 array and they have 0 errors reported so it doesn't seem like the array got damaged.
Any ideas or additional digging tricks I might try to figure out what is sucking up that space?
UPDATE #1 - I noticed when I typed top
that buff/cache
was almost exactly the size of the GB's that were being consumed on the root drive. I know that space isn't counted as used
, but I decided to fire off a quick:
sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
which took about 3mins to run but eventually returned - top
now shows buff/cache
as < 1k, BUT df -h
shows no change in disk usage.
I had hoped it was a mystery cache file on disk or something like that.