Block device suddenly full; can't identify a single file as culprit and SMART shows no drive errors

Question

Setup

Ubuntu 20.04
Dell PowerEdge R820
[PERC H710] 2x Virtual Drives (RAID-1 Boot, RAID-0 Work Drive)
Everything been fine for 6 months
No preceeding even, just suddenly, drive full.

Details...

This machine is used for plotting Chia (cryptocurrency) - it's been working away for months without issue.

I noticed the plotting process crashed (bladebit) - which is pretty uncommon, happens maybe once every 2 months - so I went to fire it back up and immediately started getting device full types of errors.

I fired off a quick df -h to see what was going on, and got this:

Filesystem          Size  Used Avail Use% Mounted on
udev                252G     0  252G   0% /dev
tmpfs                51G  2.9M   51G   1% /run
/dev/sda2           549G  512G  8.7G  99% /
tmpfs               252G  4.0K  252G   1% /dev/shm
tmpfs               5.0M     0  5.0M   0% /run/lock
tmpfs               252G     0  252G   0% /sys/fs/cgroup
/dev/sda1           511M  5.3M  506M   2% /boot/efi
tmpfs                51G     0   51G   0% /run/user/1000
<... SNIP ...>

/dev/sda2 is the boot drive - it's actually a RAID-1 (2-disk) Virtual Disk handled by the H710 RAID card in the server, but I don't think that's terribly relevant.

NORMALLY this drive is 3% full, it only has bootable Ubuntu Server 20.04 installed on it and nothing else.

I had to erase the tmp file in root and a few other garbage files to free up space enough to get things to function again, but it's sitting at dang near full.

I followed countless "find the biggest file on your server" tips from here and around the web, for example this one, with the command sudo du -a / 2>/dev/null | sort -n -r | head -n 20 returning:

$ sudo du -a / 2>/dev/null | sort -n -r | head -n 20
[sudo] password for user: 
1010830919685   /
1010823681740   /mnt
<...SNIP...>

Ok so something huge sitting in / apparently? A simple ls shows nothing of interest in there:

$ ls -lFa /
total 84
drwxr-xr-x   20 root root  4096 Jan 12 17:45 ./
drwxr-xr-x   20 root root  4096 Jan 12 17:45 ../
lrwxrwxrwx    1 root root     7 Aug 24 08:41 bin -> usr/bin/
drwxr-xr-x    4 root root  4096 Jan  6 06:22 boot/
drwxr-xr-x    2 root root  4096 Sep 28 14:04 cdrom/
drwxr-xr-x   21 root root  6920 Jan  5 16:05 dev/
drwxr-xr-x  105 root root  4096 Jan  5 01:54 etc/
drwxr-xr-x    3 root root  4096 Sep 28 14:18 home/
lrwxrwxrwx    1 root root     7 Aug 24 08:41 lib -> usr/lib/
lrwxrwxrwx    1 root root     9 Aug 24 08:41 lib32 -> usr/lib32/
lrwxrwxrwx    1 root root     9 Aug 24 08:41 lib64 -> usr/lib64/
lrwxrwxrwx    1 root root    10 Aug 24 08:41 libx32 -> usr/libx32/
drwx------    2 root root 16384 Sep 28 14:03 lost+found/
drwxr-xr-x    2 root root  4096 Aug 24 08:42 media/
-rw-r--r--    1 root root  6678 Jan  9 00:59 MegaSAS.log
drwxr-xr-x   64 root root  4096 Jan  5 01:48 mnt/
drwxr-xr-x    3 root root  4096 Nov 30 18:14 opt/
dr-xr-xr-x 1356 root root     0 Jan  3 04:40 proc/
drwx------    7 root root  4096 Nov 30 18:07 root/
drwxr-xr-x   34 root root  1100 Jan 12 08:04 run/
lrwxrwxrwx    1 root root     8 Aug 24 08:41 sbin -> usr/sbin/
drwxr-xr-x    9 root root  4096 Sep 28 22:06 snap/
drwxr-xr-x    2 root root  4096 Aug 24 08:42 srv/
dr-xr-xr-x   13 root root     0 Jan  3 04:40 sys/
drwxrwxrwt   13 root root  4096 Jan 12 17:15 tmp/
drwxr-xr-x   15 root root  4096 Aug 24 08:46 usr/
drwxr-xr-x   13 root root  4096 Aug 24 08:47 var/

Using sudo ncdu -x / (link) shows nothing interesting oddly enough:

    2.4 GiB [##########] /usr                                                                                                                                                                                                                 
    1.5 GiB [######    ] /var
  732.5 MiB [##        ] /home
  202.8 MiB [          ] /boot
    5.5 MiB [          ] /opt
    5.4 MiB [          ] /etc
    1.9 MiB [          ] /root
  168.0 KiB [          ] /tmp
<...SNIP...>

Where is this ~510GB of used space sitting?

Firing off a sudo lsof | grep deleted to see if there is some giant file being held onto, gave me this:

systemd-j    1134                               root   36u      REG                8,2 134217728    5246838 /var/log/journal/771d7f1addf64a7b930191976176149e/system@ae2f8b2397c441f7a286d25144be755f-0000000000315312-0005d4e51ab8f8e9.journal (deleted)
unattende    3932                               root    3w      REG                8,2       113    5246631 /var/log/unattended-upgrades/unattended-upgrades-shutdown.log.1 (deleted)
unattende    3932    3943 gmain                 root    3w      REG                8,2       113    5246631 /var/log/unattended-upgrades/unattended-upgrades-shutdown.log.1 (deleted)

Ok so it's holding onto a 134mb journal file, but that still doesn't explain why suddenly there is 510GB of the drive being taken up.

I also tried some additional searches, like this one, and resulted in nothing helpful either.

I eventually used megacli to check the SMART data off the 2 drives in the RAID-0 array and they have 0 errors reported so it doesn't seem like the array got damaged.

Any ideas or additional digging tricks I might try to figure out what is sucking up that space?

UPDATE #1 - I noticed when I typed top that buff/cache was almost exactly the size of the GB's that were being consumed on the root drive. I know that space isn't counted as used, but I decided to fire off a quick:

sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"

which took about 3mins to run but eventually returned - top now shows buff/cache as < 1k, BUT df -h shows no change in disk usage.

I had hoped it was a mystery cache file on disk or something like that.

UPDATE 2 - In the off chance I was "hiding" a massive file from myself by mounting on top of it, I `mount -o bind` my root dir to `/tmp/fake-root` to take a peek in the ROOT and `/mnt` directories just incase something was in there... didn't discover anything. This tip was from: https://unix.stackexchange.com/a/198543/509866 — Riyad Kalla, Jan 12 '22 at 18:30
UPDATE 3 - Fired off a `sudo find / -type f -printf '%s %p\n' 2>&1 | grep -v 'Permission denied' | sort -nr | head -10` and unfortunately besides saying that `/proc/kcore` is like 100 EB, it didn't show me any other big files I didn't already know about. — Riyad Kalla, Jan 12 '22 at 18:34

Block device suddenly full; can't identify a single file as culprit and SMART shows no drive errors

Setup

Details...

0 Answers0