8

I can't understand ceph raw space usage.

I have 14 HDD (14 OSD's) on 7 servers , 3TB each HDD ~ 42 TB raw space in total.

ceph -s 
     osdmap e4055: 14 osds: 14 up, 14 in
      pgmap v8073416: 1920 pgs, 6 pools, 16777 GB data, 4196 kobjects
            33702 GB used, 5371 GB / 39074 GB avail

I created 4 block devices, 5 TB each:

df -h
 /dev/rbd1       5.0T  2.7T  2.4T  54% /mnt/part1
/dev/rbd2       5.0T  2.7T  2.4T  53% /mnt/part2
/dev/rbd3       5.0T  2.6T  2.5T  52% /mnt/part3
/dev/rbd4       5.0T  2.9T  2.2T  57% /mnt/part4

df shows that 10,9 TB is used in total, ceph shows that 33702 GB is used. If I have 2 copies, it must be ~ 22 TB, but now I have 33,7 TB used - 11 TB missed.

ceph osd pool get archyvas size
size: 2


ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED
    39074G     5326G       33747G         86.37
POOLS:
    NAME          ID     USED      %USED     MAX AVAIL     OBJECTS
    data          0          0         0         1840G           0
    metadata      1          0         0         1840G           0
    archyvas      3      4158G     10.64         1840G     1065104
    archyvas2     4      4205G     10.76         1840G     1077119
    archyvas3     5      3931G     10.06         1840G     1006920
    archyvas4     6      4483G     11.47         1840G     1148291

Block devices and OSD FS - XFS

virgism
  • 83
  • 1
  • 7

2 Answers2

7

One possible source of confusion is GB vs. GiB/TB vs. TiB (base 10/base 2), but that cannot explain all of the difference here.

Ceph/RBD will try to "lazily" allocate space for your volumes. This is why although you created four 5TB volumes, it reports 16TB used, not 20. But 16TB is more than the sum of the "active" contents of your RBD-backed filesystems, which is only around 11TB, as you say. Several things to note:

When you delete files in your RBD-backed filesystems, the filesystems will internally mark the blocks as free, but usually not try to "return" them to the underlying block device (RBD). If your kernel RBD version is recent enough (3.18 or newer), you should be able to use fstrim to return freed blocks to RBD. I suspect that you have created and deleted other files on these file systems, right?

There is also some file system overhead beyond the net data usage that is shown by df. Besides "superblocks" and other filesystem-internal data structures, some overhead is to be expected from the granularity at which RBD allocates data. I think RBD will always allocate 4MB chunks, even when only a portion of that is used.

sleinen
  • 241
  • 1
  • 2
  • And I agree with Simon. Guess both our answers together make one complete. btw. damn you. 20 hours old question and you beat me to answering by 35 seconds? :D – Fox Apr 18 '15 at 13:47
  • Thank both of you for answers. Now I understand where is my problem, and how to solve it. – virgism Apr 20 '15 at 05:51
  • Possible options: 1. upgrade to Linux kernel > 3.18 and mount with discard option; (I tested with kernel 3.19.0-1.el6.elrepo.x86_64, but had deadlocks every day); 2. Recreate block devices with size < 5 TB (can't shrink XFS) 3. Add HDD and create additional OSD's. – virgism Apr 20 '15 at 06:00
  • 1
    Can confirm this works fine. Upgraded my Ceph client machine's kernel to 3.19 last weekend in Ubuntu LTS 14.04.3 (`sudo apt-get install --install-recommends linux-generic-lts-vivid`), rebooted, re-mapped and mounted my rbd volumes, ran an `fstrim` on each of them, and collectively recovered 450GB on a small 25TB cluster. Once you upgrade, be sure you start mounting your rbd volumes with the `discard` option. – Brian Cline Aug 10 '15 at 02:30
6

I am no ceph expert but let me guess a little.

The block devices are not mounted without discard option. So any data you write and delete does not show up on the filesystem (/mnt/part1), but as it was once written and not trimmed, it stays on the underlying filesystem.

If you look at USED for your pools and add those together, you get 16777GB, which equals to what ceph -s shows. And if you multiply that by two (two copies), you get 33554GB, which is pretty much the space used.

Fox
  • 3,887
  • 16
  • 23
  • 1
    I agree with Fox's response (which was written at the same time as mine below :-). `discard` and "trim" are basically different words for the same mechanism that can be used to return unused blocks to a block device. Mounting with the `discard` option should have the desired effect. Some people prefer to periodically run `fstrim` to avoid the overhead of continuous discards by the filesystem. Note that for any of this to work, your RBD driver needs to support TRIM/discard. As I said, the RBD kernel driver does this since Linux 3.18—see http://tracker.ceph.com/issues/190. – sleinen Apr 18 '15 at 12:51