fio 3.23 core dumps when bench-marking many small files

Question

I have been asked to come up fio benchmark results for this test dataset: 1048576x1MiB. So, overall size is 1TiB. The set contains 2^20 1MiB files. The server runs CentOS Linux release 7.8.2003 (Core). It has sufficient RAM:

[root@tbn-6 src]# free -g
              total        used        free      shared  buff/cache   available
Mem:            376           8         365           0           2         365
Swap:             3           2           1

It's actually not a physical server. Instead, it's a Docker container with the following CPU:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
[...]

Why docker? We are working on a project that evaluates the appropriateness of using containers instead of physical servers. Back to the fio issue.

I remember I had troubles with fio dealing with a dataset consisting many small files before. So, I did the following checks:

[root@tbn-6 src]# ulimit -Hn
8388608
[root@tbn-6 src]# ulimit -Sn
8388608
[root@tbn-6 src]# cat /proc/sys/kernel/shmmax
18446744073692774399

All looked OK to me. I also compiled as of this writing the latest fio 3.23 with GCC 9.

[root@tbn-6 src]# fio --version
fio-3.23

Here is the job file:

[root@tbn-6 src]# cat testfio.ini 
[writetest]
thread=1
blocksize=2m
rw=randwrite
direct=1
buffered=0
ioengine=psync
gtod_reduce=1
numjobs=12
iodepth=1
runtime=180
group_reporting=1
percentage_random=90
opendir=./1048576x1MiB

Note: of the above, the following can be taken out:

[...]
gtod_reduce=1
[...]
runtime=180
group_reporting=1
[...]

The rest MUST be kept. This is because running fio in our view the job file should be set up in such a way that emulates the application's interactions with storage as closely as possible, even knowing fio != the application.

I did the first run like so

[root@tbn-6 src]# fio testfio.ini
smalloc: OOM. Consider using --alloc-size to increase the shared memory available.
smalloc: size = 368, alloc_size = 388, blocks = 13
smalloc: pool 0, free/total blocks 1/524320
smalloc: pool 1, free/total blocks 8/524320
smalloc: pool 2, free/total blocks 10/524320
smalloc: pool 3, free/total blocks 10/524320
smalloc: pool 4, free/total blocks 10/524320
smalloc: pool 5, free/total blocks 10/524320
smalloc: pool 6, free/total blocks 10/524320
smalloc: pool 7, free/total blocks 10/524320
fio: filesetup.c:1613: alloc_new_file: Assertion `0' failed.
Aborted (core dumped)

OK, so time to use the --alloc-size

[root@tbn-6 src]# fio --alloc-size=776 testfio.ini
smalloc: OOM. Consider using --alloc-size to increase the shared memory available.
smalloc: size = 368, alloc_size = 388, blocks = 13
smalloc: pool 0, free/total blocks 1/524320
smalloc: pool 1, free/total blocks 8/524320
smalloc: pool 2, free/total blocks 10/524320
smalloc: pool 3, free/total blocks 10/524320
smalloc: pool 4, free/total blocks 10/524320
smalloc: pool 5, free/total blocks 10/524320
smalloc: pool 6, free/total blocks 10/524320
smalloc: pool 7, free/total blocks 10/524320
smalloc: pool 8, free/total blocks 8/524288
smalloc: pool 9, free/total blocks 8/524288
smalloc: pool 10, free/total blocks 8/524288
smalloc: pool 11, free/total blocks 8/524288
smalloc: pool 12, free/total blocks 8/524288
smalloc: pool 13, free/total blocks 8/524288
smalloc: pool 14, free/total blocks 8/524288
smalloc: pool 15, free/total blocks 8/524288
fio: filesetup.c:1613: alloc_new_file: Assertion `0' failed.
Aborted (core dumped)

Back to square one :(

I must be missing something. Any help is much obliged.

I know it's "unrealistic" but for the sake of diagnosing the issue is there any more of the job file that can be removed? After you get past the issue you can add them back in. Also did you see the suggestion about using a much bigger `--alloc-size` than above? — Anon, Oct 26 '20 at 17:57
Hi Anon again, as to the Docker's limitation on `/dev/shm`, I did talk to a sysadmin about the possibility. This is what he told me. "these are privileged containers we should be able to increase this limit no problem, in fact I just updated nofile soft/hard limits to 536870912 in the containers." — foss4me, Oct 27 '20 at 18:17
We always setup `fio` in such a way that as much as possible, it simulates our application's interactions with the storage. As such, these lines can be removed from the job file `gtod_reduce=1`, `runtime=180`, `group_reporting=1`, `percentage_random=90`. The rest, I am afraid, are mandatory due to the way we use `fio`. — foss4me, Oct 27 '20 at 18:21
One more, after reviewing the fio source, we have decided to add to use `--alloc-size=1048576` (just a big number), `openfiles=4096` `file_service_type=roundrobin`. Now `fio` runs. Note, without the `alloc-size` or with its value set small, e.g. your `53248` suggestion, `fio` core dumps still. — foss4me, Oct 27 '20 at 18:33
I based on them on what I read in your debug output but I can well believe my calculations were wrong - sorry about that. If I update the answer to include that information would you accept it? — Anon, Oct 28 '20 at 07:18
Anon, 100%. Please update your answer. I will accept it. To me, it's a delightful surprise that you provided such detailed responses so far. Much obliged! — foss4me, Oct 29 '20 at 00:04
I think we got lucky: Your question is unusually detailed (but not excessively), (to me) it's eye catching (you're "new" but you spent time formatting it properly?!), it's an interesting problem (to me), you already used a bleeding edge version and you said what it was, you followed up quickly, you're technical enough to understand what was posted (you read the source for yourself), you were willing to work with an answer that didn't quite work. I do wonder what made you do these things straight off the bat... Sometimes the type of response you get is proportional to "invisible" indicators... — Anon, Oct 29 '20 at 08:18
Hi Anon, as you can tell, I am not new to `fio`. Using it for Lots of Small Files (LOSF) has always been bothering to me. I write in raw markdown syntax routinely, so the formatting is easy. Building software from source is also something I am quite comfortable with. Yes, my team and I do read source, knowing that freeware authors don't usually have time to write docs with good coverage. In other words, I am probably "above average" motivated to find a solution to this problem than most :) Thanks for updating your answer. It's accepted. Much obliged. — foss4me, Oct 29 '20 at 20:20

Anon · Accepted Answer · 2020-10-29T08:00:29.353

(TL;DR setting --alloc-size to have a big number helps)

I bet you can simplify this job down and still reproduce the problem (which will be helpful for whoever looks at this because there are less places to look). I'd guess the crux is that opendir option and the fact that you say the directory contains "2^20 1MiB files"...

If you read the documentation of --alloc-size you will notice it mentions:

If running large jobs with randommap enabled, fio can run out of memory.

By default fio evenly distributes random I/O across evenly across a file (each block is written once per pass) but to do so it needs to keep track of the areas it has written which means it has to keep a data structure per file. OK you can see where this is going...

Memory pools set aside for certain data structures (because they have to be shared between jobs). Initially there are 8 pools (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L22 ) and by default each pool is 16 megabytes in size (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L21 ).

Each file that does random I/O requires a data structure to go with it. Based on your output let's guess that each file forces an allocation a data structure of 368 bytes + header (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L434 ), which combined comes to 388 bytes. Because the pool works in allocations of 32 bytes (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L70 ) this means we actually take a bite of 13 blocks (416 bytes) out of a pool per file.

Out of curiosity I have the following questions:

Are you running this in a container?
What is the maximum size that your /tmp can be?

I don't think the above are germane to your issue but it would be good rule out.

Update: by default, docker limits the amount of IPC shared memory (also see its --shm-size option). It's unclear if it was a factor in this particular case but see the "original job only stopped at 8 pools" comment below.

So why didn't setting --alloc-size=776 help? Looking at what you wrote, it seems odd that your blocks per pool didn't increase, right? I notice your pools grew to the maximum of 16 (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L24 ) the second time around. The documentation for --alloc-size says this:

--alloc-size=kb Allocate additional internal smalloc pools of size kb in KiB. [...] The pool size defaults to 16MiB. [emphasis added]

You used --alloc-size=776... isn't 776 KiB smaller than 16 MiB? That would make each pool smaller than the default and may explain why it tried to grow the number of pools to the maximum of 16 before giving up in your second run.

(2 ** 20 * 416) / 8 / 1024 = 53248 (but see the update below)

The above arithmetic suggests you want each pool to be approximately 52 megabytes in size if you are going to have 8 of them for a sum total of approximately 416 megabytes of RAM. What happens when you use --alloc-size=53248?

Update: the calculated number above was too low. In a comment the question asker reports that using a much higher setting of --alloc-size=1048576 was required.

(I'm a little concerned that the original job only stopped at 8 pools (128 MiB) though. Doesn't that suggest that trying to grow to a ninth 16 MiB pool was problematic?)

Finally, the fio documentation seems to be hinting these data structures are being allocated when you ask for a particular distribution of random I/O. This suggests that if the I/O is sequential or if the I/O is using random offsets but DOESN'T have to adhere to a distribution then maybe those data structures don't have to be allocated... What happens if you use norandommap ?

(Aside: blocksize=2M but your files are 1MiB big - is that correct?)

This question feels too big and specialist for a casual serverfault answer and may get a better answer from the fio project itself (see https://github.com/axboe/fio/blob/fio-3.23/REPORTING-BUGS , https://github.com/axboe/fio/blob/fio-3.23/README#L58 ).

Good luck!

Thanks for your response. You wrote something quite interesting. Last first: Yes, I was pondering about posting the case on the https://github.com/axboe/fio/issues already. Most of the uses posted on the Internet are "toy" cases and that's what most `fio` users do. Not useful for real production-grade benchmarking IMHO. Next, to answer your your questions in details may take more than a comment. So, I will stop here. Instead, will edit my own post. — foss4me, Oct 25 '20 at 15:09

fio 3.23 core dumps when bench-marking many small files

1 Answers1