3

I noticed weird issue when benchmarking random read I/O for files in linux (2.6.18). The Benchmarking program is my own program and it simply keeps reading 16KB of a file from a random offset.

I traced I/O behavior at system call level and scsi level by systemtap and I noticed that one 16KB sysread issues 2 scsi I/Os as following.

SYSPREAD random(8472) 3, 0x16fc5200, 16384, 128137183232 
SCSI random(8472) 0 1 0 0 start-sector: 226321183 size: 4096 bufflen 4096 FROM_DEVICE 1354354008068009
SCSI random(8472) 0 1 0 0 start-sector: 226323431 size: 16384 bufflen 16384 FROM_DEVICE 1354354008075927
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 21807710208 
SCSI random(8472) 0 1 0 0 start-sector: 1889888935 size: 4096 bufflen 4096 FROM_DEVICE 1354354008085128
SCSI random(8472) 0 1 0 0 start-sector: 1889891823 size: 16384 bufflen 16384 FROM_DEVICE 1354354008097161
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 139365318656 
SCSI random(8472) 0 1 0 0 start-sector: 254092663 size: 4096 bufflen 4096 FROM_DEVICE 1354354008100633
SCSI random(8472) 0 1 0 0 start-sector: 254094879 size: 16384 bufflen 16384 FROM_DEVICE 1354354008111723
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 60304424960 
SCSI random(8472) 0 1 0 0 start-sector: 58119807 size: 4096 bufflen 4096 FROM_DEVICE 1354354008120469
SCSI random(8472) 0 1 0 0 start-sector: 58125415 size: 16384 bufflen 16384 FROM_DEVICE 1354354008126343

As shown above, one 16KB pread issues 2 scsi I/Os. (I traced scsi io dispatching with probe scsi.iodispatching. Please ignore values except for start-sector and size.)

One scsi I/O is 16KB I/O as requested from the application and it's OK. The thing is the other 4KB I/O which I don't know why linux issues that I/O.

Of course, I/O performance is degraded by the weired 4KB I/O and I am having trouble. I also use fio (famous I/O benchmark tool) and noticed the same issue, so it's not from the application.

Can anyone explain this to me?

voretaq7
  • 79,345
  • 17
  • 128
  • 213
hiroyuki
  • 31
  • 2
  • 1
    What OS, scheduler, I/O elevator and filesystem are you using? – ewwhite Dec 01 '12 at 16:15
  • I use CentOS 5.7 and 5.8 (Linux 2.6.18). I/O scheduler is set to noop for the kernel not to do something weird. Filesystem is ext3 on LVM. I noticed the same issue with ext3 without LVM so LVM is not the cause, I think. I also tried ext2 and same issue happened. – hiroyuki Dec 02 '12 at 01:09
  • Is this a 4KB sector drive by any chance? If it is see my answer below about updating access time data... – voretaq7 Dec 02 '12 at 01:56
  • Thank you. Sector size is 512B and I set noatime mount option. – hiroyuki Dec 02 '12 at 02:03

3 Answers3

1

This may be a stupid/obvious thing that you've already checked, but is your filesystem mounted with the noatime flag?

If you did not specify noatime then Linux needs to update the inode every time a file is accessed (to set the access time), which means it has to read the area of the disk containing the inode, and write it back out. (Incidentally this is why performance-critical read-intensive filesystems are supposed to be mounted with noatime - the I/O for updating inodes constantly is substantial and can be a measurable performance hit).

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • Thank you for the comment. Yes, the filesystem is mounted with noatime flag. (First it is not, but I thought it might be the cause and I changed the option. But, I still get the same issue. – hiroyuki Dec 02 '12 at 02:00
  • Hmm, so much for that theory then -- I eagerly await someone with more Linux filesystem & I/O subsystem knowledge to enlighten us! – voretaq7 Dec 02 '12 at 05:04
0

I figured out what is going on, but I don't know what it is for.

Ext3 filesystem has some 4KB data in each 4096KB(8192 sectors) data. Visually, data is aligned like the following.

|4KB|4096KB|4KB|4096KB|4KB|4096KB| ...

And 4096KB area in only accessible by application programs. When accessing the first 4096KB area for the first time, then OS reads the 4KB just before the 4096KB area first and then read the requested data in the 4096KB area.

When accessing a large file (compared to the DRAM size) randomly, every I/O has rare chance of hitting page cahce, so every I/O request comes together with 4KB I/O.

The thing is what the 4KB data is for ? Is this location metadata for filesystem ? Is there any way I can remove this ? Or Is there any way I can clear the 4096KB area only ?

Any comments and advices are appreciated.

(I tested in many machines with many kernel versions. this happens in all machines.)

Thanks.

hiroyuki
  • 31
  • 2
0

I figured it out. It's from ext3 indirect block mapping. (Ext3 has a block which has block pointers in every 1024 blocks.)

I changed the filesystem to ext4 makes the issue disappear. (Ext4 has more efficient scheme for block addressing.)

Thank you all.

hiroyuki
  • 31
  • 2