Fastest yes in the west

31

3

There has been a question about yes in the past, but this one is slightly different. You see, despite yes having a rather long size, it is incredibly fast. Running the following command:

timeout 1s yes > out

yields a file that is 1.9 GB. Compared to this program:

int puts();main(){while(1){puts("y");}}

which yields a measly 150 Mb.

The goal is to write a program that outputs y\n (that's y with a newline) repeatedly forever, in as few lines as possible, outputting the biggest size file in as short time as possible.

The scoring works with the equation: b * y/y'. The value b is the size of your program in bytes. The value of y is the size in bytes of the out file after running the following command: timeout 1s yes > out. The value of y' is the size in bytes of the out file after running the following command: timeout 1s PROG > out where PROG is your program, eg. ./a.out. Your program should match the output of the gnu coreutils program yes ran with no arguments exactly, albeit outputting y at any speed you like. Outputting zero bytes would be a divide by zero in the original equation, resulting in a score of inf for these purposes.

Lowest score wins.

Edit:

I will run all code on my computer, please show compilation commands and any setup required.

Specs (in case you want to optimize or something):

  • OS: Arch Linux 5.5.2-arch1-1
  • CPU: AMD Ryzen 2600X
  • SSD: Samsung 970 EVO NVMe
  • RAM: 16GB 3200MHz

Here is a script to calculate the score, since it seems people are having trouble:

b=$(wc -c < $1)
timeout 1s yes > out
y=$(wc -c < out)
timeout 1s $2 > out
y2=$(wc -c < out)
rm out
echo "scale=3;$b*$y/$y2" | bc

Example, given that the script is called score.sh and your program is yes.py:

./score.sh yes.py "python yes.py"

hLk

Posted 2020-02-14T21:40:15.447

Reputation: 427

5Interesting idea, but I'm downvoting because this seems highly dependent on the host running the code. If all entries were run on the same host, then it could be an interesting comparison. – Xcali – 2020-02-14T22:25:41.677

9I will run all code on my computer, and comment the score. Probably should mention that in the question huh :) – hLk – 2020-02-14T22:55:08.203

The value of y is the size in bytes of the out file after running the following command: timeout 1s yes > out. The value of y' is the size in bytes of the out file after running the following command: timeout 1s PROG > out where PROG is your program, eg. ./a.out. Your program should match the output of the gnu coreutils program yes ran with no arguments exactly, albeit outputting y at any speed you like. Outputting zero bytes would be a divide by zero in the original equation, resulting in a score of inf for these purposes. – hLk – 2020-02-15T00:15:39.757

I see. Thanks for clarifying. Maybe the challenge could use a better wording, along the lines of your comment – Luis Mendo – 2020-02-15T00:44:33.463

4

Related: https://www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/

– Joseph Sible-Reinstate Monica – 2020-02-15T05:52:46.787

I think some users are getting confused here :) The output should be strictly an infinite "yes\n"? Or That could also accept "y\n" too? – Bilel – 2020-02-15T15:20:46.423

No, it must be strictly y\n – hLk – 2020-02-15T15:31:33.577

Size of source code or size of program? Does it have to be a full program? – S.S. Anne – 2020-02-15T18:51:11.777

Size of source (byte count) – hLk – 2020-02-16T03:50:37.477

waiting for someone to make a very short x86 answer using https://stackoverflow.com/questions/8201613/printing-a-character-to-standard-output-in-assembly-x86

– qwr – 2020-02-16T08:39:57.073

1This might be more interesting if you weren't bottlenecking the output by writing it to a disk. Your CPU's memory bandwidth (and Linux pipe throughput) is much higher than 2GB/s. Although it's still probably just going to come down to making write system calls on a large-enough buffer to amortize overhead but not create too many L3 cache misses in the kernel's copy_from_user (basically memcpy into the pagecache or pipe buffer). – Peter Cordes – 2020-02-16T22:01:22.797

So then the part that varies in user-space is implementing memset for a 2-byte pattern to create the buffer in the first place. In asm / machine code it's a choice between SIMD vpbroadcastw + 128-bit or 256-bit stores, vs. rep stosw or stosd, and how those perform on Zen1 to decide whether the extra code size for SIMD is worth the cost, how much faster is it than the fast strings microcode for rep stos.) – Peter Cordes – 2020-02-16T22:03:48.397

1

How much RAM do you have? Assuming default Linux tuning settings, some significant fraction of your free RAM can be used for dirty buffers holding pages of the file that haven't made it to disk yet, but have been successfully "written" before yes is killed. Also, be careful when benchmarking: amount of free RAM could affect your measured result! Limit Linux background flush (dirty pages) has some benchmarks of dd of /dev/zero to a file.

– Peter Cordes – 2020-02-16T22:54:25.470

@PeterCordes I have 8gb 3200Mhz ram – hLk – 2020-02-17T02:00:17.940

You should run this on BSD where they have a trivial yes. That way we might be able to beat it. – S.S. Anne – 2020-02-17T16:02:15.067

Answers

36

C, 112 bytes, 28 TB/s, score ≈ 0.008

long b[9<<21];main(){for(b[2]=write(*b=1,wmemset(b+8,'\ny\ny',1<<25),1<<27)/2;b[3]=b[2]*=2;ioctl(1,'@ ”\r',b));}

(If you’re having trouble copying and pasting this, replace the multicharacter constant '@ ”\r' with '@ \x94\r' or 1075876877.)

This writes 128 MiB of y\ns to stdout, and then repeatedly invokes the FICLONERANGE ioctl to double the file’s length by adding new references to the same disk blocks. Requires Linux x86-64 and a filesystem supporting FICLONERANGE, such as Btrfs or XFS; stdout must be opened in read-write mode (e.g. ./prog 1<> out in bash).

Self-contained testing script

This script loop-mounts a freshly created filesystem from a memory-backed image file and runs the program there. (I get about 43 TB/s in-memory, but the ratio with yes(1) is similar.) It works with either Btrfs or XFS, but Btrfs seems to give better scores.

#!/usr/bin/env bash
set -eux
dir="$(mktemp -dp /dev/shm)"
cd "$dir"
trap 'umount fs || true; rm -r "$dir"' EXIT
printf "long b[9<<21];main(){for(b[2]=write(*b=1,wmemset(b+8,'\\\\ny\\\\ny',1<<25),1<<27)/2;b[3]=b[2]*=2;ioctl(1,'@ \\x94\\\\r',b));}" > prog.c
b="$(stat -c %s prog.c)"
gcc -O3 prog.c -o prog
dd of=fs.img bs=1 count=0 seek=8G

# Pick one:
mkfs.btrfs fs.img
#mkfs.xfs -m reflink=1 fs.img

mkdir fs
mount fs.img fs
sync
timeout 1 yes 1<> fs/out || true
y="$(stat -c %s fs/out)"
rm fs/out
sync
timeout 1 ./prog 1<> fs/out || true
y1="$(stat -c %s fs/out)"
echo "score = $b * $y / $y1 = $(bc -l <<< "$b * $y / $y1")"

Anders Kaseorg

Posted 2020-02-14T21:40:15.447

Reputation: 29 242

Wow! I want to run it myself but I'll admit I'm slightly terrified of that number . Unfortunately, I think this breaks the rules, but I definitely was not clear enough. Does this output to stdout, like the yes program does? – hLk – 2020-02-15T06:14:37.390

1

@hLk The default rules clearly allow outputting to a file, but I’ve edited this to output to stdout, with the requirement that stdout be opened in read-write mode.

– Anders Kaseorg – 2020-02-15T06:26:03.557

Oh whoops, was not aware of those. I'll give it a shot in the morning. – hLk – 2020-02-15T06:30:21.937

I can't get it to work. I am on an XFS partition, and I have tried with gcc and clang. Is there an option that I'm missing? Or is there some way to see what might be wrong? – hLk – 2020-02-15T16:14:46.357

@hLk Run strace on it and see if there are any lines with -1 (E in them. – S.S. Anne – 2020-02-15T22:59:06.287

1@S.S.Anne EINVAL, ENOENT, and EFAULT. Those are the errors, not sure the reason. Maybe I'll try on btrfs – hLk – 2020-02-16T03:56:54.237

1@hLk Was the XFS partition created with -m reflink=1? (Check with xfs_info.) – Anders Kaseorg – 2020-02-16T11:05:58.750

I think it was (xfs_info has an entry that says reflink=1) – hLk – 2020-02-17T02:05:50.980

@hLk I’ve added a self-contained testing script that creates and uses a fresh memory-backed filesystem; it works for me with either Btrfs or XFS. Maybe that will help you figure out what was different about your test. – Anders Kaseorg – 2020-02-17T03:02:54.237

I think it requires kernel support. What system are you running on? – S.S. Anne – 2020-02-17T16:03:20.593

@S.S.Anne I am on Arch Linux, with kernel 5.5.2-arch1-1 – hLk – 2020-02-17T20:52:48.960

@hLk No, not your OS. Anders's. – S.S. Anne – 2020-02-17T21:34:36.973

@S.S.Anne I’m running NixOS, also with kernel v5.5.2, which is current modulo a couple of stable point releases last week. This should work with any kernel since v2.6.29 on Btrfs or v4.9 on XFS. – Anders Kaseorg – 2020-02-17T21:59:59.040

28

Bash, 3 bytes, 1.9 GB/s

yes

Try it online!

Admittedly this is a troll solution, but the rules do not explicitly forbid it, and it should get you a value close to 3, which is not bad.

Yonatan N

Posted 2020-02-14T21:40:15.447

Reputation: 497

2Haha, completely didn't realize this solution. Almost certainly the winner unless I change the rules – hLk – 2020-02-15T06:10:44.783

10

C (clang), 88 63 bytes, 2.5GB/s

b[2048];main(){for(wmemset(b,'\ny\ny',2048);write(1,b,8192););}

Try it online! Edit: Saved 25 bytes thanks to @ceilingcat by assuming 4-byte wide characters.

Neil

Posted 2020-02-14T21:40:15.447

Reputation: 95 035

Score just using -O4, clang 9.0.1: 88 * 1595658240/1665597440 = 84.3048397. Pretty good, what were the compilation options you used? And score on your comp? – hLk – 2020-02-14T23:07:07.550

@hLk: I have no idea how to score, but the timeout command created me a 2.5GB file. I didn't even specify any compilation options. – Neil – 2020-02-15T01:21:32.537

Mostly interested to see speed difference on write() size call... – spuck – 2020-02-15T05:28:06.667

1@spuck That doesn’t work; wchar_t is 4 bytes on Linux. – Anders Kaseorg – 2020-02-15T06:06:50.777

Well, at least he managed to write the right buffer size for his system... – Neil – 2020-02-15T10:37:44.930

So the high order bytes are nulls. No code golf rules saying it has to be good code. :) – spuck – 2020-02-15T23:34:03.227

8

x86-64 machine code (Linux system calls), 29B * 4.7/6.6 = ~20.6 on tmpfs on Skylake

Yes, this runs faster than GNU yes on tmpfs, the widely-used Linux ramdisk-like filesystem backed by the pagecache. I made sure to align my buffer to at least a cache-line boundary (and in fact a 4k page boundary) so the kernel's copy_from_user memcpy-like function using rep movsb would be most efficient.

I don't expect this will outperform yes writing to an SSD on a fast Ryzen CPU which isn't the bottleneck; posted score is tentative. However, writing 2 or 4 bytes at a time is unusably slow so we need a buffer to avoid a huge score penalty (like a factor of 2000 on tmpfs!). But probably this will score about 29 running at the same speed as GNU yes if they both fill the page-cache to max capacity with dirty pages and max out I/O bandwidth during the 1s window.

It's break-even for code size to use a buffer in the BSS instead of reserving stack space, in a Linux position-dependent executable where a static address can be put in a register with 5-byte mov r32, imm32. Using more space in the BSS doesn't increase the size of an executable that already has a BSS section/segment. But in this case it does make the executable larger so I'm not totally sure it's justified to not count anything for executable metadata for the BSS. But as usually, I'm counting only the size of the .text section. (And there's no initialized static data; all the constants are immediate.)

Without aligning the stack by 64 or 4k, sub esp, edx; mov edi, esp would be only 4 bytes, and safe in a Linux x32 executable: 32-bit pointers in 64-bit mode. (For efficient syscall instead of slow int 0x80 or cumbersome sysenter in 32-bit mode). Otherwise break-even for sub rsp, rdx. So if not counting the BSS bothers you, save a byte and use the stack for buffers up to almost 8MiB (with default ulimit -s).

We fill a buffer that's 4x larger than we actually pass to write; rep stosd needs a repeat count of dword chunks and it's convenient to use the same number as a byte count for the system call.

Try it online! (NASM source which assembles to this answer's machine code.) It works on TIO; it doesn't depend on the output being a regular file or anything. Below is a NASM listing of machine code hexdump & source.

 6                                  global _start
 7                                  _start:
 8                                  ;; All regs zeroed at the top of a Linux static executable (except RSP)
 9                                  
10                                  SIZEPOW equ 17
13 00000000 0FBAE911                   bts  ecx, SIZEPOW  ; 1<<SIZEPOW x 4 bytes to fill.
14 00000004 89CA                       mov  edx, ecx      ; size arg for sys_write in bytes; 1/4 of the actual buffer size
15                                  
16 00000006 B8790A790A                 mov  eax, `y\ny\n`
25 0000000B BF[00000000]               mov  edi, buf        ; mov r32, imm32 static address; 00... is before linking
26 00000010 89FE                       mov  esi, edi        ; syscall arg
27                                  
28                                     ;shr  ecx, 2         ; just fill 4x as much memory as needed
29 00000012 F3AB                       rep  stosd           ; wmemset(rdi, eax, rcx)  4*rcx bytes
30                                     
31 00000014 8D7901                     lea   edi, [rcx + 1]  ;   mov  edi, 1
32                                  .loop:
33 00000017 89F8                       mov  eax, edi         ; __NR_write = stdout fileno
34 00000019 0F05                       syscall
36 0000001B EBFA                       jmp  .loop

47                                  section .bss
48                                   align 4096
49 00000000 <res 00080000>           buf: resd 1<<SIZEPOW

0x1b + 2 bytes for last instruction is 29 bytes total.

mov cx, imm16 would also be 4 bytes, same as BTS, but is limited to 0..65535. vs. BTS r32, imm8 creating any power of 2. (Both depend on a zeroed RCX to start with, to produce a result zero-extended into RCX for rep stos)

mov ax, `y\n` would be 4 bytes, 1 fewer than mov eax, imm32, but then we'd need rep movsw which costs an extra operand-size prefix. This would have kept the ratio of filled buffer to used buffer at 2:1 instead of 4:1, so I should have done that to save a few pagefaults at startup, but not redoing the benchmarking now. (rep movsw is still fast on Skylake I think; not sure if Zen+ might suffer; but more than a factor of 2 in fill bandwidth + page fault cost is unlikely.)

A static buffer makes alignment to a cache line (or even the page size) not cost instructions; having the stack only 16-byte aligned was a slowdown. I recorded times in comments on my original code that aligned the stack pointer by 64kiB or not after reserving space for a buffer. Alignment to 64k instead of 16B made a ~3% difference in overall speed for writing a 1GiB file (to tmpfs, exiting on ENOSPC) with SIZEPOW=16, and led to a reduction in instructions retired as well. (Measured with perf stat ./yes > testd/yesout)

     ;;;; instead of  mov  edi, buf
    sub  rsp, rdx
    ;   and  rsp, -65536   ; With: 631.4M cycles 357.6M insns.   Without:  651.3M cycles 359.4M instructions.  (times are somewhat noisy but there's a real difference)
 ;;  mov  rdi, rsp   ; 3 bytes
    push  rsp
    pop   rdi        ; copy a 64-bit reg in 2 bytes
    push  rsp
    pop   rsi

For other benchmarking purposes (with a smaller 1GiB tmpfs), it was convenient to use a loop that exited when write() failed (with -ENOSPC) instead of looping until killed. I used this as the bottom of the loop

35                                  %if 0
36 0000001B EBFA                       jmp  .loop
37                                  %else
38                                     test eax,eax      ;;;;;;;;;;; For benchmarking purposes: abort on write fail
39                                     jge  .loop
40                                  
41                                     mov eax, 231    ; not golfed
42                                     xor edi, edi
43                                     syscall         ; sys_exit_group(0)
44                                  %endif
45                                  

Testing

I tested using tmpfs because I don't want to wear out my SSD repeatedly testing this. And because keeping up with a consumer-grade SSD is trivial with any reasonable buffer size. More interesting (to me) is to find the CPU/memory-bound sweet spot between system call overhead from making too many small write system calls vs. L2 cache misses from re-reading too large a buffer every time. And also minimize time spent on filling a buffer before even starting to make system calls. (Although note that with write-behind I/O buffering for filesystems backed by real disk, there's no advantage to using a smaller buffer to get disk-write started sooner. I/O write-back to actual physical media wouldn't start until dirty pages hit the Linux kernel's high water mark. The default dirty timeout is 500 centisecs, and powertop suggests raising that to 1500 (15 seconds), so that's not going to come into play, just the high water mark. So what really matters is getting to the high water mark ASAP for a physical disk, and potentially pushing as many dirty pages into the pagecache as possible within the 1 second window, to finish writeback after yes dies. So this 1-second test probably depends on how much RAM your machine has (and even how much free RAM), if you're using a physical disk instead of tmpfs.)

write inside the kernel is basically a memcpy from the provided buffer into the pagecache; specifically Linux's copy_from_user function which uses rep movsb on systems with ERMSB (Intel since IvB) or rep movsq when it's fast (PPro and later, including non-Intel vendors.)

According to perf record / perf report output, with a 128k buffer size, 45% of the counts for hardware "cycles" were in clear_page_erms (on rep stosb) and then 18.4% in copy_user_enhanced_fast_string on rep movsb. (Makes some sense: clearing is touching a cold page, copy is copying over a just-cleared buffer, presumably hitting in L2 cache for src and destination. Or L1d for dst if it clears 4k at a time. Or only L3 cache if it's clearing whole hugepages :/) Next highest was iov_iter_fault_in_readable at 3.8%. But anyway, only ~63% of total CPU time was spent on the "real work" of actually copying into the pagecache. And that happens inefficiently, zeroing before copying.

I tried with tmpfs with huge=never (which is the default) and only got 4.8GiB for 1s with the 128kiB buffer version that gets 6.6GiB on hugepages. So clearly hugepages are worth it overall. perf record counts for cycles were: 14%: clear_page_erms and 10% copy_user_enhanced_fast_string, with many other kernel functions taking low single digits percentages, like try_charge at 4%, __pagevec_lru_add_fn, get_mem_cgroup_from_mm, __this_cpu_preempt_check, and _raw_spin_lock_irq at 3 to 2%. Managing memory in 4kiB chunks obviously costs a lot more than in 2MiB chunks.

Test setup

  • i7-6700k @ 3.9GHz with 0xd6 microcode update (Nov 2019). (quad core Skylake-client microarchitecture, per-core caches: L1d 32k, L2 256k. Shared L3 = 8MiB) vs. Ryzen having 512kiB L2 caches.
    During this test: /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference = balance_performance, not full performance EPP setting. So this is closer to the normal bootup machine state of balance_power). I ran a warm-up run right before the main test to make sure the CPU speed was at full before the timed 1-second started: timeout 1s ./yes > testd/yesout; perf stat -d timeout 1s yes > testd/yesout.

    With EPP at performance (max turbo = 4.2GHz, near-instant ramp-up), in practice it runs at 4.1GHz average for the test. I get up to 6.9GiB written instead of 6.6, for the 128k buffer version. (And up to 7.3GiB with the writev version.)

  • 16 GiB of DDR4-2666 (2x8GB DIMMs, dual channel)
  • OS = Arch GNU/Linux, kernel = Linux 5.4.13-arch1-1
  • Filesystem = tmpfs using transparent hugepages:
    sudo mount -t tmpfs -o size=10G,huge=always tmpfs /tmp/test. Much of the file is using transparent hugepages: from /proc/meminfo after writing a 6.4G file: ShmemHugePages: 6723584 kB (6566MB), and goes down to 83968 kB (82MB) after deleting it. It's using x86-64 2MiB hugepages instead of 4k normal pages to reduce TLB misses.
  • System idle except for Chromium using about 3% of one core at idle clock speed.
  • free memory between tests (tmpfs won't actually page anything out to swapspace during the test):

    $ free -m   # show numbers in megabytes.  Test output deleted; tmpfs nearly empty
            total        used        free      shared  buff/cache   available
    Mem:    15820        2790       12164         236         864       12497
    Swap:    2047         930        1117
    
    $ cat /proc/sys/vm/swappiness 
    6
    
  • Spectre / Meltdown mitigations (big overhead per system call, above the ~100 cycles each for syscall / sysret to get in/out of the kernel):

    $ grep . /sys/devices/system/cpu/vulnerabilities/*
    /sys/devices/system/cpu/vulnerabilities/itlb_multihit:KVM: Vulnerable
    /sys/devices/system/cpu/vulnerabilities/l1tf:Mitigation: PTE Inversion
    /sys/devices/system/cpu/vulnerabilities/mds:Mitigation: Clear CPU buffers; SMT vulnerable
    /sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
    /sys/devices/system/cpu/vulnerabilities/spec_store_bypass:Mitigation: Speculative Store Bypass disabled via prctl and seccomp
    /sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
    /sys/devices/system/cpu/vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
    

Results writing to tmpfs for 1 second: best case 6.6GiB

Timed with this as a shell one-liner

 asm-link -nd yes.asm &&    # assemble with NASM + link
 taskset -c 3 timeout 1s ./yes > testd/yesout;     # warm-up run
 time taskset -c 3  timeout 1s ./yes > testd/yesout;
 ls -lh testd/yesout && rm testd/yesout

Up-arrow recall that string of commands a few times, take the best case. (Assume that lower values didn't ramp up the CPU speed right away; spent some time allocating hugepages, or had other spurious interference.)

I intentionally limited myself to only looking at 2 significant figures of file size (letting ls round to x.y GiB) because I know there's going to be noise, so perhaps keeping the data simpler might be good.

Results for various buffer sizes writing to tmpfs: 6.6GiB

(I did see 6.7G in some early testing with SIZEPOW=16 or 17, and I think even 6.8 or 9, but couldn't reproduce it once things settled down. Maybe some lucky early arrangement of hugepages that didn't last into a stable state?)

  • SIZEPOW=1, buffer size: 2 (just 'y\n'), best-case size: 2.9*MiB*
    real 0m1.002s, user 0m0.431s, sys 0m0.570s
  • SIZEPOW=2, buffer size: 4, best-case size: 5.8MiB
    real 0m1.002s, user 0m0.381s, sys 0m0.620s

  • ...

  • SIZEPOW=11, buffer size: 1024, best-case size: 1.3GiB
    real 0m1.005s, user 0m0.343s, sys 0m0.661s

  • SIZEPOW=11, buffer size: 2048, best-case size: 2.3GiB
    real 0m1.008s, user 0m0.270s, sys 0m0.737s. (Smaller than 1x 4k page = bad)

  • SIZEPOW=12, buffer size: 4096, best-case size: 3.6GiB
    real 0m1.012s, user 0m0.237s, sys 0m0.772s
  • SIZEPOW=13, buffer size: 8192 (same as GNU yes), best-case size: 4.8GiB
    real 0m1.016s, user 0m0.180s, sys 0m0.834s
  • SIZEPOW=14, buffer size: 16384, best-case size: 5.6GiB
    real 0m1.018s, user 0m0.090s, sys 0m0.926s
  • SIZEPOW=15, buffer size: 32768, best-case size: 6.2GiB
    real 0m1.019s, user 0m0.057s, sys 0m0.959s
  • SIZEPOW=16, buffer size: 64kiB, best-case size: 6.5GiB
    real 0m1.021s, user 0m0.023s, sys 0m0.993s
  • SIZEPOW=17, buffer size: 128kiB (1/2 L2 cache size), best-case size: 6.6GiB
    real 0m1.021s, user 0m0.017s, sys 0m1.002s
  • SIZEPOW=18, buffer size: 256kiB (= L2 cache size), best-case size: 6.6GiB
    real 0m1.020s, user 0m0.013s, sys 0m1.005s
  • SIZEPOW=19, buffer size: 512kiB (2x L2 cache size), best-case size: 6.4GiB
    real 0m1.021s, user 0m0.000s, sys 0m1.019s (User time getting meaningless / too small for Linux to measure accurately.)
  • SIZEPOW=20, buffer size: 1024kiB, best-case size: 6.2GiB
  • SIZEPOW=21, buffer size: 2MiB, best-case size: 5.7GiB
  • SIZEPOW=22, buffer size: 4MiB (1/2 L3 size), best-case size: 5.0GiB
  • SIZEPOW=23, buffer size: 8MiB (L3 cache size), best-case size: 4.9GiB
  • SIZEPOW=24, buffer size: 16MiB (2x L3 cache size), best-case size: 4.9GiB
  • SIZEPOW=25, buffer size: 32MiB, best-case size: 4.8GiB
  • ...
  • SIZEPOW=28, buffer size: 256MiB, best-case size: 4.4GiB

GNU coreutils 8.31 yes: buffer size 8192, best case size: 4.7GiB

(vs. 4.8GiB for my 8k buffer version. Perhaps because of lower startup overhead. My statically linked executable makes literally no system calls before write, just a BSS pagefault or two, vs. GNU yes being dynamically has more overhead before it gets going. And not quite as tight a loop around syscall, probably involving indirect-call/ret into libc's write wrapper, so at least two different pages of code in user-space are touched between system calls. Two chances for iTLB or i-cache misses. (Making a system call tends to invalidate a lot, especially with Spectre / Meltdown / MDS mitigations enabled, but even without that if a lot of kernel code runs)

GNU yes's buffer is 64-byte aligned (cache line) but not 4k-page aligned. IDK if that makes any difference or leads to more dTLB misses (from spanning 3 pages instead of 2). Being an odd multiple of 64 bytes (address ending in 0x440) is not ideal for the L2 spatial prefetcher (which tries to complete 128B-aligned pairs of lines) but that's probably insignificant; 8kiB is small enough that we get lots of L1d hits and very few L2 misses overall, including both user and kernel space. (perf stat -d)

The fall-off at the high end of size is I think due to cache and TLB misses. You can run perf stat -d instead of time to see LLC-misses. (And LLC-loads is an indication of L2 load misses. Indeed, small buffers get mostly L2 hits, larger buffers get some misses.)

Note that the time real/user/system times are for the timeout process itself, not for just the yes workload. And that Linux's time accounting is somewhat granular, I think. perf stat gives more details, but I didn't want even the tiny overhead of the occasional HW performance-counter event interrupt.


Further speedup ideas:

Spawning multiple threads (all writing with O_APPEND) might be a minor win, depending on how much they serialize each other when appending. (I'm hoping they just reserve space and then copy, so a 2nd call can reserve more space while an earlier one is still copying. Otherwise we might need vmsplice + splice(2) or something to do the page-dirtying in user-space and give pages to the kernel without further copying (SPLICE_F_GIFT + SPLICE_F_MOVE)). But making a clone system call would probably take significantly more code.

Intel desktop CPUs can nearly saturate their memory bandwidth with a single core, unlike many-core Xeons with higher memory / uncore latency; ironically/paradoxically worse single-core memory bandwidth despite more memory controller channels for higher aggregate bandwidth than dual/quad-core client chips. With significant time spent in the kernel not bottlenecked on DRAM bandwidth, multiple threads could still help.

Using writev(2) to pass the same buffer multiple times with one system call could give the best of both worlds: small buffer for L1 / L2 hits while copying to the pagecache, but lots of work done per syscall. Maybe a few % speedup; (Tried it; got a 7.0G output with 20 io vecs for an 8kiB buffer, up from 6.6G. Kernel CPU time is now 49.5% clear_page_erms, 13.2% copy_user_enhanced_fast_string. So yes, less time spend copying, more time spent just clearing. 37 bytes, up from 29, score 26.3. Source on Try it online!).

As expected it doesn't come close to paying for itself in user-space code size to create the array of ptr,length pairs, although it was possible with a loop around 2 push instructions (4 bytes), but extra instructions outside the loop to get the args into place were necessary. Also, the call number __NR_writev is 20 so I used an array of 20 iovs, allowing the same trick as with fd = __NR_write to save a byte vs. lea.

There's still overhead per iovec (200x 2k buffer is slower than 20x 8k buffer). The sweet spot is around 4k or 8k buffers, with very minor gains for using a big vector (20x 8k seems enough). IOV_MAX is "only" 1024, and isn't slower but is barely faster. (Try it online! - flexible iov size version for perf experiments also needs only a couple changes to flip back to plain write instead of writev.

Peter Cordes

Posted 2020-02-14T21:40:15.447

Reputation: 2 810

2a loop that exited when write() failed You're just rubbing it in, aren't you? Amazing answer. Thanks for the benchmarks. – S.S. Anne – 2020-02-17T15:54:14.277

1

@S.S.Anne: No, the part about debugging in a 1GiB tmpfs is the rubbing it in part. :P Especially after hearing your cautionary tale. And thanks, I definitely enjoy the chance to answer the occasional code-golf question with a performance component, like this and Extreme Fibonacci, where I can apply my performance tuning skills and there's something interesting to say about things.

– Peter Cordes – 2020-02-17T21:14:54.963

I finally got my stuff back. I just had to sift through my disk image with grep, no big deal. – S.S. Anne – 2020-02-29T02:18:32.620

7

Bash, 16 bytes, 16TB output, score ~0 .0018554687

Thoroughly abuses the rules

trap '' TERM
yes

It ignores timeout's SIGTERM (running an empty command) and so continues beyond the 1 second that the benchmark script intended to set. This will fill your disk unless you kill it with a different signal or set a quota or other size limit.

Joshua

Posted 2020-02-14T21:40:15.447

Reputation: 3 043

Boy am I glad I ran this on my partition...I'd say the score is basically zero if we're being honest – hLk – 2020-02-16T03:44:20.773

@hLk: Maximum file size in the filesystem controls. If your filesystem supports bigger than 16TB files it can go a lot lower. – Joshua – 2020-02-16T05:23:18.433

1Could you explain the trick please. It looks like on SIGTERM it does nothing. It then runs yes. Why does this get a score below 3? – Anush – 2020-02-16T09:43:31.097

7I would guess that it ignores the timeout's kill (running an empty string) and so continues beyond the 1s – DeathIncarnate – 2020-02-16T12:25:02.970

@DeathIncarnate Aha! Thank you :) – Anush – 2020-02-16T21:11:50.590

3@DeathIncarnate: yup, looks that way. "malicious" / dangerous answers should really explain themselves better; I updated this one since it was already community wiki. – Peter Cordes – 2020-02-16T22:08:36.147

2@PeterCordes: It's CW because that's the standard for rules abusers. – Joshua – 2020-02-16T22:31:03.293

The standard for "rules abusers" is deletion, not misuse of Community Wiki status. – pppery – 2020-02-19T17:06:36.623

@pppery: No, that's the standard for rule breakers. This follows the rules as written, ignoring their intent. – Joshua – 2020-02-19T17:12:32.330

3

Perl 5 (cperl), 26 bytes

while(1){print"y\n"x 9**4}

Try it online!

R, 18 bytes

repeat{cat('y\n')}

Try it online!

Improved thanks to @JDL

Squirrel, 24 bytes

while(1){::print("y\n")}

Try it online!

JavaScript (Node.js), 26 bytes

while(1){console.log('y')}

Try it online!

Python 3, 28 bytes

while1:print(end='y\n'*9**4)

Try it online!

Julia 1.0, 27 bytes

while(1==1) println('y')end

Try it online!

WHILE Programming Language

Performant but not participating because of many bytes...

You can't try it online! Sorry:) But here are Docs,doc and a java-based IDE

Bilel

Posted 2020-02-14T21:40:15.447

Reputation: 141

4The output should be y followed by a newline and not yes followed by a newline – Mukundan – 2020-02-15T05:27:50.543

Javascript console adds a new line by default :) that's why I've chosen this language. – Bilel – 2020-02-15T15:09:30.343

you need to output y and not yes so you need to do console.log('y') instead of console.log('yes') – Mukundan – 2020-02-15T15:11:58.653

That's what you say :) I just asked the question to @hlk about such a rule. It's being confused among answers here some use y and some use a strict 'yes'. Thanks for the advice but I'm willing to update it if a short 'y' is accepted. – Bilel – 2020-02-15T15:28:15.220

The Bash answer confused me! Clever one :) – Bilel – 2020-02-15T16:11:04.663

You know, you can go to the little chain link at the top of the TIO page and copy the Code Golf submission by clicking the clipboard on the right. Then you can just paste it into your answer. – S.S. Anne – 2020-02-15T20:55:31.883

thx @S.S.Anne ! I didn't notice the custom template for Code Gulf ! easier :) – Bilel – 2020-02-15T21:18:22.497

@Bilel It still confuses me! – Anush – 2020-02-16T09:45:03.473

for the R one, it is probably quicker to use repeat than while(1) as you don't have to evaluate the 1 (it's also fewer characters) – JDL – 2020-02-17T09:53:50.730

It would be great if you could add the score in each case. – Anush – 2020-02-18T06:36:20.563

3

C (gcc), 67 bytes

b[1<<16];main(){for(wmemset(b,'\ny\ny',1<<16);~write(1,b,1<<18););}

Your times may vary. I'm running this inside a VM on a weak computer.

  • b[1<<16]; is an integer array of \$2^{18}\$ bytes.
  • wmemset(b,'\ny\ny',1<<16); sets that array to a pattern of y\ny\n. The characters are reversed due to the endianness of the x86 platform.
  • ~write(1,b,1<<18) writes that array over and over until the write fails. (The ~ makes a -1 error return -> false; without it we iterate until the process is killed, saving 1 byte.)

May be faster on other systems if I increase the buffer size to 1 megabyte but there's no noticeable difference on mine. This really should be an average.

Try it online!

S.S. Anne

Posted 2020-02-14T21:40:15.447

Reputation: 1 161

A buffer size less than half L2 cache size is probably good so the copy_from_user (kernel memcpy) inside each write can hopefully hit in L2 cache, instead of evicting the data during a loop over a larger buffer. The OP's Zen+ L2 cache size is 512kiB private per-core. https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#Memory_Hierarchy (same as Zen and Zen2). Unless a huge write system call can complete the whole size even if a SIGTERM arrives part way into it? But no, write is interruptible. Anyway, large enough to amortize syscall overhead (including Spectre mitigation) = good

– Peter Cordes – 2020-02-16T22:33:04.550

1Note that write returns -1 on error, not 0, so it loops forever / until killed, not "until write fails". – Peter Cordes – 2020-02-16T22:35:54.240

1Yeah. I learned that so my filesystem has been screwed. – S.S. Anne – 2020-02-16T22:36:15.630

Hmm this is consistently getting a score of ~67.5 on my comp – hLk – 2020-02-17T02:13:23.247

@hlk I said my situation was crappy. I have a better version akin to Joshua's bash solution whenever I recover it from my VM's files. – S.S. Anne – 2020-02-17T02:15:56.707

1That's interesting, I wonder why it beats yes in the VM... – hLk – 2020-02-17T02:17:47.167

1In case you're curious about buffer sizes, I posted a machine-code answer with benchmark results from an i7-6700k Skylake for a range of power-of-2 buffer sizes. – Peter Cordes – 2020-02-17T11:22:34.397

@hLk I think it was because I was doing some other stuff while yes was running. – S.S. Anne – 2020-02-17T16:06:00.083

I'm curious how filling a filesystem hosed your FS in a VM. Did you fill tmpfs and get the VM itself killed by the OOM killer, so write ordering wasn't respected for the VM's persistent FS committing to its backing file? Or did you fill the persistent FS? Which filesystem (ext4?) in what VM? Sounds like a bug that could get reported upstream. Or did your VM image grow huge and get restricted by host limits, resulting in it stopping behaving the way the guest expected? – Peter Cordes – 2020-02-29T20:46:46.980

@PeterCordes I think it's BTRFS because of how the backups work, but you can check containers_and_vms for more info. It's known upstream, but it's not (currently, maybe at all) being fixed. I heard something about it being caused by 9p. The container still ran after it filled up, but it was read-only. After that it stopped starting up (presumably because lxc couldn't write log-files).

– S.S. Anne – 2020-02-29T20:47:58.430

2

Python 3, 34 bytes, 1.9GB/s, score ≈ 32.711

a='y\n'*2**17
while 1:print(end=a)

Try it online!

Mukundan

Posted 2020-02-14T21:40:15.447

Reputation: 1 188

I'd like to see the difference in results between 9999 and int(9e9). 4 extra characters, but a much higher number per print – mazunki – 2020-02-16T17:42:13.893

2@mazunki: Linux uses large enough I/O buffers that ~8kiB might keep up with the average throughput of the OP's disk, on the OP's Ryzen CPU. A buffer size that's a multiple of 4kiB might be better though, so all the copies to the pagecache are whole pages, leading to the kernel using aligned memory accesses. 9e9 is so large that it's probably worse, leading to TLB misses and cache misses, and taking a lot of time to prepare in memory before even making one system call. I'd try 8e4 (80kiB, comfortably smaller than L2 cache size, and a multiple of 64 = CPU cache line size). Or 64<<9 – Peter Cordes – 2020-02-16T22:24:12.140

Yes this is definitely faster on my comp now. Score is 31.507 – hLk – 2020-02-17T02:16:29.883

@mazunki: I posted a machine-code answer with benchmarks for various buffer sizes on a Skylake CPU. Presumably Python (with high interpreter overhead = more potential TLB and cache misses after returning from a syscall) will benefit from somewhat larger buffers, maybe cache-blocking for L3 instead of L2 size. – Peter Cordes – 2020-02-17T11:19:51.997

1

Japt, 6 bytes

No clue how the scoring in this challenge works.

@Opy}a

Try it online!

@          :Function
 Opy       :  Print "y"
    }      :End function
     a     :Call repeatedly until it returns a truthy value
           :(Japt's O.p() method returns undefined)

Shaggy

Posted 2020-02-14T21:40:15.447

Reputation: 24 623

1You gotta be fast. – Joshua – 2020-02-17T04:23:48.600

1

x86-32 Linux, 26 bytes (ungolfed), 1.5M

write syscall test. Produces a very underwhelming result.

main:
    push    $0x0A79     # "y\n"
    mov     $1, %ebx    # write to stdout (fd=1)
    mov     %esp, %ecx  # use chars on stack
    mov     $2, %edx    # write 2 chars

loop:
    mov     $4, %eax    # sys_write call number 
    int     $0x80
    jmp     loop

qwr

Posted 2020-02-14T21:40:15.447

Reputation: 8 929

It's just called x86. You can improve the result by buffering. – S.S. Anne – 2020-02-16T17:09:57.377

2It has to be compiled as 32 bit on Linux to make use of the ABI. – qwr – 2020-02-16T18:54:20.363

You could save 1 byte with pushw (costs operand-size prefix, saves 2 bytes of immediate). Obviously a slow int $0x80 for every 2 bytes is garbage for performance. (Worse than sysenter or 64-bit syscall; you could have made a 64-bit version of this with the same code-size; push rsp; pop rsi.) Also, with Spectre mitigation and so on, every syscall has quite a lot of extra overhead (even on AMD where Meltdown / MDS mitigation isn't needed). So the optimal buffer size to amortize syscall overhead vs. cache-miss cost might be higher. – Peter Cordes – 2020-02-16T22:42:16.007

rep stosw or rep stosd seem obviously good for filling a buffer of 8k or 128k or something on the stack. And BTW, x86-64 __NR_write is 1, same as the stdout FD, so you just need a 2-byte mov reg,reg – Peter Cordes – 2020-02-16T22:44:12.290

work in progress I'll post after supper: https://godbolt.org/z/J358ax x86-64 NASM uses the stack for a buffer, or BSS is smaller. Power of 2 buffer sizes between 65536 and 262144 seem to be the sweet spot on i7-6700k (DDR4-2666, EPP=balance_power: 3.9GHz) for writing into a 1GB tmpfs until ENOSPC (sudo mount -t tmpfs -o size=1G,huge=always tmpfs /tmp/testd to enable tmpfs hugepages). Should correlate to how (non) CPU-bound it is on Ryzen writing to SSD.

– Peter Cordes – 2020-02-17T00:48:19.820

yeah I wondered how slow the naive solution of writing one line at a time is. It is the better for golfing but obviously very slow. but I got too lazy to golf properly. – qwr – 2020-02-17T04:26:09.863

Ok, posted an answer which outperforms GNU coreutils yes on tmpfs. I included benchmark results for buffer sizes of 2 and 4 bytes, and a range of powers of 2 from 1024B to 256MiB.

– Peter Cordes – 2020-02-17T11:08:55.213

1

Rust, 49 bytes \$\times\,\frac{31952896}{772800512}\approx2.025\$

Plain Rust, compiled with -C target-cpu=native -C opt-level=3. Clearly not the winner, only for reference; not bad, btw :)

fn main(){loop{print!("{}","y\n".repeat(8192));}}

Try it online!

PieCot

Posted 2020-02-14T21:40:15.447

Reputation: 1 039

1

05AB1E (legacy) / 05AB1E, 4 bytes, score: waiting for OP

['y,

Or alternatively:

'y[=

Try it online.

Not sure which combination is the shortest, so for now this answer has four possible variations:

The 05AB1E (legacy) version is compiled and run in Python 3, and the 05AB1E version is compiled and run in Elixir. Based on performance on TIO for most challenges I did, I assume the legacy version will score much better.

I will leave just one of these four combinations with the largest output after OP's local test. The links provided contain instructions on how to install and run both 05AB1E and 05AB1E (legacy) locally.

Explanation:

[     # Loop indefinitely:
 'y  '#  Push string "y"
   ,  #  Pop and output it with trailing newline

'y   '# Push string "y"
  [   # Loop indefinitely:
   =  #  Output with trailing newline (without popping)

Kevin Cruijssen

Posted 2020-02-14T21:40:15.447

Reputation: 67 575

Lazy but the shortest ! – Bilel – 2020-02-18T01:44:19.287

1

Pyth, 10 bytes \$\times\frac{1907040256}{1248223232}\approx15.27\$

#pp*^T4"y

Explanation:

#            :  Loop till error
     T       :  The integer 10
    ^ 4      :  Evaluates 10 ** 4 (10000)
   *   "y\n  :  Concatenates "y\n" to itself 10000 times
  p          :  Print without trailing newline and return it
 p           :  Print without trailing newline and return it

Try it online!

Mukundan

Posted 2020-02-14T21:40:15.447

Reputation: 1 188

0

Python 3, 18 bytes \$\times\frac{16834568}{622668}\approx486.65\$

while 1:print('y')

Try it online!

Noodle9

Posted 2020-02-14T21:40:15.447

Reputation: 2 776

0

C (clang), 57 54 bytes \$\times\frac{44152832}{45940736}\approx51.8984\$

Saved 3 bytes thanks to @S.S.Anne!!!

#define p putchar_unlocked
main(){for(;;p(10))p('y');}

Try it online!

This is on my ancient tablet, probability better on my laptop that's far away right now - on holiday! :))))

Noodle9

Posted 2020-02-14T21:40:15.447

Reputation: 2 776

54 bytes – S.S. Anne – 2020-02-16T21:52:40.400

@S.S.Anne Oh, very good with for golf - thanks! :-) – Noodle9 – 2020-02-16T22:41:25.033

1Compact, but I'm not confident even putchar_unlocked or fputs_unlocked("y", stdout) could manage 1.9GB/s, even on a fast CPU like the OP's Zen. So it's a question of how much slower than just re-writing the same large buffers this is, and which is higher in the scoring system vs. the other C answers that create an 8kB or so buffer. – Peter Cordes – 2020-02-16T22:59:39.990

@PeterCordes Yup, was thinking that's got to be better. :-) – Noodle9 – 2020-02-16T23:01:34.153

Yes this does poorly on my comp: 275.913. Buffers are really that critical it seems – hLk – 2020-02-17T02:21:39.547

1@hLk Makes sense, this shows that buffers are doing their job. Providing a fast way to store output and using an expensive (timewise) system call to write only occasionally (when the buffers full). – Noodle9 – 2020-02-17T09:50:18.257

0

chevron, 6 bytes

>y
->1

Superloach

Posted 2020-02-14T21:40:15.447

Reputation: 71

This isn't [tag:code-golf]; you need to specify your score (how fast it is) – pppery – 2020-02-21T04:42:18.430

0

brainfuck, 30*971956224/621731~46899.2

>+[>+[<]>->+]++++++++++<[.>.<]

Try it online!

The BF JIT that I'm using prints each character in a separate syscall. An implementation that coalesces multiple . commands into one syscall would likely perform better.

Thanks to the list of BF constants.

ceilingcat

Posted 2020-02-14T21:40:15.447

Reputation: 5 503