Benchmarking bash commands with time and tee

1

I have a directory containing more than 80GB of simple text file databases that I anticipate needing to grep through often. For this reason, I'm trying to create some tests to compare GNU grep with what, as far as I can tell, is the fastest alternative to it currently out in the wild - ripgrep - in order to determine which will work the fastest with my data.

The first test will consist of three for loops that run grep, rg and grep -F on a 15GB text file, and the second test will be a series of the same commands run on the whole of the data. After a few days of constant cycling between employing my own limited bash knowledge, looking up solutions and trouble-shooting errors, I've managed to hack together the following for the first test (which will also be repurposed for the second test):

for i in {1..15}; \
do (time LC_ALL=C grep -i "ajndoandajskaskaksnaodnasnakdnaosnaond" "15gbfile.txt") 2>&1 | 
tee -a "../grep Test 1.txt"; \
done; \
for i in {1..15}; \
do (time rg -i "ajndoandajskaskaksnaodnasnakdnaosnaond" "15gbfile.txt") 2>&1 |
tee -a "../ripgrep Test 1.txt"; \
done;
for i in {1..15}; \
do (time LC_ALL=C grep -Fi "ajndoandajskaskaksnaodnasnakdnaosnaond" "15gbfile.txt") 2>&1 |
tee -a "../grep -F Test 1.txt"; \
done;

It's ugly, but it works exactly as intended. It executes all three for loops one after the other, each one grepping 15 times for a long string that will never be found, and then printing the output of time for each grep to both STDOUT and a file.

However, because I'm benchmarking, I want to make sure that the code is suitable to accurately test the (relative) speeds of my use cases on a POSIX/bash/Cygwin system, and that there's nothing I'm overlooking that would skew the results I get. In particular, things like caching, disk IO, and other considerations I'm not aware of. I would also welcome any suggestions that would make it behave more robustly\look less ugly.

Hashim

Posted 2018-09-24T23:08:19.227

Reputation: 6 967

1

What about caching? Part of the 15gb file would be in memory after the first loop entering the second which may make the second one artificially faster. Be interesting to run with this and without it to see what difference it makes: https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

– Paul – 2018-09-24T23:24:18.133

@Paul That's most of the reason I ran grep 15 times for each, in the belief that caching would only make a time difference to the first one or two run of it. Is this not the case? – Hashim – 2018-09-24T23:27:08.593

2This benchmark is definitely not consistent with the problem you're trying to solve. Firstly, if you're searching 80GB of files, then it's likely some large fraction of that will need to be read from disk. grep and ripgrep will do this at about the same speed because they are both likely bottlenecked by I/O speed for simple patterns. Secondly, ripgrep will crawl a directory in parallel by default while grep -r will not. This could result in better search times that won't be captured by searching a single file. – BurntSushi5 – 2018-09-25T12:09:28.537

@BurntSushi5 Might be worth disclosing that you're the developer of ripgrep, but I see your point. As mentioned in the post, this is simply the first test I'm intending to run. The second test will be to run the commands on the entire directory. Most of my intention here was to simply ensure there were no failings in the code itself so that it could also be repurposed for the second test. – Hashim – 2018-09-26T00:02:54.160

@Paul A thought that's just occurred to me, do all of those apply on Cygwin running on top of a Windows 7 system? I'm not sure whether Cygwin bash does any caching. – Hashim – 2018-09-26T00:06:17.233

@Hashim I don't know for sure, but I doubt it. Regarding your other question, at the entrance of the first loop, the file will be uncached, whereas at the start of the ripgrep loop it potentially will be cached. So that gives the ripgrep loop an advantage. – Paul – 2018-09-26T00:14:31.997

@Paul Would running each command successively in a single for loop, as suggests in his answer, solve that problem? Something like this: https://pastebin.com/L2ua3ihP?

– Hashim – 2018-09-26T00:48:32.403

@Hashim Sorry, but I'm not going to say I'm the author of ripgrep every single time I want to talk about ripgrep on the Internet. I'll disclose it when I think it's prudent to, but I otherwise think it's easy to discover if people care. Cygwin isn't the thing that will do caching; the OS will. If you want a basic "first" test, then pick a smaller file, or find a way to guarantee that your 15GB file will always be in memory (by sticking it on a ramdisk). Otherwise, your test is just going to be susceptible to the OS's caching strategy. – BurntSushi5 – 2018-09-26T11:04:13.253

@Hashim If you want a more sophisticated way of benchmarking command line tools, then consider using Hyperfine: https://github.com/sharkdp/hyperfine --- Otherwise, establishing a "basic" first test isn't clearly useful to me if your second benchmark is going to exercise a completely different type of search with different behavior. Benchmarking the case where everything is in memory vs the case where you need to read from disk require two different strategies, and they in turn depend on what it is you want to measure.

– BurntSushi5 – 2018-09-26T11:06:19.907

@BurntSushi5 Regarding your first comment - so the aim is not to eliminate caching, but to ensure that it's consistent throughout the tests? – Hashim – 2018-09-27T18:56:35.747

@Hashim In the ideal sense, sure. But I don't see how that's possible, since it's a transparent thing handled by the OS. In practice, you have two choices: either ensure everything is in cache or ensure nothing is in cache. The former can be generally achieved by an ample amount of warmup or putting the input on a ramdisk, assuming your input fits into memory. (Does it? 15GB probably won't be entirely cached on a system with 16GB of memory.) The latter is usually possible, although I only know how to do it in Linux: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'. – BurntSushi5 – 2018-09-28T11:18:55.437

@BurntSushi5 Would using a sufficiently large file ensure that nothing is in cache? Say a 30GB file on a system with 16GB of memory? Or would the file simply be cached in parts? – Hashim – 2018-09-29T22:35:00.363

1I think I gave you the only two options that I know to be reliable enough for serious and reproducible benchmarking. I see no reason to assume that a file is either completely cached or completely uncached. – BurntSushi5 – 2018-09-30T14:21:52.100

Answers

1

IMHO you test is biased, because you are running the three commands at vastly different times. You should have a single loop that runs in succession the grep, rgrep, grep -F commands, and if you can make that order random, that would even be better.

On the other hand, performance isn't everything, and I would require a very significant better performance to switch to a specific command, and this far better performer would show even with biased benchmarks.

xenoid

Posted 2018-09-24T23:08:19.227

Reputation: 7 552

In addition the tee command is probably taking more time than the grep – matzeri – 2018-09-25T08:39:00.473

I see your point in the first paragraph, but what effect would this have on eliminating the benefits of caching? My initial intention of running the same commands successively was in the belief that caching would apply to the first few instances of the command and so these first few runs could be ignored. Also, do Linux concepts of caching/disk IO even apply to bash running in Cygwin on top of Windows 7? – Hashim – 2018-09-26T00:10:50.867

File caching isn't a "Linux concept." – BurntSushi5 – 2018-09-26T11:07:09.760

@BurntSushi5 I never claimed it was, but the two OSes likely have different implementations/approaches to it, and I was asking whether bash running in Cygwin would use Linux's or Windows'. – Hashim – 2018-09-27T18:01:16.193