Why is reading a FILE faster than reading a VARIABLE?

I don't understand the results of a simple performance test I ran using two basic scripts (running on a high end server):

perfVar.zsh :

#!/bin/zsh -f

MYVAR=`cat $1`
for i in {1..10}
do
  echo $MYVAR
done

perfCat.zsh

#!/bin/zsh -f

for i in {1..10}
do
cat $1
done

Performance test result:

> time ./perfVar.zsh BigTextFile > /dev/null
./perfVar.zsh FE > /dev/null  6.86s user 0.32s system 100% cpu 7.177 total
> time ./perfCat.zsh BigTextFile > /dev/null
./perfCat.zsh FE > /dev/null  0.01s user 0.10s system 91% cpu 0.118 total

I would have thought that accessing a VARIABLE was way faster than reading a FILE on the file system... Why this result ? Is there a way to optimize the perfCat.zsh script by reducing the number of accesses to the file system ?

Sébastien

Posted 2011-05-04T17:41:52.920

Reputation: 827

How big is BigTextFile? And how much RAM is in the computer? – Heath – 2011-05-04T18:01:56.513

Is that a typo above? As written both scripts will cat the first command-line argument ($1) rather than the loop variable ($i). – CarlF – 2011-05-04T19:08:33.233

@CarlF No it is not a typo. I don't use the value of $i, I just want to repeat the operation 10 times (read the file $1) – Sébastien – 2011-05-05T07:42:16.863

@Heath - File is 50MB and server got 48GB RAM (1GB free). Results are similar on an other server with less RAM occupation. – Sébastien – 2011-05-05T08:07:02.223

As an aside, if you use perl/ruby/python (or something similar) instead of a shell language, you'll probably see much more comparable results. – Brian Vandenberg – 2011-05-05T16:15:50.603

Answers

I was able to reproduce the same behavior in Bash. The main problem here is that you're using shell variables in a way that they weren't designed for; and therefore not optimized for. When you do 'echo $HUGEVAR', the shell has to build a command line containing the entire contents of $HUGEVAR (even though 'echo' is a built-in command, there's still a command line).

So the shell expands HUGEVAR into a large string which is then parsed again to split it on whitespace into a list of individual arguments to the echo command. (Note that this will have the effect of collapsing consecutive whitespace characters in the input file to single space characters). Clearly, this process is not very efficient with large strings.

You should just use the method of 'cat bigfile' multiple times; and allow the OS's file system cache to do its job and speed up the repeated access of the big file; you avoid the subtle (possibly unwanted) modification to the string that the shell does when you use echo (plus the 'cat' method will work with binary files where the shell method could break on binary data).

Heath

Posted 2011-05-04T17:41:52.920

Reputation: 705

In bash and csh, the variable choice ...

#!/usr/bin/env bash
MYVAR=`cat $1`

#!/usr/bin/env tcsh
set myvar=`cat $1`

... will cause it to execute the cat command as well as any interpretation of the text that may occur. As an example, if the environment variable LANG is set to UTF8, or if it turns newlines into spaces. Finally, it needs to allocate space to store the result of the cat.

By contrast, script #2 just cats the file and is done with it. In fact, since it's writing to /dev/null, that will probably improve performance as well.

Try writing to a file instead of /dev/null and re-time it. It'll almost certainly be faster still, but the timings may be more in line with each-other.

Lastly, have it time just the loop instead of timing the entire script. If what you want to do is time reading from a variable -vs- reading from a file, then you're not timing it properly.

edit

For timing, rather than use the time command I'd recommend doing this:

#!/usr/bin/env bash

# do some stuff
date --rfc-3339=ns
for (( i = 0; i < 10; i++ )); do
  # Some more stuff
done;
date --rfc-3339=ns

This will output the current date & time accurate to the nanosecond.

Brian Vandenberg

Posted 2011-05-04T17:41:52.920

Reputation: 504

I just tried piping the output of the cat and of the echo to a grep. This way, even in the "cat case" the data has to be allocated in RAM before being grepped.

The results are similar: reading the file is still much faster.

Is there a way to prevent zsh from doing "any interpretation of the text" ? – Sébastien – 2011-05-05T09:12:31.200

Nice. I kind of expected that. Did you also make it only time the actual work being done? Previously, you were also timing it doing the VAR=$(cat $1), which would have skewed the results a little. – Brian Vandenberg – 2011-05-05T09:14:57.630

I haven't managed to time it properly. I surrounded the loop with time { } but it does not work – Sébastien – 2011-05-05T12:02:59.110

I'm going to edit my answer to provide a suggestion for that. – Brian Vandenberg – 2011-05-05T14:36:10.633

Variable assignment (in contrast to most other script built-ins) is an expensive operation. The reason you are seeing such a drastic performance difference is because the size of data you are working with. On the surface it appears as if you are only assigning the data once to a single variable (MYVAR), but in reality zsh is assigning the data to a temporary location (mapping and unmapping memory) on every echo call. Normally this is not a problem, but when working with a large chunk of data, it becomes noticeable.

The reason the cat loop is superior is two fold. Size of data, and file system caching.

h0tw1r3

Posted 2011-05-04T17:41:52.920

Reputation: 1 408

More tests show that indeed, each echo call has more or less the same cost than the initial variable assignment.

Why does ZSH need to do memory mapping when it is just asked to read a variable ?

Also you are right about file system caching impacting performance, but this has a small impact compared to this variable assigment and access problem – Sébastien – 2011-05-05T08:59:32.570