How to run sed on over 10 million files in a directory?

16

5

I have a directory that has 10144911 files in it. So far I've tried the following:

  • for f in ls; do sed -i -e 's/blah/blee/g' $f; done

Crashed my shell, the ls is in a tilda but i can't figure out how to make one.

  • ls | xargs -0 sed -i -e 's/blah/blee/g'

Too many args for sed

  • find . -name "*.txt" -exec sed -i -e 's/blah/blee/g' {} \;

Couldn't fork any more no more memory

Any other ideas on how to create this kind command? The files don't need to communicate with each other. ls | wc -l seems to work (very slow) so it must be possible.

Sandro

Posted 2011-03-14T02:03:12.980

Reputation: 499

1It would be faster if you can avoid invoking sed for each file. I'm not sure if there's a way to open, edit, save, and close a series of files in sed; if speed is essential you may want to use a different program, perhaps perl or python. – intuited – 2011-03-14T05:40:56.193

@intuited: it would be even faster to not do anything to the files at all ... seriously? if you want to change a pattern in a set of files you have to look into each file to see, if there is the pattern. if you know in advance that you can skip 'some' files, then its obvious faster to not even touch the files. and the startup time for sed is probably faster than launching python or perl as well, except if you do everything in that interpreter. – akira – 2011-03-14T09:41:16.067

@akira: Are you saying that launching perl or python once for as many files as will fit on a command line is more expensive than launching sed once for each of those files? I would be really surprised if that were the case. —————— I guess you didn't understand that my suggestion is to invoke (start) the editing program once (or at least fewer times — see my answer), and have it open, modify and resave each of the files in turn, rather than invoking the editing program separately for each of those files. – intuited – 2011-03-14T17:21:37.440

your first comment does not reflect what you really wanted to say: "replace sed by python/perl" .. by just doing that and looking @ the commandline OP has given, an innocent reader could assume that "find . -exec python" is faster than "find . -exec sed" .. which is obviously not the case. in your own answer you call python much more often than it is actually needed. – akira – 2011-03-14T20:26:01.807

I think that akira misinterpreted your (intuited) suggestion. I believe that you were suggesting to bunch files together. I tried that with my xargs attempt, time to try it again :) – Sandro – 2011-03-14T20:47:02.767

@Sandro: your 'xargs -0 sed -i' calls sed already on nr_x of files and is not launched for each file. i find @intuited's first comment just misleading because he provides only half of what he has in mind. and his answer left out the interesting part (for others) as well. – akira – 2011-03-14T21:58:51.350

Sandro: Crazy! I think for the benefit of the community, you should explain how you ended up in this situation. How big is the directory entry itself? Probably several hundred megs. What filesystem are you using? The xargs option might work if you use -n to limit the number of args per sed run. – deltaray – 2011-03-15T00:41:14.613

Answers

19

Give this a try:

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

It will only feed one filename to each invocation of sed. That will solve the "too many args for sed" problem. The -P option should allow multiple processes to be forked at the same time. If 0 doesn't work (it's supposed to run as many as possible), try other numbers (10? 100? the number of cores you have?) to limit the number.

Paused until further notice.

Posted 2011-03-14T02:03:12.980

Reputation: 86 075

3Probably, it will need to be find . -name \*.txt -print0 to avoid having the shell expand the glob and trying to alloc space for 10 million arguments to find. – Chris Johnsen – 2011-03-14T06:38:35.963

@ChrisJohnsen: Yes, that's correct. I rushed posting my answer and missed including those essential parts. I've edited my answer with those corrections. Thanks. – Paused until further notice. – 2011-03-14T07:37:01.483

Trying it now... crosses fingers – Sandro – 2011-03-14T20:02:11.427

7

I've tested this method (and all the others) on 10 million (empty) files, named "hello 00000001" to "hello 10000000" (14 bytes per name).

UPDATE: I've now included a quad-core run on the 'find |xargs' method (still without 'sed'; just echo >/dev/null)..

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done  

Here is a summary of how the provided answers fared when run against the test data mentioned above. These results involve only the basic overheads; ie 'sed' was not called. The sed process will almost certainly be the most time-consuming, but I thought it would be interesting to see how the bare methods compared.

Dennis's 'find |xargs' method, using a single core, took *4 hours 21 mins** longer than the bash array method on a no sed run... However, the multi-core advantage offered by 'find' should outweigh the time differences shown when sed is being called for processing the files...

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+----------------------------------------------------- 

Peter.O

Posted 2011-03-14T02:03:12.980

Reputation: 2 743

2

Another opportunity for the completely safe find:

while IFS= read -rd $'\0' path
do
    file_path="$(readlink -fn -- "$path"; echo x)"
    file_path="${file_path%x}"
    sed -i -e 's/blah/blee/g' -- "$file_path"
done < <( find "$absolute_dir_path" -type f -print0 )

l0b0

Posted 2011-03-14T02:03:12.980

Reputation: 6 306

1

This is mostly off-topic, but you could use

find -maxdepth 1 -type f -name '*.txt' | xargs python -c '
import fileinput
for line in fileinput.input(inplace=True):
    print line.replace("blah", "blee"),
'

The main benefit here (over ... xargs ... -I {} ... sed ...) is speed: you avoid invoking sed 10 million times. It would be faster still if you could avoid using Python (since python is kind of slow, relatively), so perl might be a better choice for this task. I'm not sure how to do the equivalent conveniently with perl.

The way this works is that xargs will invoke Python with as many arguments as it can fit on a single command line, and keep doing that until it runs out of arguments (which are being supplied by ls -f *.txt). The number of arguments to each invocation will depend on the length of the filenames and, um, some other stuff. The fileinput.input function yields successive lines from the files named in each invocation's arguments, and the inplace option tells it to magically "catch" the output and use it to replace each line.

Note that Python's string replace method doesn't use regexps; if you need those, you have to import re and use print re.sub(line, "blah", "blee"). They are Perl-Compatible RegExps, which are sort of heavily fortified versions of the ones you get with sed -r.

edit

As akira mentions in the comments, the original version using a glob (ls -f *.txt) in place of the find command wouldn't work because globs are processed by the shell (bash) itself. This means that before the command is even run, 10 million filenames will be substituted into the command line. This is pretty much guaranteed to exceed the maximum size of a command's argument list. You can use xargs --show-limits for system-specific info on this.

The maximum size of the argument list is also taken into account by xargs, which limits the number of arguments it passes to each invocation of python according to that limit. Since xargs will still have to invoke python quite a few times, akira's suggestion to use os.path.walk to get the file listing will probably save you some time.

intuited

Posted 2011-03-14T02:03:12.980

Reputation: 2 861

1whats the point of using the glob operator (which will fail for that many files anyway) ... and then feed the files to python which has os.path.walk()? – akira – 2011-03-14T09:34:29.427

@akira: glob operator is to avoid trying to replace the contents of . and ... Certainly there are other ways to do that (i.e. find) but I'm trying to stick as closely as possible to what the OP understands. This is also the reason for not using os.path.walk. – intuited – 2011-03-14T17:16:17.070

@akira: Good suggestion, though, that would probably be considerably faster. – intuited – 2011-03-14T17:27:23.677

i think that OP will understand os.path.walk quite easily. – akira – 2011-03-14T20:27:24.830

0

Try:

ls | while read file; do (something to $file); done

Reuben L.

Posted 2011-03-14T02:03:12.980

Reputation: 942

2ls -f would be better; do you really want to wait around for it to stat() and sort that many files? – geekosaur – 2011-03-14T02:47:00.313

right now i'm trying: for f in *.txt; do blah; done. I'll give that a whack if it fails. Thank you! – Sandro – 2011-03-14T03:47:48.663