Spawning multiple parallel wgets and storing results in a bash array to be pretty printed when all wgets are done

I have a long list of urls on my own website listed in a carriage return seperated text file. So for instance:

http:/www.mysite.com/url1.html
http:/www.mysite.com/url2.html
http:/www.mysite.com/url3.html

I need to spawn a number of parallel wgets to hit each URL twice, check for and retrieve a particular header and then save the results in an array which I want to output in a nice report.

I have part of what I want by using the following xargs command:

xargs -x -P 20 -n 1 wget --server-response -q -O - --delete-after<./urls.txt 2>&1 | grep Caching

The question is how do I run this command twice and store the following:

The URL hit
The 1st result of the grep against the Caching header
The 2nd result of the grep against the Caching header

So the output should look something like:

=====================================================
http:/www.mysite.com/url1.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

=====================================================
http:/www.mysite.com/url2.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

And so forth.

Order that the URLS appear isn't necessarily a concern as long as the header(s) are associated with the URL.

Because of the number of URLs I need to hit multiple URLs in parallel not serially otherwise it will take way too long.

The trick is how do I get multiple parallel wgets AND store the results in a meaningful way. I'm not married to using an array if there is a more logical way of doing this (maybe writing to a log file?)

Do any bash gurus have any suggestions for how I might proceed?

Brad

Posted 2013-06-10T12:15:51.817

Reputation: 185

Are your entries really separated by carriage returns (\r), not new lines (\n) or windows style (\r\n)? Is this a file from an old Mac? – terdon – 2013-06-10T16:53:45.340

1You may want to experiment with gnu parallel. in particular the manpage mentions "GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially." – kampu – 2013-06-11T04:16:38.243

Answers

Make a small script that does the right thing given a single url (based on terdon's code):

#!/bin/bash

url=$1
echo "=======================================";
echo "$url"
echo "=======================================";
echo -n "First Hit: Caching: ";
wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
echo -n "Second Hit: Caching: ";      
wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo "";

Then run this script in parallel (say, 500 jobs at a time) using GNU Parallel:

cat urls.txt | parallel -j500 my_script

GNU Parallel will make sure the output from two processes are never mixed - a guarantee xargs does not give.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

wget -O - pi.dk/3 | sh

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange

Posted 2013-06-10T12:15:51.817

Reputation: 3 034

1Ah, yes, should have thought of that, +1. – terdon – 2013-06-11T11:33:34.973

One trivial solution would be to log the output from each of the wget commands to a separate file and using cat to merge them afterwards.

l0b0

Posted 2013-06-10T12:15:51.817

Reputation: 6 306

I have 22,000 URLs. I suppose I could create 22,000 text files and then try and merge and delete them afterwards but I must admit I'm not terribly fond of generating all that I/O. – Brad – 2013-06-10T18:25:45.597

22,000 files isn't much in my book, but I guess it comes with the territory. time for i in {1..22000}; do echo "Number $i" > $i; done - 1.7 seconds. Removing them: Less than a second. – l0b0 – 2013-06-10T20:40:34.723

I will assume that your file is newline, not carriage return separated, because the command you give will not work with an \r separated file.

If your file is using \r instead of \n for line endings, change it to using \n by running this:

perl -i -pe 's/\r/\n/g' urls.txt

If you are using Windows style (\r\n) line endings, use this:

perl -i -pe 's/\r//g' urls.txt

Now, once you have your file in Unix form, if you don't mind your jobs not being run in parallel, you can do something like this:

while read url; do 
  echo "=======================================";
  echo "$url"
  echo "=======================================";
  echo -n "First Hit: Caching: ";
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
  echo -n "Second Hit: Caching: ";      
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo "";
done < urls.txt

UPDATE in response to your comment:

If you have 22,000 URLs I can indeed understand why you want to do this in parallel. One thing you could try is creating tmp files:

(while read url; do 
 ( 
  echo "=======================================";
  echo "$url"
  echo "=======================================";
  echo -n "First Hit: Caching: ";
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
  echo -n "Second Hit: Caching: ";      
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi; 
  echo ""; ) > `mktemp urltmpXXX` 2>/dev/null&
done < urls.txt )

There are two subshells launched there, the first, (while ... < urls.txt) is justthere to suppress completion messages. The second (( echo "=== ... ) > mktemp urltmpXXX) is there to collect all output for a given URL into one file.

The script above will create 22,000 tmp files called urltmpXXX where the XXX is replaced by as many random characters. Since the tmp files will each have 6 lines of text when they have all finished, you can therefore monitor (and optionally delete the files) with this command:

b=`awk 'END{print NR}' urls.txt`; 
while true; do 
 a=`wc -l urltmp* | grep total | awk '{print $1}'`;     
 if [ $a == $((6 * $b)) ]; then cat urltmp* > urls.out; break; 
  else sleep 1; fi; 
done

Now the other problem is that this will launch 22000 jobs at once. Depending on your system this may or may not be a problem. One way around this is to split your input file and then run the above loop once for each file.

terdon

Posted 2013-06-10T12:15:51.817

Reputation: 45 216

Thanks I already have a script that runs serially. I.E. one url at a time. The issue is that we have 22,000 urls to hit. Running though them serially takes too long. I need a solution that executes in parallel to reduce the time to run the script. The trouble is once you execute in parallel how do you record the results in a way that is can be generated into a sensible report afterwards? – Brad – 2013-06-10T18:24:13.527

@Brad I have updated my answer with a (perhaps absurdly convoluted) way of running it in parallel. – terdon – 2013-06-10T19:10:09.490

Actually this brought my server to its knees. Oops! I guess I need to break this up / throttle it somehow. – Brad – 2013-06-11T02:49:38.967

@Brad yeah, I did warn you :). Try splitting the file into, say 100 line chunks: split -l 100 urls.txt, then run the loop on each file: for file in x*; do (while read url; do ... ;done < $file); done. Here, <$file replaces <urls.txt. – terdon – 2013-06-11T02:58:00.283