Multiple read from a txt file in bash (parallel processing )

Here is a simple bash script for HTTP status code

while read url
    do
        urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
        echo "$url  $urlstatus" >> urlstatus.txt
    done < $1

I am reading URL from text file but it processes only one at a time, taking too much time, GNU parallel and xargs also process one line at time (tested)

How to process simultaneous URL for processing to improve timing? In other words threading of URL file rather than bash commands (which GNU parallel and xargs do)

as answer from user this code works fine except it don't process some last url

urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 ) && echo "$url  $urlstatus" >> urlstatus.txt &

may be adding wait help ,,, any suggestions

user7423959

Posted 2017-01-18T12:13:56.070

Reputation: 45

You could look into sub processes for this. That would mean you could start an individual shell/thread for each curl. As for your solution using xargs/parallel it would be worth it to include it since you might have just done something wrong. Just reading the file should be fast enough (except if it's really large) but the waiting for the answer is probably whats your problem. – Seth – 2017-01-18T12:41:59.973

actually after using parallel it processes single URL as same as normal bash script timing. – user7423959 – 2017-01-18T13:37:01.050

Why would a single URL be any faster? With a single URL you could all the parallelization you want, it won't get faster. With multiple URLs on the other hand you could request a set of URLs at a time. So the issue might've been how you've called/used parallels. Hence it could be useful to include how you actually tried to use it. – Seth – 2017-01-18T13:44:19.520

here is example --cat abc.txt | parallel -j100 --pipe /root/bash.sh abc.txt you now gotta some idea ,,,,and n1 is also used ,,, it processes one url at time ,, not parralel consuming same time . – user7423959 – 2017-01-18T13:47:55.757

Answers

In bash, you could use the & symbol to run programs in background. Example

for i in {1..100..1}; do
  echo $i>>numbers.txt &
done;

EDIT: Sorry but the answer for your question in the comment is wrong, so i just edited the answer. Suggestions wrt code

urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 ) && echo "$url  $urlstatus" >> urlstatus.txt &

me_alok

Posted 2017-01-18T12:13:56.070

Reputation: 362

can you give suggestion w.r.t. code as adding this symbol (&) don't improve timing . – user7423959 – 2017-01-18T13:38:14.457

Try this

urlstatus=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "${url}" --max-time 5 ) & – me_alok – 2017-01-18T13:59:37.537

already tried , – user7423959 – 2017-01-18T14:39:23.757

This worked for me. – ninja – 2017-01-18T14:58:31.507

It works, i tested before editing the answer – me_alok – 2017-01-18T15:02:50.403

your code works fine but one problem- it dont process the last some url ,, might need to add wait somewhere in code ,,, any suggestion on this – user7423959 – 2017-01-19T04:15:11.470

actually it misses a lot of url ,,, only some are shown – user7423959 – 2017-01-19T04:56:39.793

adding wait at the end of file also not working – user7423959 – 2017-01-19T05:14:12.317

There is no need to add a wait command in here unless you want to limit the number of threads and it should be inside the while loop. – me_alok – 2017-01-19T07:41:30.490

For missing url issue, what's the output in urlstatus.txt? Is it just the status code that's missing or the entire url and status? – me_alok – 2017-01-19T07:43:42.267

URL missing are total whose status code is 000,,,, that is not an issue, I want thread control in this script as very longer text file hangs mine system for while (although produces results),,,, any suggestions on adding threading to this code – user7423959 – 2017-01-20T05:15:33.547

Can you produce a sample input and output? – me_alok – 2017-01-20T06:37:29.290

Yeah the output is properly reproduced any suggestion on thread control in this script – user7423959 – 2017-01-20T09:00:59.430

Can you post a sample output (both stdout and urlstatus.txt)? – me_alok – 2017-01-20T10:40:47.600

here is input file http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 2. i am saving your script as bash.sh and executing as from terminal ./bash.sh top1m.txt( unzipping above) 4. then it produces results in urlstatus.txt file 5. i want thread control inthis script (you may take some input to test as small file ) 6. there are much more files as this is big ...there are like 100 ,,,500 kb etc not as big is this ,,,6. your answer is working , ijust asking if thread control possible

– user7423959 – 2017-01-20T12:51:54.090

Well multithreading is working in here, use 'top' command to see this. For thread control, let me see what i can do – me_alok – 2017-01-21T08:22:31.093

GNU parallel and xargs also process one line at time (tested)

Can you give an example of this? If you use -j then you should be able to run much more than one process at a time.

I would write it like this:

doit() {
    url="$1"
    urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
    echo "$url  $urlstatus"
}
export -f doit
cat input.txt | parallel -j0 -k doit

Based on the input.txt:

Input file is txt file and lines are separated  as
ABC.Com
Bcd.Com
Any.Google.Com
Something  like this
www.google.com
pi.dk

I get the output:

Input file is txt file and lines are separated  as  000
ABC.Com  301
Bcd.Com  301
Any.Google.Com  000
Something  like this  000
www.google.com  302
pi.dk  200

Which looks about right:

000 if domain does not exist
301/302 for redirection
200 for success

I must say I am a bit surprised if the input lines you have provided really are parts of the input you actually use. None of these domains exist, and domain names with spaces in probably never will exist - ever:

Input file is txt file and lines are separated  as
Any.Google.Com
Something  like this

If you have not given input from your actual input file, you really should do that instead of making up stuff - especially if the made up stuff does not resemble the real data.

Edit

Debugging why it does not work for you.

Please do not write a script, but run this directly in the terminal:

bash # press enter here to make sure you are running this in bash
doit() {
    url="$1"
    urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
    echo "$url  $urlstatus"
}
export -f doit
echo pi.dk | parallel -j0 -k doit

This should give:

pi.dk  200

Ole Tange

Posted 2017-01-18T12:13:56.070

Reputation: 3 034

hey i got same status code 000 ,, can you tell me how you executing your script from terminal , may it help – user7423959 – 2017-01-19T04:33:03.413

I put the input lines above into the file input.txt. Then I run the exact lines that is written above. My shell is bash. – Ole Tange – 2017-01-19T07:49:50.097

i explain the whole process--- 1. i copied your bash script and saved it as bash.sh and giving execution permissions . 2. my input file is big file but i also tested on small 10 lines file---here is list www.yahoo.com ,www.google.com facebook.com amazon.com bing.com apple.com www.microsoft.com www.windows.com ,,,,,all seperated by lines and saved as top.txt 4. now then i go to terminal and type ./bash.sh top.txt 5. now it gives the result 000 in each 6. now can you assist me further where ia am wrong ,,,thanks – user7423959 – 2017-01-19T09:19:15.363

This works fine – user7423959 – 2017-01-20T05:12:03.917

slower than xargs and consumes all PC resources – ajcg – 2019-09-04T20:12:01.803