Multiple read from a txt file in bash (parallel processing )

3

Here is a simple bash script for HTTP status code

while read url
    do
        urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
        echo "$url  $urlstatus" >> urlstatus.txt
    done < $1

I am reading URL from text file but it processes only one at a time, taking too much time, GNU parallel and xargs also process one line at time (tested)

How to process simultaneous URL for processing to improve timing? In other words threading of URL file rather than bash commands (which GNU parallel and xargs do)

as answer from user this code works fine except it don't process some last url

urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 ) && echo "$url  $urlstatus" >> urlstatus.txt &

may be adding wait help ,,, any suggestions

user7423959

Posted 2017-01-18T12:13:56.070

Reputation: 45

You could look into sub processes for this. That would mean you could start an individual shell/thread for each curl. As for your solution using xargs/parallel it would be worth it to include it since you might have just done something wrong. Just reading the file should be fast enough (except if it's really large) but the waiting for the answer is probably whats your problem. – Seth – 2017-01-18T12:41:59.973

actually after using parallel it processes single URL as same as normal bash script timing. – user7423959 – 2017-01-18T13:37:01.050

Why would a single URL be any faster? With a single URL you could all the parallelization you want, it won't get faster. With multiple URLs on the other hand you could request a set of URLs at a time. So the issue might've been how you've called/used parallels. Hence it could be useful to include how you actually tried to use it. – Seth – 2017-01-18T13:44:19.520

here is example --cat abc.txt | parallel -j100 --pipe /root/bash.sh abc.txt you now gotta some idea ,,,,and n1 is also used ,,, it processes one url at time ,, not parralel consuming same time . – user7423959 – 2017-01-18T13:47:55.757

Answers

3

In bash, you could use the & symbol to run programs in background. Example

for i in {1..100..1}; do
  echo $i>>numbers.txt &
done;

EDIT: Sorry but the answer for your question in the comment is wrong, so i just edited the answer. Suggestions wrt code

urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 ) && echo "$url  $urlstatus" >> urlstatus.txt &

me_alok

Posted 2017-01-18T12:13:56.070

Reputation: 362

can you give suggestion w.r.t. code as adding this symbol (&) don't improve timing . – user7423959 – 2017-01-18T13:38:14.457

Try this

urlstatus=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "${url}" --max-time 5 ) & – me_alok – 2017-01-18T13:59:37.537

already tried , – user7423959 – 2017-01-18T14:39:23.757

This worked for me. – ninja – 2017-01-18T14:58:31.507

It works, i tested before editing the answer – me_alok – 2017-01-18T15:02:50.403

your code works fine but one problem- it dont process the last some url ,, might need to add wait somewhere in code ,,, any suggestion on this – user7423959 – 2017-01-19T04:15:11.470

actually it misses a lot of url ,,, only some are shown – user7423959 – 2017-01-19T04:56:39.793

adding wait at the end of file also not working – user7423959 – 2017-01-19T05:14:12.317

There is no need to add a wait command in here unless you want to limit the number of threads and it should be inside the while loop. – me_alok – 2017-01-19T07:41:30.490

For missing url issue, what's the output in urlstatus.txt? Is it just the status code that's missing or the entire url and status? – me_alok – 2017-01-19T07:43:42.267

URL missing are total whose status code is 000,,,, that is not an issue, I want thread control in this script as very longer text file hangs mine system for while (although produces results),,,, any suggestions on adding threading to this code – user7423959 – 2017-01-20T05:15:33.547

Can you produce a sample input and output? – me_alok – 2017-01-20T06:37:29.290

Yeah the output is properly reproduced any suggestion on thread control in this script – user7423959 – 2017-01-20T09:00:59.430

Can you post a sample output (both stdout and urlstatus.txt)? – me_alok – 2017-01-20T10:40:47.600

>

  • here is input file http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 2. i am saving your script as bash.sh and executing as from terminal ./bash.sh top1m.txt( unzipping above) 4. then it produces results in urlstatus.txt file 5. i want thread control inthis script (you may take some input to test as small file ) 6. there are much more files as this is big ...there are like 100 ,,,500 kb etc not as big is this ,,,6. your answer is working , ijust asking if thread control possible
  • – user7423959 – 2017-01-20T12:51:54.090

    Well multithreading is working in here, use 'top' command to see this. For thread control, let me see what i can do – me_alok – 2017-01-21T08:22:31.093

    2

    GNU parallel and xargs also process one line at time (tested)

    Can you give an example of this? If you use -j then you should be able to run much more than one process at a time.

    I would write it like this:

    doit() {
        url="$1"
        urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
        echo "$url  $urlstatus"
    }
    export -f doit
    cat input.txt | parallel -j0 -k doit
    

    Based on the input.txt:

    Input file is txt file and lines are separated  as
    ABC.Com
    Bcd.Com
    Any.Google.Com
    Something  like this
    www.google.com
    pi.dk
    

    I get the output:

    Input file is txt file and lines are separated  as  000
    ABC.Com  301
    Bcd.Com  301
    Any.Google.Com  000
    Something  like this  000
    www.google.com  302
    pi.dk  200
    

    Which looks about right:

    000 if domain does not exist
    301/302 for redirection
    200 for success
    

    I must say I am a bit surprised if the input lines you have provided really are parts of the input you actually use. None of these domains exist, and domain names with spaces in probably never will exist - ever:

    Input file is txt file and lines are separated  as
    Any.Google.Com
    Something  like this
    

    If you have not given input from your actual input file, you really should do that instead of making up stuff - especially if the made up stuff does not resemble the real data.

    Edit

    Debugging why it does not work for you.

    Please do not write a script, but run this directly in the terminal:

    bash # press enter here to make sure you are running this in bash
    doit() {
        url="$1"
        urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
        echo "$url  $urlstatus"
    }
    export -f doit
    echo pi.dk | parallel -j0 -k doit
    

    This should give:

    pi.dk  200
    

    Ole Tange

    Posted 2017-01-18T12:13:56.070

    Reputation: 3 034

    hey i got same status code 000 ,, can you tell me how you executing your script from terminal , may it help – user7423959 – 2017-01-19T04:33:03.413

    I put the input lines above into the file input.txt. Then I run the exact lines that is written above. My shell is bash. – Ole Tange – 2017-01-19T07:49:50.097

    i explain the whole process--- 1. i copied your bash script and saved it as bash.sh and giving execution permissions . 2. my input file is big file but i also tested on small 10 lines file---here is list www.yahoo.com ,www.google.com facebook.com amazon.com bing.com apple.com www.microsoft.com www.windows.com ,,,,,all seperated by lines and saved as top.txt 4. now then i go to terminal and type ./bash.sh top.txt 5. now it gives the result 000 in each 6. now can you assist me further where ia am wrong ,,,thanks – user7423959 – 2017-01-19T09:19:15.363

    This works fine – user7423959 – 2017-01-20T05:12:03.917

    slower than xargs and consumes all PC resources – ajcg – 2019-09-04T20:12:01.803