retrieve and extract links (Linux/Windows)

2

I have a "source.txt" file which contains list of some URLs. For example:

source.txt:    
http://www.amazon.com/gp/product/B007OZNZG0/ref=s9_pop_gw_g349_ir05/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
http://www.amazon.com/gp/product/B0083PWAPW/ref=s9_pop_gw_g424_ir04/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846

I want to retrieve each link inside "source.txt" and search through the html of each and extract all links from them which contain "/gp/product" and then store them in "extracted.txt" file, which would be similar to:

extracted.txt:
http://www.amazon.com/gp/product/B008GFRB9E/ref=fs_j
http://www.amazon.com/gp/product/B008GFUA4C/ref=fs_2
...

I am using Windows 7 (64 bit) and Cygwin, so I can run Linux commands as well.

Si14

Posted 2013-04-02T01:50:18.487

Reputation: 73

You might want to rephrase your question.  The first time I read it, I thought a simple grep (as in ssmy’s answer) was what you wanted.  Now I’ve read your question 2½ times, and I guess you mean that you want to retrieve each of the web pages whose URLs you have, and then search through the HTML for “/gp/product”.  Is that what you mean?  If so, I believe you should look at wget. – Scott – 2013-04-02T03:53:14.143

I modified the question. Yes, I mean retrieve and search through the URLs in "source.txt". – Si14 – 2013-04-02T04:21:45.013

1You can use "wget -qO- -i source.txt | grep /gp/product" but that will output the lines containing "/gp/product" with all html-tags etc. – FSMaxB – 2013-04-02T12:13:31.647

@FSMaxB Thank you. I tried and you are right. The output is the lines with "href="/gp/product/", while it should be "http://www.amazon.com/gp/product". Any suggestions how to modify this?

– Si14 – 2013-04-02T14:11:11.867

1@Si14 Maybe you can use awk or sed to extract the actual links from this list, but I don't know how to do it. At least it's a first step. – FSMaxB – 2013-04-02T15:44:16.990

Answers

-1

In a bash shell you can use grep. grep "/gp/product/" source.txt >extracted.txt

ssmy

Posted 2013-04-02T01:50:18.487

Reputation: 1 250

The above command searches for that keyword only in the source.txt and does not open and search the URLs in it. I tried it in Cygwin. I am not sure how to test it on bash shell? Do you have suggestions? – Si14 – 2013-04-02T03:24:13.513