retrieve and extract links (Linux/Windows)

I have a "source.txt" file which contains list of some URLs. For example:

source.txt:    
http://www.amazon.com/gp/product/B007OZNZG0/ref=s9_pop_gw_g349_ir05/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
http://www.amazon.com/gp/product/B0083PWAPW/ref=s9_pop_gw_g424_ir04/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846

I want to retrieve each link inside "source.txt" and search through the html of each and extract all links from them which contain "/gp/product" and then store them in "extracted.txt" file, which would be similar to:

extracted.txt:
http://www.amazon.com/gp/product/B008GFRB9E/ref=fs_j
http://www.amazon.com/gp/product/B008GFUA4C/ref=fs_2
...

I am using Windows 7 (64 bit) and Cygwin, so I can run Linux commands as well.

Si14

Posted 2013-04-02T01:50:18.487

Reputation: 73

You might want to rephrase your question. The first time I read it, I thought a simple grep (as in ssmy’s answer) was what you wanted. Now I’ve read your question 2½ times, and I guess you mean that you want to retrieve each of the web pages whose URLs you have, and then search through the HTML for “/gp/product”. Is that what you mean? If so, I believe you should look at wget. – Scott – 2013-04-02T03:53:14.143

I modified the question. Yes, I mean retrieve and search through the URLs in "source.txt". – Si14 – 2013-04-02T04:21:45.013

1You can use "wget -qO- -i source.txt | grep /gp/product" but that will output the lines containing "/gp/product" with all html-tags etc. – FSMaxB – 2013-04-02T12:13:31.647

@FSMaxB Thank you. I tried and you are right. The output is the lines with "href="/gp/product/", while it should be "http://www.amazon.com/gp/product". Any suggestions how to modify this?

– Si14 – 2013-04-02T14:11:11.867

1@Si14 Maybe you can use awk or sed to extract the actual links from this list, but I don't know how to do it. At least it's a first step. – FSMaxB – 2013-04-02T15:44:16.990

retrieve and extract links (Linux/Windows)

Answers