2
I have a "source.txt" file which contains list of some URLs. For example:
source.txt:
http://www.amazon.com/gp/product/B007OZNZG0/ref=s9_pop_gw_g349_ir05/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
http://www.amazon.com/gp/product/B0083PWAPW/ref=s9_pop_gw_g424_ir04/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
I want to retrieve each link inside "source.txt" and search through the html of each and extract all links from them which contain "/gp/product" and then store them in "extracted.txt" file, which would be similar to:
extracted.txt:
http://www.amazon.com/gp/product/B008GFRB9E/ref=fs_j
http://www.amazon.com/gp/product/B008GFUA4C/ref=fs_2
...
I am using Windows 7 (64 bit) and Cygwin, so I can run Linux commands as well.
You might want to rephrase your question. The first time I read it, I thought a simple
grep
(as in ssmy’s answer) was what you wanted. Now I’ve read your question 2½ times, and I guess you mean that you want to retrieve each of the web pages whose URLs you have, and then search through the HTML for “/gp/product
”. Is that what you mean? If so, I believe you should look atwget
. – Scott – 2013-04-02T03:53:14.143I modified the question. Yes, I mean retrieve and search through the URLs in "source.txt". – Si14 – 2013-04-02T04:21:45.013
1You can use "wget -qO- -i source.txt | grep /gp/product" but that will output the lines containing "/gp/product" with all html-tags etc. – FSMaxB – 2013-04-02T12:13:31.647
@FSMaxB Thank you. I tried and you are right. The output is the lines with "href="/gp/product/", while it should be "http://www.amazon.com/gp/product". Any suggestions how to modify this?
– Si14 – 2013-04-02T14:11:11.8671@Si14 Maybe you can use awk or sed to extract the actual links from this list, but I don't know how to do it. At least it's a first step. – FSMaxB – 2013-04-02T15:44:16.990