How to get the page source of a specific google search result page?

0

I want to write a code for giving out the names of the characters in tv series or movies using a shell script... I plan to do that by extarcting the page source of the google search result for which i'll be requiring the page source of the links... for eg this link I tried directly using wget but it gives error code 8 and also curl -L feeds the "wrong" page source

juggernauthk108

Posted 2016-11-02T16:07:31.667

Reputation: 127

Are you sure that it is the "wrong" page source? Google likely uses clientside code (JavaScript) to populate the character data once the page loads, i.e., the page source you receive won't look like the source of the browser because the source displayed in the browser has been changed by JavaScript after page load. wget and curl do not do any processing. – varlogtim – 2016-11-15T20:31:14.550

Answers

0

If you look at the wget log messages, you will see that you finally get "403 Forbidden" from Google.

So feel invited to look at this Stackoverflow answer. Google doesn't want it's search results page to be used in an automated way, and I suppose they've got pretty good reasons.

If you want to do this anyway, you can set another User Agent string with wget --user-agent=Chrome -O results.html 'https://www.google.com/search?hl=en&q=iron%20man%20character%20names'

However, the answer you get from Google then is not easy to parse - maybe you can use a movie database for this task?

u_Ltd.

Posted 2016-11-02T16:07:31.667

Reputation: 213

that worked and indeed output is not somthing easy to be parsed.. and about using movie DB actually what i want to make is more generic and a piece of a mosiac that was troubling (which u solved) me... – juggernauthk108 – 2016-11-29T12:59:07.427