How does Google search extract text from pages behind a paywall?

Question

For a recent example, type in to Google (with quotes) "Reggie Kray as a character witness" and your first result should be this article from The Times. Even with all of my best JavaScript tricks, I can only read up to the start of the fourth paragraph. However, the Google search result is clearly showing a part of the article that was further on than I could read.

What sorcery is this? I don't find it likely that Google is paying for a newspaper subscription or that they're using some underhanded trick to bypass paywalls, so how have they got through this security?

score 4 · Accepted Answer · answered Aug 06 '20 at 15:33

If this happens, usually the webserver is detecting the GoogleBot and delivering a different result to google.

There where websites that just used the User-Agent header, so you could just change that in your client and read the full article.

Proper implementations will at least check that the IP also is actually one used by the Google Crawler.

If they have it properly implemented, you cannot get the content, unless you are Google.

I know one newspaper that obviously whitelists all Google IPs, so it is possible to bypass the paywall by using a Google Compute Engine VM, which uses an IP address owned by google but not one of the designated crawler IPs.

+1. Also, after Google has indexed a paywall-protected page, it may be possible to access the page through googleweblight. See https://support.google.com/webmasters/answer/6211428?hl=en. — mti2935, Aug 06 '20 at 15:36

score 1 · Answer 2 · answered Aug 06 '20 at 15:36

The spider Google is using to access the page identifies itself as "GoogleBot" and comes from Google IP address. Most sites only check the User-Agent ("GoogleBot" here), while others check the IP too.

You can install an extension that allows you to change the User-Agent, change it to GoogleBot, and try to access the site again. If they aren't checking the IP, you will be able to view the entire page.

How does Google search extract text from pages behind a paywall?

2 Answers2