How do paywalled sites get their pages into Google?

First, googlebot, indexes the entire web. They want to index all websites including the paywalled sites. My completely insignificant personal website is index by google all the time.

Google can only index what the website allows them to see, they make no attempt to bypass security or access files that are not volunteered to them.

If the website feeds google a paywall, it indexes that, and stops there because that is all that is available. There are different HTML tags to suggest whether something should be cached or not. Google probably respects those.

https://stackoverflow.com/questions/1341089/using-meta-tags-to-turn-off-caching-in-all-browsers

<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />

Each bot, include google, downloads from each website robots.txt for further instruction on what to do.

Lets look at nwtimes: robots.txt

User-agent: *
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /archives/
Disallow: /auth/
Disallow: /cnet/
Disallow: /college/
Disallow: /external/
Disallow: /financialtimes/
Disallow: /idg/
Disallow: /indexes/
Disallow: /library/
Disallow: /nytimes-partners/
Disallow: /packages/flash/multimedia/TEMPLATES/
Disallow: /pages/college/
Disallow: /paidcontent/
Disallow: /partners/
Disallow: /reuters/
Disallow: /register
Disallow: /thestreet/
Disallow: /svc
Disallow: /video/embedded/*
Disallow: /web-services/
Disallow: /gst/travel/travsearch*

Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz

Now lets look at tnooz: robots.txt

User-agent: msnbot
User-agent: AhrefsBot
User-agent: bingbot
User-agent: YandexBot
Crawl-delay: 10

Not a single restriction to be found in their file.

qz.com only has a couple restrictions:

   # If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see https://developer.wordpress.com/docs/firehose/ for more details.

Sitemap: https://qz.com/news-sitemap.xml

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
# Sitemap archive
Sitemap: https://qz.com/sitemap.xml

Disallow: /wp-login.php
Disallow: /activate/ # har har
Disallow: /cgi-bin/ # MT refugees
Disallow: /mshots/v1/
Disallow: /next/
Disallow: /public.api/

User-agent: IRLbot
Crawl-delay: 3600

Some sites offer googlebots sample/partial articles and google will cache the parts offered to them.

Source (below) https://yoast.com/ultimate-guide-robots-txt/

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means the search engine has to be able to index that page and find the noindex tag, so the page should not be blocked by robots.txt.

https://support.google.com/webmasters/answer/66356?hl=en&visit_id=1-636280385333935278-3996937908&rd=1

cybernard

Posted 2017-04-17T14:10:20.803

Reputation: 11 200

1The Google search bot advertises itself as a bot, it is a trivial action as a web master, to present paid content to a search engine bot. – Ramhound – 2017-04-17T15:30:01.103

Tyler Durden, you might ask the question "How do I ensure Google searches show my content pages on my paywalled site?" on one of our sister sites like https://webmasters.stackexchange.com/ . A website can control stuff like this through its robots.txt

– Christopher Hostage – 2017-04-17T18:44:23.893

How do paywalled sites get their pages into Google?

Answers