How do paywalled sites get their pages into Google?

-1

I notice that paywalled sites like the New York Times come up in Google searches, but if you try to click the link you run into a paywall. Also, I notice that Google does NOT have a cache of the paywalled sites. For example, here are some search results:

enter image description here

So, here you can see that the last two links have small green triangles leading to the cached contents but the NYT links above do NOT have the green triangle. Is this the result of some kind of dirty deal that the NYT has to give secret access to the content to Google in exchange for promoting their paywalled content? Obviously Google has access to the paywalled pages since they indexed them. Why don't they give access to their cache of the page?

Tyler Durden

Posted 2017-04-17T14:10:20.803

Reputation: 4 710

Question was closed 2017-04-17T14:50:47.057

1The Google search bot advertises itself as a bot, it is a trivial action as a web master, to present paid content to a search engine bot. – Ramhound – 2017-04-17T15:30:01.103

Tyler Durden, you might ask the question "How do I ensure Google searches show my content pages on my paywalled site?" on one of our sister sites like https://webmasters.stackexchange.com/ . A website can control stuff like this through its robots.txt

– Christopher Hostage – 2017-04-17T18:44:23.893

Answers

2

How do paywalled sites get their pages into Google?

First, googlebot, indexes the entire web. They want to index all websites including the paywalled sites. My completely insignificant personal website is index by google all the time.

Google can only index what the website allows them to see, they make no attempt to bypass security or access files that are not volunteered to them.

If the website feeds google a paywall, it indexes that, and stops there because that is all that is available. There are different HTML tags to suggest whether something should be cached or not. Google probably respects those.

https://stackoverflow.com/questions/1341089/using-meta-tags-to-turn-off-caching-in-all-browsers

<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />

Each bot, include google, downloads from each website robots.txt for further instruction on what to do.

Lets look at nwtimes: robots.txt

User-agent: *
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /archives/
Disallow: /auth/
Disallow: /cnet/
Disallow: /college/
Disallow: /external/
Disallow: /financialtimes/
Disallow: /idg/
Disallow: /indexes/
Disallow: /library/
Disallow: /nytimes-partners/
Disallow: /packages/flash/multimedia/TEMPLATES/
Disallow: /pages/college/
Disallow: /paidcontent/
Disallow: /partners/
Disallow: /reuters/
Disallow: /register
Disallow: /thestreet/
Disallow: /svc
Disallow: /video/embedded/*
Disallow: /web-services/
Disallow: /gst/travel/travsearch*

Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz

Now lets look at tnooz: robots.txt

User-agent: msnbot
User-agent: AhrefsBot
User-agent: bingbot
User-agent: YandexBot
Crawl-delay: 10

Not a single restriction to be found in their file.

qz.com only has a couple restrictions:

   # If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see https://developer.wordpress.com/docs/firehose/ for more details.

Sitemap: https://qz.com/news-sitemap.xml

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
# Sitemap archive
Sitemap: https://qz.com/sitemap.xml

Disallow: /wp-login.php
Disallow: /activate/ # har har
Disallow: /cgi-bin/ # MT refugees
Disallow: /mshots/v1/
Disallow: /next/
Disallow: /public.api/

User-agent: IRLbot
Crawl-delay: 3600

Some sites offer googlebots sample/partial articles and google will cache the parts offered to them.

Source (below) https://yoast.com/ultimate-guide-robots-txt/

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means the search engine has to be able to index that page and find the noindex tag, so the page should not be blocked by robots.txt.

https://support.google.com/webmasters/answer/66356?hl=en&visit_id=1-636280385333935278-3996937908&rd=1

cybernard

Posted 2017-04-17T14:10:20.803

Reputation: 11 200