Why can't search engines such as Google, Yahoo and Bing reach dark web and archive its content to display it in their results?
-
6Because if Google would index it, it wouldn't, by definition, belong to the dark web which is part of the deep web. Your question is tautological, similar to: "why can't we see invisible things". – entrop-x Dec 13 '17 at 08:37
-
4@entrop-x The "deep web" does not mean that it is hidden. This is a common myth. And the idea that the "dark web" is part of the "deep web" is also incorrect. The dark web is any networked resource that requires special tools to access. The deep web is any networked resource which is not indexed by default (such as YouTube comments, or private FTP servers, or forum PMs). – forest Dec 13 '17 at 08:45
-
2@ forest: these are nevertheless the definitions given on Wikipedia: https://en.wikipedia.org/wiki/Deep_web "The deep web,[1] invisible web,[2] or hidden web[3] are *parts of the World Wide Web whose contents are not indexed by standard web search engines for any reason* " and https://en.wikipedia.org/wiki/Dark_web : "The dark web forms a small part of the deep web" – entrop-x Dec 13 '17 at 09:53
-
2@entrop-x I don't think the question is tautological. It could be rephrased as "Why is the dark web a part of the deep web?" That it should be does not logically follow from the definition from the dark web wikipedia article. – Anders Dec 13 '17 at 11:53
-
Both those Wikipedia articles (and their talk sections) are cringe worthy. I'm surprised they haven't been semi-protected. Regardless, their definition is (partially) incorrect, as you can trivially see certain "dark web" sites on search engines, even if they are indexed at a lesser rate. – forest Dec 13 '17 at 12:14
-
@Anders: then what *is* the definition of "the dark web" ? I would think that the "deep web" is the set those resources that are, by choice of indexing sites, or by demand of the resource, respected by the indexing sites (say robots.txt files), not indexed, but which may be accessible by normal web tools (say, browsers...), and the "dark net" is that subset of the deep net that is moreover also only accessible with special protocols/tools (Tor, I2P, Freenet...) and/or protected by access walls (intranets/VPN/....). – entrop-x Dec 13 '17 at 12:34
-
@forest: why are those indexed sites then called "dark web" sites ? What's the defining aspect they have to call them "dark web" sites if they are on a search index and accessible without special passwords/... ? – entrop-x Dec 13 '17 at 12:37
-
@forest: would you call the bitcoin network part of the dark web, because it needs special tools (a bitcoin client) to access ? And openbazaar ? – entrop-x Dec 13 '17 at 12:39
-
Those are not part of the "web" anymore than BitTorrent is. Just because something involves networking does not make it part of the WWW. The web has hyperlinks, the web has html, etc. Crap that is called the dark web is often called such because it sounds scary and cool. Seriously. – forest Dec 13 '17 at 12:42
-
Probably the main interest of this question is that it makes us think of what exactly we call "dark web" then, and we see that it bites us everywhere. What if you torrent zipped HTML files to consult them locally in a browser ? What about ftp services on Tor ? What about freenet files (HTML format or not) ? – entrop-x Dec 13 '17 at 12:58
-
I wouldn't call zipped HTML part of the "dark web". FTP hidden services, sure (though I think they're not compatible with the Tor network, due to the way FTP works), and Freesites (websites on Freenet) would also count. – forest Dec 14 '17 at 04:12
-
This "dark web" term was introduced by journalists because they try to highlight the negative behaviour of unindexed web. In general, it is only deep web which cannot be indexed because search engines cannot harvest onion addresses. – defalt Dec 14 '17 at 04:51
-
TorProject even refuses to say "hidden services" because media and journalists try to put facts in negative way about it instead of understanding its technicality. So they now say "onion services" which i find it more appropriate. – defalt Dec 14 '17 at 04:54
-
Yeah I noticed they started doing that recently. I find it a silly change since "onion" has a negative connotation as well. That's just my opinion though. Maybe I just don't like change. – forest Dec 14 '17 at 05:09
-
@forest The change was necessary. [They mentioned it here why](https://www.youtube.com/watch?v=VmsFxBEN3fc). Media and bloggers post false material about the word **hidden** when they hear *hidden* service protocol. – defalt Dec 14 '17 at 05:34
-
Yeah I suppose that makes sense. – forest Dec 14 '17 at 05:37
-
Is there anything else you want me to add to my answer? – forest Jan 04 '20 at 02:44
1 Answers
Assuming you are talking about Tor hidden services, then the answer is they can, but only indirectly. There are various "portal" sites which provide a gateway to hidden services. These gateways are normal websites with regular domains, but are running tor2web software, which uses the Tor client to relay traffic between non-Tor and Tor users (but note that they provide no anonymity). These can be indexed at will.
There are several reasons why Tor hidden services are not indexed frequently:
- As someone else pointed out earlier, it's very disjointed. Very few sites link to each other, limiting crawlers' ability to find new sites and new pages. It's like the open internet from the 90s.
- It uses its own protocol, so without portal/gateway sites, they would be unable to connect. Try connecting to a
.onion
domain in a normal browser. You'll see it won't even resolve. - There are not that many hidden services out there. The myth that it is "vast" is ill-founded, based on a misunderstanding of the terminology. In reality, it is really quite small.
- Some sites are blocked by portal/gateway sites for legal reasons, so they can only be accessed using the Tor protocol. As search engine crawlers don't use this, they can't access the sites.
There is no single "database" of hidden services as there is for regular domains (root nameservers). A hidden service is based on an encoded, truncated hash of the server's public key. The client uses the service's domain name and looks up the hidden service's descriptor in a semi-public database, which contains its public key and a list of Introduction Points (relays chosen by the server). The client selects a random relay as a Rendezvous Point and sends that relay's ID to the hidden service over the Introduction Point. The server and client then meet through the Rendezvous Point over their own three-hop circuits.
Through a complex protocol, the client and server thus manage to form a connection without either of them needing to reveal their real IP. Because there is no IP address that the domain resolves to, a regular search engine can't reach it using standard HTTP with TCP/IP. In order for a search engine crawler to connect to these sites, it would have to use this protocol. That is not very practical for them.
- 64,616
- 20
- 206
- 257
-
6Good mention about point #3. I really hate when i see that picture of an IceBerg explaining the size of deepweb. TorProject and researchers have never stated that deepweb is vast. This myth is exaggerated by media and common people to make it look more scary. – defalt Dec 14 '17 at 05:14
-
2Yup. The iceberg explanation is the bane of my existence. The sad thing is, one of the most common versions out there now days is actually a _parody_ of the original, talking about quantum mumbo jumbo and making up silly words, yet it's disseminated widely. I'm amazed that people don't realize that a 1000x larger "web" would require roughly 1000x more users on it. Do they think that they're in the 0.1% of internet users who are _not_ on this super hidden dark deep spooky hacker web? /minirant – forest Dec 14 '17 at 05:17
-
@defalt An iceberg is a perfect analogy for the deep web. But not the dark web. The deep web is just the parts of the world wide web which are not indexed which includes anything behind a login. – Marie Aug 16 '19 at 00:03
-
2@Marie There is no such thing as dark web. This term is created by media. We call non-indexable part of web as deep web. – defalt Aug 16 '19 at 03:09