Is it because Yahoo crawls private emails (seems really improbable)?
That's highly unlikely. E.g., Yahoo surely wouldn't index a password reset link that you get mailed to your private Yahoo inbox.
Is it because these emails where somehow leaked on the web and Yahoo crawled them?
That's a plausible explanation. It's sometimes hard to reason how exactly a search engine has discovered content but it had to appear somewhere on the public Internet or be made accessible to the search engine. Another possibility that comes to mind is that the framework of the site automatically indexes all content in a sitemap (e.g. at yoursite.example/sitemap.xml
) - it's something that Wordpress often does. Also, are you sure the content isn't visible via directory listing or a publicly accessible database dump?
Also I was thinking, is it possible that clients copied/pasted the URL directly in Yahoo search, and then Yahoo searched for that and kept it?
I don't know if Yahoo automatically indexes the URLs you search for - this surely sounds risky from a security perspective. But I find it unlikely that multiple of your customers would put their URLs in the Yahoo search bar.
If you have request logs available, you could check which documents the Yahoo crawler was accessing in the past (look for a user-agent containing "Yahoo! Slurp
").
Countermeasures
To prevent search engines from indexing sensitive content, you can add a Disallow
entry to your robots.txt
file with a *
wildcard, like this:
Disallow: /app/secretcontent.ext?token=*
But note that not all search engines respect Disallow
directives and robots.txt
is one of the first places that attackers look at for information gathering.
More generally, you might want to let links to sensitive content quickly expire (or even turn them into one-shot links that can be only accessed once, if appropriate).
Another approach could be sending users the link to yoursite.example/secretcontent
and the token separately. The site would then present a form where the user has to enter the token. This form would submit the token via POST
so that the token is never visible in the URL and hence can't be indexed.