0

I recently asked for 301 redirection on ServerFault and I didn't get a proper solution to my problem, but now I have a new idea: use the robots.txt to disallow certain URLs from my site to be "crawled".

My problem was simple: after a migration from a propietary, customised CMS to WordPress we had a lot of URLs that Google didn't find on the new site, and that went to 404 page. This is bad for our Pagerank and the searches, because Google still thinks those pages are alive.

We have a list of the URLs that don't work, and I tried to redirect to the good ones. The problem is, there are 20.000 of them, and there's no chance to solve the problem with a regular expression. We had to do 301 redirects ONE BY ONE, and that was a hell of a task.

But I was wondering: Could we just list those all bad URLs on our robots.txt with the "Disallow:" prefix, so Google does not index them? Is this a bad idea?

javipas
  • 1,292
  • 3
  • 23
  • 38

2 Answers2

1

If Google thinks that your 404 page is valid then you need to be returning a 404 response code on that page. Fix that, and the rest will be fine.

Ignacio Vazquez-Abrams
  • 45,019
  • 5
  • 78
  • 84
  • I'm sorry, but I don't understand your tip... Could you explain this with a little more detail, maybe some example? – javipas Dec 10 '10 at 12:18
  • Your 404 page is not telling the client that it's a 404 page. Fix this. – Ignacio Vazquez-Abrams Dec 10 '10 at 12:20
  • Ok, but... after doing that... should I do something with my robots.txt and disallow all the URLs that give a 404, or do I leave everything as is? BTW, thanks for your quick answers :P – javipas Dec 10 '10 at 12:57
  • if the pages are not there (404) properly then they wont be indexed anyway. There is no need to add them to robots.txt – JamesRyan Dec 10 '10 at 13:21
  • When the bots get the 404, they will mark the page as not found. – Ignacio Vazquez-Abrams Dec 10 '10 at 13:21
  • Good to know. We've added the 404 response code, and tried the header response with a Firefox extension. Seems to be working, I'll wait until tomorrow to see if Google has stopped indexing all those pages. – javipas Dec 10 '10 at 13:33
0

To put this simply, yes this wouldn't be a great idea. By blocking google from not seeing pages it can't determine what on them and can in some instances view them as suspicious as you are hiding things that aren't necessary.

What you should do is redirect any relevant pages to the new pages.

example

"domain-old.com/a" and "domain-old.com/b" might be redirected to "domain-new.com/a-b"

This is because /a + /b's content is on /a-b - there is relevance and the redirect makes sense.

If it had irrelevant content redirecting this would be considered bad

"domain-old.com/a", "domain-old.com/b" and "domain-old.com/c" redirected to "domain-new.com/a-b"

In this case /c makes no sense as /a-b has no relevance to the content on page /c

/c would be left with a 404

It's important to note that if your pages receive a 404 you will loose that traffic.

RonanW.
  • 419
  • 2
  • 6