Spider/crawl a website and get each URL and page title in a CSV file

1

I am moving from an old ASP shopping cart site to a Drupal/Ubercart site. Part of this move is to ensure that old links will redirect to the new ones. To do that all I need is some way to get a list of all the links from the old site.

Preferably the results would have the page title and ideally I could give it some way to return other data from the page (ex. a CSS selector).

I would prefer if it were in OS X, but I can use Windows apps too.

I have tried Integrity, but it's output is nearly impossible to decipher, plus it doesn't seem to work well.

Tyler Clendenin

Posted 2012-08-02T05:54:31.787

Reputation: 251

R, can handle this. But I'm not sure how to do it for an entire website. Here's an example of parsing one page: http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r

– Brandon Bertelsen – 2012-08-02T06:44:05.027

Answers

0

If you don't mind writing Perl scripts ...

This module implements a configurable web traversal engine, for a robot or other web agent. Given an initial web page (URL), the Robot will get the contents of that page, and extract all links on the page, adding them to a list of URLs to visit.

RedGrittyBrick

Posted 2012-08-02T05:54:31.787

Reputation: 70 632

I am horrible with Perl, and I cannot figure out how to install a module from CPAN =p – Tyler Clendenin – 2012-08-02T15:38:18.880